FlowDB Dataset

Introduction

FlowDB is a dataset of hourly river flow and precipitation data for over 9000+ rivers in the United States. The dataset was created in order to give insight into flash floods and droughts as well as study how climate change affects water sheds.

Dataset Creation

Part I

Dataset Construction Notebook (I) In this notebook the initial meta-data files are built for each river.

Dataset Construction Notebook (II) This is the notebook where the bulk of the data is gathered each.

USGS Station Meta-Data Part I (includes longitude and latitude): This file was originally retrieved

USGS River flow data retrieved from site

Re-adding the meta-data (a notebook that documents how we later added lat/lon data.

Data Format

Metadata format:

Metadata is in the following format:

gage_id: Refers to the USGS gage_id. This number may actually have a zero in front of it that was accidentally omitted. So in the below case the actual id would 06324500 which corresponds to the

stations: This is a list of the closest weather stations ordered ascending.

cat: The category of weather station

dist: The distance in miles between the gage and the weather station.

missing_precip: The number of missing precipitation values between 2014-2018

missing_temprature: The number of missing temp

station_id: The UUID of the station

{'gage_id': 6324500, 'latitude': 45.0571972, 'longitude': -105.8783778, 'stations': [{'cat': 'ASOS', 'dist': 83.18954430440758, 'missing_precip': 268566, 'missing_temp': 276833, 'station_id': 'GCC'}, {'cat': 'ASOS',

File naming convention

Files are named as follows: gage_id_station_id.csv For example 01037380KRKD_flow.csv in this case 01037380 is the USGS gage id and KRKD is the weather station id.

River data format:

The river data is a comma separated file .csv

Unnamed: 0_x hour_updated p01m valid tmpf Unnamed: 0_y agency_cd site_no datetime tz_cd 69512_00060 69512_00060_cd cfs 5.0 2014-01-01 06:00:00+00:00 0.0 2014-01-01 05:58 28.94 5.0 USGS 1491000 2014-01-01 06:00:00+00:00 EST 418 A 418.0

Data dictionary

Column name

Type

Description

Column name

Type

Description

hour_updated

datetime

This the weather station time. This datetime was originally in UTC (we left it as is.

datetime

datetime

the USGS datetime (which has also been converted to UTC).

p01m:

float

is the precipitation in millimeters that occurred during the past hour.

tmpf

float

is the temperature in Fahrenheit (average over the past hour)

cfs

float

The discharge in cubic feet per second (the target variable)

tz_cd

string

The original time zone code of the USGS data.

agency_cd

string

The code of the USGS agency in charge (not really helpful)

site_no

string

Not really helpful always USGS

Unnamed: 0_x

string

 

River flow information

Some rivers are dam and fed whereas others are entirely based on natural flows. Additionally some rivers have had alterations to the way they measure flows.

Accessing the dataset

Data is currently stored in Google Cloud Buckets and be accessed via using the gsutil tool or downloading files manually.

This will download all the temporal data:

gsutil cp -r gs://aistream-datasets/day_addition .

This will download all the meta-data

gsutil cp -r gs://aistream-datasets/flow/meta_data .

See this colaboratory notebook if you don’t know how to install/use gsutil.

Dataset Versions and Planned Updates

V.1.0: Released 9/2/20: River flow data, precipitation data, and station distance, latitude/lon meta-data. This can be found at gs://aistream-datasets/day_addition

V.1.1: In Progress 1/20/20: Expanding data to include additional years 2000-2020. Including additional USGS gage data where available such as height, sediment, etc. Finalizing a cleaned up subset to post to Kaggle.

V.1.2 Planned TBD: Adding aerial imagery of river basins and data from + finalizing flash flood linkage. v.

V.2.0 Adding soil moisture data, snowpack data, yearly mean, humidity, and slope data.