Scope freeze — reservoir_cyano_water_quality YearLocationdex (6th YearLocationdex kind)
Decided 2026-06-27 (Mike pasted the data.gov catalog URL). Source = EPA ScienceHub dataset "1987-2018 Cyanobacteria and Water Quality Data for 20 Reservoirs" (DOI 10.23719/1503175), the data behind Smucker, Beaulieu, Nietch & Young, "Increasingly severe cyanobacterial blooms and deep water hypoxia coincide with warming water temperatures in reservoirs," Global Change Biology 27(11):2507-2519 (2021), https://doi.org/10.1111/gcb.15618.
This is the 6th YearLocationdex kind (slot = place × period cell), after drought and the four
NARS water-quality kinds (nla_water_quality, ncca_water_quality, nrsa_water_quality,
nwca_wetland_chemistry). Same family, same shape: a fixed waterbody measured over repeated
periods, one cell per (place, period). The grain follows the dataset's own analytical unit, the
reservoir-year row (the merged master table is exactly 20 reservoirs × 32 years = 640 rows).
Slot
One (reservoir, year) cell holding that reservoir-year's MEASURED summer water-quality
indicators: the year's maximum cyanobacteria cell density, plus chlorophyll, Secchi clarity,
nutrients (surface + inflow), summer precipitation, May-Oct surface water temperatures, deep
water temperatures, and deep dissolved oxygen. 20 US Army Corps of Engineers reservoirs in
Kentucky / Indiana / Ohio × years 1987-2018 = 640 cells.
Source
Two EPA ScienceHub tables (multi-source cited cell, the universal rule):
- Roster
Reservoir_information.xlsx— the place: abbreviation, name, type (Forest / Ag / Urban), stratification, latitude, longitude, year filled, watershed area, forest fraction, surface area, storage volume, max/mean depth, Zmean:Zmax. - Master measurements
CyanoMaxCD_environmental_vars_FINAL.xlsx(sheetData) — the merged per-reservoir-year measured indicators (cyanobacteria max cell density + environmental variables) the study used for its correlations and multivariate analysis.
The cyanobacteria cell densities derive from samples the US Army Corps of Engineers collected at each reservoir's deepest station (station 20001), counted and provided to EPA in October 2019.
Measured reality — IN / OUT (bright line feedback_measured_reality_only)
- IN — the raw per-reservoir-per-year measurements: cyanobacteria maximum cell density (cells/mL), chlorophyll-a (µg/L), Secchi clarity (cm), total phosphorus and dissolved phosphorus (ppb), TKN / NH3 / NOx nitrogen forms (ppm), TOC (ppm), alkalinity (ppm), the inflow nutrient set, summer (Jun-Aug) precipitation (inches), May/Jun/Jul/Aug surface water temperatures (°C), May/Jun deep dissolved oxygen (mg/L), and May/Jun/Jul/Aug deep water temperatures (°C). Carried per cell, null where a reservoir-year did not measure it.
- OUT — the study's GAM fits (
GAM_for_stratifying_reservoirs.xlsx,GAM_for_nonstratifying_reservoirs.xlsx) are Generalized Additive model output (smoothed fitted trends), not measurements → not ingested. The merged table'slogCyanoMax(a log transform) andSummer_precip_Z-score(a standardization) are skipped as redundant derivatives of measurements we already carry, not as a bright-line exclusion.
"data is data" (feedback_data_is_data_partial_coverage)
216 of the 640 cells have no cyanobacteria count (424 do), and many cells miss individual analytes. Every cell is carried and the missing fields nulled, rather than dropping sparse reservoir-years. A reservoir's thinner early record is not a rejection reason.
Decisions / gotchas frozen here
- Grain = reservoir-year, not reservoir. With only 20 places this could have been a Locationdex (20 slots each embedding a 32-year series). It is filed as YearLocationdex because the dataset's unit of analysis IS the reservoir-year (its 640-row master table), matching the NARS precedent (site × survey-year) and the data model's "one slot = one place's one year."
- One master table covers the cell. The merged
CyanoMaxCD_environmental_vars_FINALalready carries cyano max + every environmental variable per reservoir-year, so it (plus the roster for place attributes) fully populates each cell; the granular sub-annual tables are deferred. - 'na' = null. The master table uses the string
nafor missing;clean_numcoercesna/ blank / None to null and keeps real numbers (unit-tested). - XLSX, not CSV (read via openpyxl); curl, not urllib/httpx — the known sandbox hang.
Storage
YearLocationdex pattern: ONE place-sorted spine-parquet
(data/yearlocation_storehouse/reservoir_cyano_water_quality/reservoir_cyano_water_quality_spine.parquet)
sorted by reservoir, year, plus a by-year index JSON. Built by
scripts/build_reservoir_cyano_water_quality_yearlocationdex.py. 640 reservoir-year cells
(20 reservoirs × 1987-2018), 26 measured indicators per cell, 424 cells with a cyano count.
Deferred (not in v1)
- Sub-annual measured series: the per-sample cyanobacteria and cyanotoxin/taxa station
tables (
Cyanobacteria_data.xlsx,Cyanotoxin_taxa_data.xlsx) — finer than the yearly cell. - Monthly / depth-profile measured tables: the standardized surface/deep temperature and deep-DO depth profiles and the monthly nutrient-trend tables (additional cell layers).
- NLCD watershed land-cover as a static per-reservoir covariate.
- Live edge: none — this is a closed 1987-2018 study dataset, not a live feed.