Listening for events…

Scope freeze — reservoir_cyano_water_quality YearLocationdex (6th YearLocationdex kind)

Decided 2026-06-27 (Mike pasted the data.gov catalog URL). Source = EPA ScienceHub dataset "1987-2018 Cyanobacteria and Water Quality Data for 20 Reservoirs" (DOI 10.23719/1503175), the data behind Smucker, Beaulieu, Nietch & Young, "Increasingly severe cyanobacterial blooms and deep water hypoxia coincide with warming water temperatures in reservoirs," Global Change Biology 27(11):2507-2519 (2021), https://doi.org/10.1111/gcb.15618.

This is the 6th YearLocationdex kind (slot = place × period cell), after drought and the four NARS water-quality kinds (nla_water_quality, ncca_water_quality, nrsa_water_quality, nwca_wetland_chemistry). Same family, same shape: a fixed waterbody measured over repeated periods, one cell per (place, period). The grain follows the dataset's own analytical unit, the reservoir-year row (the merged master table is exactly 20 reservoirs × 32 years = 640 rows).

Slot

One (reservoir, year) cell holding that reservoir-year's MEASURED summer water-quality indicators: the year's maximum cyanobacteria cell density, plus chlorophyll, Secchi clarity, nutrients (surface + inflow), summer precipitation, May-Oct surface water temperatures, deep water temperatures, and deep dissolved oxygen. 20 US Army Corps of Engineers reservoirs in Kentucky / Indiana / Ohio × years 1987-2018 = 640 cells.

Source

Two EPA ScienceHub tables (multi-source cited cell, the universal rule):

  • Roster Reservoir_information.xlsx — the place: abbreviation, name, type (Forest / Ag / Urban), stratification, latitude, longitude, year filled, watershed area, forest fraction, surface area, storage volume, max/mean depth, Zmean:Zmax.
  • Master measurements CyanoMaxCD_environmental_vars_FINAL.xlsx (sheet Data) — the merged per-reservoir-year measured indicators (cyanobacteria max cell density + environmental variables) the study used for its correlations and multivariate analysis.

The cyanobacteria cell densities derive from samples the US Army Corps of Engineers collected at each reservoir's deepest station (station 20001), counted and provided to EPA in October 2019.

Measured reality — IN / OUT (bright line feedback_measured_reality_only)

  • IN — the raw per-reservoir-per-year measurements: cyanobacteria maximum cell density (cells/mL), chlorophyll-a (µg/L), Secchi clarity (cm), total phosphorus and dissolved phosphorus (ppb), TKN / NH3 / NOx nitrogen forms (ppm), TOC (ppm), alkalinity (ppm), the inflow nutrient set, summer (Jun-Aug) precipitation (inches), May/Jun/Jul/Aug surface water temperatures (°C), May/Jun deep dissolved oxygen (mg/L), and May/Jun/Jul/Aug deep water temperatures (°C). Carried per cell, null where a reservoir-year did not measure it.
  • OUT — the study's GAM fits (GAM_for_stratifying_reservoirs.xlsx, GAM_for_nonstratifying_reservoirs.xlsx) are Generalized Additive model output (smoothed fitted trends), not measurements → not ingested. The merged table's logCyanoMax (a log transform) and Summer_precip_Z-score (a standardization) are skipped as redundant derivatives of measurements we already carry, not as a bright-line exclusion.

"data is data" (feedback_data_is_data_partial_coverage)

216 of the 640 cells have no cyanobacteria count (424 do), and many cells miss individual analytes. Every cell is carried and the missing fields nulled, rather than dropping sparse reservoir-years. A reservoir's thinner early record is not a rejection reason.

Decisions / gotchas frozen here

  • Grain = reservoir-year, not reservoir. With only 20 places this could have been a Locationdex (20 slots each embedding a 32-year series). It is filed as YearLocationdex because the dataset's unit of analysis IS the reservoir-year (its 640-row master table), matching the NARS precedent (site × survey-year) and the data model's "one slot = one place's one year."
  • One master table covers the cell. The merged CyanoMaxCD_environmental_vars_FINAL already carries cyano max + every environmental variable per reservoir-year, so it (plus the roster for place attributes) fully populates each cell; the granular sub-annual tables are deferred.
  • 'na' = null. The master table uses the string na for missing; clean_num coerces na / blank / None to null and keeps real numbers (unit-tested).
  • XLSX, not CSV (read via openpyxl); curl, not urllib/httpx — the known sandbox hang.

Storage

YearLocationdex pattern: ONE place-sorted spine-parquet (data/yearlocation_storehouse/reservoir_cyano_water_quality/reservoir_cyano_water_quality_spine.parquet) sorted by reservoir, year, plus a by-year index JSON. Built by scripts/build_reservoir_cyano_water_quality_yearlocationdex.py. 640 reservoir-year cells (20 reservoirs × 1987-2018), 26 measured indicators per cell, 424 cells with a cyano count.

Deferred (not in v1)

  • Sub-annual measured series: the per-sample cyanobacteria and cyanotoxin/taxa station tables (Cyanobacteria_data.xlsx, Cyanotoxin_taxa_data.xlsx) — finer than the yearly cell.
  • Monthly / depth-profile measured tables: the standardized surface/deep temperature and deep-DO depth profiles and the monthly nutrient-trend tables (additional cell layers).
  • NLCD watershed land-cover as a static per-reservoir covariate.
  • Live edge: none — this is a closed 1987-2018 study dataset, not a live feed.
Live Feed