Listening for events…

Scope freeze — nla_water_quality YearLocationdex (2nd YearLocationdex kind)

Decided 2026-06-27 (Mike). Source = EPA National Aquatic Resource Surveys (NARS). Mike's shape call when the URL came in: organize a sampled site-visit as a site × survey-year grid (YearLocationdex), pilot on the National Lakes Assessment (NLA). The 2nd YearLocationdex kind after drought (docs/yearlocationdex-framework.md).

Slot

One (lake site, survey year) cell holding that visit's MEASURED water-quality indicators.

Source

https://www.epa.gov/national-aquatic-resource-surveys/data-national-aquatic-resource-surveys. NLA field-samples thousands of US lakes each cycle. Pilot scope = cycles 2007 + 2012, the two with clean raw measured tables on that page. (NLA 2017/2022 raw chemistry lives on EPA's newer data portal, not this page → deferred backfill per "data is data".)

Measured reality — IN / OUT (bright line feedback_measured_reality_only)

  • IN — raw per-site lab/field measurements: total nitrogen, total phosphorus, turbidity, acid-neutralizing capacity, dissolved organic carbon, conductivity, chlorophyll-a, Secchi clarity, pH. Carried per cell, null where a cycle did not measure it ("data is data").
  • OUT — the survey's design-based condition estimates. The *_conditionestimates tables carry WGT/STRATUM/PANEL/MDCATY survey weights that extrapolate the sampled lakes to a national population ("% of US lakes in good condition"). That is a computed population estimate, not a measurement of a physical thing. Also OUT: derived index scores (multimetric indices / condition classes).

Cell indicators (canonical units)

ptl_ugl (µg/L) · ntl_ugl (µg/L) · turb_ntu (NTU) · anc_ueql (µeq/L) · doc_mgl (mg/L) · cond_uscm (µS/cm) · chla_ugl (µg/L) · secchi_m (m) · ph (unitless). Plus lat, lon, state. pH is 2012-only (the 2007 chem table omits it) → null for 2007.

Decisions / gotchas frozen here

  • Unit harmonization (load-bearing). 2012 reports total nitrogen in mg/L, 2007 in µg/L. The build converts 2012 NTL ×1000 → µg/L so the two cycles share a unit; otherwise the cross-cycle series would show a fake 1000× shift. Verified against the 2012 *_UNITS columns; every other carried analyte already shares units across cycles. Sanity check after build: 2007 TN median 568 µg/L vs 2012 615.5 µg/L (aligned).
  • CR-only line endings. The 2012 secchi CSV uses lone-CR (\r) row terminators; polars reads it as one 17,109-column row unless normalized. The build normalizes \r\n and \r to \n before parsing (parse_csv_bytes).
  • One cell per site-year = the index visit. NLA revisits a subset of sites (VISIT_NO==2) for QA; we keep VISIT_NO==1 as the cell. Revisits deferred.
  • Site IDs differ across cycles. NLA draws a (mostly) fresh probability sample each cycle, so 2007 IDs (NLA06608-0001) and 2012 IDs (NLA12_CA-143) do not align; the place axis is sparse across cycles. Cross-cycle same-lake linkage needs NLA's resample crosswalk (deferred). Coordinates are carried so a lake can be re-linked spatially later.
  • curl, not urllib/httpx — the known sandbox hang.

Storage

YearLocationdex pattern (drought): ONE place-sorted spine-parquet (data/yearlocation_storehouse/nla_water_quality/nla_water_quality_spine.parquet) sorted by site_id, year, plus a by-year index JSON. Built by scripts/build_nla_water_quality_yearlocationdex.py. 2,443 lake-year cells (2007: 1,157; 2012: 1,286).

Deferred (not in v1)

  • NLA 2017 + 2022 raw chemistry from the newer EPA NARS portal (extends the time axis).
  • The other three NARS surveys as sibling kinds: Rivers & Streams (NRSA 2013-14/2018-19/ 2023-24), Coastal (NCCA 2010), Wetlands (NWCA 2011). Each its own phenomenon/kind.
  • VISIT_NO==2 revisits; biology (benthic / phytoplankton / zooplankton counts) and full physical-habitat measures as additional cell layers.
  • Cross-cycle resample crosswalk (same-lake linkage across cycles).
  • Live edge: none — NARS is a periodic survey posted in cycle batches.
Live Feed