Scope freeze — nla_water_quality YearLocationdex (2nd YearLocationdex kind)
Decided 2026-06-27 (Mike). Source = EPA National Aquatic Resource Surveys (NARS).
Mike's shape call when the URL came in: organize a sampled site-visit as a site × survey-year
grid (YearLocationdex), pilot on the National Lakes Assessment (NLA). The 2nd YearLocationdex
kind after drought (docs/yearlocationdex-framework.md).
Slot
One (lake site, survey year) cell holding that visit's MEASURED water-quality indicators.
Source
https://www.epa.gov/national-aquatic-resource-surveys/data-national-aquatic-resource-surveys. NLA field-samples thousands of US lakes each cycle. Pilot scope = cycles 2007 + 2012, the two with clean raw measured tables on that page. (NLA 2017/2022 raw chemistry lives on EPA's newer data portal, not this page → deferred backfill per "data is data".)
Measured reality — IN / OUT (bright line feedback_measured_reality_only)
- IN — raw per-site lab/field measurements: total nitrogen, total phosphorus, turbidity, acid-neutralizing capacity, dissolved organic carbon, conductivity, chlorophyll-a, Secchi clarity, pH. Carried per cell, null where a cycle did not measure it ("data is data").
- OUT — the survey's design-based condition estimates. The
*_conditionestimatestables carryWGT/STRATUM/PANEL/MDCATYsurvey weights that extrapolate the sampled lakes to a national population ("% of US lakes in good condition"). That is a computed population estimate, not a measurement of a physical thing. Also OUT: derived index scores (multimetric indices / condition classes).
Cell indicators (canonical units)
ptl_ugl (µg/L) · ntl_ugl (µg/L) · turb_ntu (NTU) · anc_ueql (µeq/L) · doc_mgl (mg/L)
· cond_uscm (µS/cm) · chla_ugl (µg/L) · secchi_m (m) · ph (unitless). Plus lat, lon,
state. pH is 2012-only (the 2007 chem table omits it) → null for 2007.
Decisions / gotchas frozen here
- Unit harmonization (load-bearing). 2012 reports total nitrogen in mg/L, 2007 in
µg/L. The build converts 2012 NTL ×1000 → µg/L so the two cycles share a unit; otherwise
the cross-cycle series would show a fake 1000× shift. Verified against the 2012
*_UNITScolumns; every other carried analyte already shares units across cycles. Sanity check after build: 2007 TN median 568 µg/L vs 2012 615.5 µg/L (aligned). - CR-only line endings. The 2012 secchi CSV uses lone-CR (
\r) row terminators; polars reads it as one 17,109-column row unless normalized. The build normalizes\r\nand\rto\nbefore parsing (parse_csv_bytes). - One cell per site-year = the index visit. NLA revisits a subset of sites (VISIT_NO==2) for QA; we keep VISIT_NO==1 as the cell. Revisits deferred.
- Site IDs differ across cycles. NLA draws a (mostly) fresh probability sample each cycle,
so 2007 IDs (
NLA06608-0001) and 2012 IDs (NLA12_CA-143) do not align; the place axis is sparse across cycles. Cross-cycle same-lake linkage needs NLA's resample crosswalk (deferred). Coordinates are carried so a lake can be re-linked spatially later. - curl, not urllib/httpx — the known sandbox hang.
Storage
YearLocationdex pattern (drought): ONE place-sorted spine-parquet
(data/yearlocation_storehouse/nla_water_quality/nla_water_quality_spine.parquet) sorted by
site_id, year, plus a by-year index JSON. Built by
scripts/build_nla_water_quality_yearlocationdex.py. 2,443 lake-year cells (2007: 1,157;
2012: 1,286).
Deferred (not in v1)
- NLA 2017 + 2022 raw chemistry from the newer EPA NARS portal (extends the time axis).
- The other three NARS surveys as sibling kinds: Rivers & Streams (NRSA 2013-14/2018-19/ 2023-24), Coastal (NCCA 2010), Wetlands (NWCA 2011). Each its own phenomenon/kind.
- VISIT_NO==2 revisits; biology (benthic / phytoplankton / zooplankton counts) and full physical-habitat measures as additional cell layers.
- Cross-cycle resample crosswalk (same-lake linkage across cycles).
- Live edge: none — NARS is a periodic survey posted in cycle batches.