Scope freeze — nrsa_water_quality YearLocationdex (4th YearLocationdex kind)
Decided 2026-06-27 (Mike re-pasted the NARS data page as the go-ahead for the next sibling).
Source = EPA National Aquatic Resource Surveys (NARS), the National Rivers and Streams
Assessment (NRSA). The rivers-and-streams sibling of nla_water_quality (lakes) and
ncca_water_quality (coastal). Same shape Mike approved for NARS: a site × survey-cycle grid
(YearLocationdex). The 4th YearLocationdex kind (docs/yearlocationdex-framework.md).
Categorization settled by the NLA/NCCA precedent, so this was built directly.
Slot
One (river/stream site, survey cycle) cell holding that visit's MEASURED water-quality
indicators. Unlike NCCA (one cycle), NRSA gives a real multi-cycle year axis.
Source / scope
https://www.epa.gov/national-aquatic-resource-surveys/data-national-aquatic-resource-surveys.
Pilot scope = the three modern wide-format cycles 2013-14, 2018-19, 2023-24 (the 2023-24
data was posted 2026-06). The older 2008-09 cycle uses a different schema and is a deferred
backfill that would extend the axis to 2008 ("data is data"). year = the cycle's nominal start
year (2013/2018/2023); a cycle label ("2013-14") is carried alongside.
Measured reality — IN / OUT (bright line feedback_measured_reality_only)
- IN — raw per-site lab measurements: total nitrogen, total phosphorus, chlorophyll-a, conductivity, pH, turbidity, dissolved organic carbon, acid-neutralizing capacity.
- OUT — the survey's design-based condition estimates (the
*_allcondrollups carry population weights that extrapolate the sampled sites to "% of US river miles in good condition" = a computed population estimate) and the MMI / index scores.
Cell indicators (canonical units)
ptl_ugl (µg/L) · ntl_ugl (µg/L) · chla_ugl (µg/L) · cond_uscm (µS/cm) · ph ·
turb_ntu (NTU) · doc_mgl (mg/L) · anc_ueql (µeq/L). This is the full NLA vocabulary
minus Secchi (rivers are not Secchi-sampled); the strongest cross-kind overlap of the three
water kinds.
Decisions / gotchas frozen here
- Unit harmonization (load-bearing, and asymmetric vs NLA/NCCA). NRSA total N is mg/L →
×1000 to µg/L (shares
ntl_ugl). NRSA total P is ALREADY µg/L — no scaling — unlike NLA and NCCA where PTL was mg/L. Getting this backwards would inflate river TP 1000×. CHLA (µg/L), COND (µS/cm), TURB (NTU), DOC (mg/L), ANC (µeq/L), pH already share units across all three cycles (verified from each cycle's*_UNITScolumns). Sanity after build: TP median 59 µg/L, TN median 620 µg/L (aligned with NLA freshwater ≈600). - Wide chem, one cell column per analyte. All three cycles store
<ANALYTE>_RESULTcolumns (1314 uses_RESULT_UNITS, 1819/2324 use_UNITS). The mapper (map_wide_chem) is schema-uniform: an analyte absent in a cycle yields its null cell column. - Chlorophyll split. 1819/2324 carry
CHLA_RESULTin the chem file; 1314 keeps it in a separatewidewchlfile, joined on UID. - latin-1 bytes. Some NRSA site files carry non-UTF-8 bytes (accented place names); the
parser uses
encoding="utf8-lossy"so a stray byte doesn't abort the whole file. CR-only / CRLF endings normalized as in the other NARS kinds. curl, not urllib/httpx (sandbox hang). - One cell per (site, cycle). A handful of NRSA reference (RF) sites carry two index records within a cycle; the build dedups on (site_id, year) keeping the first (40 of 5,944 rows).
- Site IDs differ across cycles. Fresh probability sample each cycle → the place axis is sparse across cycles; coords carried for later spatial re-linking.
Storage
YearLocationdex pattern: ONE place-sorted spine-parquet
(data/yearlocation_storehouse/nrsa_water_quality/nrsa_water_quality_spine.parquet) sorted by
site_id, year, plus a by-year index JSON. Built by
scripts/build_nrsa_water_quality_yearlocationdex.py. 5,904 river-site-cycle cells
(2013-14: 2,069; 2018-19: 1,919; 2023-24: 1,916).
Deferred (not in v1)
- NRSA 2008-09 (older schema; extends the year axis back to 2008).
- NWCA (Wetlands) as the last NARS sibling kind. (NLA / NCCA / NRSA done.)
- Revisits (VISIT_NO==2); benthic / fish biology, physical-habitat, enterococci, fish-tissue mercury as additional cell layers.
- Cross-cycle resample crosswalk (same-site linkage across cycles).
- Live edge: none — NARS is a periodic survey posted in cycle batches.