Scope freeze — ncca_water_quality YearLocationdex (3rd YearLocationdex kind)
Decided 2026-06-27 (Mike pasted the source URL). Source = EPA National Aquatic Resource
Surveys (NARS), the National Coastal Condition Assessment (NCCA). The coastal sibling of
nla_water_quality that the NLA scope freeze pre-flagged as a deferred kind. Same shape Mike
already approved for NARS: a site × survey-year grid (YearLocationdex). The 3rd
YearLocationdex kind after drought and nla_water_quality
(docs/yearlocationdex-framework.md). Categorization was settled by the NLA precedent, so this
was built directly rather than re-asked.
Slot
One (coastal site, survey year) cell holding that visit's MEASURED water-quality indicators.
Source
Data.gov record:
https://catalog.data.gov/dataset/national-coastal-condition-assessment-2015-datafiles-for-report-national-coastal-condition.
Two "alldatafiles" bundles on EPA's pasteur host (estuarine + Great Lakes), folded into one
kind and distinguished by the region column (NCCA_REG: Northeast / Southeast / Gulf / West /
Great Lakes). Pilot scope = the 2015 cycle (the cycle behind that record). NCCA 2010 is a
deferred backfill that extends the year axis ("data is data"); a single-year grid now is fine.
Measured reality — IN / OUT (bright line feedback_measured_reality_only)
- IN — raw per-site lab/field measurements: total nitrogen, total phosphorus, chlorophyll-a, conductivity, pH, dissolved inorganic nitrogen, ammonia-nitrogen, soluble reactive phosphorus, Secchi clarity. Carried per cell, null where not measured.
- OUT — the
*_DATA_FOR_POPESTIMATES_*.xlsxfiles. Like NLA's condition estimates, they carryWGT/STRATUM/PANELsurvey weights that extrapolate the sampled sites to a national population ("% of US coastal waters in good condition"). That is a computed population estimate, not a measurement.
Cell indicators (canonical units)
Shared with nla_water_quality (same names + units, for cross-kind reads):
ptl_ugl (µg/L) · ntl_ugl (µg/L) · chla_ugl (µg/L) · cond_uscm (µS/cm) · ph ·
secchi_m (m). Coastal extras (native mg/L): din_mgl · ammonia_n_mgl · srp_mgl.
Decisions / gotchas frozen here
- Unit harmonization (load-bearing). NCCA reports total N and total P in mg/L; the NLA
kind's canonical unit is µg/L. The build scales NTL/PTL ×1000 so the two NARS kinds share
ntl_ugl/ptl_ugl. Sanity after build: TN median 446 µg/L, TP median 38 µg/L (same order as NLA freshwater ≈580 µg/L TN, no fake 1000× offset). - Conductivity is correctly high. Median
cond_uscm≈ 21,500 — that is right for the estuarine/saline mix (seawater ≈ 50,000 µS/cm); Great Lakes freshwater sites pull the low end. Not an error. - Long-format water chemistry. NCCA water chem is one row per analyte (
ANALYTE/RESULT/RESULT_UNITS); the build pivotsUID × ANALYTE → RESULT(pivot_water_chem) before mapping. - One cell per site-year = the index visit. Keyed on
UID(unique per site-visit), filtered to VISIT_NO==1 (equivalently INDEX_NCCA15=='Y'). VISIT_NO==2 revisits deferred. - Site IDs differ across cycles. NCCA draws a fresh probability sample each cycle, so 2015 site IDs will not align with 2010's; the place axis is sparse across cycles. Coords are carried for later spatial re-linking.
- CR-only / CRLF line endings normalized before parse (
parse_csv_bytes), the shared NARS quirk. curl, not urllib/httpx — the known sandbox hang.
Storage
YearLocationdex pattern: ONE place-sorted spine-parquet
(data/yearlocation_storehouse/ncca_water_quality/ncca_water_quality_spine.parquet) sorted by
site_id, year, plus a by-year index JSON. Built by
scripts/build_ncca_water_quality_yearlocationdex.py. 1,093 coastal-site-year cells (2015;
by region: Great Lakes 361, Northeast 253, Gulf 237, West 124, Southeast 118).
Deferred (not in v1)
- NCCA 2010 raw data (extends the time axis to a real two-cycle grid).
- The other NARS surveys as sibling kinds: Rivers & Streams (NRSA), Wetlands (NWCA). Each its own phenomenon/kind. (NLA done; NCCA done.)
- VISIT_NO==2 revisits; sediment chemistry, fish-tissue contaminants, benthic biology, enterococci as additional cell layers.
- Cross-cycle resample crosswalk (same-site linkage across cycles).
- Live edge: none — NARS is a periodic survey posted in cycle batches.