Listening for events…

Scope freeze — ncca_water_quality YearLocationdex (3rd YearLocationdex kind)

Decided 2026-06-27 (Mike pasted the source URL). Source = EPA National Aquatic Resource Surveys (NARS), the National Coastal Condition Assessment (NCCA). The coastal sibling of nla_water_quality that the NLA scope freeze pre-flagged as a deferred kind. Same shape Mike already approved for NARS: a site × survey-year grid (YearLocationdex). The 3rd YearLocationdex kind after drought and nla_water_quality (docs/yearlocationdex-framework.md). Categorization was settled by the NLA precedent, so this was built directly rather than re-asked.

Slot

One (coastal site, survey year) cell holding that visit's MEASURED water-quality indicators.

Source

Data.gov record: https://catalog.data.gov/dataset/national-coastal-condition-assessment-2015-datafiles-for-report-national-coastal-condition. Two "alldatafiles" bundles on EPA's pasteur host (estuarine + Great Lakes), folded into one kind and distinguished by the region column (NCCA_REG: Northeast / Southeast / Gulf / West / Great Lakes). Pilot scope = the 2015 cycle (the cycle behind that record). NCCA 2010 is a deferred backfill that extends the year axis ("data is data"); a single-year grid now is fine.

Measured reality — IN / OUT (bright line feedback_measured_reality_only)

  • IN — raw per-site lab/field measurements: total nitrogen, total phosphorus, chlorophyll-a, conductivity, pH, dissolved inorganic nitrogen, ammonia-nitrogen, soluble reactive phosphorus, Secchi clarity. Carried per cell, null where not measured.
  • OUT — the *_DATA_FOR_POPESTIMATES_*.xlsx files. Like NLA's condition estimates, they carry WGT/STRATUM/PANEL survey weights that extrapolate the sampled sites to a national population ("% of US coastal waters in good condition"). That is a computed population estimate, not a measurement.

Cell indicators (canonical units)

Shared with nla_water_quality (same names + units, for cross-kind reads): ptl_ugl (µg/L) · ntl_ugl (µg/L) · chla_ugl (µg/L) · cond_uscm (µS/cm) · ph · secchi_m (m). Coastal extras (native mg/L): din_mgl · ammonia_n_mgl · srp_mgl.

Decisions / gotchas frozen here

  • Unit harmonization (load-bearing). NCCA reports total N and total P in mg/L; the NLA kind's canonical unit is µg/L. The build scales NTL/PTL ×1000 so the two NARS kinds share ntl_ugl / ptl_ugl. Sanity after build: TN median 446 µg/L, TP median 38 µg/L (same order as NLA freshwater ≈580 µg/L TN, no fake 1000× offset).
  • Conductivity is correctly high. Median cond_uscm ≈ 21,500 — that is right for the estuarine/saline mix (seawater ≈ 50,000 µS/cm); Great Lakes freshwater sites pull the low end. Not an error.
  • Long-format water chemistry. NCCA water chem is one row per analyte (ANALYTE/RESULT/ RESULT_UNITS); the build pivots UID × ANALYTE → RESULT (pivot_water_chem) before mapping.
  • One cell per site-year = the index visit. Keyed on UID (unique per site-visit), filtered to VISIT_NO==1 (equivalently INDEX_NCCA15=='Y'). VISIT_NO==2 revisits deferred.
  • Site IDs differ across cycles. NCCA draws a fresh probability sample each cycle, so 2015 site IDs will not align with 2010's; the place axis is sparse across cycles. Coords are carried for later spatial re-linking.
  • CR-only / CRLF line endings normalized before parse (parse_csv_bytes), the shared NARS quirk. curl, not urllib/httpx — the known sandbox hang.

Storage

YearLocationdex pattern: ONE place-sorted spine-parquet (data/yearlocation_storehouse/ncca_water_quality/ncca_water_quality_spine.parquet) sorted by site_id, year, plus a by-year index JSON. Built by scripts/build_ncca_water_quality_yearlocationdex.py. 1,093 coastal-site-year cells (2015; by region: Great Lakes 361, Northeast 253, Gulf 237, West 124, Southeast 118).

Deferred (not in v1)

  • NCCA 2010 raw data (extends the time axis to a real two-cycle grid).
  • The other NARS surveys as sibling kinds: Rivers & Streams (NRSA), Wetlands (NWCA). Each its own phenomenon/kind. (NLA done; NCCA done.)
  • VISIT_NO==2 revisits; sediment chemistry, fish-tissue contaminants, benthic biology, enterococci as additional cell layers.
  • Cross-cycle resample crosswalk (same-site linkage across cycles).
  • Live edge: none — NARS is a periodic survey posted in cycle batches.
Live Feed