Listening for events…

Tornado Cross-Source Enrichment (NCEI Storm Events → tor) — Scope + Frozen Settings

Status: FROZEN 2026-06-16 (Mike chose "tornado-enrich first" over a new severe-weather kind or full Storm Events completeness) · Owner: Mike + Claude (engine room) Parent: docs/event-spine-framework.md (Eventdex), docs/yeardex-framework.md Part 2 (the universal multi-source cited-slot rule), docs/scope-tornado-eventdex.md (the tor kind this enriches).

This is the first same-event cross-source merge on the platform. Every prior multi-source slot (the entry kind's four fireball catalogs) partitioned a slot space: different physical events, one per slot, the source just a prefix. This one is harder: two catalogs describe the same physical tornado, and the job is to fold the second catalog's columns into the existing slot, cited, without minting a duplicate and without overwriting the primary.

It does not create a new kind. It enriches the 73,458 existing tor slots in place.

Why this exists

The tor spine (spc_tornado_history, the SPC tornado database) carries the numbers of every surveyed US tornado, 1950–2025: EF rating, begin/end points, track length, width, casualties, loss. What it does not carry is prose — the National Weather Service survey narrative, the episode that groups a tornado with the rest of its outbreak, and NCEI's own event identifiers. NOAA's Storm Events Database (the NCEI product) carries exactly that. The two share upstream provenance (both descend from NWS storm data), so for a given tornado they should agree closely on time and place. Enrichment attaches NCEI's narrative + episode + identifiers to the matching tor slot as a cited block, leaving SPC as the primary record.

The second source

  • Product: NCEI Storm Events Database, annual detail files StormEvents_details-ftp_v1.0_d{YYYY}_c{YYYYMMDD}.csv.gz under https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/. (The PG-staged stormevents-csvfiles rows are AutoSense catalog_csv_row stubs — the crawler listed the directory, never parsed a record. This is a fresh one-time parse, the same move FEMA and landuse made.)
  • Filter (this build): EVENT_TYPE = 'Tornado' only. Hail, wind, flood, lightning, winter, and the other ~45 event types are out of scope for the enrichment build — they are new slots in a future severe-weather kind, the deferred follow-on Mike named. This document is tornado-only.
  • Per-record riches we want (NCEI detail columns): EVENT_ID, EPISODE_ID, EVENT_NARRATIVE, EPISODE_NARRATIVE, CZ_NAME (county), BEGIN_DATE_TIME + CZ_TIMEZONE, BEGIN_LAT/BEGIN_LON, END_LAT/END_LON, STATE, and NCEI's own TOR_F_SCALE, TOR_LENGTH, TOR_WIDTH, INJURIES_DIRECT, DEATHS_DIRECT, DAMAGE_PROPERTY.
  • Coverage (lat/lon completeness, especially pre-1996) is verified at pull time in Brick B, not asserted here.

The match — LOAD-BEARING (engine-room call, Mike may veto the tolerances)

There is no shared key between the two catalogs (SPC's om/tornado_number does not appear in NCEI; NCEI's EVENT_ID does not appear in SPC). The match is therefore on physics, and the tolerances are the one genuinely judgmental call in this build. They are set tight because the two catalogs share provenance, generous enough to absorb known rounding and the timezone conversion:

A candidate NCEI tornado record matches a tor slot when all hold:

  1. Same state (2-letter postal).
  2. Same calendar day ± 1 day — NCEI BEGIN_DATE_TIME is in local standard time (CZ_TIMEZONE, e.g. CST-6); convert to UTC first. The ±1-day band absorbs the midnight-rollover that the local→UTC shift can cause.
  3. Begin time within ± 30 minutes after the local→UTC conversion.
  4. Point within 15 km — NCEI begin-point within 15 km of the SPC begin point OR the SPC end point (older NCEI lat/lon is rounded to 2 decimals; the begin-or-end test absorbs which endpoint each catalog calls "begin" and the county-segmentation offset).

One-to-many is expected and kept. NCEI splits a multi-county tornado into one EVENT_ID per county segment; SPC may carry the same tornado as a single track. A tor slot therefore matches a list of NCEI segments — all of them are attached, ordered by begin time, each cited.

Ambiguity rule (frozen): if one NCEI segment is within tolerance of two different tor slots (two same-day, same-state tornadoes close together), it binds to the nearest in space, then nearest in time; a tie that survives both is logged as ambiguous and attached to neither, never guessed.

What a matched slot gains (cited block)

The tor dossier keeps every existing field unchanged and gains:

  • sources: ["spc_tornado_history", "ncei_storm_events"] (was implicitly SPC-only).
  • A storm_events block citing NCEI:
    • source: "ncei_storm_events" and the file/url provenance.
    • ncei_event_ids: [...], episode_id, n_segments.
    • episode_narrative, event_narratives: [...]the prose SPC lacks (the headline value).
    • cz_names: [...] (the counties NCEI walks the track through).
    • ncei_figures: NCEI's own f_scale/length_miles/width_yards/injuries/deaths/ damage_property, carried as NCEI's reading for cross-source comparison — flagged, never overwriting the SPC primary values the slot already displays.
    • match: {distance_km, dt_minutes, endpoint} — the provenance of the match itself.

SPC stays primary: the slot's displayed EF, points, track, and casualties remain the SPC figures. NCEI is additive and clearly attributed.

Unmatched records (frozen handling)

  • Unmatched tor slot (no NCEI counterpart within tolerance): slot is left exactly as is — no storm_events block, sources stays SPC-only. Expected for pre-1996 if NCEI lat/lon is sparse there; the match rate by decade is reported in Brick B.
  • Unmatched NCEI tornado segment (no SPC slot within tolerance): not minted as a new tor slot in this build. SPC is the authoritative tornado spine; an NCEI-only "tornado" with no SPC row is almost always a segmentation artifact or a catalog discrepancy, which is worth a human look, not an automatic new event. These are written to a sidecar report (data/stormevents_tornado_unmatched.parquet) and counted, never discarded (consistent with the framework's "an unmatched record is never thrown away"). Whether any of them deserve to become slots is a deferred question, explicitly out of this enrichment build.

Measured-reality — IN (no new bright-line question)

Both catalogs are post-event storm surveys: a real tornado that physically happened, walked and rated by NWS survey crews. NCEI's narrative is the crew's written account of measured damage. This is the same class as the tor spine already shipped and as FEMA's administrative disaster records. There is no model output here, so [[feedback_measured_reality_only]] raises no new ruling — the enrichment inherits the spine's standing IN.

No new sweep

Enrichment adds the cited NCEI block only. The tor slot's existing sensor sweep (lightning / GLM / NWS alerts / IGRA, at 150 km, ±6 h) is unchanged and not re-run — the narrative and episode are catalog metadata, not a new sensor layer. Re-sweeping is a separate concern if the spine itself ever reloads.

Build bricks

  • Brick A — freeze. THIS DOC (2026-06-16): tornado-only enrichment of the existing tor kind, NCEI Storm Events as the cited second source, the four-part physics match (state / ±1 day / ±30 min / 15 km begin-or-end), one-to-many segment attachment, SPC-primary cited block, unmatched-NCEI to a sidecar (not minted), measured-reality IN, no new sweep.
  • Brick B — pull + match. DONE 2026-06-16: scripts/match_stormevents_tornadoes.py pulled all 77 NCEI detail files 1950–2026 (curl-seeded data/stormevents_cache/; httpx hangs against the NCEI index in this environment, so the script reads the cache and falls back to httpx only when empty), kept 79,216 tornado segments (1,186 dropped for genuinely missing coords/time), converted local→UTC, and matched against the 73,458-row tor spine. 74,463 matches → 69,231 of 73,458 surveyed slots enriched (94.2%), 4,753 unmatched + 813 ambiguous to the sidecar. Per-decade match rate 0.91–0.97, flat across eras. Joplin 2011 matched at 2.1 km. Gotcha banked: older detail files carry a bare CZ_TIMEZONE (CST) with no numeric offset while modern files carry CST-6; the first pass wrongly skipped 52,499 pre-2000 records (which DO have coordinates) until the bare-abbreviation map was added. Also, pl.DataFrame(rows) needs infer_schema_length=None because the earliest matched rows have null narratives.
  • Brick C/D — enrich + verify. DONE 2026-06-16: src/terrapulse/monitor/tor_stormevents_enrich.py folded the cited storm_events block into all 69,231 matched dossiers (0 missing), sources: [spc_tornado_history, ncei_storm_events], SPC primary fields untouched, casualties summed across county segments, narratives attached, track-max NCEI rating in the roll-up (Joplin: Newton EF2 + Jasper EF5 → EF5). Index rebuilt (census ~144,869). 4 unit tests (tests/test_monitor/test_tor_stormevents_enrich.py) green. Joplin now carries the NWS survey narrative its SPC record lacked. The tornado cross-source enrichment is COMPLETE (A+B+C/D) — the platform's first same-event cross-source merge.
  • First report: deferred. Engine room, not paper mode.
Live Feed