Tornado Cross-Source Enrichment (NCEI Storm Events → tor) — Scope + Frozen Settings
Status: FROZEN 2026-06-16 (Mike chose "tornado-enrich first" over a new severe-weather
kind or full Storm Events completeness) · Owner: Mike + Claude (engine room)
Parent: docs/event-spine-framework.md (Eventdex), docs/yeardex-framework.md Part 2
(the universal multi-source cited-slot rule), docs/scope-tornado-eventdex.md (the tor
kind this enriches).
This is the first same-event cross-source merge on the platform. Every prior multi-source
slot (the entry kind's four fireball catalogs) partitioned a slot space: different physical
events, one per slot, the source just a prefix. This one is harder: two catalogs describe the
same physical tornado, and the job is to fold the second catalog's columns into the existing
slot, cited, without minting a duplicate and without overwriting the primary.
It does not create a new kind. It enriches the 73,458 existing tor slots in place.
Why this exists
The tor spine (spc_tornado_history, the SPC tornado database) carries the numbers of every
surveyed US tornado, 1950–2025: EF rating, begin/end points, track length, width, casualties,
loss. What it does not carry is prose — the National Weather Service survey narrative, the
episode that groups a tornado with the rest of its outbreak, and NCEI's own event identifiers.
NOAA's Storm Events Database (the NCEI product) carries exactly that. The two share upstream
provenance (both descend from NWS storm data), so for a given tornado they should agree closely on
time and place. Enrichment attaches NCEI's narrative + episode + identifiers to the matching tor
slot as a cited block, leaving SPC as the primary record.
The second source
- Product: NCEI Storm Events Database, annual detail files
StormEvents_details-ftp_v1.0_d{YYYY}_c{YYYYMMDD}.csv.gzunderhttps://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/. (The PG-stagedstormevents-csvfilesrows are AutoSensecatalog_csv_rowstubs — the crawler listed the directory, never parsed a record. This is a fresh one-time parse, the same move FEMA and landuse made.) - Filter (this build):
EVENT_TYPE = 'Tornado'only. Hail, wind, flood, lightning, winter, and the other ~45 event types are out of scope for the enrichment build — they are new slots in a future severe-weather kind, the deferred follow-on Mike named. This document is tornado-only. - Per-record riches we want (NCEI detail columns):
EVENT_ID,EPISODE_ID,EVENT_NARRATIVE,EPISODE_NARRATIVE,CZ_NAME(county),BEGIN_DATE_TIME+CZ_TIMEZONE,BEGIN_LAT/BEGIN_LON,END_LAT/END_LON,STATE, and NCEI's ownTOR_F_SCALE,TOR_LENGTH,TOR_WIDTH,INJURIES_DIRECT,DEATHS_DIRECT,DAMAGE_PROPERTY. - Coverage (lat/lon completeness, especially pre-1996) is verified at pull time in Brick B, not asserted here.
The match — LOAD-BEARING (engine-room call, Mike may veto the tolerances)
There is no shared key between the two catalogs (SPC's om/tornado_number does not appear
in NCEI; NCEI's EVENT_ID does not appear in SPC). The match is therefore on physics, and the
tolerances are the one genuinely judgmental call in this build. They are set tight because the two
catalogs share provenance, generous enough to absorb known rounding and the timezone conversion:
A candidate NCEI tornado record matches a tor slot when all hold:
- Same state (2-letter postal).
- Same calendar day ± 1 day — NCEI
BEGIN_DATE_TIMEis in local standard time (CZ_TIMEZONE, e.g.CST-6); convert to UTC first. The ±1-day band absorbs the midnight-rollover that the local→UTC shift can cause. - Begin time within ± 30 minutes after the local→UTC conversion.
- Point within 15 km — NCEI begin-point within 15 km of the SPC begin point OR the SPC end point (older NCEI lat/lon is rounded to 2 decimals; the begin-or-end test absorbs which endpoint each catalog calls "begin" and the county-segmentation offset).
One-to-many is expected and kept. NCEI splits a multi-county tornado into one EVENT_ID per
county segment; SPC may carry the same tornado as a single track. A tor slot therefore matches a
list of NCEI segments — all of them are attached, ordered by begin time, each cited.
Ambiguity rule (frozen): if one NCEI segment is within tolerance of two different tor slots
(two same-day, same-state tornadoes close together), it binds to the nearest in space, then
nearest in time; a tie that survives both is logged as ambiguous and attached to neither, never
guessed.
What a matched slot gains (cited block)
The tor dossier keeps every existing field unchanged and gains:
sources: ["spc_tornado_history", "ncei_storm_events"](was implicitly SPC-only).- A
storm_eventsblock citing NCEI:source: "ncei_storm_events"and the file/url provenance.ncei_event_ids: [...],episode_id,n_segments.episode_narrative,event_narratives: [...]— the prose SPC lacks (the headline value).cz_names: [...](the counties NCEI walks the track through).ncei_figures: NCEI's ownf_scale/length_miles/width_yards/injuries/deaths/damage_property, carried as NCEI's reading for cross-source comparison — flagged, never overwriting the SPC primary values the slot already displays.match: {distance_km, dt_minutes, endpoint}— the provenance of the match itself.
SPC stays primary: the slot's displayed EF, points, track, and casualties remain the SPC figures. NCEI is additive and clearly attributed.
Unmatched records (frozen handling)
- Unmatched
torslot (no NCEI counterpart within tolerance): slot is left exactly as is — nostorm_eventsblock,sourcesstays SPC-only. Expected for pre-1996 if NCEI lat/lon is sparse there; the match rate by decade is reported in Brick B. - Unmatched NCEI tornado segment (no SPC slot within tolerance): not minted as a new
torslot in this build. SPC is the authoritative tornado spine; an NCEI-only "tornado" with no SPC row is almost always a segmentation artifact or a catalog discrepancy, which is worth a human look, not an automatic new event. These are written to a sidecar report (data/stormevents_tornado_unmatched.parquet) and counted, never discarded (consistent with the framework's "an unmatched record is never thrown away"). Whether any of them deserve to become slots is a deferred question, explicitly out of this enrichment build.
Measured-reality — IN (no new bright-line question)
Both catalogs are post-event storm surveys: a real tornado that physically happened, walked and
rated by NWS survey crews. NCEI's narrative is the crew's written account of measured damage. This
is the same class as the tor spine already shipped and as FEMA's administrative disaster records.
There is no model output here, so [[feedback_measured_reality_only]] raises no new ruling — the
enrichment inherits the spine's standing IN.
No new sweep
Enrichment adds the cited NCEI block only. The tor slot's existing sensor sweep
(lightning / GLM / NWS alerts / IGRA, at 150 km, ±6 h) is unchanged and not re-run — the
narrative and episode are catalog metadata, not a new sensor layer. Re-sweeping is a separate
concern if the spine itself ever reloads.
Build bricks
- Brick A — freeze. THIS DOC (2026-06-16): tornado-only enrichment of the existing
torkind, NCEI Storm Events as the cited second source, the four-part physics match (state / ±1 day / ±30 min / 15 km begin-or-end), one-to-many segment attachment, SPC-primary cited block, unmatched-NCEI to a sidecar (not minted), measured-reality IN, no new sweep. - Brick B — pull + match. DONE 2026-06-16:
scripts/match_stormevents_tornadoes.pypulled all 77 NCEI detail files 1950–2026 (curl-seededdata/stormevents_cache/; httpx hangs against the NCEI index in this environment, so the script reads the cache and falls back to httpx only when empty), kept 79,216 tornado segments (1,186 dropped for genuinely missing coords/time), converted local→UTC, and matched against the 73,458-rowtorspine. 74,463 matches → 69,231 of 73,458 surveyed slots enriched (94.2%), 4,753 unmatched + 813 ambiguous to the sidecar. Per-decade match rate 0.91–0.97, flat across eras. Joplin 2011 matched at 2.1 km. Gotcha banked: older detail files carry a bareCZ_TIMEZONE(CST) with no numeric offset while modern files carryCST-6; the first pass wrongly skipped 52,499 pre-2000 records (which DO have coordinates) until the bare-abbreviation map was added. Also,pl.DataFrame(rows)needsinfer_schema_length=Nonebecause the earliest matched rows have null narratives. - Brick C/D — enrich + verify. DONE 2026-06-16:
src/terrapulse/monitor/tor_stormevents_enrich.pyfolded the citedstorm_eventsblock into all 69,231 matched dossiers (0 missing),sources: [spc_tornado_history, ncei_storm_events], SPC primary fields untouched, casualties summed across county segments, narratives attached, track-max NCEI rating in the roll-up (Joplin: Newton EF2 + Jasper EF5 → EF5). Index rebuilt (census ~144,869). 4 unit tests (tests/test_monitor/test_tor_stormevents_enrich.py) green. Joplin now carries the NWS survey narrative its SPC record lacked. The tornado cross-source enrichment is COMPLETE (A+B+C/D) — the platform's first same-event cross-source merge. - First report: deferred. Engine room, not paper mode.