Tornado Eventdex — Scope + Frozen Settings
Status: rule FROZEN 2026-06-11 (floor decided by Mike: no floor — full surveyed catalog) · Owner: Mike + Claude (engine room)
Parent: docs/event-spine-framework.md (the Eventdex framework; this is kind #4 after tc, gst, and eq)
This document scopes the tornado kind of the event storehouse. Per the pre-registration discipline: the settings below get frozen before any backfill and are not tuned afterward. Like the earthquake kind, no detection rule needs inventing — the spine is an authoritative external catalog — but the tornado catalog has two wrinkles the eq kind did not have: the event IDs are only year-unique, and the live edge is a different product from the spine (preliminary reports vs. surveyed tracks), with no shared ID between them.
All coverage claims below were verified against PG16 on 2026-06-11.
The spine
- Catalog: the SPC tornado database (surveyed tornado tracks, damage-rated), already
loaded locally as
spc_tornado_history— 70,022 rows, 1950-01-03 → 2023-12-19, fetcher_classBackfillOnly, sourcehttps://www.spc.noaa.gov/wcm/. This was loaded once as a curated backlog source; the Eventdex promotes it to a spine. - Per-row riches (verified in
extra_json): EF scale (asvalue), begin lat/lon (the observation point),end_lat/end_lon,length_miles,width_yards,injuries,fatalities,loss_millions,state,tornado_number(SPC "om" number). - Spot checks pass: Joplin 2011 (EF5, 158 fatalities), Moore 2013 (EF5), El Reno 2011 (EF5, 63.1-mile track) all present with correct figures.
- Slot ID (frozen, amended same-day — see Amendments):
{year}-{om}from the current SPC release. The om number is unique per year, not globally (3 collisions in the 2025 release: 1995/2001/2002 — frozen tiebreak: collide → order by timestamp, append_2to the later slot; exact duplicate rows collapse to one slot). om numbers are NOT stable across SPC releases — the 2025 release renumbered every 2007+ tornado to a datetime scheme (YYMMDDhhmm-NN; Joplin:296616→1105221634-01); 1950–2006 numbers were unchanged. The spine therefore mirrors the current release wholesale, and slot IDs are re-issued with it. - EFU is real data: 1,024 rows have no EF rating, all 2016–2023 — the modern "EFU" category (tornado confirmed, nothing in its path to damage-rate). Any EF-based floor must state explicitly whether EFU is in or out.
The live edge (different product — stated honestly)
spc_reports(live, 8 ticks/day): preliminary local storm reports — 2,166 tornado reports since 2026-03-26 (plus hail/wind). These have time, point location, county, comments — no EF rating, no track, no om number. A report is an eyewitness/radar confirmation, not a survey.- Consequence: a current-season tornado cannot be keyed to its eventual surveyed entry. The spine and the live edge meet only when SPC publishes the next annual surveyed file.
- Two-tier slot model (frozen):
- Surveyed slots (
{year}-{om}) — definitive, from the annual SPC database file. - Provisional slots (
prelim-{YYYYMMDD}-{hhmm}-{lat}x{lon}) — fromspc_reports, explicitly marked"provisional": truein the dossier. When the annual surveyed file for that year lands, the year's provisional slots are retired wholesale (moved aside, not deleted — slots are never deleted by automation) and replaced by surveyed slots. No row-level matching is attempted — preliminary reports many-to-one onto real tornadoes, and a fuzzy match would be a model-adjacent inference. Wholesale replacement is the honest rule.
- Surveyed slots (
- This is the paper-taxonomy provisional→definitive dynamic in its purest form yet: ComCat revised in place under one ID; tornadoes get re-issued under new IDs annually.
Amendments (2026-06-11, same evening — Brick B gap-audit findings)
The first gap-audit, run hours after the freeze, found three defects in the original 2023-release load. None touches the frozen floor or sweep settings; all three were recorded here before any data was changed (Mike's go: "reload it").
- Upstream ID renumbering. The draft asserted stable IDs. Wrong across releases: SPC's 2025 release renumbered all 2007–2023 tornadoes. Rule adopted: the spine is replaced wholesale from the current release on each annual re-pull, slot IDs re-issued with it; superseded dossier slots are retired, not deleted. 1950–2006 IDs (unchanged across these releases) stay stable in practice.
- Timestamps were CST stored as UTC. The original loader read the file's times
(timezone code 3 = CST on 99.95% of rows) as UTC, making every spine timestamp 6 h
early. Verified against Joplin's documented 22:34 UTC touchdown (file: 16:34 CST).
The reload stores true UTC (tz 3 → +6 h, tz 6 = MST → +7 h, tz 9 = GMT → +0; tz 0 =
unknown → +6 h CST assumption, flagged). The
tz_codeis kept per row. - Loss units changed upstream. Old release: millions of dollars (1996+); 2025
release: whole dollars. Pre-1996 the field is a 1–9 category code in both. The
reload stores
loss_rawverbatim +loss_millions(dollars/1e6) for 1996+ only; the old load's pre-1996 "loss_millions" values were category codes mislabeled as millions and are not carried forward.
Back-year revisions also confirmed: every year 2007–2022 gained rows in the 2025 release (2014 +42, 2022 +24); 2023 unchanged. The wholesale-reload rule absorbs these by construction. Catalog size after reload: 73,458 tornadoes, 1950 → 2025.
History depth + gap
- Spine on disk ends 2023-12-19. SPC has since published the updated database through
at least 2024 (annual release each spring at
spc.noaa.gov/wcm/). Gap-fill brick: re-pull the current 1950→latest file and load 2024 (and 2025 if published). This is a re-pull of an existing curated source, not a new fetcher — the framework doc's "needs an NCEI Storm Events fetcher" estimate was wrong; verified 2026-06-11 that the spine already exists locally. - Years after the last surveyed release run on provisional slots until their file lands.
- 1950s–1970s undercount (weak tornadoes underreported before systematic surveys) is a known property of the catalog, recorded here, not corrected for — the Eventdex stores what the catalog measured.
The decision: EF floor
Counts from the local spine (1950–2023):
| Floor | Slots | Modern rate (≈/yr) | Note |
|---|---|---|---|
| None (all surveyed) | 70,022 | ~1,200 | every slot is a damage-surveyed real event |
| EF1+ | 36,780 | ~500 | drops EF0 (32,218) and EFU (1,024) |
| EF2+ ("significant") | 12,998 | ~120 | exactly TC/eq Eventdex scale |
| EF3+ ("intense") | 3,231 | ~35 | violent-tornado tier only |
Considerations, both directions:
- Unlike the M4.5-global quake case (250k slots, mostly meaningless), there is no slot spam here: every row in the SPC database is a confirmed, surveyed tornado. The catalog is the floor.
- Pre-sensor-era slots are index-only either way (lightning depth starts 2026-04-20), so the marginal cost of an EF0 card is one small JSON file. 70k cards ≈ the eq backfill twice over, minutes not hours.
- The counterargument: 70k is 5× the TC kind, and EF0s are 46% of the catalog — short, weak, often <1 mile of track. If the Eventdex is "every card big enough that a sweep layer plausibly registers it," EF1+ or EF2+ is the analog of the M6.0 floor.
DECISION (Mike, 2026-06-11, FROZEN): no floor — every surveyed tornado in the SPC database gets a slot, EFU included. The survey process is the floor: every row is a confirmed, damage-surveyed event, so there is no slot-spam failure mode here. 70,022 slots at freeze time, growing ~1,200/yr. A weak tornado does not get dropped; a provisional report does not get grandfathered into a surveyed slot.
The sweep (per-kind settings, frozen)
- Geometry: short-track event — two fixes (begin point, end point; identical when no end coords). Sweep both, TC-style, dedup hits across the pair.
- Radius: 150 km flat (no tiering). Rationale: the relevant context — lightning activity in the parent storm, neighboring reports, warning polygons — lives at mesocyclone-to-supercell scale, not at magnitude-scaled basin scale.
- Window:
[touchdown − 6 h, touchdown + 6 h]. Pre-window catches the parent storm's electrification ramp (the lightning-jump literature lives in the tens of minutes before); post-window catches the storm's evolution past the tornado. - Sweep layers (verified live 2026-06-11, with local depth):
blitzortung_lightning— ground-network strokes (since 2026-04-20, 22.2M rows)goes18_glm_flashes/goes19_glm_flashes— satellite optical flashes (since 2026-04-22)spc_reports— sibling reports: hail/wind/other tornado reports around the tracknws_alerts— warning polygons active over the track window (since 2021; issuance records — operational facts, not forecasts, but recorded here as context layers)igra_soundings— profile sensor, launch-collapsed, pre-storm environment (since 2024-12-31)
- Consequence, stated up front: every surveyed slot (1950–2023) is index-only by construction. Sensor depth begins 2026-04. The first surveyed year with lightning coverage will be 2026, published ~spring 2027. Until then the sweep earns its keep on provisional slots only. The slots exist; the data grows into them — same posture as pre-2024 TC and pre-2025 gst/eq.
Dossier shape
Same pattern as eq: per-sensor aggregates + hits parquet for the spatially-swept layers.
Spine metadata in the slot: EF scale, begin/end coords, length, width, injuries,
fatalities, loss, state, provisional flag. Kind directory: data/event_storehouse/tor/.
Build bricks
- Brick A — freeze. DONE 2026-06-11: no floor (full surveyed catalog, EFU included), slot-ID + two-tier model + sweep settings frozen above.
- Brick B — spine reload. DONE 2026-06-11: wholesale reload from the 2025
release (
scripts/reload_tornado_spine.py) — 73,458 rows 1950→2025, per-year counts match the file exactly (76/76 years), 3 known ID collisions, true-UTC timestamps (Joplin 22:34 UTC ✓), Greenfield 2024 EF4 spot-checked. Audit findings that drove the reload-not-append decision are in Amendments above. Old release archived in DuckDB (raw_eventstable); 2025 release inraw_events_2025_release. - Brick C — kind registration + live sweep. DONE 2026-06-11:
tor_sweep.pyKindConfig + 30-min scheduler job beside the other three kinds. Provisional-slot path live (first tick: 121 provisional dossiers from the trailing 14 days, 938k sensor hits; spc_reports sibling layer deduplicated ~3× and self-excluded). Sweep covers both track endpoints with closer-hit dedup. - Brick D — dossier backfill. DONE 2026-06-11:
scripts/backfill_tor_dossiers.py, 73,458 surveyed dossiers in 1,543 s (decades: 1950s 4,793 / 1960s 6,811 / 1970s 8,579 / 1980s 8,195 / 1990s 12,137 / 2000s 12,779 / 2010s 12,070 / 2020s 8,094; EF: 0 33,329 / 1 25,179 / 2 10,079 / 3 2,669 / 4 596 / 5 60 / U 1,546). 2,749 cards carry sensor hits (2021+ nws_alerts depth + 2025 IGRA); Joplin 2011 surfaced independently as deadliest (158 fatalities). Count matches the spine exactly. Known coverage gap recorded: nws_alerts has no 2025 rows locally (2021–2024 + live 2026 only) — gap-fill candidate, not a sweep defect. - First report: deferred. Engine room, not paper mode.