Listening for events…

Tornado Eventdex — Scope + Frozen Settings

Status: rule FROZEN 2026-06-11 (floor decided by Mike: no floor — full surveyed catalog) · Owner: Mike + Claude (engine room) Parent: docs/event-spine-framework.md (the Eventdex framework; this is kind #4 after tc, gst, and eq)

This document scopes the tornado kind of the event storehouse. Per the pre-registration discipline: the settings below get frozen before any backfill and are not tuned afterward. Like the earthquake kind, no detection rule needs inventing — the spine is an authoritative external catalog — but the tornado catalog has two wrinkles the eq kind did not have: the event IDs are only year-unique, and the live edge is a different product from the spine (preliminary reports vs. surveyed tracks), with no shared ID between them.

All coverage claims below were verified against PG16 on 2026-06-11.


The spine

  • Catalog: the SPC tornado database (surveyed tornado tracks, damage-rated), already loaded locally as spc_tornado_history70,022 rows, 1950-01-03 → 2023-12-19, fetcher_class BackfillOnly, source https://www.spc.noaa.gov/wcm/. This was loaded once as a curated backlog source; the Eventdex promotes it to a spine.
  • Per-row riches (verified in extra_json): EF scale (as value), begin lat/lon (the observation point), end_lat/end_lon, length_miles, width_yards, injuries, fatalities, loss_millions, state, tornado_number (SPC "om" number).
  • Spot checks pass: Joplin 2011 (EF5, 158 fatalities), Moore 2013 (EF5), El Reno 2011 (EF5, 63.1-mile track) all present with correct figures.
  • Slot ID (frozen, amended same-day — see Amendments): {year}-{om} from the current SPC release. The om number is unique per year, not globally (3 collisions in the 2025 release: 1995/2001/2002 — frozen tiebreak: collide → order by timestamp, append _2 to the later slot; exact duplicate rows collapse to one slot). om numbers are NOT stable across SPC releases — the 2025 release renumbered every 2007+ tornado to a datetime scheme (YYMMDDhhmm-NN; Joplin: 2966161105221634-01); 1950–2006 numbers were unchanged. The spine therefore mirrors the current release wholesale, and slot IDs are re-issued with it.
  • EFU is real data: 1,024 rows have no EF rating, all 2016–2023 — the modern "EFU" category (tornado confirmed, nothing in its path to damage-rate). Any EF-based floor must state explicitly whether EFU is in or out.

The live edge (different product — stated honestly)

  • spc_reports (live, 8 ticks/day): preliminary local storm reports — 2,166 tornado reports since 2026-03-26 (plus hail/wind). These have time, point location, county, comments — no EF rating, no track, no om number. A report is an eyewitness/radar confirmation, not a survey.
  • Consequence: a current-season tornado cannot be keyed to its eventual surveyed entry. The spine and the live edge meet only when SPC publishes the next annual surveyed file.
  • Two-tier slot model (frozen):
    • Surveyed slots ({year}-{om}) — definitive, from the annual SPC database file.
    • Provisional slots (prelim-{YYYYMMDD}-{hhmm}-{lat}x{lon}) — from spc_reports, explicitly marked "provisional": true in the dossier. When the annual surveyed file for that year lands, the year's provisional slots are retired wholesale (moved aside, not deleted — slots are never deleted by automation) and replaced by surveyed slots. No row-level matching is attempted — preliminary reports many-to-one onto real tornadoes, and a fuzzy match would be a model-adjacent inference. Wholesale replacement is the honest rule.
  • This is the paper-taxonomy provisional→definitive dynamic in its purest form yet: ComCat revised in place under one ID; tornadoes get re-issued under new IDs annually.

Amendments (2026-06-11, same evening — Brick B gap-audit findings)

The first gap-audit, run hours after the freeze, found three defects in the original 2023-release load. None touches the frozen floor or sweep settings; all three were recorded here before any data was changed (Mike's go: "reload it").

  1. Upstream ID renumbering. The draft asserted stable IDs. Wrong across releases: SPC's 2025 release renumbered all 2007–2023 tornadoes. Rule adopted: the spine is replaced wholesale from the current release on each annual re-pull, slot IDs re-issued with it; superseded dossier slots are retired, not deleted. 1950–2006 IDs (unchanged across these releases) stay stable in practice.
  2. Timestamps were CST stored as UTC. The original loader read the file's times (timezone code 3 = CST on 99.95% of rows) as UTC, making every spine timestamp 6 h early. Verified against Joplin's documented 22:34 UTC touchdown (file: 16:34 CST). The reload stores true UTC (tz 3 → +6 h, tz 6 = MST → +7 h, tz 9 = GMT → +0; tz 0 = unknown → +6 h CST assumption, flagged). The tz_code is kept per row.
  3. Loss units changed upstream. Old release: millions of dollars (1996+); 2025 release: whole dollars. Pre-1996 the field is a 1–9 category code in both. The reload stores loss_raw verbatim + loss_millions (dollars/1e6) for 1996+ only; the old load's pre-1996 "loss_millions" values were category codes mislabeled as millions and are not carried forward.

Back-year revisions also confirmed: every year 2007–2022 gained rows in the 2025 release (2014 +42, 2022 +24); 2023 unchanged. The wholesale-reload rule absorbs these by construction. Catalog size after reload: 73,458 tornadoes, 1950 → 2025.

History depth + gap

  • Spine on disk ends 2023-12-19. SPC has since published the updated database through at least 2024 (annual release each spring at spc.noaa.gov/wcm/). Gap-fill brick: re-pull the current 1950→latest file and load 2024 (and 2025 if published). This is a re-pull of an existing curated source, not a new fetcher — the framework doc's "needs an NCEI Storm Events fetcher" estimate was wrong; verified 2026-06-11 that the spine already exists locally.
  • Years after the last surveyed release run on provisional slots until their file lands.
  • 1950s–1970s undercount (weak tornadoes underreported before systematic surveys) is a known property of the catalog, recorded here, not corrected for — the Eventdex stores what the catalog measured.

The decision: EF floor

Counts from the local spine (1950–2023):

Floor Slots Modern rate (≈/yr) Note
None (all surveyed) 70,022 ~1,200 every slot is a damage-surveyed real event
EF1+ 36,780 ~500 drops EF0 (32,218) and EFU (1,024)
EF2+ ("significant") 12,998 ~120 exactly TC/eq Eventdex scale
EF3+ ("intense") 3,231 ~35 violent-tornado tier only

Considerations, both directions:

  • Unlike the M4.5-global quake case (250k slots, mostly meaningless), there is no slot spam here: every row in the SPC database is a confirmed, surveyed tornado. The catalog is the floor.
  • Pre-sensor-era slots are index-only either way (lightning depth starts 2026-04-20), so the marginal cost of an EF0 card is one small JSON file. 70k cards ≈ the eq backfill twice over, minutes not hours.
  • The counterargument: 70k is 5× the TC kind, and EF0s are 46% of the catalog — short, weak, often <1 mile of track. If the Eventdex is "every card big enough that a sweep layer plausibly registers it," EF1+ or EF2+ is the analog of the M6.0 floor.

DECISION (Mike, 2026-06-11, FROZEN): no floor — every surveyed tornado in the SPC database gets a slot, EFU included. The survey process is the floor: every row is a confirmed, damage-surveyed event, so there is no slot-spam failure mode here. 70,022 slots at freeze time, growing ~1,200/yr. A weak tornado does not get dropped; a provisional report does not get grandfathered into a surveyed slot.

The sweep (per-kind settings, frozen)

  • Geometry: short-track event — two fixes (begin point, end point; identical when no end coords). Sweep both, TC-style, dedup hits across the pair.
  • Radius: 150 km flat (no tiering). Rationale: the relevant context — lightning activity in the parent storm, neighboring reports, warning polygons — lives at mesocyclone-to-supercell scale, not at magnitude-scaled basin scale.
  • Window: [touchdown − 6 h, touchdown + 6 h]. Pre-window catches the parent storm's electrification ramp (the lightning-jump literature lives in the tens of minutes before); post-window catches the storm's evolution past the tornado.
  • Sweep layers (verified live 2026-06-11, with local depth):
    • blitzortung_lightning — ground-network strokes (since 2026-04-20, 22.2M rows)
    • goes18_glm_flashes / goes19_glm_flashes — satellite optical flashes (since 2026-04-22)
    • spc_reports — sibling reports: hail/wind/other tornado reports around the track
    • nws_alerts — warning polygons active over the track window (since 2021; issuance records — operational facts, not forecasts, but recorded here as context layers)
    • igra_soundings — profile sensor, launch-collapsed, pre-storm environment (since 2024-12-31)
  • Consequence, stated up front: every surveyed slot (1950–2023) is index-only by construction. Sensor depth begins 2026-04. The first surveyed year with lightning coverage will be 2026, published ~spring 2027. Until then the sweep earns its keep on provisional slots only. The slots exist; the data grows into them — same posture as pre-2024 TC and pre-2025 gst/eq.

Dossier shape

Same pattern as eq: per-sensor aggregates + hits parquet for the spatially-swept layers. Spine metadata in the slot: EF scale, begin/end coords, length, width, injuries, fatalities, loss, state, provisional flag. Kind directory: data/event_storehouse/tor/.

Build bricks

  • Brick A — freeze. DONE 2026-06-11: no floor (full surveyed catalog, EFU included), slot-ID + two-tier model + sweep settings frozen above.
  • Brick B — spine reload. DONE 2026-06-11: wholesale reload from the 2025 release (scripts/reload_tornado_spine.py) — 73,458 rows 1950→2025, per-year counts match the file exactly (76/76 years), 3 known ID collisions, true-UTC timestamps (Joplin 22:34 UTC ✓), Greenfield 2024 EF4 spot-checked. Audit findings that drove the reload-not-append decision are in Amendments above. Old release archived in DuckDB (raw_events table); 2025 release in raw_events_2025_release.
  • Brick C — kind registration + live sweep. DONE 2026-06-11: tor_sweep.py KindConfig + 30-min scheduler job beside the other three kinds. Provisional-slot path live (first tick: 121 provisional dossiers from the trailing 14 days, 938k sensor hits; spc_reports sibling layer deduplicated ~3× and self-excluded). Sweep covers both track endpoints with closer-hit dedup.
  • Brick D — dossier backfill. DONE 2026-06-11: scripts/backfill_tor_dossiers.py, 73,458 surveyed dossiers in 1,543 s (decades: 1950s 4,793 / 1960s 6,811 / 1970s 8,579 / 1980s 8,195 / 1990s 12,137 / 2000s 12,779 / 2010s 12,070 / 2020s 8,094; EF: 0 33,329 / 1 25,179 / 2 10,079 / 3 2,669 / 4 596 / 5 60 / U 1,546). 2,749 cards carry sensor hits (2021+ nws_alerts depth + 2025 IGRA); Joplin 2011 surfaced independently as deadliest (158 fatalities). Count matches the spine exactly. Known coverage gap recorded: nws_alerts has no 2025 rows locally (2021–2024 + live 2026 only) — gap-fill candidate, not a sweep defect.
  • First report: deferred. Engine room, not paper mode.
Live Feed