Listening for events…

Backfill gap matrix

For each curated source: how much of its theoretically-available history TerraPulse currently holds, the gap to its theoretical maximum, and a rough backfill-effort estimate. Use this as a prioritization frame for "what should we backfill next?" rather than "is this source loading right now."

Built 2026-05-13 from the live observations table. Coverage figures are the actual MIN(timestamp_utc) / MAX(timestamp_utc) per source_id at extract time. Theoretical-max columns are sourced from the public docs of each provider.

Convention:

  • "Already covered" — current_start ≤ theoretical_start, no meaningful gap
  • "Modest gap" — < 5 years missing
  • "Major gap" — ≥ 5 years missing and the missing years are useful for active or queued research
  • "Effort:" S/M/L (sub-day, day-or-two, week-plus)

This file is partial — it lists sources tied to active research arcs and high-traffic ingestion. The long tail of one-shot EU/EPA catalog sources is omitted (they're static datasets, no meaningful backfill question). Add rows when a new arc begins or when a source's coverage shifts.


Severe weather

nws_alerts

  • Current coverage: 2021-01-01 → today (708k rows)
  • Theoretical max: ~2007 (iNWS Common Alerting Protocol rollout); some pre-CAP feeds go back further
  • Gap: ~14 years pre-2021
  • Backfill feasibility: Iowa Environmental Mesonet archives NWS alerts via API back to ~2005. Already used for SPC tornado backfill (#174). Same path works for general NWS alert types.
  • Effort: M — needs alert-type stratification and dedup logic
  • Active arc relevance: High. Underpins severe-weather climatology and cross-source cascade studies.

spc_reports (live)

  • Current coverage: 2026-03-22 → today (49k rows)
  • Theoretical max: SPC online catalog from 1950 (storm reports), continuously updated
  • Gap: 76 years
  • Backfill feasibility: SPC publishes annual CSVs. spc_tornado_history already covers tornadoes 1950-2023. Hail and wind reports same source, separate CSVs.
  • Effort: M for hail+wind history; tornadoes already done (#174)
  • Active arc relevance: High. Negative-class controls for WSPR-tornado V5/V2-of-V3.

spc_tornado_history (backfill-only source)

  • Current coverage: 1950-01-03 → 2023-12-19 (70k rows)
  • Theoretical max: Same; SPC publishes 2024 + 2025 in annual updates
  • Gap: 2 years missing recent
  • Backfill feasibility: Annual CSV publish cycle. Re-pull 2024 + 2025 when SPC issues them.
  • Effort: S
  • Active arc relevance: High but already largely done.

spc_outlook

  • Current coverage: 2026-03-23 → today (3.3k rows)
  • Theoretical max: SPC has archived day-1/day-2/day-8 convective outlooks since 2002 (varies by product)
  • Gap: 24 years
  • Backfill feasibility: SPC archives outlooks as shapefiles. Path exists but heavier — need shapefile parsing + KML/GeoJSON conversion.
  • Effort: L
  • Active arc relevance: Medium. Useful for forecast-verification studies (#75).

Space weather

dscovr_solar_wind

  • Current coverage: 2025-10-02 → today (6.9M rows)
  • Theoretical max: DSCOVR launched 2015-02; L1 solar wind data continuous since mid-2016
  • Gap: ~9 years
  • Backfill feasibility: NOAA hosts DSCOVR archive on their CDF servers. Bulk download + per-day parse.
  • Effort: M
  • Active arc relevance: High. Solar wind → Kp lag (#131) needs deep history for proper statistics.

goes_xray

  • Current coverage: 2026-03-17 → today (13M rows)
  • Theoretical max: GOES X-ray flux archive goes back to 1986 (GOES-6 onward). Continuous across satellites.
  • Gap: 40 years
  • Backfill feasibility: NOAA SWPC ftp archive. Annual files. Many satellite generations to harmonize.
  • Effort: L (~week)
  • Active arc relevance: High. Solar-flare-watch papers (#28 in spec queue, x-flare-watch workspaces already done at small scale).

nasa_donki

  • Current coverage: 2025-09-16 → today (30k rows)
  • Theoretical max: NASA DONKI catalog goes back to 2010-08-01
  • Gap: 15 years
  • Backfill feasibility: DONKI REST API supports historical date range queries. Walk the date window.
  • Effort: S (single script, well-defined API)
  • Active arc relevance: Medium. Cascade triggers paper #76 wants this.

dst_index

  • Current coverage: 2026-03-03 → ~30 days ago (2.2k rows). Wall-clock-to-MAX(timestamp_utc) shows ~30 days; this is the WDC Kyoto intrinsic provisional-validation lag, not an ingestion bug (verified #192).
  • Theoretical max: Dst index 1957-onwards from WDC Kyoto
  • Gap: ~69 years pre-2026; the 30-day "lag" is a property of the dataset, not a fixable gap
  • Backfill feasibility: WDC Kyoto distributes hourly Dst as text files. Straightforward backfill.
  • Effort: S
  • Active arc relevance: Medium. Geomagnetic storm onset timing (#131). Note that any analysis using Dst is implicitly working with values lagged ~30 days.

silso_sunspots

  • Current coverage: 1817-12-31 → ~1 day behind today's date (754k rows). The current MAX(timestamp_utc) reflects SILSO's own publishing cadence (~1-day lag) rather than an ingestion bug (verified #192).
  • Theoretical max: SILSO daily v2.0 series goes back to 1818 (essentially fully covered)
  • Gap: None
  • Backfill feasibility: N/A — already covered
  • Effort: N/A

intermagnet

  • Current coverage: 2026-04-12 → 2026-05-11 (11k rows)
  • Theoretical max: INTERMAGNET network goes back to 1991; varies per station
  • Gap: ~34 years
  • Backfill feasibility: INTERMAGNET requires registration for bulk historical pulls. Per-station, per-day files.
  • Effort: L (registration + parsing across ~30 stations × decades)
  • Active arc relevance: Medium. Ground-magnetometer cross-validation for ionospheric studies.

Atmospheric (radiosondes)

igra_soundings

  • Current coverage: 2024-12-31 → today (103M rows, ~2,747 launches across ~89 stations)
  • Theoretical max: NCEI IGRA2 goes back to 1905 for some stations; most US stations from 1947+
  • Gap: ~78 years for older stations
  • Backfill feasibility: NCEI publishes "data-por" (period-of-record) zip files per station alongside the y2d files we use. Same parser works.
  • Effort: L — ~78 years × 89 stations × decompress+parse. Probably 6-12 hours of run time.
  • Active arc relevance: Very high. Would unblock historical V-arc replications (e.g., 2011 Super Outbreak CAPE/LI vs WSPR cross-checks).
  • Note: Current y2d coverage was just backfilled (2026-05-06). POR backfill is the natural next step but a real time investment.

Hurricanes

ibtracs_hurricanes

  • Current coverage: 1842-10-24 → today (766k rows)
  • Theoretical max: IBTrACS v04 goes back to 1842 (essentially fully covered)
  • Gap: None
  • Backfill feasibility: N/A
  • Note: Already covered.

Hydrology

usgs_water

  • Current coverage: 2025-09-16 → today (2.6M rows)
  • Theoretical max: USGS NWIS has streamflow back to early 1900s for some sites; modern instantaneous-values data back to ~2007 systemwide
  • Gap: ~18 years for IV data; longer for daily
  • Backfill feasibility: NWIS supports date-range parameters. Per-state pull (already used in fetcher).
  • Effort: M
  • Active arc relevance: Medium. Lunar tidal streamflow paper #130, California streamflow workspace.

noaa_tides

  • Current coverage: 2025-09-15 → today (320k rows)
  • Theoretical max: CO-OPS tidal data 100+ years for some major stations
  • Gap: ~100 years for deep-history stations; ~10 years for most modern coverage
  • Backfill feasibility: CO-OPS API supports begin_date / end_date parameters.
  • Effort: M
  • Active arc relevance: Low. No active arc beyond reference data.

usdm_drought

  • Current coverage: 2010-01-04 → 2026-05-11 (13k rows)
  • Theoretical max: USDM has weekly drought maps from 2000-01-04
  • Gap: 10 years pre-2010
  • Backfill feasibility: USDM archives weekly shapefiles back to 2000.
  • Effort: S
  • Active arc relevance: Low.

Radio propagation

wspr

  • Current coverage: 2026-03-31 → today (18k rows; corridor-aggregated)
  • Theoretical max: WSPRnet.org goes back to ~2009; we have ~16 years missing
  • Gap: ~16 years
  • Backfill feasibility: Per-month dumps from wspr.live (archived). 21-year-census workspace wspr-21year-census/ already pulled a representative sample. Bulk historical via clickhouse-style queries to wspr.live.
  • Effort: L — but high payoff
  • Active arc relevance: Very high. WSPR-tornado V2 historical (#142) is exactly this gap. Same data unlocks tornado replications all the way back to 2010s.

hamqsl_propagation

  • Current coverage: 2026-03-29 → today (7.7k rows)
  • Theoretical max: HamQSL publishes current-conditions only; no historical archive
  • Gap: N/A (provider doesn't archive)
  • Backfill feasibility: Not possible from the source.
  • Effort: N/A

lwa_spectra

  • Current coverage: 2026-02-25 only (1.6k rows; effectively one snapshot)
  • Theoretical max: LWA-1 dynamic spectra archive at LWA observatory; multi-year history but bandwidth-limited downloads
  • Gap: ~5+ years
  • Backfill feasibility: Per-day FITS file downloads. Heavy bandwidth.
  • Effort: L
  • Active arc relevance: Medium. Cross-validation for WSPR-tornado D-layer hypothesis (#96, #102).

Seismic

usgs_earthquake

  • Current coverage: 2021-04-12 → today (185k rows after dedup)
  • Theoretical max: USGS comprehensive catalog back to 1900
  • Gap: ~121 years
  • Backfill feasibility: USGS FDSN supports decade-long pulls; magnitude filter helps trim volume. Issue #25 already proposes this work.
  • Effort: M
  • Active arc relevance: Medium. Earthquake-solar-cycle overlay studies (#25 spec).

emsc / gfz_geofon / isc

  • Current coverage: all started 2026-03-31 (negligible history)
  • Theoretical max: EMSC catalog ~1998+; GFZ Geofon ~1992+; ISC bulletin ~1900+
  • Gap: Decades each
  • Backfill feasibility: Each has its own bulk-download path. Mostly redundant with USGS — backfilling all three doesn't add much beyond what a deeper USGS pull would give.
  • Effort: M per source
  • Active arc relevance: Low. Recommend prioritizing USGS deep backfill instead.

Climate / GHG

noaa_co2

  • Current coverage: 1974-05-18 → 2026-05-09 (33k rows)
  • Theoretical max: Mauna Loa CO2 record from 1958, NOAA flask network from ~1968
  • Gap: 16-20 years pre-1974
  • Backfill feasibility: NOAA GML hosts the full ML archive. Already mostly covered.
  • Effort: S
  • Active arc relevance: Low.

noaa_climate_indices

  • Current coverage: 1949-12-31 → 2025-12-31 (4.7k rows)
  • Theoretical max: Same range; index series typically begin 1948-1950
  • Gap: None historically; recent updates land annually
  • Backfill feasibility: N/A
  • Note: Already covered.

world_bank

  • Current coverage: 1959-12-31 → 2022-12-31 (17k rows)
  • Theoretical max: Same range; World Bank Open Data publishes new year ~Q3 of following year
  • Gap: 2023, 2024, and likely 2025 missing
  • Backfill feasibility: Re-pull when WB publishes annual updates.
  • Effort: S
  • Active arc relevance: Low.

nasa_power

  • Current coverage: 2022-12-31 → 2026-05-03 (11k rows)
  • Theoretical max: NASA POWER dataset goes back to 1981
  • Gap: ~41 years
  • Backfill feasibility: POWER API supports daily aggregates over decadal windows. Per-grid-cell pulls.
  • Effort: L (volume × spatial resolution)
  • Active arc relevance: Low.

Cosmic rays

nmdb_cosmic_rays

  • Current coverage: 2026-04-02 → today (3.8k rows)
  • Theoretical max: NMDB neutron monitors back to 1957 for some stations (Oulu, Apatity)
  • Gap: ~69 years
  • Backfill feasibility: NMDB REST API supports custom date ranges per station.
  • Effort: M
  • Active arc relevance: Medium. Forbush decrease catalog #103 needs this depth.

Astronomy / transients

fink_transients

  • Current coverage: 2026-02-09 → today (10k rows)
  • Theoretical max: ZTF survey ~2017+, Fink archive from ~2020
  • Gap: ~9 years for ZTF; ~6 for Fink-curated alerts
  • Backfill feasibility: Fink hosts a public bulk-download API.
  • Effort: M
  • Active arc relevance: Low-medium. ZTF×space-weather workspace already prototyped.

gwosc_events

  • Current coverage: 2005-11-03 → 2025-01-14 (758 rows)
  • Theoretical max: GWOSC event catalog covers all advanced-LIGO observation runs (O1-O4)
  • Gap: None
  • Note: Already covered.

Recommended priorities

If picking what to backfill next, ranked by value-to-active-arc × effort-inverse:

  1. WSPR historical via wspr.live (#142 path; effort L) — biggest unlock for WSPR-tornado replications across 16 years of historical events. Highest-value target.
  2. USGS earthquake to 1900 (#25; effort M) — opens earthquake-solar-cycle and earthquake-tidal overlay studies.
  3. DSCOVR backfill to 2016 (effort M) — unlocks solar-wind→Kp lag at proper statistics depth (#131).
  4. GOES X-ray to 1986 (effort L) — solar flare watch papers across multiple cycles.
  5. NASA DONKI to 2010 (effort S, easy win) — feeds the cascade-triggers paper (#76).
  6. NMDB cosmic rays decadal (effort M) — Forbush catalog (#103).
  7. IGRA POR (period-of-record) per station (effort L) — opens historical V-arc replications. Highest payoff but biggest time investment.
  8. NWS alerts pre-2021 via IEM (effort M) — generalizes the SPC-tornado backfill to all alert types.

Lower-priority single-shot fixes:

  • world_bank annual update (2023+ missing). Easy refresh.

(The dst_index and silso_sunspots "stale" entries in earlier drafts of this matrix were false positives from confusing MAX(timestamp_utc) with MAX(created_at). Both fetchers are healthy — they just publish on lagged schedules. See #192 closure for the diagnostic walkthrough.)


What's not in this matrix

  • The ~100 AutoSense-discovered EU/EPA catalog sources. These are static one-shot datasets (e.g., "2009 USVI hurricane survey data"). They're loaded fully on first ingest; there's no time-series backfill question. They show up as a single horizontal row in observations from their ingestion date.
  • Sources with provider-side "current-conditions only" policies (hamqsl_propagation). Provider doesn't archive.
  • Sources where current_coverage already equals theoretical_max (ibtracs_hurricanes, silso_sunspots, noaa_climate_indices, gwosc_events).
Live Feed