Backfill gap matrix
For each curated source: how much of its theoretically-available history TerraPulse currently holds, the gap to its theoretical maximum, and a rough backfill-effort estimate. Use this as a prioritization frame for "what should we backfill next?" rather than "is this source loading right now."
Built 2026-05-13 from the live observations table. Coverage figures are the actual MIN(timestamp_utc) / MAX(timestamp_utc) per source_id at extract time. Theoretical-max columns are sourced from the public docs of each provider.
Convention:
- "Already covered" — current_start ≤ theoretical_start, no meaningful gap
- "Modest gap" — < 5 years missing
- "Major gap" — ≥ 5 years missing and the missing years are useful for active or queued research
- "Effort:" S/M/L (sub-day, day-or-two, week-plus)
This file is partial — it lists sources tied to active research arcs and high-traffic ingestion. The long tail of one-shot EU/EPA catalog sources is omitted (they're static datasets, no meaningful backfill question). Add rows when a new arc begins or when a source's coverage shifts.
Severe weather
nws_alerts
- Current coverage: 2021-01-01 → today (708k rows)
- Theoretical max: ~2007 (iNWS Common Alerting Protocol rollout); some pre-CAP feeds go back further
- Gap: ~14 years pre-2021
- Backfill feasibility: Iowa Environmental Mesonet archives NWS alerts via API back to ~2005. Already used for SPC tornado backfill (#174). Same path works for general NWS alert types.
- Effort: M — needs alert-type stratification and dedup logic
- Active arc relevance: High. Underpins severe-weather climatology and cross-source cascade studies.
spc_reports (live)
- Current coverage: 2026-03-22 → today (49k rows)
- Theoretical max: SPC online catalog from 1950 (storm reports), continuously updated
- Gap: 76 years
- Backfill feasibility: SPC publishes annual CSVs.
spc_tornado_historyalready covers tornadoes 1950-2023. Hail and wind reports same source, separate CSVs. - Effort: M for hail+wind history; tornadoes already done (#174)
- Active arc relevance: High. Negative-class controls for WSPR-tornado V5/V2-of-V3.
spc_tornado_history (backfill-only source)
- Current coverage: 1950-01-03 → 2023-12-19 (70k rows)
- Theoretical max: Same; SPC publishes 2024 + 2025 in annual updates
- Gap: 2 years missing recent
- Backfill feasibility: Annual CSV publish cycle. Re-pull 2024 + 2025 when SPC issues them.
- Effort: S
- Active arc relevance: High but already largely done.
spc_outlook
- Current coverage: 2026-03-23 → today (3.3k rows)
- Theoretical max: SPC has archived day-1/day-2/day-8 convective outlooks since 2002 (varies by product)
- Gap: 24 years
- Backfill feasibility: SPC archives outlooks as shapefiles. Path exists but heavier — need shapefile parsing + KML/GeoJSON conversion.
- Effort: L
- Active arc relevance: Medium. Useful for forecast-verification studies (#75).
Space weather
dscovr_solar_wind
- Current coverage: 2025-10-02 → today (6.9M rows)
- Theoretical max: DSCOVR launched 2015-02; L1 solar wind data continuous since mid-2016
- Gap: ~9 years
- Backfill feasibility: NOAA hosts DSCOVR archive on their CDF servers. Bulk download + per-day parse.
- Effort: M
- Active arc relevance: High. Solar wind → Kp lag (#131) needs deep history for proper statistics.
goes_xray
- Current coverage: 2026-03-17 → today (13M rows)
- Theoretical max: GOES X-ray flux archive goes back to 1986 (GOES-6 onward). Continuous across satellites.
- Gap: 40 years
- Backfill feasibility: NOAA SWPC ftp archive. Annual files. Many satellite generations to harmonize.
- Effort: L (~week)
- Active arc relevance: High. Solar-flare-watch papers (#28 in spec queue, x-flare-watch workspaces already done at small scale).
nasa_donki
- Current coverage: 2025-09-16 → today (30k rows)
- Theoretical max: NASA DONKI catalog goes back to 2010-08-01
- Gap: 15 years
- Backfill feasibility: DONKI REST API supports historical date range queries. Walk the date window.
- Effort: S (single script, well-defined API)
- Active arc relevance: Medium. Cascade triggers paper #76 wants this.
dst_index
- Current coverage: 2026-03-03 → ~30 days ago (2.2k rows). Wall-clock-to-
MAX(timestamp_utc)shows ~30 days; this is the WDC Kyoto intrinsic provisional-validation lag, not an ingestion bug (verified #192). - Theoretical max: Dst index 1957-onwards from WDC Kyoto
- Gap: ~69 years pre-2026; the 30-day "lag" is a property of the dataset, not a fixable gap
- Backfill feasibility: WDC Kyoto distributes hourly Dst as text files. Straightforward backfill.
- Effort: S
- Active arc relevance: Medium. Geomagnetic storm onset timing (#131). Note that any analysis using Dst is implicitly working with values lagged ~30 days.
silso_sunspots
- Current coverage: 1817-12-31 → ~1 day behind today's date (754k rows). The current
MAX(timestamp_utc)reflects SILSO's own publishing cadence (~1-day lag) rather than an ingestion bug (verified #192). - Theoretical max: SILSO daily v2.0 series goes back to 1818 (essentially fully covered)
- Gap: None
- Backfill feasibility: N/A — already covered
- Effort: N/A
intermagnet
- Current coverage: 2026-04-12 → 2026-05-11 (11k rows)
- Theoretical max: INTERMAGNET network goes back to 1991; varies per station
- Gap: ~34 years
- Backfill feasibility: INTERMAGNET requires registration for bulk historical pulls. Per-station, per-day files.
- Effort: L (registration + parsing across ~30 stations × decades)
- Active arc relevance: Medium. Ground-magnetometer cross-validation for ionospheric studies.
Atmospheric (radiosondes)
igra_soundings
- Current coverage: 2024-12-31 → today (103M rows, ~2,747 launches across ~89 stations)
- Theoretical max: NCEI IGRA2 goes back to 1905 for some stations; most US stations from 1947+
- Gap: ~78 years for older stations
- Backfill feasibility: NCEI publishes "data-por" (period-of-record) zip files per station alongside the y2d files we use. Same parser works.
- Effort: L — ~78 years × 89 stations × decompress+parse. Probably 6-12 hours of run time.
- Active arc relevance: Very high. Would unblock historical V-arc replications (e.g., 2011 Super Outbreak CAPE/LI vs WSPR cross-checks).
- Note: Current
y2dcoverage was just backfilled (2026-05-06). POR backfill is the natural next step but a real time investment.
Hurricanes
ibtracs_hurricanes
- Current coverage: 1842-10-24 → today (766k rows)
- Theoretical max: IBTrACS v04 goes back to 1842 (essentially fully covered)
- Gap: None
- Backfill feasibility: N/A
- Note: Already covered.
Hydrology
usgs_water
- Current coverage: 2025-09-16 → today (2.6M rows)
- Theoretical max: USGS NWIS has streamflow back to early 1900s for some sites; modern instantaneous-values data back to ~2007 systemwide
- Gap: ~18 years for IV data; longer for daily
- Backfill feasibility: NWIS supports date-range parameters. Per-state pull (already used in fetcher).
- Effort: M
- Active arc relevance: Medium. Lunar tidal streamflow paper #130, California streamflow workspace.
noaa_tides
- Current coverage: 2025-09-15 → today (320k rows)
- Theoretical max: CO-OPS tidal data 100+ years for some major stations
- Gap: ~100 years for deep-history stations; ~10 years for most modern coverage
- Backfill feasibility: CO-OPS API supports
begin_date/end_dateparameters. - Effort: M
- Active arc relevance: Low. No active arc beyond reference data.
usdm_drought
- Current coverage: 2010-01-04 → 2026-05-11 (13k rows)
- Theoretical max: USDM has weekly drought maps from 2000-01-04
- Gap: 10 years pre-2010
- Backfill feasibility: USDM archives weekly shapefiles back to 2000.
- Effort: S
- Active arc relevance: Low.
Radio propagation
wspr
- Current coverage: 2026-03-31 → today (18k rows; corridor-aggregated)
- Theoretical max: WSPRnet.org goes back to ~2009; we have ~16 years missing
- Gap: ~16 years
- Backfill feasibility: Per-month dumps from wspr.live (archived). 21-year-census workspace
wspr-21year-census/already pulled a representative sample. Bulk historical via clickhouse-style queries to wspr.live. - Effort: L — but high payoff
- Active arc relevance: Very high. WSPR-tornado V2 historical (#142) is exactly this gap. Same data unlocks tornado replications all the way back to 2010s.
hamqsl_propagation
- Current coverage: 2026-03-29 → today (7.7k rows)
- Theoretical max: HamQSL publishes current-conditions only; no historical archive
- Gap: N/A (provider doesn't archive)
- Backfill feasibility: Not possible from the source.
- Effort: N/A
lwa_spectra
- Current coverage: 2026-02-25 only (1.6k rows; effectively one snapshot)
- Theoretical max: LWA-1 dynamic spectra archive at LWA observatory; multi-year history but bandwidth-limited downloads
- Gap: ~5+ years
- Backfill feasibility: Per-day FITS file downloads. Heavy bandwidth.
- Effort: L
- Active arc relevance: Medium. Cross-validation for WSPR-tornado D-layer hypothesis (#96, #102).
Seismic
usgs_earthquake
- Current coverage: 2021-04-12 → today (185k rows after dedup)
- Theoretical max: USGS comprehensive catalog back to 1900
- Gap: ~121 years
- Backfill feasibility: USGS FDSN supports decade-long pulls; magnitude filter helps trim volume. Issue #25 already proposes this work.
- Effort: M
- Active arc relevance: Medium. Earthquake-solar-cycle overlay studies (#25 spec).
emsc / gfz_geofon / isc
- Current coverage: all started 2026-03-31 (negligible history)
- Theoretical max: EMSC catalog ~1998+; GFZ Geofon ~1992+; ISC bulletin ~1900+
- Gap: Decades each
- Backfill feasibility: Each has its own bulk-download path. Mostly redundant with USGS — backfilling all three doesn't add much beyond what a deeper USGS pull would give.
- Effort: M per source
- Active arc relevance: Low. Recommend prioritizing USGS deep backfill instead.
Climate / GHG
noaa_co2
- Current coverage: 1974-05-18 → 2026-05-09 (33k rows)
- Theoretical max: Mauna Loa CO2 record from 1958, NOAA flask network from ~1968
- Gap: 16-20 years pre-1974
- Backfill feasibility: NOAA GML hosts the full ML archive. Already mostly covered.
- Effort: S
- Active arc relevance: Low.
noaa_climate_indices
- Current coverage: 1949-12-31 → 2025-12-31 (4.7k rows)
- Theoretical max: Same range; index series typically begin 1948-1950
- Gap: None historically; recent updates land annually
- Backfill feasibility: N/A
- Note: Already covered.
world_bank
- Current coverage: 1959-12-31 → 2022-12-31 (17k rows)
- Theoretical max: Same range; World Bank Open Data publishes new year ~Q3 of following year
- Gap: 2023, 2024, and likely 2025 missing
- Backfill feasibility: Re-pull when WB publishes annual updates.
- Effort: S
- Active arc relevance: Low.
nasa_power
- Current coverage: 2022-12-31 → 2026-05-03 (11k rows)
- Theoretical max: NASA POWER dataset goes back to 1981
- Gap: ~41 years
- Backfill feasibility: POWER API supports daily aggregates over decadal windows. Per-grid-cell pulls.
- Effort: L (volume × spatial resolution)
- Active arc relevance: Low.
Cosmic rays
nmdb_cosmic_rays
- Current coverage: 2026-04-02 → today (3.8k rows)
- Theoretical max: NMDB neutron monitors back to 1957 for some stations (Oulu, Apatity)
- Gap: ~69 years
- Backfill feasibility: NMDB REST API supports custom date ranges per station.
- Effort: M
- Active arc relevance: Medium. Forbush decrease catalog #103 needs this depth.
Astronomy / transients
fink_transients
- Current coverage: 2026-02-09 → today (10k rows)
- Theoretical max: ZTF survey ~2017+, Fink archive from ~2020
- Gap: ~9 years for ZTF; ~6 for Fink-curated alerts
- Backfill feasibility: Fink hosts a public bulk-download API.
- Effort: M
- Active arc relevance: Low-medium. ZTF×space-weather workspace already prototyped.
gwosc_events
- Current coverage: 2005-11-03 → 2025-01-14 (758 rows)
- Theoretical max: GWOSC event catalog covers all advanced-LIGO observation runs (O1-O4)
- Gap: None
- Note: Already covered.
Recommended priorities
If picking what to backfill next, ranked by value-to-active-arc × effort-inverse:
- WSPR historical via wspr.live (#142 path; effort L) — biggest unlock for WSPR-tornado replications across 16 years of historical events. Highest-value target.
- USGS earthquake to 1900 (#25; effort M) — opens earthquake-solar-cycle and earthquake-tidal overlay studies.
- DSCOVR backfill to 2016 (effort M) — unlocks solar-wind→Kp lag at proper statistics depth (#131).
- GOES X-ray to 1986 (effort L) — solar flare watch papers across multiple cycles.
- NASA DONKI to 2010 (effort S, easy win) — feeds the cascade-triggers paper (#76).
- NMDB cosmic rays decadal (effort M) — Forbush catalog (#103).
- IGRA POR (period-of-record) per station (effort L) — opens historical V-arc replications. Highest payoff but biggest time investment.
- NWS alerts pre-2021 via IEM (effort M) — generalizes the SPC-tornado backfill to all alert types.
Lower-priority single-shot fixes:
world_bankannual update (2023+ missing). Easy refresh.
(The dst_index and silso_sunspots "stale" entries in earlier drafts of this matrix were false positives from confusing MAX(timestamp_utc) with MAX(created_at). Both fetchers are healthy — they just publish on lagged schedules. See #192 closure for the diagnostic walkthrough.)
What's not in this matrix
- The ~100 AutoSense-discovered EU/EPA catalog sources. These are static one-shot datasets (e.g., "2009 USVI hurricane survey data"). They're loaded fully on first ingest; there's no time-series backfill question. They show up as a single horizontal row in
observationsfrom their ingestion date. - Sources with provider-side "current-conditions only" policies (
hamqsl_propagation). Provider doesn't archive. - Sources where
current_coveragealready equalstheoretical_max(ibtracs_hurricanes,silso_sunspots,noaa_climate_indices,gwosc_events).