Yeardex and Multi-Source Cited Slots — Framework Extensions

Status: framework extension · Opened: 2026-06-16 · Owner: Mike + Claude (engine room) Parent: docs/event-spine-framework.md (the Eventdex framework). These two ideas are first-class additions to that framework, not a side branch. Mike coined both on 2026-06-16 while deciding what to do with the data-source links he had uploaded through /admin/datasources.

This doc defines two things:

Yeardex — a sibling of the Eventdex treatment where the slot is a calendar year instead of a discrete event. It gives a home to the data that is real and measured but has no discrete events to index: annual statistical series, inventories, and research datasets.
Multi-source cited slots — the rule that any slot, in Eventdex or Yeardex, may carry data from several sources at once, with every datum citing where it came from. This generalizes the entry kind's multi-source slot space to a universal property of the storehouse, and it is how overlapping sources enrich one slot instead of duplicating it.

The motivating discovery: of the ~338 links uploaded by hand through the admin UI, only a couple are discrete-event catalogs (NOAA Storm Events; FEMA, already built as kind #10). The rest are annual statistical tables (USDA land use 1945–2017, milk supply), EPA/DOE research datasets, and hourly station series. Under plain Eventdex those links are "not event-shaped" and get discarded. Mike's reframe: nothing uploaded is wasted. Event-shaped links become Eventdex slots; year-shaped links become Yeardex slots; overlapping links enrich an existing slot, cited.

The measured-reality bright line is unchanged and binding on both extensions: a slot, whether it is an event or a year, accretes measurements of what physically happened, never a model's estimate of what will, might, or "would have" happened.

Part 1 — Yeardex: when the slot is a year

The idea, plainly

An Eventdex slot answers "what happened in this event." A Yeardex slot answers "what was measured in this year." A year is a stable, addressable, revisitable thing with an ID (the year number), exactly like an event has a stable ID. So a year qualifies as a slot under the same rules that let a hurricane or an earthquake be a slot. The difference is only the axis: Eventdex indexes by event, Yeardex indexes by year.

This rescues every link that records a real quantity over time but has no discrete events to point at. "US cropland area, by state, 1945–2017" is not a catalog of events. It is a measurement of a real thing, taken once a year. Each year is a slot; the slot accretes that year's cropland figure (and everything else measured that year). The series stops being un-indexable and becomes a timeline of year-slots.

What qualifies for Yeardex (vs Eventdex)

Use Eventdex when the data is a catalog of discrete, dated, ideally located events with stable IDs (storms, quakes, eruptions, declarations, fireballs). Use Yeardex when the data is a real, measured quantity reported on a regular calendar cadence with no discrete events to index:

Annual statistical series — USDA land-use tables, milk supply and utilization, food imports, bioenergy statistics. One row per year (often per year × state); the year is the natural slot.
Inventories and registries reported periodically — greenhouse-gas emission totals by year, facility counts. The measured quantity is "as of year Y."
Research datasets keyed to a study year — many of the data.gov EPA/DOE links are a single year's field campaign or survey. The study year is the slot.

The qualification test mirrors the spine test from the parent framework:

A stable period ID — the year (or year × jurisdiction). No period, no slot.
Measured-reality provenance — the figure records something that was measured or counted, not projected. A projection of 2030 emissions fails the bright line; a recorded 2017 cropland acreage passes.
A regular cadence — annual is the canonical Yeardex cadence and the one this doc freezes. (Monthly or quarterly series are a later question; default is to roll them up to the year slot and keep the finer grain inside the slot, never to mint sub-year slots without a scope freeze.)
Depth — a historical run worth backfilling, so the timeline has mass on day one (USDA's 1945→ runs are ideal).

Slot ID and granularity (the open scope decision)

The slot key generalizes from (kind, event_id) to (kind, period_id), where period_id is the year. The kind scopes the subject; the year is the slot within it. Two shapes are on the table, and the choice is the first decision each Yeardex scope-freeze must make (it is not frozen here, it is the per-kind call, the way the magnitude floor was the per-kind call for eq):

Subject-scoped kinds (recommended default). One kind per coherent subject, slot = year. E.g. kind = us-landuse, slots 1945, 1950, …, 2017, each year accreting that year's land-use figures from every source that reports them, cited. Clean, queryable, mirrors how eq/tor/vol each own one subject. The risk is proliferation (many thin subject-kinds).
One national timeline. A single kind = us-annual, slot = year, each year accreting all annual US measurements for that year (land use + emissions + agriculture + …), every field cited to its source. Mirrors FEMA's "one timeline of the nation" elegance, but mixes unrelated quantities in one slot and makes the slot a grab-bag. Likely too coarse; recorded here as the considered alternative.

The slot ID string within a kind is the bare year (2017.json), exactly as entry uses native IDs and fema uses DR-4480. Kind directory: data/year_storehouse/<kind>/ (a sibling of data/event_storehouse/), or a kind-tagged subtree of the same storehouse — a build decision for the first Yeardex brick, not a framework freeze.

How a year slot accretes data

A Yeardex slot is catalog-first, sweep-none. There is no spatial sweep: a year is not a place, so there is nothing to query within a radius. The slot is a dossier that accretes the year's measured figures, each tagged with its source and (where the data carries it) its sub-year breakdown (by-state, by-commodity, by-sector). This is the same posture as the cosmic and FEMA kinds: the spine is the product; the slot records the measurement and cites it.

A year slot is therefore always multi-source by nature — a year accretes every annual dataset that reports a figure for it, which is exactly what Part 2 is about.

Measured-reality on the year axis

The bright line reads cleanly on years: a recorded annual figure is in (2017 cropland acreage, 2015 state-level GHG emission total, a 2019 field survey's measurements). A projected or modeled annual figure is out (a 2030 emissions projection, a modeled "expected annual loss"). Many of the uploaded EPA/DOE links carry both recorded data and modeled scenarios in the same dataset; the Yeardex slot takes the recorded columns and leaves the modeled ones, the same split FEMA made between the declaration record (in) and the National Risk Index loss models (out).

Part 2 — Multi-source cited slots: the universal rule

The rule

Any slot may carry data from more than one source. Every datum carries a citation to the source it came from. A slot is not owned by one feed; it is an addressable subject (an event, or a year) that multiple sources can each describe, and the slot keeps each contribution attributed.

This was already true in one corner of the storehouse: the entry kind feeds four catalogs (gmn:/sat:/cneos:/met:) into one source-prefixed slot space. Mike's 2026-06-16 generalization promotes that from an entry-only trick to a property of every kind: the prefix/citation machinery that made entry work is the same machinery that lets a tornado slot hold both SPC's survey fields and NOAA Storm Events' fields, each cited.

Two ways sources combine in a slot

There are two distinct cases, and the rule covers both:

Different events from different catalogs, sharing a slot space (the entry case). GMN and CNEOS catalog different fireballs; the source prefix keeps their IDs globally unique inside one kind. No matching is needed; the sources partition the slot space.
The same event described by two catalogs (the new case). SPC and NOAA Storm Events both record the same tornado. Here the sources must be matched so the two descriptions land in one slot, enriching it, rather than creating two slots for one real tornado. This is the harder case and the one Storm Events forces.

Enrichment, not duplication (the tornado worked example)

Our tor kind already holds ~73,600 tornado slots from SPC. NOAA Storm Events also records every US tornado, plus narrative, damage, and injury fields SPC does not carry. Under the multi-source rule we do not build a parallel Storm Events tornado catalog. We match each Storm Events tornado to its existing tor slot and add its fields there, cited to NOAA Storm Events. The tornado slot ends up with SPC's surveyed track (cited to SPC) and Storm Events' narrative/damage (cited to NCEI), in one place. Storm Events' non-tornado hazards (hail, wind, flood, lightning, winter) have no existing slot and become new slots in a Storm Events kind.

The matching key for tornadoes is the hard part and will be its own scope decision: SPC and Storm Events use different ID schemes, so the merge keys on (date, time window, state, county, and start lat/lon proximity), with unmatched rows recorded as their own slots rather than silently dropped. The principle the framework freezes: when two sources describe the same real event, they share one slot; when a match cannot be made confidently, the unmatched record gets its own slot and is never discarded.

Citation shape

Each contributed block in a slot carries a source tag (the datasource slug or prefix) so any field can be traced to its origin, the same way every normalized observation already carries duckdb_source_ref. A slot that has been enriched lists its contributing sources. Minimum bar: a reader of any slot can answer "which feed told us this?" for every field. The exact JSON shape is a build decision for the first multi-source-merge brick (Storm Events), not a framework freeze; the requirement is attribution per datum.

How the two extensions combine

Yeardex slots are multi-source by construction: a year accretes every annual dataset that reports a figure for it, each cited. So Part 2 is not optional decoration on Part 1 — it is the mechanism that makes a year slot work. A us-landuse 2017 slot holds USDA's cropland figure (cited to USDA ERS), and if a second source also reports 2017 land use, its figure sits alongside, cited to it. The same rule, the same attribution, on both the event axis and the year axis.

What this unlocks for the uploaded links

With both extensions in the framework, the ~338 admin-uploaded links sort cleanly and none are discarded:

Event-shaped → Eventdex. NOAA Storm Events (the one strong remaining event catalog), and FEMA disasters (already kind #10). Storm Events tornadoes enrich tor; its other hazards become new slots.
Year-shaped → Yeardex. The USDA land-use and milk series, the annual GHG inventories, the year-keyed EPA/DOE research datasets — each a year-slot timeline, recorded columns in, modeled columns out.
Overlapping → enrich, cited. Any uploaded link that re-describes an event or year we already hold adds its fields to that slot, attributed, rather than duplicating it.

Frozen vs open

Frozen (2026-06-16, Mike):

Yeardex exists as a first-class sibling of Eventdex: the slot may be a calendar year.
The multi-source cited-slot rule is universal: any slot may carry multiple sources, every datum cited; same-event matches share one slot, unmatched records are never discarded.
The measured-reality bright line binds both: recorded figures in, modeled/projected figures out, on the year axis as on the event axis.

Open (decided per kind, in each docs/scope-*.md freeze, not here):

Yeardex subject granularity — subject-scoped kinds (default) vs one national timeline.
Yeardex storehouse layout and slot-JSON shape; whether sub-year cadences roll up or get their own treatment.
The Storm Events ↔ tor tornado matching key and confidence threshold; the per-datum citation JSON shape (settled by the first multi-source-merge build).

Sequencing (proposed, Mike to confirm per his cadence)

This framework doc — defines both extensions. ✅ (this file).
Storm Events Eventdex kind — first real test of same-event cross-source merging (tornadoes → tor, cited; other hazards → new slots). Proves Part 2 on event data.
First Yeardex kind — ✅ 2026-06-16: US Land Use Yeardex (kind = landuse, docs/scope-landuse-yeardex.md), USDA ERS Major Uses of Land, 16 survey-year slots 1945–2017, 16,128 data points across 16 categories × 63 geographies, each value citing its ERS table. Proves Part 1 (slot = year) and Part 2 (multi-source cited slots, here one provider's many category tables) end to end. Measured-reality call: IN, census-anchored accounting (Mike). Stored in a separate data/year_storehouse/.
Scale Yeardex across the remaining year-shaped uploaded links, subject by subject (milk supply, annual emissions inventories, year-keyed research datasets).

Same operating discipline as every Eventdex kind: scope frozen in a docs/scope-*.md doc before backfill and never tuned; live-forward where a feed exists, history backfill second; strategic per-slot fills, never a blanket obligation.