The TerraPulse Knowledge Graph
"Technically, this entire app is a knowledge graph."
Every fetcher, every workspace, every paper, every finding, every visualization is a node. Every ingestion, every analysis, every citation is an edge. The lab is a graph that produces papers; the papers are nodes that produce more edges.
Current state
| Metric | Value | Source |
|---|---|---|
| Total nodes | 989 | data/graph_cache.json |
| Total edges | 1,481 | same |
| Connected components | 141 | most are tiny orphan datasource bundles |
| Main research cluster | 491 nodes | contains all 54 workspaces' research output |
| Workspaces (papers) | 54 | workspaces/*/workspace.json |
| Datasources | 456 | nearly all the upstream sources Terrapulse has ever ingested |
| Metrics | 196 | the canonical observation types in PostgreSQL |
| Findings | 283 | extracted from data/results.json files |
Node types
| Type | Count | What it represents |
|---|---|---|
| datasource | 456 | An external data source the platform pulls from (USGS, NASA DONKI, EMSC, NMDB, ...) |
| metric | 196 | A canonical observation type stored in PostgreSQL (earthquake_magnitude, wspr_snr_40m, ...) |
| workspace | 54 | A research workspace (/workspaces/{slug}/) — usually maps to a paper |
| finding | 283 | An extracted statistical result (correlation, p-value, effect size) from a workspace's results.json |
Edge types
| Type | Count | Direction | Meaning |
|---|---|---|---|
| produces | 971 | datasource → metric | "USGS produces earthquake_magnitude" |
| uses | 246 | metric → workspace | "wspr-storm-corridor uses wspr_snr_40m" |
| tested | 237 | workspace → finding | "wspr-storm-corridor tested r=-0.29" |
| cites | 27 | workspace → workspace | "this paper references that paper's lab page" |
Architecture
The graph is rebuilt hourly by scripts/regenerate_graph.py (called from APScheduler) and served from data/graph_cache.json.
Data sources
The graph builder reads from four places:
- PostgreSQL
datasourcestable — every active fetcher becomes adatasourcenode - PostgreSQL
observations.metricdistinct values — every active metric becomes ametricnode, linked to its datasource via aproducesedge workspaces/*/workspace.json—title,metrics[],tags[], andstatus. Themetrics[]declaration is what tells the graph "this workspace uses these metrics."workspaces/*/data/results.json— extracted statistical findings via pattern-matching oncorrelation_r,pearson_r,p_value, etc.
Rendering
The frontend at /garden (web/src/pages/garden.astro) uses D3 force-directed simulation:
- Graph data is embedded at SSR time from
data/graph_cache.json(no client API call) - D3 v7 is self-hosted at
/d3.v7.min.js(no CDN dependency) - Loading spinner shows while D3 initializes
- Click any node → highlight + connection panel
- Filter by node type
- Color-coded: datasources (blue), metrics (orange), workspaces (purple), findings (green)
- Edge colors carry effect size (positive=green, negative=red, null=gray dashed)
What's connected, what isn't
Top connected workspaces (28 edges down to 24):
- Drought-wildfire-AQI cascade (28)
- Magnetic pole drift vs WSPR (26)
- Lunar tidal signal in rivers (25)
- Radon precursor hypothesis (24)
- Solar Flux–Kp Index Time Lag (24)
Orphaned workspaces (3 edges or fewer):
- wspr-21year-census (3) — declared metrics but the graph builder still misses some
- Air Quality–Weather Coupling (2) — never declared metrics
- Radiation Global Baseline (2)
- California Streamflow (1)
- WSPR Ionospheric Geography (0) — workspace exists but nothing in workspace.json points the graph builder at its dependencies
Detached datasource clusters (the 141 - 1 = 140 small components): mostly orphan upstream sources that were ingested once and never picked up by a workspace (FEMA disaster declarations, the bulk of the catalog catalog).
The 2026-04-08 bug
The graph builder used to do substring matching on script files to detect which metrics a workspace used. It scanned scripts/extract.py and scripts/analyze.py for any string that matched a registered metric ID (e.g., wspr_snr_40m).
This silently failed whenever a workspace's analysis script used integer band codes (band == 7) instead of string metric names. The four WSPR papers (census, solar cycle, station-pair, ionospheric geography) wrote their analysis at the integer-band layer, so the graph builder never saw the metric strings, and the entire WSPR ecosystem became a 27-node detached island floating next to the main 487-node research cluster.
The fix was three lines: read the metrics: array from workspace.json directly. Plus 5 workspace.json patches to actually declare the metrics. The result: 27-node WSPR orphan + 19-node HamQSL orphan collapsed into the main cluster (487 → 540 nodes, 1,455 → 1,593 edges in a single rebuild).
The lesson: a graph is only as good as its declared edges. Make declarations explicit and authoritative; treat substring scans as a fallback, not the source of truth.
What's missing today
These are real gaps the graph doesn't currently model:
- Articles ↛ workspaces — the
/articles/*pages link to workspaces via<a href="/lab/{slug}">, but the graph builder doesn't see articles at all. Article exposure is invisible. - Workspaces ↛ articles — backlinks don't exist in either direction.
- Pulse events ↛ metrics — the live ticker fires events about metrics every minute, but the graph treats events as ephemeral.
- Forbush events ↛ workspaces — the new Forbush detector creates
forbush_eventobservations but no workspace is currently bound to that metric. - Workspace dependencies on each other — only
citesedges (text-mention from index.md). No "this paper depends on that paper's data extraction." - The Pulse streamer ↛ datasources — 6 streaming sources (USGS, EMSC, NOAA SWPC, GOES, DSCOVR, CNEOS) push events but they're not modeled as a separate "live source" tier.
- Editor pipeline ↛ workspaces — Mike's reviews and Dana's copy edits are absent. The editorial graph (which papers got which review rounds, who flagged what) is not represented.
Ideas for improvement
Tier 1 — quick wins (1 sprint each)
A. Auto-discover metrics from analyze.py imports/queries
Beyond the workspace.json declaration, scan analyze.py for psql ... 'SELECT ... metric = ' patterns and SQL metric IN (...) clauses. Catch the cases where the author forgot to declare them. Combine with the existing workspace.json source-of-truth.
B. Bidirectional article ↔ workspace edges
When the hourly graph rebuild runs, also scan web/src/pages/articles/*.astro for /lab/{slug} patterns. Add an article node type and featured_in edges. Render on both the lab page ("As featured in: …") and the article page (already there).
C. Workspace status colors
Color workspace nodes by status: complete=green, draft=gray, revised=blue, failed=red. The current uniform purple loses the most-important state.
D. Edge weights from effect sizes
tested edges carry verdict but not magnitude. Use Cohen's d or |r| to thicken edges so the strongest findings draw the eye in /garden.
E. Replace mass datasource cluster with categories
The 286-node "datasource galaxy" component is dominated by FEMA/CB catalog entries. Group them into category supernodes ("FEMA disaster declarations", "Campaign Brain catalog"), so the rendering doesn't waste pixels on uninteresting bulk.
Tier 2 — bigger features
F. The Workspace Navigator (template page) — see below
A per-workspace landing page that shows the workspace as the center of its own mini-graph, with all artifacts navigable from one place.
G. Time-axis on the graph
Add a slider showing how the graph evolved month by month. Watch the WSPR cluster bloom in March, the Pulse arrive in April, the Granger network rewire after the HAC fix.
H. Pulse events as ephemeral nodes
When a quake or fireball comes through the Pulse, briefly render it as a node attached to its metric for 30 seconds, then fade it out. The /garden becomes a living organism.
I. Editor history as a layer
Overlay the Mike → Elise → Dana pipeline. Each accepted paper has an edit-distance score from its first draft to the published version, color-coded.
J. Cross-workspace dependency graph
Detect when one workspace's analysis script reads another workspace's data/results.json or data/*.parquet. Build hard depends_on edges (not just text-citation cites).
Tier 3 — research-grade ambition
K. Findings as queryable nodes
Each finding could carry (metric_a, metric_b, lag, effect, p) as structured fields. Then "show me all the negative correlations between WSPR and any solar metric at lag <= 7 days" becomes a graph query, not a full-text search.
L. The graph as input to the next paper
When a researcher (or Elise) starts a new paper, the graph already knows which metrics overlap in time, which pairs have been tested, which lags worked. Suggest experiments the graph hasn't tried yet.
Proposal: The Workspace Navigator
A template page that organizes everything around a single workspace — like a research mission control panel.
Goals
- One screen, everything about the workspace at a glance
- A small per-workspace knowledge graph (the workspace + everything it touches, ~30 nodes max)
- Quick navigation to all artifacts (paper PDF, scripts, data files, visualizations, results)
- Surface what's connected: which metrics, which findings, which other workspaces cite this one
- Where appropriate: surface upstream Pulse activity for the workspace's metrics
Layout sketch
┌──────────────────────────────────────────────────────────────────┐
│ WORKSPACE NAVIGATOR — wspr-station-pair-validation │
│ Two Competing Responses Hidden in the 10 m WSPR Anticorrelation │
│ Status: complete · Issue: #104 · Updated: 2026-04-07 │
├───────────────────────────────────┬──────────────────────────────┤
│ │ │
│ ┌─────────────────────────┐ │ ARTIFACTS │
│ │ │ │ ───────── │
│ │ [mini knowledge │ │ 📄 paper.pdf (8 pages) │
│ │ graph: this │ │ 📖 index.md (1.2k lines) │
│ │ workspace + │ │ 🐍 scripts/ │
│ │ immediate │ │ • extract.py │
│ │ neighbors │ │ • pair_one_pass1.py │
│ │ only ~30 nodes] │ │ • pair_one_pass2.py │
│ │ │ │ • analyze.py │
│ │ │ │ 📊 data/ │
│ └─────────────────────────┘ │ • qualifying_pairs.parquet
│ │ • pair_monthly_snr.parquet
│ FINDINGS │ • results.json │
│ ──────── │ • full_band_monthly.parquet
│ • short-path 10m: r=-0.61 │ 📈 www/ │
│ • long-path 10m: r=+0.54 │ • full-vs-filtered-r.html
│ • Fisher z=-13.2 (p≈7e-40) │ • 10m-short-vs-long.html
│ • 12m: filtered r drops to-0.12│ • pair-counts-by-band.html
│ • 20m: positive control passes │ • fisher-forest.html │
│ │ │
├───────────────────────────────────┴──────────────────────────────┤
│ RELATED │
│ ───────── │
│ Cites: wspr-solar-cycle-modulation, wspr-21year-census │
│ Cited by: (none yet) │
│ Featured in: "10.94 Billion Spots, One Wrong Sign" article │
│ │
│ METRICS USED (declared) │
│ ─────────────────────── │
│ wspr_snr_80m wspr_snr_40m wspr_snr_30m wspr_snr_20m │
│ wspr_snr_17m wspr_snr_15m wspr_snr_12m wspr_snr_10m │
│ sunspot_number │
│ │
│ PULSE LIVE (last 24h, filtered to this workspace's metrics) │
│ ────────────────────────────────────────────────────────────── │
│ (would show ham radio events, sunspot updates, etc.) │
└──────────────────────────────────────────────────────────────────┘
Implementation outline
Route: /lab/{slug}/navigator (sibling to the existing /lab/{slug})
Astro page: web/src/pages/lab/[slug]/navigator.astro
Server-side reads (all from filesystem at SSR, like the existing lab pages):
workspaces/{slug}/workspace.json— title, status, issue, metrics, tagsworkspaces/{slug}/index.md— first 200 lines for the narrative paneworkspaces/{slug}/data/results.json— extract structured findingsworkspaces/{slug}/scripts/*.py— list (with sizes)workspaces/{slug}/data/*.parquet— list (with sizes)workspaces/{slug}/www/*.htmland*.png— listworkspaces/{slug}/paper/paper.pdf— exists check + sizedata/graph_cache.json— extract subgraph rooted at this workspace, BFS depth 2
Mini graph:
- Center node = the workspace
- Neighbors = metrics it uses, findings it produced, other workspaces it cites or that cite it
- Render with the same D3 simulation as
/garden, but a smaller container (400×300) - Click a metric → see all other workspaces that use the same metric
- Click a finding → highlight, show extracted r/p/N
Artifacts panel:
- Each script, data file, viz, paper file links to the API endpoint that serves it
- Sort by type (scripts → data → www → paper)
- Show sizes
Findings panel:
- Pull the top 5-10 findings from
results.json - Format as bullets with effect size + p-value
- Color-code by verdict (positive, negative, null)
Related panel:
citesfrom the graph cache- "cited by" via reverse lookup
- "featured in" from articles scan
Pulse Live panel (advanced):
- Subscribe to the existing WebSocket
- Filter messages: only show events whose
kindmatches a metric this workspace uses - E.g., the WSPR navigator would highlight ham/solar events; the seismic navigator would surface earthquakes
- Same chip styling as the global Pulse Ticker, but workspace-scoped
Why this is the right next thing
- The lab page is the funnel. Right now it's a card grid with a thumbnail and 120 chars of description. The navigator turns each workspace into a navigable hub.
- Per-workspace mini-graphs visualize what's connected without overwhelming the user with the full 989-node garden.
- It teaches the graph by example. Anyone landing on the navigator immediately sees that workspaces have metrics, metrics come from datasources, findings cite other findings.
- It surfaces orphans. A workspace with no
metrics:declared will show an empty mini-graph and a warning. Self-correcting documentation. - It composes with the Pulse. The navigator can subscribe to the WebSocket and filter to its workspace's domain, turning every research page into a live monitor for its area of interest.
Phased delivery
Phase 1 (1 sprint item): Static navigator. Workspace metadata, artifacts list, findings from results.json, no graph yet.
Phase 2 (1 sprint item): Mini graph rendered server-side from data/graph_cache.json subgraph. D3 reuse from /garden.
Phase 3 (1 sprint item): Bidirectional article links + "cited by" reverse lookup.
Phase 4 (stretch): Pulse Live panel subscribed to the WebSocket, filtered by workspace metrics.
Open questions
- Should the navigator replace
/lab/{slug}or live alongside it as/lab/{slug}/navigator? - Do we want a public-facing "research dashboard" view that's the navigator with social-share affordances (Twitter, RSS, citation export)?
- Should the per-workspace pulse panel persist events to a per-workspace ring buffer the same way the global Pulse does?
Appendix: graph generation in code
src/terrapulse/lab/knowledge_graph.py
├── Node dataclass
├── Edge dataclass
├── KnowledgeGraph class with add_node/add_edge/find_metric_links
└── build_graph() entry point
├── reads PostgreSQL datasources
├── reads PostgreSQL distinct metrics
├── walks workspaces/*/
│ ├── reads workspace.json (title, metrics[]) ← FIXED 2026-04-08
│ ├── reads scripts/*.py (substring fallback)
│ ├── reads index.md (cites detection)
│ └── reads data/results.json (finding extraction)
└── returns KnowledgeGraph
scripts/regenerate_graph.py standalone caller, writes data/graph_cache.json
src/terrapulse/ingestion/scheduler.py APScheduler job, calls regenerate_graph hourly
data/graph_cache.json the live cache, served via SSR embed
web/src/pages/garden.astro the D3 frontend