The TerraPulse Knowledge Graph

"Technically, this entire app is a knowledge graph."

Every fetcher, every workspace, every paper, every finding, every visualization is a node. Every ingestion, every analysis, every citation is an edge. The lab is a graph that produces papers; the papers are nodes that produce more edges.

Current state

Metric	Value	Source
Total nodes	989	`data/graph_cache.json`
Total edges	1,481	same
Connected components	141	most are tiny orphan datasource bundles
Main research cluster	491 nodes	contains all 54 workspaces' research output
Workspaces (papers)	54	`workspaces/*/workspace.json`
Datasources	456	nearly all the upstream sources Terrapulse has ever ingested
Metrics	196	the canonical observation types in PostgreSQL
Findings	283	extracted from `data/results.json` files

Node types

Type	Count	What it represents
datasource	456	An external data source the platform pulls from (USGS, NASA DONKI, EMSC, NMDB, ...)
metric	196	A canonical observation type stored in PostgreSQL (`earthquake_magnitude`, `wspr_snr_40m`, ...)
workspace	54	A research workspace (`/workspaces/{slug}/`) — usually maps to a paper
finding	283	An extracted statistical result (correlation, p-value, effect size) from a workspace's `results.json`

Edge types

Type	Count	Direction	Meaning
produces	971	datasource → metric	"USGS produces earthquake_magnitude"
uses	246	metric → workspace	"wspr-storm-corridor uses wspr_snr_40m"
tested	237	workspace → finding	"wspr-storm-corridor tested r=-0.29"
cites	27	workspace → workspace	"this paper references that paper's lab page"

Architecture

The graph is rebuilt hourly by scripts/regenerate_graph.py (called from APScheduler) and served from data/graph_cache.json.

Data sources

The graph builder reads from four places:

PostgreSQL datasources table — every active fetcher becomes a datasource node
PostgreSQL observations.metric distinct values — every active metric becomes a metric node, linked to its datasource via a produces edge
workspaces/*/workspace.json — title, metrics[], tags[], and status. The metrics[] declaration is what tells the graph "this workspace uses these metrics."
workspaces/*/data/results.json — extracted statistical findings via pattern-matching on correlation_r, pearson_r, p_value, etc.

Rendering

The frontend at /garden (web/src/pages/garden.astro) uses D3 force-directed simulation:

Graph data is embedded at SSR time from data/graph_cache.json (no client API call)
D3 v7 is self-hosted at /d3.v7.min.js (no CDN dependency)
Loading spinner shows while D3 initializes
Click any node → highlight + connection panel
Filter by node type
Color-coded: datasources (blue), metrics (orange), workspaces (purple), findings (green)
Edge colors carry effect size (positive=green, negative=red, null=gray dashed)

What's connected, what isn't

Top connected workspaces (28 edges down to 24):

Drought-wildfire-AQI cascade (28)
Magnetic pole drift vs WSPR (26)
Lunar tidal signal in rivers (25)
Radon precursor hypothesis (24)
Solar Flux–Kp Index Time Lag (24)

Orphaned workspaces (3 edges or fewer):

wspr-21year-census (3) — declared metrics but the graph builder still misses some
Air Quality–Weather Coupling (2) — never declared metrics
Radiation Global Baseline (2)
California Streamflow (1)
WSPR Ionospheric Geography (0) — workspace exists but nothing in workspace.json points the graph builder at its dependencies

Detached datasource clusters (the 141 - 1 = 140 small components): mostly orphan upstream sources that were ingested once and never picked up by a workspace (FEMA disaster declarations, the bulk of the catalog catalog).

The 2026-04-08 bug

The graph builder used to do substring matching on script files to detect which metrics a workspace used. It scanned scripts/extract.py and scripts/analyze.py for any string that matched a registered metric ID (e.g., wspr_snr_40m).

This silently failed whenever a workspace's analysis script used integer band codes (band == 7) instead of string metric names. The four WSPR papers (census, solar cycle, station-pair, ionospheric geography) wrote their analysis at the integer-band layer, so the graph builder never saw the metric strings, and the entire WSPR ecosystem became a 27-node detached island floating next to the main 487-node research cluster.

The fix was three lines: read the metrics: array from workspace.json directly. Plus 5 workspace.json patches to actually declare the metrics. The result: 27-node WSPR orphan + 19-node HamQSL orphan collapsed into the main cluster (487 → 540 nodes, 1,455 → 1,593 edges in a single rebuild).

The lesson: a graph is only as good as its declared edges. Make declarations explicit and authoritative; treat substring scans as a fallback, not the source of truth.

What's missing today

These are real gaps the graph doesn't currently model:

Articles ↛ workspaces — the /articles/* pages link to workspaces via <a href="/lab/{slug}">, but the graph builder doesn't see articles at all. Article exposure is invisible.
Workspaces ↛ articles — backlinks don't exist in either direction.
Pulse events ↛ metrics — the live ticker fires events about metrics every minute, but the graph treats events as ephemeral.
Forbush events ↛ workspaces — the new Forbush detector creates forbush_event observations but no workspace is currently bound to that metric.
Workspace dependencies on each other — only cites edges (text-mention from index.md). No "this paper depends on that paper's data extraction."
The Pulse streamer ↛ datasources — 6 streaming sources (USGS, EMSC, NOAA SWPC, GOES, DSCOVR, CNEOS) push events but they're not modeled as a separate "live source" tier.
Editor pipeline ↛ workspaces — Mike's reviews and Dana's copy edits are absent. The editorial graph (which papers got which review rounds, who flagged what) is not represented.

Ideas for improvement

Tier 1 — quick wins (1 sprint each)

A. Auto-discover metrics from analyze.py imports/queries

Beyond the workspace.json declaration, scan analyze.py for psql ... 'SELECT ... metric = ' patterns and SQL metric IN (...) clauses. Catch the cases where the author forgot to declare them. Combine with the existing workspace.json source-of-truth.

B. Bidirectional article ↔ workspace edges

When the hourly graph rebuild runs, also scan web/src/pages/articles/*.astro for /lab/{slug} patterns. Add an article node type and featured_in edges. Render on both the lab page ("As featured in: …") and the article page (already there).

C. Workspace status colors

Color workspace nodes by status: complete=green, draft=gray, revised=blue, failed=red. The current uniform purple loses the most-important state.

D. Edge weights from effect sizes

tested edges carry verdict but not magnitude. Use Cohen's d or |r| to thicken edges so the strongest findings draw the eye in /garden.

E. Replace mass datasource cluster with categories

The 286-node "datasource galaxy" component is dominated by FEMA/CB catalog entries. Group them into category supernodes ("FEMA disaster declarations", "Campaign Brain catalog"), so the rendering doesn't waste pixels on uninteresting bulk.

Tier 2 — bigger features

F. The Workspace Navigator (template page) — see below

A per-workspace landing page that shows the workspace as the center of its own mini-graph, with all artifacts navigable from one place.

G. Time-axis on the graph

Add a slider showing how the graph evolved month by month. Watch the WSPR cluster bloom in March, the Pulse arrive in April, the Granger network rewire after the HAC fix.

H. Pulse events as ephemeral nodes

When a quake or fireball comes through the Pulse, briefly render it as a node attached to its metric for 30 seconds, then fade it out. The /garden becomes a living organism.

I. Editor history as a layer

Overlay the Mike → Elise → Dana pipeline. Each accepted paper has an edit-distance score from its first draft to the published version, color-coded.

J. Cross-workspace dependency graph

Detect when one workspace's analysis script reads another workspace's data/results.json or data/*.parquet. Build hard depends_on edges (not just text-citation cites).

Tier 3 — research-grade ambition

K. Findings as queryable nodes

Each finding could carry (metric_a, metric_b, lag, effect, p) as structured fields. Then "show me all the negative correlations between WSPR and any solar metric at lag <= 7 days" becomes a graph query, not a full-text search.

L. The graph as input to the next paper

When a researcher (or Elise) starts a new paper, the graph already knows which metrics overlap in time, which pairs have been tested, which lags worked. Suggest experiments the graph hasn't tried yet.

Proposal: The Workspace Navigator

A template page that organizes everything around a single workspace — like a research mission control panel.

Goals

One screen, everything about the workspace at a glance
A small per-workspace knowledge graph (the workspace + everything it touches, ~30 nodes max)
Quick navigation to all artifacts (paper PDF, scripts, data files, visualizations, results)
Surface what's connected: which metrics, which findings, which other workspaces cite this one
Where appropriate: surface upstream Pulse activity for the workspace's metrics

Layout sketch

┌──────────────────────────────────────────────────────────────────┐
│  WORKSPACE NAVIGATOR — wspr-station-pair-validation              │
│  Two Competing Responses Hidden in the 10 m WSPR Anticorrelation │
│  Status: complete  ·  Issue: #104  ·  Updated: 2026-04-07        │
├───────────────────────────────────┬──────────────────────────────┤
│                                   │                              │
│   ┌─────────────────────────┐    │  ARTIFACTS                   │
│   │                         │    │  ─────────                   │
│   │   [mini knowledge       │    │  📄 paper.pdf  (8 pages)     │
│   │    graph: this          │    │  📖 index.md   (1.2k lines)  │
│   │    workspace +          │    │  🐍 scripts/                 │
│   │    immediate            │    │      • extract.py            │
│   │    neighbors            │    │      • pair_one_pass1.py     │
│   │    only ~30 nodes]      │    │      • pair_one_pass2.py     │
│   │                         │    │      • analyze.py            │
│   │                         │    │  📊 data/                    │
│   └─────────────────────────┘    │      • qualifying_pairs.parquet
│                                   │      • pair_monthly_snr.parquet
│   FINDINGS                        │      • results.json          │
│   ────────                        │      • full_band_monthly.parquet
│   • short-path 10m: r=-0.61      │  📈 www/                     │
│   • long-path 10m: r=+0.54       │      • full-vs-filtered-r.html
│   • Fisher z=-13.2 (p≈7e-40)     │      • 10m-short-vs-long.html
│   • 12m: filtered r drops to-0.12│      • pair-counts-by-band.html
│   • 20m: positive control passes │      • fisher-forest.html    │
│                                   │                              │
├───────────────────────────────────┴──────────────────────────────┤
│  RELATED                                                          │
│  ─────────                                                        │
│  Cites:    wspr-solar-cycle-modulation, wspr-21year-census       │
│  Cited by: (none yet)                                            │
│  Featured in: "10.94 Billion Spots, One Wrong Sign" article      │
│                                                                   │
│  METRICS USED (declared)                                          │
│  ───────────────────────                                          │
│  wspr_snr_80m  wspr_snr_40m  wspr_snr_30m  wspr_snr_20m          │
│  wspr_snr_17m  wspr_snr_15m  wspr_snr_12m  wspr_snr_10m          │
│  sunspot_number                                                   │
│                                                                   │
│  PULSE LIVE (last 24h, filtered to this workspace's metrics)     │
│  ──────────────────────────────────────────────────────────────  │
│  (would show ham radio events, sunspot updates, etc.)            │
└──────────────────────────────────────────────────────────────────┘

Implementation outline

Route: /lab/{slug}/navigator (sibling to the existing /lab/{slug})

Astro page: web/src/pages/lab/[slug]/navigator.astro

Server-side reads (all from filesystem at SSR, like the existing lab pages):

workspaces/{slug}/workspace.json — title, status, issue, metrics, tags
workspaces/{slug}/index.md — first 200 lines for the narrative pane
workspaces/{slug}/data/results.json — extract structured findings
workspaces/{slug}/scripts/*.py — list (with sizes)
workspaces/{slug}/data/*.parquet — list (with sizes)
workspaces/{slug}/www/*.html and *.png — list
workspaces/{slug}/paper/paper.pdf — exists check + size
data/graph_cache.json — extract subgraph rooted at this workspace, BFS depth 2

Mini graph:

Center node = the workspace
Neighbors = metrics it uses, findings it produced, other workspaces it cites or that cite it
Render with the same D3 simulation as /garden, but a smaller container (400×300)
Click a metric → see all other workspaces that use the same metric
Click a finding → highlight, show extracted r/p/N

Artifacts panel:

Each script, data file, viz, paper file links to the API endpoint that serves it
Sort by type (scripts → data → www → paper)
Show sizes

Findings panel:

Pull the top 5-10 findings from results.json
Format as bullets with effect size + p-value
Color-code by verdict (positive, negative, null)

Related panel:

cites from the graph cache
"cited by" via reverse lookup
"featured in" from articles scan

Pulse Live panel (advanced):

Subscribe to the existing WebSocket
Filter messages: only show events whose kind matches a metric this workspace uses
E.g., the WSPR navigator would highlight ham/solar events; the seismic navigator would surface earthquakes
Same chip styling as the global Pulse Ticker, but workspace-scoped

Why this is the right next thing

The lab page is the funnel. Right now it's a card grid with a thumbnail and 120 chars of description. The navigator turns each workspace into a navigable hub.
Per-workspace mini-graphs visualize what's connected without overwhelming the user with the full 989-node garden.
It teaches the graph by example. Anyone landing on the navigator immediately sees that workspaces have metrics, metrics come from datasources, findings cite other findings.
It surfaces orphans. A workspace with no metrics: declared will show an empty mini-graph and a warning. Self-correcting documentation.
It composes with the Pulse. The navigator can subscribe to the WebSocket and filter to its workspace's domain, turning every research page into a live monitor for its area of interest.

Phased delivery

Phase 1 (1 sprint item): Static navigator. Workspace metadata, artifacts list, findings from results.json, no graph yet.

Phase 2 (1 sprint item): Mini graph rendered server-side from data/graph_cache.json subgraph. D3 reuse from /garden.

Phase 3 (1 sprint item): Bidirectional article links + "cited by" reverse lookup.

Phase 4 (stretch): Pulse Live panel subscribed to the WebSocket, filtered by workspace metrics.

Open questions

Should the navigator replace /lab/{slug} or live alongside it as /lab/{slug}/navigator?
Do we want a public-facing "research dashboard" view that's the navigator with social-share affordances (Twitter, RSS, citation export)?
Should the per-workspace pulse panel persist events to a per-workspace ring buffer the same way the global Pulse does?

Appendix: graph generation in code

src/terrapulse/lab/knowledge_graph.py
  ├── Node                              dataclass
  ├── Edge                              dataclass
  ├── KnowledgeGraph                    class with add_node/add_edge/find_metric_links
  └── build_graph()                     entry point
       ├── reads PostgreSQL datasources
       ├── reads PostgreSQL distinct metrics
       ├── walks workspaces/*/
       │    ├── reads workspace.json (title, metrics[])    ← FIXED 2026-04-08
       │    ├── reads scripts/*.py (substring fallback)
       │    ├── reads index.md (cites detection)
       │    └── reads data/results.json (finding extraction)
       └── returns KnowledgeGraph

scripts/regenerate_graph.py             standalone caller, writes data/graph_cache.json
src/terrapulse/ingestion/scheduler.py   APScheduler job, calls regenerate_graph hourly
data/graph_cache.json                   the live cache, served via SSR embed
web/src/pages/garden.astro              the D3 frontend