Listening for events…

The TerraPulse Knowledge Graph

"Technically, this entire app is a knowledge graph."

Every fetcher, every workspace, every paper, every finding, every visualization is a node. Every ingestion, every analysis, every citation is an edge. The lab is a graph that produces papers; the papers are nodes that produce more edges.

Current state

Metric Value Source
Total nodes 989 data/graph_cache.json
Total edges 1,481 same
Connected components 141 most are tiny orphan datasource bundles
Main research cluster 491 nodes contains all 54 workspaces' research output
Workspaces (papers) 54 workspaces/*/workspace.json
Datasources 456 nearly all the upstream sources Terrapulse has ever ingested
Metrics 196 the canonical observation types in PostgreSQL
Findings 283 extracted from data/results.json files

Node types

Type Count What it represents
datasource 456 An external data source the platform pulls from (USGS, NASA DONKI, EMSC, NMDB, ...)
metric 196 A canonical observation type stored in PostgreSQL (earthquake_magnitude, wspr_snr_40m, ...)
workspace 54 A research workspace (/workspaces/{slug}/) — usually maps to a paper
finding 283 An extracted statistical result (correlation, p-value, effect size) from a workspace's results.json

Edge types

Type Count Direction Meaning
produces 971 datasource → metric "USGS produces earthquake_magnitude"
uses 246 metric → workspace "wspr-storm-corridor uses wspr_snr_40m"
tested 237 workspace → finding "wspr-storm-corridor tested r=-0.29"
cites 27 workspace → workspace "this paper references that paper's lab page"

Architecture

The graph is rebuilt hourly by scripts/regenerate_graph.py (called from APScheduler) and served from data/graph_cache.json.

Data sources

The graph builder reads from four places:

  1. PostgreSQL datasources table — every active fetcher becomes a datasource node
  2. PostgreSQL observations.metric distinct values — every active metric becomes a metric node, linked to its datasource via a produces edge
  3. workspaces/*/workspace.jsontitle, metrics[], tags[], and status. The metrics[] declaration is what tells the graph "this workspace uses these metrics."
  4. workspaces/*/data/results.json — extracted statistical findings via pattern-matching on correlation_r, pearson_r, p_value, etc.

Rendering

The frontend at /garden (web/src/pages/garden.astro) uses D3 force-directed simulation:

  • Graph data is embedded at SSR time from data/graph_cache.json (no client API call)
  • D3 v7 is self-hosted at /d3.v7.min.js (no CDN dependency)
  • Loading spinner shows while D3 initializes
  • Click any node → highlight + connection panel
  • Filter by node type
  • Color-coded: datasources (blue), metrics (orange), workspaces (purple), findings (green)
  • Edge colors carry effect size (positive=green, negative=red, null=gray dashed)

What's connected, what isn't

Top connected workspaces (28 edges down to 24):

  • Drought-wildfire-AQI cascade (28)
  • Magnetic pole drift vs WSPR (26)
  • Lunar tidal signal in rivers (25)
  • Radon precursor hypothesis (24)
  • Solar Flux–Kp Index Time Lag (24)

Orphaned workspaces (3 edges or fewer):

  • wspr-21year-census (3) — declared metrics but the graph builder still misses some
  • Air Quality–Weather Coupling (2) — never declared metrics
  • Radiation Global Baseline (2)
  • California Streamflow (1)
  • WSPR Ionospheric Geography (0) — workspace exists but nothing in workspace.json points the graph builder at its dependencies

Detached datasource clusters (the 141 - 1 = 140 small components): mostly orphan upstream sources that were ingested once and never picked up by a workspace (FEMA disaster declarations, the bulk of the catalog catalog).

The 2026-04-08 bug

The graph builder used to do substring matching on script files to detect which metrics a workspace used. It scanned scripts/extract.py and scripts/analyze.py for any string that matched a registered metric ID (e.g., wspr_snr_40m).

This silently failed whenever a workspace's analysis script used integer band codes (band == 7) instead of string metric names. The four WSPR papers (census, solar cycle, station-pair, ionospheric geography) wrote their analysis at the integer-band layer, so the graph builder never saw the metric strings, and the entire WSPR ecosystem became a 27-node detached island floating next to the main 487-node research cluster.

The fix was three lines: read the metrics: array from workspace.json directly. Plus 5 workspace.json patches to actually declare the metrics. The result: 27-node WSPR orphan + 19-node HamQSL orphan collapsed into the main cluster (487 → 540 nodes, 1,455 → 1,593 edges in a single rebuild).

The lesson: a graph is only as good as its declared edges. Make declarations explicit and authoritative; treat substring scans as a fallback, not the source of truth.

What's missing today

These are real gaps the graph doesn't currently model:

  1. Articles ↛ workspaces — the /articles/* pages link to workspaces via <a href="/lab/{slug}">, but the graph builder doesn't see articles at all. Article exposure is invisible.
  2. Workspaces ↛ articles — backlinks don't exist in either direction.
  3. Pulse events ↛ metrics — the live ticker fires events about metrics every minute, but the graph treats events as ephemeral.
  4. Forbush events ↛ workspaces — the new Forbush detector creates forbush_event observations but no workspace is currently bound to that metric.
  5. Workspace dependencies on each other — only cites edges (text-mention from index.md). No "this paper depends on that paper's data extraction."
  6. The Pulse streamer ↛ datasources — 6 streaming sources (USGS, EMSC, NOAA SWPC, GOES, DSCOVR, CNEOS) push events but they're not modeled as a separate "live source" tier.
  7. Editor pipeline ↛ workspaces — Mike's reviews and Dana's copy edits are absent. The editorial graph (which papers got which review rounds, who flagged what) is not represented.

Ideas for improvement

Tier 1 — quick wins (1 sprint each)

A. Auto-discover metrics from analyze.py imports/queries

Beyond the workspace.json declaration, scan analyze.py for psql ... 'SELECT ... metric = ' patterns and SQL metric IN (...) clauses. Catch the cases where the author forgot to declare them. Combine with the existing workspace.json source-of-truth.

B. Bidirectional article ↔ workspace edges

When the hourly graph rebuild runs, also scan web/src/pages/articles/*.astro for /lab/{slug} patterns. Add an article node type and featured_in edges. Render on both the lab page ("As featured in: …") and the article page (already there).

C. Workspace status colors

Color workspace nodes by status: complete=green, draft=gray, revised=blue, failed=red. The current uniform purple loses the most-important state.

D. Edge weights from effect sizes

tested edges carry verdict but not magnitude. Use Cohen's d or |r| to thicken edges so the strongest findings draw the eye in /garden.

E. Replace mass datasource cluster with categories

The 286-node "datasource galaxy" component is dominated by FEMA/CB catalog entries. Group them into category supernodes ("FEMA disaster declarations", "Campaign Brain catalog"), so the rendering doesn't waste pixels on uninteresting bulk.

Tier 2 — bigger features

F. The Workspace Navigator (template page) — see below

A per-workspace landing page that shows the workspace as the center of its own mini-graph, with all artifacts navigable from one place.

G. Time-axis on the graph

Add a slider showing how the graph evolved month by month. Watch the WSPR cluster bloom in March, the Pulse arrive in April, the Granger network rewire after the HAC fix.

H. Pulse events as ephemeral nodes

When a quake or fireball comes through the Pulse, briefly render it as a node attached to its metric for 30 seconds, then fade it out. The /garden becomes a living organism.

I. Editor history as a layer

Overlay the Mike → Elise → Dana pipeline. Each accepted paper has an edit-distance score from its first draft to the published version, color-coded.

J. Cross-workspace dependency graph

Detect when one workspace's analysis script reads another workspace's data/results.json or data/*.parquet. Build hard depends_on edges (not just text-citation cites).

Tier 3 — research-grade ambition

K. Findings as queryable nodes

Each finding could carry (metric_a, metric_b, lag, effect, p) as structured fields. Then "show me all the negative correlations between WSPR and any solar metric at lag <= 7 days" becomes a graph query, not a full-text search.

L. The graph as input to the next paper

When a researcher (or Elise) starts a new paper, the graph already knows which metrics overlap in time, which pairs have been tested, which lags worked. Suggest experiments the graph hasn't tried yet.


Proposal: The Workspace Navigator

A template page that organizes everything around a single workspace — like a research mission control panel.

Goals

  1. One screen, everything about the workspace at a glance
  2. A small per-workspace knowledge graph (the workspace + everything it touches, ~30 nodes max)
  3. Quick navigation to all artifacts (paper PDF, scripts, data files, visualizations, results)
  4. Surface what's connected: which metrics, which findings, which other workspaces cite this one
  5. Where appropriate: surface upstream Pulse activity for the workspace's metrics

Layout sketch

┌──────────────────────────────────────────────────────────────────┐
│  WORKSPACE NAVIGATOR — wspr-station-pair-validation              │
│  Two Competing Responses Hidden in the 10 m WSPR Anticorrelation │
│  Status: complete  ·  Issue: #104  ·  Updated: 2026-04-07        │
├───────────────────────────────────┬──────────────────────────────┤
│                                   │                              │
│   ┌─────────────────────────┐    │  ARTIFACTS                   │
│   │                         │    │  ─────────                   │
│   │   [mini knowledge       │    │  📄 paper.pdf  (8 pages)     │
│   │    graph: this          │    │  📖 index.md   (1.2k lines)  │
│   │    workspace +          │    │  🐍 scripts/                 │
│   │    immediate            │    │      • extract.py            │
│   │    neighbors            │    │      • pair_one_pass1.py     │
│   │    only ~30 nodes]      │    │      • pair_one_pass2.py     │
│   │                         │    │      • analyze.py            │
│   │                         │    │  📊 data/                    │
│   └─────────────────────────┘    │      • qualifying_pairs.parquet
│                                   │      • pair_monthly_snr.parquet
│   FINDINGS                        │      • results.json          │
│   ────────                        │      • full_band_monthly.parquet
│   • short-path 10m: r=-0.61      │  📈 www/                     │
│   • long-path 10m: r=+0.54       │      • full-vs-filtered-r.html
│   • Fisher z=-13.2 (p≈7e-40)     │      • 10m-short-vs-long.html
│   • 12m: filtered r drops to-0.12│      • pair-counts-by-band.html
│   • 20m: positive control passes │      • fisher-forest.html    │
│                                   │                              │
├───────────────────────────────────┴──────────────────────────────┤
│  RELATED                                                          │
│  ─────────                                                        │
│  Cites:    wspr-solar-cycle-modulation, wspr-21year-census       │
│  Cited by: (none yet)                                            │
│  Featured in: "10.94 Billion Spots, One Wrong Sign" article      │
│                                                                   │
│  METRICS USED (declared)                                          │
│  ───────────────────────                                          │
│  wspr_snr_80m  wspr_snr_40m  wspr_snr_30m  wspr_snr_20m          │
│  wspr_snr_17m  wspr_snr_15m  wspr_snr_12m  wspr_snr_10m          │
│  sunspot_number                                                   │
│                                                                   │
│  PULSE LIVE (last 24h, filtered to this workspace's metrics)     │
│  ──────────────────────────────────────────────────────────────  │
│  (would show ham radio events, sunspot updates, etc.)            │
└──────────────────────────────────────────────────────────────────┘

Implementation outline

Route: /lab/{slug}/navigator (sibling to the existing /lab/{slug})

Astro page: web/src/pages/lab/[slug]/navigator.astro

Server-side reads (all from filesystem at SSR, like the existing lab pages):

  • workspaces/{slug}/workspace.json — title, status, issue, metrics, tags
  • workspaces/{slug}/index.md — first 200 lines for the narrative pane
  • workspaces/{slug}/data/results.json — extract structured findings
  • workspaces/{slug}/scripts/*.py — list (with sizes)
  • workspaces/{slug}/data/*.parquet — list (with sizes)
  • workspaces/{slug}/www/*.html and *.png — list
  • workspaces/{slug}/paper/paper.pdf — exists check + size
  • data/graph_cache.json — extract subgraph rooted at this workspace, BFS depth 2

Mini graph:

  • Center node = the workspace
  • Neighbors = metrics it uses, findings it produced, other workspaces it cites or that cite it
  • Render with the same D3 simulation as /garden, but a smaller container (400×300)
  • Click a metric → see all other workspaces that use the same metric
  • Click a finding → highlight, show extracted r/p/N

Artifacts panel:

  • Each script, data file, viz, paper file links to the API endpoint that serves it
  • Sort by type (scripts → data → www → paper)
  • Show sizes

Findings panel:

  • Pull the top 5-10 findings from results.json
  • Format as bullets with effect size + p-value
  • Color-code by verdict (positive, negative, null)

Related panel:

  • cites from the graph cache
  • "cited by" via reverse lookup
  • "featured in" from articles scan

Pulse Live panel (advanced):

  • Subscribe to the existing WebSocket
  • Filter messages: only show events whose kind matches a metric this workspace uses
  • E.g., the WSPR navigator would highlight ham/solar events; the seismic navigator would surface earthquakes
  • Same chip styling as the global Pulse Ticker, but workspace-scoped

Why this is the right next thing

  1. The lab page is the funnel. Right now it's a card grid with a thumbnail and 120 chars of description. The navigator turns each workspace into a navigable hub.
  2. Per-workspace mini-graphs visualize what's connected without overwhelming the user with the full 989-node garden.
  3. It teaches the graph by example. Anyone landing on the navigator immediately sees that workspaces have metrics, metrics come from datasources, findings cite other findings.
  4. It surfaces orphans. A workspace with no metrics: declared will show an empty mini-graph and a warning. Self-correcting documentation.
  5. It composes with the Pulse. The navigator can subscribe to the WebSocket and filter to its workspace's domain, turning every research page into a live monitor for its area of interest.

Phased delivery

Phase 1 (1 sprint item): Static navigator. Workspace metadata, artifacts list, findings from results.json, no graph yet.

Phase 2 (1 sprint item): Mini graph rendered server-side from data/graph_cache.json subgraph. D3 reuse from /garden.

Phase 3 (1 sprint item): Bidirectional article links + "cited by" reverse lookup.

Phase 4 (stretch): Pulse Live panel subscribed to the WebSocket, filtered by workspace metrics.

Open questions

  • Should the navigator replace /lab/{slug} or live alongside it as /lab/{slug}/navigator?
  • Do we want a public-facing "research dashboard" view that's the navigator with social-share affordances (Twitter, RSS, citation export)?
  • Should the per-workspace pulse panel persist events to a per-workspace ring buffer the same way the global Pulse does?

Appendix: graph generation in code

src/terrapulse/lab/knowledge_graph.py
  ├── Node                              dataclass
  ├── Edge                              dataclass
  ├── KnowledgeGraph                    class with add_node/add_edge/find_metric_links
  └── build_graph()                     entry point
       ├── reads PostgreSQL datasources
       ├── reads PostgreSQL distinct metrics
       ├── walks workspaces/*/
       │    ├── reads workspace.json (title, metrics[])    ← FIXED 2026-04-08
       │    ├── reads scripts/*.py (substring fallback)
       │    ├── reads index.md (cites detection)
       │    └── reads data/results.json (finding extraction)
       └── returns KnowledgeGraph

scripts/regenerate_graph.py             standalone caller, writes data/graph_cache.json
src/terrapulse/ingestion/scheduler.py   APScheduler job, calls regenerate_graph hourly
data/graph_cache.json                   the live cache, served via SSR embed
web/src/pages/garden.astro              the D3 frontend
Live Feed