Paper Machine Agent — Domain Knowledge Context

This file is the "brain" of the Paper Machine Agent (PMA). It teaches an AI agent how to conduct rigorous cross-domain environmental research on the TerraPulse platform. Load this file into agent context before any research task.

Platform: TerraPulse — 29M observations, 113 metrics, 14 scientific domains Database: PostgreSQL + PostGIS at localhost:5433

1. Data Access

Connection

Always use synchronous psycopg2 for data extraction scripts. The async SQLAlchemy engine is for the API server only — research scripts run in batch and psycopg2 is simpler and more reliable for one-off queries.

import psycopg2
import numpy as np
import polars as pl
from pathlib import Path

PG_DSN = "postgresql://terrapulse:terrapulse@localhost:5433/terrapulse"
DATA_DIR = Path(__file__).resolve().parent.parent / "data"

conn = psycopg2.connect(PG_DSN)
cur = conn.cursor()

Core Query Patterns

List all available metrics with counts:

cur.execute("""
    SELECT metric, COUNT(*) AS n,
           MIN(timestamp_utc) AS first_obs,
           MAX(timestamp_utc) AS last_obs
    FROM observations
    WHERE value IS NOT NULL
    GROUP BY metric
    ORDER BY n DESC
""")
metrics = cur.fetchall()
for m, n, first, last in metrics:
    print(f"  {m:40s} {n:>10,}  {first} → {last}")

Daily aggregation (the most common pattern):

cur.execute("""
    SELECT timestamp_utc::date AS day,
           COUNT(*) AS n,
           AVG(CAST(value AS DOUBLE PRECISION)) AS avg_val,
           MAX(CAST(value AS DOUBLE PRECISION)) AS max_val,
           MIN(CAST(value AS DOUBLE PRECISION)) AS min_val
    FROM observations
    WHERE metric = %s AND value IS NOT NULL
    GROUP BY 1
    ORDER BY 1
""", (metric_name,))
rows = cur.fetchall()
df = pl.DataFrame({
    "day": [r[0] for r in rows],
    "n": [r[1] for r in rows],
    "avg": [float(r[2]) for r in rows],
    "max": [float(r[3]) for r in rows],
    "min": [float(r[4]) for r in rows],
})

Hourly aggregation:

cur.execute("""
    SELECT date_trunc('hour', timestamp_utc) AS hour,
           COUNT(*) AS n,
           AVG(CAST(value AS DOUBLE PRECISION)) AS avg_val
    FROM observations
    WHERE metric = %s AND value IS NOT NULL
      AND timestamp_utc >= %s AND timestamp_utc < %s
    GROUP BY 1
    ORDER BY 1
""", (metric_name, start_utc, end_utc))

Source-scoped multi-metric queries (DO NOT use metric LIKE 'prefix_%'):

When you need every metric emitted by a single source, scope by source_id — NOT by metric LIKE 'prefix_%'. The latter is a full-heap scan of the 180M-row observations table and takes ~30 min; the former uses the idx_obs_source_ts index and completes in seconds.

# WRONG — full scan, will pile up under concurrency (incident #189)
cur.execute("SELECT ... FROM observations WHERE metric LIKE 'nws_%%'")

# RIGHT — uses the (source_id, timestamp_utc) index
cur.execute("""
    SELECT metric, timestamp_utc, CAST(value AS DOUBLE PRECISION) AS val,
           latitude, longitude, state, country, extra_json
    FROM observations
    WHERE source_id = (SELECT source_id FROM datasources WHERE slug = %s)
      AND value IS NOT NULL
    ORDER BY timestamp_utc
""", ('nws_alerts',))

Why: the source_id filter prunes the heap scan to a single source's rows via the index. The metric LIKE filter has to examine every row in the table because no index exists on metric directly, and the planner can't use idx_obs_source_ts since source_id isn't constrained. On the production 180M-row table, source-scoped is ~1.5 s; LIKE-scoped is ~30 min.

This rule applies to any source whose metrics share a common prefix — nws_*, donki_*, sounding_*, spc_*, mag_field_*, etc. Always prefer source_id-scoping over metric LIKE. The exception: when you genuinely need rows across multiple sources (e.g., correlation between two unrelated metrics), use explicit metric IN (...) rather than a wildcard.

Earthquake deduplication (REQUIRED before any earthquake analysis):

Earthquake data contains duplicates from overlapping USGS feed windows. Always deduplicate by rounding to the nearest minute and grouping by lat/lon:

cur.execute("""
    SELECT DISTINCT ON (
        date_trunc('minute', timestamp_utc),
        ROUND(CAST(latitude AS NUMERIC), 2),
        ROUND(CAST(longitude AS NUMERIC), 2)
    )
        timestamp_utc,
        CAST(value AS DOUBLE PRECISION) AS magnitude,
        latitude,
        longitude
    FROM observations
    WHERE metric = 'earthquake_magnitude' AND value IS NOT NULL
    ORDER BY date_trunc('minute', timestamp_utc),
             ROUND(CAST(latitude AS NUMERIC), 2),
             ROUND(CAST(longitude AS NUMERIC), 2),
             timestamp_utc
""")
eq_rows = cur.fetchall()

Data Guards

These guards prevent runaway queries, bad data, and out-of-memory errors:

LIMIT 1M rows — Never fetch more than 1 million rows in a single query without explicit justification. For larger datasets, aggregate server-side.
Filter nulls — Always include WHERE value IS NOT NULL unless you are specifically studying data completeness.
Filter > 0 for continuous metrics — Many continuous metrics (streamflow, radiation, wave height) have physical meaning only when positive. Zero values often indicate sensor off or no-data:
```
WHERE metric = 'streamflow_00060' AND value IS NOT NULL
  AND CAST(value AS DOUBLE PRECISION) > 0
```

Subsample > 100K — When fitting distributions or running bootstrap tests on datasets larger than 100,000 rows, subsample first:

if len(values) > 100_000:
    values = np.random.default_rng(42).choice(values, 100_000, replace=False)
    print(f"Subsampled to 100,000 for fitting (seed=42)")

Close connections — Always close cursor and connection:
```
cur.close()
conn.close()
```

Key Metrics Reference Table

Domain	Metric Name	Records	Notes
Solar / Space Weather	`solar_xray_flux`	9.2M	GOES X-ray flux (5-min cadence)
	`solar_wind_speed`	2.8M	DSCOVR solar wind (km/s)
	`solar_wind_bz`	2.6M	IMF Bz component (nT, southward=negative)
	`space_kp_index`	160K	Planetary K-index (0-9 scale)
	`space_solar_flux_10cm`	84K	F10.7 solar radio flux
	`sunspot_number`	76K	Daily sunspot number (1818-2026)
	`donki_cme_speed`	18K	CME speed from NASA DONKI (km/s)
	`donki_flare_class`	2.5K	Solar flare class (C/M/X encoded)
	`donki_storm_kp`	788	Geomagnetic storm peak Kp
	`donki_high_speed_stream`	580	High-speed solar wind streams
Seismology	`earthquake_magnitude`	175K	USGS earthquakes (2021-2026, global)
Severe Weather (NWS)	`nws_*` (40+ types)	471K	NWS alerts by type (3-day rolling window)
Severe Weather (SPC)	`spc_tornado`	70K	Historical tornadoes (1950-2023)
	`spc_hail`	29K	Hail reports
	`spc_wind`	7.7K	Wind damage reports
Meteorology	`temperature_2m`	56K	Open-Meteo temperature (10 cities)
	`geosphere_temperature`	9.6K	GeoSphere Austria stations
	`nasa_power_t2m`	11K	NASA POWER reanalysis temperature
Air Quality	`us_aqi`	55K	US AQI from Open-Meteo
	`geosphere_pm25`	7.4K	PM2.5 from GeoSphere Austria
Hydrology / Oceanography	`streamflow_00060`	2.4M	USGS streamflow (cfs, 15-min)
	`water_level`	247K	NOAA tide gauge water levels
	`wave_height`	22K	Open-Meteo wave height (m)
	`river_discharge`	1.5K	Open-Meteo river discharge
Radiation	`radiation`	1.4M	Safecast citizen science (CPM)
Volcanology	`volcanic_eruption`	5K	Smithsonian GVP eruptions
	`volcano_alert`	5.9K	USGS volcano alerts
	`holocene_volcano`	1.2K	Holocene volcano catalog
Near-Earth Objects	`neo_close_approach`	25K	NASA NEO close approaches
	`neo_fireball`	541	CNEOS fireball database (1998-2026)
Gravitational Waves	`gw_merger`	368	LIGO/Virgo/KAGRA confirmed mergers
	`gw_observing_run`	2.1K	Detector observing run metadata
Optical Transients	`ztf_sn_candidate`	1.4K	ZTF supernova candidates
	`ztf_kilonova_candidate`	395	ZTF kilonova candidates
Climate Indices	`climate_oni`	1.8K	Oceanic Nino Index (1949-2025)
	`climate_mei_v2`	1.1K	Multivariate ENSO Index (1979-2025)
	`drought_area_pct`	13K	USDM drought area % by state
Atmospheric Composition	`co2_daily`	16K	Daily CO2 concentration
	`ch4_monthly`	509	Monthly methane concentration
Satellite	`sar_scene`	7.6K	Sentinel-1 SAR scene catalog
Socioeconomic	`wb_SP.POP.TOTL`	3.3K	World Bank population
	`wb_AG.LND.FRST.ZS`	2.8K	World Bank forest area %
	`wb_EG.USE.PCAP.KG.OE`	2.8K	World Bank energy use per capita
Geomagnetic (NEW)	`dst_index`	400+	Dst ring current (1963-present, hourly, CDAWeb HAPI)
	`emsc_magnitude`	~340/day	EMSC seismicity (M0.8+ Europe/Middle East)
	`gfz_magnitude`	~5/hr	GFZ GEOFON (M2+ global, focal mechanisms)
	`isc_magnitude`	~27/6hr	ISC aggregated (496 agencies worldwide)
	`mag_field_*`	varies	INTERMAGNET observatories (X/Y/Z/F nT, hourly)
	`magnetic_pole_north`	435	North magnetic pole lat/lon (1590-2025)
	`magnetic_pole_south`	435	South magnetic pole lat/lon (1590-2025)
Ionosphere (NEW)	`wspr_snr_*` (18 bands)	~18/hr	WSPR propagation (mean SNR per band per hour)
	`ham_band_*` (4 bands)	~4/3hr	HamQSL band conditions (Good/Fair/Poor/Closed)
	`ham_kindex`, `ham_solarflux`, etc.	~7/3hr	HamQSL solar/geomagnetic indices
UAP	`ufo_sighting`	80K	NUFORC sighting reports (1906-2014)

Data Indices (NEW)

Two pre-built indices enable cross-domain queries without touching PostgreSQL:

Timeline Index (data/timeline.parquet): 8,128 rows (one per day, 2004-2026), 23 columns. Query any date range to get earthquake counts, Kp, solar wind, X-ray class, WSPR SNR, NWS alerts, fireballs, radiation, sunspots, volcanoes — all in one row. Refreshed hourly.

GeoIndex (data/geoindex.parquet): 9,584 cells at 1° resolution. Query any lat/lon bounding box to find what data exists there and which metrics are available. Useful for spatial context of events.

WSPR Denoised Data (NEW)

For any WSPR study, use the denoised datasets on /mnt/ursa:

corridors_denoised.parquet — 3.3M rows, 6 geographic corridors, hourly, noise-corrected
daily_denoised.parquet — 109K rows, 7 HF bands, 21 years, noise-corrected
noise_baseline.parquet — median noise per band per hour-of-day

Key columns added by denoising: adjusted_snr (noise-corrected), signal_level (receiver-independent), noise_anomaly (deviation from baseline).

2. Statistical Methods

Every method below includes production-ready Python code tested on TerraPulse data. Copy these directly into analyze.py scripts.

2.1 Correlation (Pearson + Spearman)

Always report BOTH Pearson (linear) and Spearman (monotonic rank) correlations. Disagreement between them indicates nonlinearity.

from scipy import stats as sp_stats

def correlate(x: np.ndarray, y: np.ndarray, label: str = "") -> dict:
    """Compute Pearson and Spearman correlations with sample size.

    Always report both. Disagreement flags nonlinearity.
    """
    # Drop pairs where either is NaN
    mask = ~(np.isnan(x) | np.isnan(y))
    x_clean, y_clean = x[mask], y[mask]
    n = len(x_clean)

    if n < 10:
        print(f"  {label}: N={n} — TOO FEW for correlation")
        return {"n": n, "pearson_r": None, "spearman_rho": None}

    r_pearson, p_pearson = sp_stats.pearsonr(x_clean, y_clean)
    rho_spearman, p_spearman = sp_stats.spearmanr(x_clean, y_clean)

    print(f"  {label}: N={n}")
    print(f"    Pearson  r={r_pearson:.4f}, p={p_pearson:.6f}, R²={r_pearson**2:.4f}")
    print(f"    Spearman ρ={rho_spearman:.4f}, p={p_spearman:.6f}")

    return {
        "n": n,
        "pearson_r": round(float(r_pearson), 4),
        "pearson_p": round(float(p_pearson), 6),
        "pearson_r_squared": round(float(r_pearson**2), 4),
        "spearman_rho": round(float(rho_spearman), 4),
        "spearman_p": round(float(p_spearman), 6),
        "label": label,
    }

2.2 Schuster Test (Phase Analysis)

The classical test for periodic triggering (e.g., tidal forcing of earthquakes). Projects event phases onto a unit circle — if events cluster at a preferred phase, the resultant vector is long.

def schuster_test(phases: np.ndarray, label: str = "") -> dict:
    """Schuster test for phase clustering.

    phases: array of phase values in [0, 1) where 0=start, 1=full cycle.
    Under the null (uniform distribution), D²/N ~ exponential(1).

    Returns:
        D²: squared resultant length
        p: exp(-D²/N) — probability of this alignment by chance
        R: mean resultant length (0=uniform, 1=perfect alignment)
        mean_phase: the preferred phase (if significant)
    """
    N = len(phases)
    cos_sum = np.sum(np.cos(2 * np.pi * phases))
    sin_sum = np.sum(np.sin(2 * np.pi * phases))
    D2 = cos_sum**2 + sin_sum**2
    p = np.exp(-D2 / N)
    R = np.sqrt(D2) / N
    mean_phase = np.arctan2(sin_sum, cos_sum) / (2 * np.pi) % 1.0

    print(f"  Schuster test ({label}): N={N:,}")
    print(f"    D² = {D2:.1f}")
    print(f"    p = exp(-D²/N) = {p:.6e}")
    print(f"    R = {R:.6f} (mean resultant length)")
    print(f"    Mean phase = {mean_phase:.4f}")
    print(f"    {'SIGNIFICANT' if p < 0.05 else 'Not significant'}")

    return {
        "n": N,
        "D_squared": round(float(D2), 2),
        "p_value": float(p),
        "R": round(float(R), 6),
        "mean_phase": round(float(mean_phase), 4),
        "significant": p < 0.05,
        "label": label,
    }

2.3 Granger Causality (MLE with F-test)

Tests whether lagged values of X improve prediction of Y beyond Y's own history. This is "predictive causality" — not true causation.

def granger_causality_test(x: np.ndarray, y: np.ndarray, max_lag: int = 7) -> dict:
    """Granger causality F-test: does X help predict Y?

    Compares:
      Restricted:   Y_t = a0 + sum(a_i * Y_{t-i}) + e
      Unrestricted: Y_t = a0 + sum(a_i * Y_{t-i}) + sum(b_i * X_{t-i}) + e

    Tests lags 1 through max_lag, returns the best (lowest p-value).
    """
    best = {"lag": 1, "f_stat": 0.0, "p_value": 1.0}

    for lag in range(1, max_lag + 1):
        n = len(y) - lag
        if n < 2 * lag + 10:  # Need sufficient degrees of freedom
            continue

        # Dependent variable
        Y = y[lag:]

        # Restricted model: Y on its own lags + intercept
        Y_lags = np.column_stack([y[lag - i - 1:len(y) - i - 1] for i in range(lag)])
        Y_lags = np.column_stack([np.ones(n), Y_lags])

        try:
            beta_r, _, _, _ = np.linalg.lstsq(Y_lags, Y, rcond=None)
            rss_r = np.sum((Y - Y_lags @ beta_r) ** 2)
        except np.linalg.LinAlgError:
            continue

        # Unrestricted model: Y on Y-lags + X-lags
        X_lags = np.column_stack([x[lag - i - 1:len(x) - i - 1] for i in range(lag)])
        Z = np.column_stack([Y_lags, X_lags])

        try:
            beta_u, _, _, _ = np.linalg.lstsq(Z, Y, rcond=None)
            rss_u = np.sum((Y - Z @ beta_u) ** 2)
        except np.linalg.LinAlgError:
            continue

        # F-test
        df1 = lag  # additional parameters in unrestricted
        df2 = n - 2 * lag - 1  # residual DoF
        if df2 <= 0 or rss_u <= 0:
            continue

        f_stat = ((rss_r - rss_u) / df1) / (rss_u / df2)
        p_value = 1 - sp_stats.f.cdf(f_stat, df1, df2)

        if p_value < best["p_value"]:
            best = {"lag": lag, "f_stat": float(f_stat), "p_value": float(p_value)}

    return best

2.4 Superposed Epoch Analysis

Stack multiple events on a common timeline to detect systematic pre- and post-event patterns. The gold standard for "does X respond to Y events?"

def superposed_epoch(event_times: list, series_times: np.ndarray,
                     series_values: np.ndarray, window_days: int = 7,
                     label: str = "") -> dict:
    """Superposed epoch analysis with statistical tests.

    event_times: list of datetime objects (the "key dates")
    series_times: array of datetime objects for the response variable
    series_values: array of float values for the response variable
    window_days: half-window size (±window_days around each event)

    Stacks all events and compares each epoch day to the pre-event baseline.
    """
    from datetime import timedelta

    # Convert to days relative to each event
    epoch_matrix = []

    for event_t in event_times:
        offsets = []
        vals = []
        for t, v in zip(series_times, series_values):
            delta = (t - event_t).total_seconds() / 86400.0
            if -window_days <= delta <= window_days:
                offsets.append(round(delta))
                vals.append(v)

        if len(offsets) > 0:
            day_dict = {}
            for d, v in zip(offsets, vals):
                day_dict.setdefault(d, []).append(v)
            # Average within each day for this event
            epoch = {d: np.mean(vs) for d, vs in day_dict.items()}
            epoch_matrix.append(epoch)

    if len(epoch_matrix) < 3:
        print(f"  SEA ({label}): Only {len(epoch_matrix)} events stacked — too few")
        return {"n_events": len(epoch_matrix), "significant": False}

    # Build aligned matrix: columns = epoch days, rows = events
    days = list(range(-window_days, window_days + 1))
    matrix = np.full((len(epoch_matrix), len(days)), np.nan)
    for i, epoch in enumerate(epoch_matrix):
        for j, d in enumerate(days):
            if d in epoch:
                matrix[i, j] = epoch[d]

    # Epoch averages
    epoch_means = np.nanmean(matrix, axis=0)
    epoch_stds = np.nanstd(matrix, axis=0)

    # Baseline: average of pre-event days [-window_days, -1]
    pre_cols = [j for j, d in enumerate(days) if -window_days <= d < 0]
    pre_values = matrix[:, pre_cols].flatten()
    pre_values = pre_values[~np.isnan(pre_values)]
    baseline_mean = np.mean(pre_values)
    baseline_std = np.std(pre_values)

    # Test each post-event day against baseline
    results_by_day = {}
    for j, d in enumerate(days):
        if d < 0:
            continue
        col = matrix[:, j]
        col_clean = col[~np.isnan(col)]
        if len(col_clean) < 3:
            continue
        t_stat, t_p = sp_stats.ttest_1samp(col_clean, baseline_mean)
        u_stat, u_p = sp_stats.mannwhitneyu(col_clean, pre_values,
                                            alternative='two-sided')
        results_by_day[d] = {
            "mean": round(float(np.mean(col_clean)), 4),
            "n": len(col_clean),
            "t_stat": round(float(t_stat), 3),
            "t_p": round(float(t_p), 6),
            "mann_whitney_p": round(float(u_p), 6),
        }

    print(f"  SEA ({label}): {len(epoch_matrix)} events stacked, ±{window_days} days")
    print(f"    Baseline (pre-event): mean={baseline_mean:.4f}, std={baseline_std:.4f}")
    for d, r in sorted(results_by_day.items()):
        sig = "***" if r["t_p"] < 0.001 else "**" if r["t_p"] < 0.01 else "*" if r["t_p"] < 0.05 else ""
        print(f"    Day {d:+d}: mean={r['mean']:.4f}  t={r['t_stat']:.2f}  p={r['t_p']:.4f} {sig}")

    return {
        "n_events": len(epoch_matrix),
        "baseline_mean": round(float(baseline_mean), 4),
        "baseline_std": round(float(baseline_std), 4),
        "epoch_means": {d: round(float(m), 4) for d, m in zip(days, epoch_means)},
        "post_event_tests": results_by_day,
        "label": label,
    }

2.5 Distribution Fitting (Clauset Power Law + Alternatives)

Following Clauset et al. (2009): estimate x_min via KS minimization, fit power law + exponential + lognormal + Weibull on the same tail, compare with Vuong likelihood ratio test, and validate with bootstrap GOF.

def fit_power_law(values: np.ndarray, bootstrap_n: int = 500,
                  label: str = "") -> dict:
    """Clauset et al. (2009) power law fitting with alternatives.

    1. Estimate x_min by minimizing KS statistic
    2. MLE fit for power law, exponential, lognormal, Weibull
    3. Vuong pairwise comparison
    4. Bootstrap goodness-of-fit (p < 0.1 = reject power law)

    Returns dict with all fit parameters and the best model.
    """
    from scipy.special import erfc

    values = values[values > 0]  # Power law requires positive values

    # --- Step 1: Estimate x_min ---
    unique_vals = np.sort(np.unique(values))
    if len(unique_vals) > 500:
        candidates = unique_vals[np.linspace(0, len(unique_vals) - 2, 200).astype(int)]
    else:
        candidates = unique_vals[:-1]

    best_xmin, best_alpha, best_ks = candidates[0], 2.0, 1.0
    for xm in candidates:
        tail = values[values >= xm]
        n = len(tail)
        if n < 10:
            continue
        alpha = 1 + n / np.sum(np.log(tail / xm))
        if alpha <= 1:
            continue
        # KS statistic
        tail_sorted = np.sort(tail)
        cdf_theory = 1 - (tail_sorted / xm) ** (1 - alpha)
        cdf_empirical = np.arange(1, n + 1) / n
        ks = np.max(np.abs(cdf_empirical - cdf_theory))
        if ks < best_ks:
            best_ks, best_xmin, best_alpha = ks, xm, alpha

    tail = values[values >= best_xmin]
    n_tail = len(tail)

    # --- Step 2: Fit alternatives on the same tail ---
    # Exponential MLE: lambda = 1 / mean(x - x_min)
    shifted = tail - best_xmin
    shifted = shifted[shifted > 0]
    exp_lambda = 1.0 / np.mean(shifted) if len(shifted) > 0 else 1.0
    exp_ks = sp_stats.kstest(shifted, 'expon', args=(0, 1/exp_lambda)).statistic

    # Lognormal MLE
    log_tail = np.log(tail)
    ln_mu, ln_sigma = np.mean(log_tail), np.std(log_tail)
    ln_ks = sp_stats.kstest(tail, 'lognorm', args=(ln_sigma, 0, np.exp(ln_mu))).statistic

    # Weibull MLE
    try:
        wb_params = sp_stats.weibull_min.fit(tail, floc=0)
        wb_k = wb_params[0]
        wb_ks = sp_stats.kstest(tail, 'weibull_min', args=wb_params).statistic
    except Exception:
        wb_k, wb_ks = None, 1.0

    # --- Step 3: Vuong likelihood ratio tests ---
    def vuong(loglik_ratios):
        n = len(loglik_ratios)
        if n < 10:
            return 0.0, 1.0, "inconclusive"
        R = np.sum(loglik_ratios)
        sigma = np.std(loglik_ratios, ddof=1)
        if sigma == 0:
            return R, 1.0, "inconclusive"
        stat = R / (sigma * np.sqrt(n))
        p = float(erfc(abs(stat) / np.sqrt(2)))
        if p > 0.1:
            return float(R), p, "inconclusive"
        return float(R), p, ("power_law" if R > 0 else "alternative")

    # Power law log-likelihoods per point
    pl_ll = (np.log(best_alpha - 1) - np.log(best_xmin)
             - best_alpha * np.log(tail / best_xmin))
    exp_ll = np.log(exp_lambda) - exp_lambda * (tail - best_xmin)
    ln_ll = sp_stats.lognorm.logpdf(tail, ln_sigma, 0, np.exp(ln_mu))

    comparisons = {}
    R_exp, p_exp, fav_exp = vuong(pl_ll - exp_ll)
    comparisons["vs_exponential"] = {"R": round(R_exp, 2), "p": round(p_exp, 4), "favors": fav_exp}
    R_ln, p_ln, fav_ln = vuong(pl_ll - ln_ll)
    comparisons["vs_lognormal"] = {"R": round(R_ln, 2), "p": round(p_ln, 4), "favors": fav_ln}

    # --- Step 4: Bootstrap GOF ---
    observed_ks = best_ks
    frac_tail = n_tail / len(values)
    count_ge = 0
    rng = np.random.default_rng(42)
    for _ in range(bootstrap_n):
        n_synth_tail = rng.binomial(len(values), frac_tail)
        if n_synth_tail < 10:
            continue
        synth_tail = best_xmin * (1 - rng.random(n_synth_tail)) ** (-1 / (best_alpha - 1))
        synth_alpha = 1 + n_synth_tail / np.sum(np.log(synth_tail / best_xmin))
        synth_sorted = np.sort(synth_tail)
        cdf_t = 1 - (synth_sorted / best_xmin) ** (1 - synth_alpha)
        cdf_e = np.arange(1, n_synth_tail + 1) / n_synth_tail
        synth_ks = np.max(np.abs(cdf_e - cdf_t))
        if synth_ks >= observed_ks:
            count_ge += 1
    bootstrap_p = count_ge / bootstrap_n

    # --- Determine best fit ---
    fits = {"power_law": best_ks, "exponential": exp_ks, "lognormal": ln_ks}
    if wb_k is not None:
        fits["weibull"] = wb_ks
    best_fit = min(fits, key=fits.get)
    if bootstrap_p < 0.1:
        best_fit = min({k: v for k, v in fits.items() if k != "power_law"},
                       key=lambda k: fits[k], default=best_fit)

    print(f"  Distribution fit ({label}): N={len(values):,}, tail N={n_tail:,}")
    print(f"    x_min = {best_xmin}")
    print(f"    Power Law: alpha={best_alpha:.3f}, KS={best_ks:.4f}, bootstrap_p={bootstrap_p:.3f}")
    print(f"    Exponential: lambda={exp_lambda:.4f}, KS={exp_ks:.4f}")
    print(f"    Lognormal: mu={ln_mu:.3f}, sigma={ln_sigma:.3f}, KS={ln_ks:.4f}")
    if wb_k is not None:
        print(f"    Weibull: k={wb_k:.3f}, KS={wb_ks:.4f}")
    print(f"    Vuong tests: {comparisons}")
    print(f"    >>> BEST FIT: {best_fit}")

    return {
        "n": len(values),
        "n_tail": n_tail,
        "x_min": float(best_xmin),
        "power_law": {"alpha": round(best_alpha, 3), "ks": round(best_ks, 4),
                      "bootstrap_p": round(bootstrap_p, 3)},
        "exponential": {"lambda": round(exp_lambda, 4), "ks": round(exp_ks, 4)},
        "lognormal": {"mu": round(ln_mu, 3), "sigma": round(ln_sigma, 3),
                      "ks": round(ln_ks, 4)},
        "weibull": {"k": round(wb_k, 3) if wb_k else None,
                    "ks": round(wb_ks, 4) if wb_k else None},
        "comparisons": comparisons,
        "best_fit": best_fit,
        "label": label,
    }

2.6 Multiple Comparisons (Bonferroni Correction)

When testing more than 3 pairs, apply Bonferroni correction to control the family-wise error rate. This is non-negotiable.

def bonferroni_threshold(n_pairs: int, alpha: float = 0.05) -> float:
    """Compute Bonferroni-corrected significance threshold.

    With 72 pairs, the threshold becomes 0.05/72 = 0.000694.
    A result at p=0.01 that "looks significant" is NOT after Bonferroni.
    """
    threshold = alpha / n_pairs
    print(f"  Bonferroni: {n_pairs} tests, alpha={alpha} → threshold={threshold:.6f}")
    return threshold


def apply_bonferroni(results: list[dict], alpha: float = 0.05) -> list[dict]:
    """Mark results as surviving or not surviving Bonferroni correction.

    Each result dict must have a 'p_value' key.
    """
    n = len(results)
    threshold = bonferroni_threshold(n, alpha)

    surviving = 0
    for r in results:
        r["bonferroni_threshold"] = threshold
        r["survives_bonferroni"] = r["p_value"] < threshold
        if r["survives_bonferroni"]:
            surviving += 1

    print(f"  {surviving}/{n} results survive Bonferroni correction")
    return results

2.7 Cross-Correlation with Lags

Compute correlation at multiple time lags to detect delayed coupling.

def cross_correlation(x: np.ndarray, y: np.ndarray,
                      max_lag: int = 24, label: str = "") -> dict:
    """Cross-correlation between x and y at lags -max_lag to +max_lag.

    Positive lag means x leads y (x at time t correlates with y at time t+lag).
    Reports Pearson r and p-value at each lag.
    """
    results = []

    for lag in range(-max_lag, max_lag + 1):
        if lag > 0:
            x_shifted = x[:-lag]
            y_shifted = y[lag:]
        elif lag < 0:
            x_shifted = x[-lag:]
            y_shifted = y[:lag]
        else:
            x_shifted = x
            y_shifted = y

        # Drop NaNs
        mask = ~(np.isnan(x_shifted) | np.isnan(y_shifted))
        x_c, y_c = x_shifted[mask], y_shifted[mask]
        n = len(x_c)

        if n < 10:
            continue

        r, p = sp_stats.pearsonr(x_c, y_c)
        results.append({
            "lag": lag,
            "r": round(float(r), 4),
            "p": round(float(p), 6),
            "n": n,
        })

    # Find peak
    if results:
        best = max(results, key=lambda r: abs(r["r"]))
        print(f"  Cross-correlation ({label}):")
        print(f"    Peak: lag={best['lag']}, r={best['r']:.4f}, p={best['p']:.6f}, N={best['n']}")

    return {
        "lags": results,
        "peak_lag": best["lag"] if results else None,
        "peak_r": best["r"] if results else None,
        "peak_p": best["p"] if results else None,
        "label": label,
    }

3. Quality Standards

These standards are NON-NEGOTIABLE. They encode hard-won lessons from 30 research workspaces and 10 published papers. Every one of these rules exists because we violated it and had to correct the record.

3.0 Anti-Sycophancy Protocol

When the data says no, SAY NO. Do not soften, hedge, reframe, or find silver linings in null results. Do not agree with dramatic claims from the internet without cross-checking against our data first.

Known cases where enthusiasm outran the numbers:

Fireball "158x anomaly" → duplicate ingestion bug. Actual March 2026: 3 fireballs (normal).
AMS "mysterious surge" → CNEOS shows 4 in Q1, exactly the 4.3/quarter baseline.
X1.5 CME "will produce Kp storm" → CME missed Earth entirely. Most CMEs miss.
Solar wind r=0.54 → V2 with more data: r=0.09. The effect was 6x smaller than V1 suggested.

Rules:

Check the data BEFORE agreeing with any claim.
"That's noise" is a valid and useful response.
Correct your own errors immediately — do not wait to be asked.
A null result reported clearly is worth more than a positive result reported vaguely.
If you catch yourself saying "interesting!" about something that might be an artifact, stop and verify.

3.1 Honest Nulls

A null result IS a finding. Report it prominently, not buried in an appendix. Some of our best papers are null results:

Solar-terrestrial forcing: "The solar cycle does not modulate geophysical event rates" (205 years of data, zero significant correlations)
NEO-gravitational: "Gravitational wave events do not correlate with near-Earth object trajectories" (clean null)
Earthquake tidal forcing: Schuster test p=0.28 — earthquakes do NOT cluster by lunar phase (despite 171K events)

If your analysis finds nothing, SAY SO CLEARLY. Do not hedge with "further investigation needed" when the data is conclusive. A null with N=100,000 is more valuable than a positive with N=50.

3.2 V1 to V2 Discipline

Every surprising positive result must be verified with more data.

The canonical cautionary tale: Solar wind speed vs earthquakes showed r=0.54 (p<0.00001) with 74 hours of data (V1). With 1,800 hours of DSCOVR archive data (V2), the correlation dropped to r=0.09 — still statistically significant but explaining less than 1% of variance.

Rule of thumb: r > 0.5 with N < 500 is a flag for verification, not a finding. Do not publish V1 results without attempting V2 replication.

The V1-to-V2 pipeline:

V1: Initial analysis on available data. Flag any strong correlations.
Seek more data: Backfill archives, extend time windows, add sources.
V2: Re-run the identical analysis on the expanded dataset.
Report BOTH V1 and V2. The shrinkage itself is scientifically informative.

3.3 Effect Sizes

Always report effect size alongside p-values. A p-value alone is meaningless — with enough data, any nonzero effect becomes "significant."

Required effect size metrics:

r or rho for correlations (with R² = variance explained)
Cohen's d for group comparisons: d = (mean1 - mean2) / pooled_std
Ratio for rate comparisons: "high-Kp days average 1.76x more earthquakes"

def cohens_d(group1: np.ndarray, group2: np.ndarray) -> float:
    """Cohen's d effect size for two independent samples."""
    n1, n2 = len(group1), len(group2)
    var1, var2 = np.var(group1, ddof=1), np.var(group2, ddof=1)
    pooled_std = np.sqrt(((n1 - 1) * var1 + (n2 - 1) * var2) / (n1 + n2 - 2))
    if pooled_std == 0:
        return 0.0
    d = (np.mean(group1) - np.mean(group2)) / pooled_std
    return round(float(d), 4)

Interpretation: |d| < 0.2 = negligible, 0.2-0.5 = small, 0.5-0.8 = medium, > 0.8 = large.

3.4 Sample Sizes

State N for every statistical test. No exceptions. Every correlation, every t-test, every Schuster test — the reader must see N alongside the result.

Pattern:

Solar wind vs earthquakes: r=0.09, p=0.0001, N=1,800 hours
Schuster test (all M≥4): p=0.28, N=171,852 earthquakes
Granger F-test: F=4.32, p=0.003, N=181 hours, lag=2h

3.5 Bootstrap Before Power Law

When fitting power laws, ALWAYS run the bootstrap goodness-of-fit test. If bootstrap p < 0.1, reject the power law hypothesis — the data is better described by an alternative distribution.

We learned this from the universal-scaling-laws workspace: earthquake magnitudes looked like a power law until bootstrap rejected it (p=0.03). The true distribution was lognormal.

3.6 Deduplicate Earthquakes

USGS earthquake feeds have overlapping windows that produce duplicate records. Before ANY earthquake analysis, run the deduplication query from Section 1. Failure to deduplicate inflates N and biases clustering tests.

3.7 At Least One Null Hypothesis Per Paper

Every paper must explicitly state and test at least one null hypothesis. "We find no evidence for X" is a valid and valuable conclusion. The null should be the default expectation, and the burden of proof is on the alternative.

3.8 Bonferroni for > 3 Pairs

When testing more than 3 pairwise comparisons, apply Bonferroni correction. In the Granger causality network (72 pairs), only 15 of 37 nominal significances survived Bonferroni — more than half were false positives.

Report both: "37 significant at p<0.05, 15 survive Bonferroni (p<0.00069)."

3.9 Pre-Flight Checklist (Lessons from Mike's Reviews)

These items are caught by the editor on EVERY paper. Do them BEFORE submitting. The list is structured: statistical hygiene first (1–6), interpretation/framing next (7–10), cross-document and housekeeping last (11–15).

Statistical hygiene

Sensitivity analysis. For any threshold or window parameter (e.g., 3σ, 90-day), report results at ±1 level (2.5σ/3.5σ, 60-day/180-day). A one-row table showing "results are robust" prevents a revision round.
Report Mann-Whitney alongside t-test. For ANY group comparison (SEA post vs pre, high-Kp vs low-Kp, anomaly classes), report both parametric (t-test) and non-parametric (Mann-Whitney). If they disagree, the non-parametric result is more reliable and that disagreement is itself a finding (indicates non-normality or outlier influence).
Detection floor analytically, not by eyeball. When reporting a non-significant correlation as a null, state the |r| that would have been needed to clear α at the relevant N. Use sp_stats.t.ppf(1 - α/2, df) to compute r_crit = t_crit / sqrt(t_crit² + df). At α=0.01 Bonferroni the floor is meaningfully higher than at α=0.05; reporting only the α=0.05 floor when you ran Bonferroni-corrected tests is wrong. Banked from #202 (Safecast×Forbush): R0 reported floor ≈0.30 when the actual α=0.01 floor was 0.384.
By-construction tautology check. If a class definition uses variable X as the threshold AND the test compares classes on variable X, the result is mathematically forced (e.g., thresholding z_PM2.5 at 2 then testing z_PM2.5 between the two halves). The number is descriptive, not a hypothesis test. Either downgrade it to characterization-of-spread, OR add a non-tautological test (Gaussian-mixture BIC for bimodality, Hartigan dip test, or test on an ex-ante predictor of class membership). Banked from #203 (AQI anomalies): R0 led with d=2.70 on a thresholded variable; R1 added GaussianMixture BIC (ΔBIC=4.64) as the non-tautological evidence.
Leave-one-out robustness when one subgroup dominates a class. If a single city / event / station contributes >30% of one class in a comparison, recompute the test excluding that subgroup. If the effect strengthens or holds: report as robustness. If it weakens: the original finding was subgroup-driven and the headline must shift. Banked from #203: Sydney contributed 4/12 of the non-PM-dominated class; leave-Sydney-out d went from 2.70 to 2.85, confirming the split was not a Sydney artifact.
Per-event paired tests when N events is small. For superposed-epoch analyses, compute one number per event (e.g., post-event mean minus pre-event mean), then run a paired test on the events (ttest_rel + wilcoxon). Pooling all post-event days as if independent inflates the apparent sample size — days within one event are correlated. Banked from #202: R0 had pooled SEA giving d=0.46; per-event paired R1 gave d=0.78 on the same data (paired correctly handles within-event correlation).

Interpretation & framing

Sign-against-prediction column for forcing tests. When testing whether driver X modulates response Y, state the predicted sign of r given the physics, then a "sign agrees / wrong sign" flag for each driver. Five drivers carrying the wrong sign is a stronger null than five drivers being non-significant. Banked from #202: R0 mistakenly called a Dst correlation "physically expected"; R1 caught all five drivers were wrong-direction, strengthening the null.
Mechanism-from-pattern caveat. Hemispheric / seasonal / latitudinal pattern claims require direct multi-channel observation to support a mechanism inference. If the channel that would distinguish mechanisms (e.g., ozone vs. PM constituent) has no data on the relevant subgroup, the mechanism claim is "suggestive, not confirmatory." Say so explicitly. Banked from #203: R0 inferred Sydney's non-PM anomalies were photochemical from a seasonal heuristic; all 6 Sydney anomalies pre-date the constituent ingest window so there is no direct evidence.
Case-study honesty (small-N care). With N<10 case studies, don't claim the classification "cleanly separates" mechanisms unless every event falls cleanly. Count the unambiguous ones, count the boundary ones, report both. Banked from #203: R0 said "PM-split correctly separates" 4 non-PM case studies; R1 counted them as 2 photochemical
- 2 weak/boundary.
Honest nulls in the headline, not buried. A null result is a finding; report it in the abstract and conclusion, not buried in the discussion. "We find no evidence for X" with N=100,000 is more valuable than a positive with N=50.

Cross-document and housekeeping

Cite prior TerraPulse workspaces. Run ls workspaces/ and check if any prior workspace studied related metrics. If solar-geomagnetic-lag measured Kp delay and you're measuring Kp delay, cite it and compare numbers. Banked across #182/#202/#203.
Explain count discrepancies. If the issue says 66 events but the paper analyzes 32, explain why in the first paragraph. "Of the 66 events, 32 were available in the exported catalog" prevents confusion. Same applies to N=247 in issue → N=246 in paper (partial-day exclusions); footnote the table caption.
Reference accuracy. Don't cite a paper for a claim it doesn't make. Bell & Davis 2001 reanalyses the 1952 London smog mortality — citing it for "PM-inversion seasonality" is wrong. The editor cross-checks references against claims. Banked from #203.
Table notation. Never use d= for anything other than Cohen's d. Use peak= for peak values, Δ= for changes, r= for correlations. Mixed notation in a single column header guarantees an editorial flag.
Check the math + Data availability statement.
- Verify every number in the paper matches results.json. The editor cross-checks and WILL catch mismatches.
- Include as a bold paragraph after Conclusion, not buried in text: \textbf{Data availability.} All data and scripts at \texttt{workspaces/{slug}/}.

4. Paper Template

revtex4-2 LaTeX Skeleton

All TerraPulse papers use this exact template. Do not modify the document class, author block, or affiliation.

\documentclass[twocolumn,prl,aps,showpacs]{revtex4-2}
\usepackage{amsmath,amssymb}
\usepackage{graphicx}
\usepackage{hyperref}
\usepackage{booktabs}

\begin{document}

\title{Your Title Here}

\author{M.~Isenbek}
\email{michael@terrapulse.info}
\author{E.~Isenbek}
\author{B.~Isenbek}
\affiliation{TerraPulse Climate Intelligence Platform, \url{https://terrapulse.info}}

\date{\today}

\begin{abstract}
One paragraph. State the question, the data, the method, and the key finding.
Include the most important numbers (N, r, p). End with the implication.
\end{abstract}

\maketitle

\section{Introduction}

Why does this question matter? What is the prior literature? What gap does
this paper fill? 2-3 paragraphs.

\section{Data}

Where does the data come from? What is the temporal and spatial coverage?
Include a table summarizing each data source with record counts and date ranges.
Always cite TerraPulse: \url{https://terrapulse.info}.

\begin{table}[h]
\caption{Data sources and coverage.}
\label{tab:data}
\begin{ruledtabular}
\begin{tabular}{lrr}
\textbf{Source} & \textbf{$N$} & \textbf{Span} \\
\colrule
Source 1 & 100,000 & 2021--2026 \\
Source 2 & 50,000 & 2025--2026 \\
\end{tabular}
\end{ruledtabular}
\end{table}

\section{Method}

What statistical tests are you running? Define them mathematically.
Reference the standard citations (e.g., Clauset et al. 2009 for power laws).

\section{Results}

Present the numbers. Tables and figures. State significance after Bonferroni
where applicable. Report effect sizes, not just p-values.

\begin{table}[h]
\caption{Results summary.}
\label{tab:results}
\begin{ruledtabular}
\begin{tabular}{lrrrl}
\textbf{Test} & \textbf{$r$} & \textbf{$p$} & \textbf{$N$} & \textbf{Sig?} \\
\colrule
Test 1 & +0.09 & 0.0001 & 1,800 & Yes \\
Test 2 & -0.02 & 0.81 & 120 & No \\
\end{tabular}
\end{ruledtabular}
\end{table}

\section{Discussion}

What do the results mean? How do they relate to prior work? What are the
limitations? If a null result, say so clearly and explain why it matters.
If V1 → V2 shrinkage occurred, discuss it honestly.

\section{Conclusion}

2-3 sentences. The main finding. The implication. What should be done next.

\begin{thebibliography}{99}
\bibitem{ref1} Author, ``Title,'' \textit{Journal} \textbf{Vol}, Page (Year).
\end{thebibliography}

\end{document}

Compile Command

cd workspaces/<workspace>/paper/
pdflatex paper.tex && pdflatex paper.tex

Run twice to resolve cross-references and table numbering.

Target Specifications

Length: 3-4 pages (two-column revtex4-2 format)
Tables: 1-2 (data summary + results)
Figures: 1-2 (referenced from www/ via \includegraphics)
References: 5-15 (key citations, not exhaustive)

5. Workspace Conventions

Directory Structure

Every research workspace follows this layout:

workspaces/<workspace-name>/
├── index.md              # Research narrative (the "lab notebook")
├── workspace.json        # Machine-readable metadata
├── scripts/
│   ├── extract.py        # Data extraction from PostgreSQL
│   ├── analyze.py        # Statistical analysis
│   └── visualize.py      # (optional) Separate visualization script
├── data/
│   ├── *.parquet         # Extracted datasets (Polars/Parquet)
│   └── *.json            # Intermediate results
├── www/
│   ├── *.html            # Interactive Plotly visualizations
│   └── *.png             # Static images for papers
├── paper/
│   └── paper.tex         # LaTeX paper (if publishing)
└── tmp/                  # Scratch files, debugging output

Pipeline Flow

The canonical research pipeline is linear:

extract.py → data/*.parquet
    → analyze.py → data/results.json
        → visualize.py → www/*.html + www/*.png
            → index.md (narrative writeup)
                → paper/paper.tex (if publishing)

extract.py — Queries PostgreSQL, saves Parquet files to data/. Uses psycopg2 + polars. No analysis logic here.
analyze.py — Loads Parquet files, runs statistical tests, writes data/results.json. All numbers in the paper come from this JSON.
Visualizations — Generate both interactive HTML (for the web) and static PNG (for the paper). Usually done at the end of analyze.py or in a separate visualize.py.
index.md — The human-readable research narrative. Written after analysis is complete.
paper/paper.tex — The formal paper. Only created if the findings warrant publication.

index.md Format

# Workspace Title

> **Author:** Claude (TerraPulse Lab)
> **Status:** Draft | Complete | Published
> **Created:** YYYY-MM-DD
> **GitHub Issue:** #NN

## Hypothesis

What are we investigating? State it as a testable question.

## Data Sources

Table of metrics used, record counts, date ranges.

## Methodology

Statistical methods, preprocessing steps, parameters.

## Findings

Results with numbers. Tables. References to visualizations in www/.

## References

Related work, data source documentation.

Plotly Visualization Pattern

Always generate both interactive HTML and static PNG:

import plotly.graph_objects as go

fig = go.Figure()

# ... build your figure ...

fig.update_layout(
    title="Descriptive Title (N=1,234)",
    xaxis_title="X Label (units)",
    yaxis_title="Y Label (units)",
    template="plotly_white",
)

# Interactive version for the web
fig.write_html(
    str(WWW_DIR / "figure-name.html"),
    include_plotlyjs="cdn",
)

# Static version for papers (high DPI)
fig.write_image(
    str(WWW_DIR / "figure-name.png"),
    scale=2,
)

print(f"  Wrote {WWW_DIR / 'figure-name.html'}")
print(f"  Wrote {WWW_DIR / 'figure-name.png'}")

workspace.json Format

{
  "name": "workspace-name",
  "title": "Human-Readable Title",
  "status": "complete",
  "created": "2026-03-28",
  "issue": 43,
  "metrics": ["earthquake_magnitude", "space_kp_index"],
  "methods": ["correlation", "granger_causality"],
  "findings_summary": "One-sentence summary of the key result."
}