Incident Postmortem — hard system freeze (2026-06-01)


Crash time	2026-06-01 ~20:55:03 EDT (≈5 h after the PG16 migration cutover)
Down for	~36 h (recovery boot 2026-06-03 10:28; drives cleaned; current boot 16:52)
Severity	High — whole machine froze; both businesses offline until manual recovery
Data loss	None. PG16 (migrated to internal NVMe) recovered clean; 276.6 M obs, ingesting
Root cause	Hard freeze / power-or-hardware event — no logged cause (see §3)
Status	Recovered; broken services fixed; hardware/power investigation OPEN
Related	Second incident in 3 days — see USB-drop postmortem

1. Summary

About 5 hours after the PG16-off-USB migration completed, the entire machine froze instantly at ~20:55:03 EDT on 2026-06-01. The system was healthy and serving normally up to the last logged second; then total silence. No kernel panic, no hardware error, nothing — the signature of a hard hardware/power-level lockup, not a software fault. It stayed down ~36 h until manual recovery.

The migration paid off: PG16 (TerraPulse) was on internal NVMe by then and recovered cleanly with no data loss. Had it still been on the USB drive, this hard crash could have corrupted it as the 2026-05-30 USB drop did.

2. Timeline (EDT)

2026-06-01 15:54 — PG16 migration cutover to internal NVMe completes; healthy.
2026-06-01 15:54–20:55 — Normal operation; PG16 checkpoints every 5 min; ingestion flowing. Last successful checkpoint start 20:54:17.
2026-06-01 20:55:03 — Last log line of any kind. Routine TerraPulse activity (geosphere fetch 200, GLM fetch 200, WSPR query). Then the machine freezes. No shutdown, no panic, no further logs.
2026-06-01 → 06-03 — Machine down ~36 h.
2026-06-03 10:28 — First recovery boot; PG16 logs database system was interrupted; last known up at 2026-06-01 20:53:47.
2026-06-03 (between boots) — Drives cleaned (operator); all disks re-enumerated (nvme0↔nvme1, /mnt/ursa sda1→sdj1, backup-ursa sdq1→sdo1, etc.).
2026-06-03 16:52 — Current boot. UUID-based mounts brought everything back correctly. PG16 online on NVMe; several services left in a failed state (§5).

3. Root cause analysis

Finding: a hard freeze with no recorded cause — points to power/hardware, not software.

Evidence (from on-disk rsyslog logs, which survived; the systemd journal did not):

syslog shows TerraPulse serving normally to the last instant — the final entries at 20:55:02–20:55:03 are successful HTTP fetches and a DB query. The system was not degrading.
kern.log's last line before the gap is a routine [UFW BLOCK] entry; there is no panic, oops, hung_task/blocked for more than Ns, OOM-killer, MCE/Hardware Error, thermal/throttle, ata/nvme/usb reset, or I/O error anywhere in the crash hour.
A software panic, OOM, thermal event, or disk failure all leave traces on disk before/at the event. The complete absence of any trace means the kernel died too abruptly to log — characteristic of a power interruption (loss/brownout/ PSU instability) or a hardware/CPU hard-lock.
This is a different failure mode from 2026-05-30 (that was a USB-drive transport drop with copious ext4/I/O errors). Two unrelated-looking failures in three days, on a heavily loaded box, point to underlying hardware/power fragility.

Honest caveat — a self-implicating possibility. On the afternoon of 2026-06-01 we disabled all power saving and pinned the CPU governor to performance, raising this machine's sustained power draw and heat. This cannot be proven to have contributed, and the evidence is mixed against it:

No thermal-throttle messages were logged; CPU/NVMe temps are normal now under the same performance setting (Tctl ~64 °C, NVMe ~41–52 °C, limit ~82 °C).
The freeze was instantaneous, not the gradual ramp a thermal event produces.
The box was already unstable the day before (USB drop), independent of this change.

But on a box that may already be near its power envelope (17 USB disks + 2 NVMe + multiple NICs/WireGuard tunnels), always-max-power tipping a marginal PSU cannot be ruled out. It stays on the suspect list pending a hardware/power review.

4. What survived / data integrity

PG16 (TerraPulse): intact, on internal NVMe (/var/lib/pgdata/16-main, now nvme1n1p2). Clean crash recovery; 276,626,045 observations; ingestion resumed; NVMe SMART PASSED. No corruption.
PG15 (secondary / nominate.ai): data on /mnt/ursa intact; clean WAL recovery on restart (§5).
Mounts: all disks re-enumerated after the crash + cleaning, but UUID-based fstab entries remounted everything correctly. /mnt/ursa readable.
Systemd journal: lost — /var/log/journal empty, boot list jumps April→today. The independent on-disk rsyslog logs (/var/log/syslog, kern.log) are what made this diagnosis possible; keep them.

5. Broken services found on recovery + fixes applied

ssh.service — was failed. sshd is configured to bind 192.168.1.100:1223; at boot it started before the (USB) network adapter received that IP, so it could not bind and gave up (Cannot assign requested address). The IP is present now. Fixed: restarted (listening again). Hardened: drop-in After=network-online.target + Restart=on-failure/RestartSec=5s so the boot race self-heals.
postgresql@15-main — was failed. Didn't start at boot (likely the USB /mnt/ursa wasn't mounted yet). Fixed: started; clean recovery; serving on 5432.
smartmontools — was failed (timeout). DEVICESCAN probing all 17 USB drives at startup exceeded the default start timeout. Fixed: drop-in TimeoutStartSec=300 (kept enabled — proactive SMART monitoring is worth having given the hardware fragility).

6. Recommendations (OPEN)

[HIGH] Hardware/power investigation. Two failures in three days under heavy load. Check PSU capacity/health and power delivery; put the box on a UPS if it isn't (a brownout would produce exactly this no-log freeze); run memtest.
[HIGH] Instrument for the next freeze. This one left no trace. Enable a hardware/systemd watchdog (auto-reboot on lockup) and/or netconsole or remote syslog, plus kernel.panic/panic_on_oops sysctls, so the next event is captured or auto-recovers instead of sitting dead for 36 h.
[MED] Reconsider the always-performance / no-power-saving profile until the power question is settled, or confirm PSU/cooling headroom. Flagged because we introduced it the same afternoon (§3).
[MED] PG15 still on USB. The secondary DB remains on the failure-prone USB drive (Phase 4 of the migration plan). Decide: slim + co-locate on NVMe, dedicated disk, or hardened USB.
[LOW] journald persistence isn't reliable here — it was set to persistent 2026-05-31 but still lost the crash window. rsyslog is the dependable record; ensure it stays enabled.

7. What went well

The PG16-off-USB migration, done hours earlier, is the reason this hard crash cost zero TerraPulse data — it recovered clean from fast internal NVMe.
UUID-based mounts survived a total device re-enumeration without manual fixup.
On-disk rsyslog logs preserved the crash window even though the journal was wiped.