Incident Postmortem — hard system freeze (2026-06-01)
| Crash time | 2026-06-01 ~20:55:03 EDT (≈5 h after the PG16 migration cutover) |
| Down for | ~36 h (recovery boot 2026-06-03 10:28; drives cleaned; current boot 16:52) |
| Severity | High — whole machine froze; both businesses offline until manual recovery |
| Data loss | None. PG16 (migrated to internal NVMe) recovered clean; 276.6 M obs, ingesting |
| Root cause | Hard freeze / power-or-hardware event — no logged cause (see §3) |
| Status | Recovered; broken services fixed; hardware/power investigation OPEN |
| Related | Second incident in 3 days — see USB-drop postmortem |
1. Summary
About 5 hours after the PG16-off-USB migration completed, the entire machine froze instantly at ~20:55:03 EDT on 2026-06-01. The system was healthy and serving normally up to the last logged second; then total silence. No kernel panic, no hardware error, nothing — the signature of a hard hardware/power-level lockup, not a software fault. It stayed down ~36 h until manual recovery.
The migration paid off: PG16 (TerraPulse) was on internal NVMe by then and recovered cleanly with no data loss. Had it still been on the USB drive, this hard crash could have corrupted it as the 2026-05-30 USB drop did.
2. Timeline (EDT)
- 2026-06-01 15:54 — PG16 migration cutover to internal NVMe completes; healthy.
- 2026-06-01 15:54–20:55 — Normal operation; PG16 checkpoints every 5 min; ingestion flowing. Last successful checkpoint start 20:54:17.
- 2026-06-01 20:55:03 — Last log line of any kind. Routine TerraPulse activity (geosphere fetch 200, GLM fetch 200, WSPR query). Then the machine freezes. No shutdown, no panic, no further logs.
- 2026-06-01 → 06-03 — Machine down ~36 h.
- 2026-06-03 10:28 — First recovery boot; PG16 logs
database system was interrupted; last known up at 2026-06-01 20:53:47. - 2026-06-03 (between boots) — Drives cleaned (operator); all disks
re-enumerated (nvme0↔nvme1,
/mnt/ursasda1→sdj1, backup-ursa sdq1→sdo1, etc.). - 2026-06-03 16:52 — Current boot. UUID-based mounts brought everything back correctly. PG16 online on NVMe; several services left in a failed state (§5).
3. Root cause analysis
Finding: a hard freeze with no recorded cause — points to power/hardware, not software.
Evidence (from on-disk rsyslog logs, which survived; the systemd journal did not):
syslogshows TerraPulse serving normally to the last instant — the final entries at 20:55:02–20:55:03 are successful HTTP fetches and a DB query. The system was not degrading.kern.log's last line before the gap is a routine[UFW BLOCK]entry; there is no panic, oops,hung_task/blocked for more than Ns, OOM-killer, MCE/Hardware Error, thermal/throttle,ata/nvme/usbreset, or I/O error anywhere in the crash hour.- A software panic, OOM, thermal event, or disk failure all leave traces on disk before/at the event. The complete absence of any trace means the kernel died too abruptly to log — characteristic of a power interruption (loss/brownout/ PSU instability) or a hardware/CPU hard-lock.
- This is a different failure mode from 2026-05-30 (that was a USB-drive transport drop with copious ext4/I/O errors). Two unrelated-looking failures in three days, on a heavily loaded box, point to underlying hardware/power fragility.
Honest caveat — a self-implicating possibility. On the afternoon of 2026-06-01
we disabled all power saving and pinned the CPU governor to performance, raising
this machine's sustained power draw and heat. This cannot be proven to have
contributed, and the evidence is mixed against it:
- No thermal-throttle messages were logged; CPU/NVMe temps are normal now under
the same
performancesetting (Tctl ~64 °C, NVMe ~41–52 °C, limit ~82 °C). - The freeze was instantaneous, not the gradual ramp a thermal event produces.
- The box was already unstable the day before (USB drop), independent of this change.
But on a box that may already be near its power envelope (17 USB disks + 2 NVMe + multiple NICs/WireGuard tunnels), always-max-power tipping a marginal PSU cannot be ruled out. It stays on the suspect list pending a hardware/power review.
4. What survived / data integrity
- PG16 (TerraPulse): intact, on internal NVMe (
/var/lib/pgdata/16-main, nownvme1n1p2). Clean crash recovery; 276,626,045 observations; ingestion resumed; NVMe SMART PASSED. No corruption. - PG15 (secondary / nominate.ai): data on
/mnt/ursaintact; clean WAL recovery on restart (§5). - Mounts: all disks re-enumerated after the crash + cleaning, but UUID-based
fstab entries remounted everything correctly.
/mnt/ursareadable. - Systemd journal: lost —
/var/log/journalempty, boot list jumps April→today. The independent on-disk rsyslog logs (/var/log/syslog,kern.log) are what made this diagnosis possible; keep them.
5. Broken services found on recovery + fixes applied
- ssh.service — was failed. sshd is configured to bind
192.168.1.100:1223; at boot it started before the (USB) network adapter received that IP, so it could not bind and gave up (Cannot assign requested address). The IP is present now. Fixed: restarted (listening again). Hardened: drop-inAfter=network-online.target+Restart=on-failure/RestartSec=5sso the boot race self-heals. - postgresql@15-main — was failed. Didn't start at boot (likely the USB
/mnt/ursawasn't mounted yet). Fixed: started; clean recovery; serving on 5432. - smartmontools — was failed (timeout).
DEVICESCANprobing all 17 USB drives at startup exceeded the default start timeout. Fixed: drop-inTimeoutStartSec=300(kept enabled — proactive SMART monitoring is worth having given the hardware fragility).
6. Recommendations (OPEN)
- [HIGH] Hardware/power investigation. Two failures in three days under heavy load. Check PSU capacity/health and power delivery; put the box on a UPS if it isn't (a brownout would produce exactly this no-log freeze); run memtest.
- [HIGH] Instrument for the next freeze. This one left no trace. Enable a
hardware/systemd watchdog (auto-reboot on lockup) and/or netconsole or
remote syslog, plus
kernel.panic/panic_on_oopssysctls, so the next event is captured or auto-recovers instead of sitting dead for 36 h. - [MED] Reconsider the always-
performance/ no-power-saving profile until the power question is settled, or confirm PSU/cooling headroom. Flagged because we introduced it the same afternoon (§3). - [MED] PG15 still on USB. The secondary DB remains on the failure-prone USB drive (Phase 4 of the migration plan). Decide: slim + co-locate on NVMe, dedicated disk, or hardened USB.
- [LOW] journald persistence isn't reliable here — it was set to persistent 2026-05-31 but still lost the crash window. rsyslog is the dependable record; ensure it stays enabled.
7. What went well
- The PG16-off-USB migration, done hours earlier, is the reason this hard crash cost zero TerraPulse data — it recovered clean from fast internal NVMe.
- UUID-based mounts survived a total device re-enumeration without manual fixup.
- On-disk rsyslog logs preserved the crash window even though the journal was wiped.