Incident Postmortem — `/mnt/ursa` USB drop + Postgres corruption


Incident date	2026-05-30 21:04 EDT (onset)
Recovery window	2026-05-31 ~13:00–14:00 EDT
Severity	High — both Postgres clusters down; data filesystem actively corrupting writes
Data loss	None confirmed (264.7M observations intact; valid pre-crash + post-recovery backups)
Status	Resolved; root cause identified; hardening applied; one architectural action item open
Author	Claude (managing/ops), with Mike (EIC) on the decision gates

1. Summary

The 7.3 TB disk mounted at /mnt/ursa — a consumer WD easystore USB 3.0 external drive that holds everything on this multi-tenant box (both Postgres clusters' data, every service's Python/node runtime environment for two businesses, and the nightly backups) — dropped off the USB bus at 21:04 EDT. The drive re-enumerated under a new device name, its ext4 filesystem was left corrupt and actively discarding writes, and both databases were down or unusable until a full-box maintenance window the next afternoon: stop everything → e2fsck → remount → recover both clusters.

The drive hardware is healthy. The failure was a USB transport disconnect, not a dying disk. The systemic risk is the architecture: production data for two businesses on a single consumer USB drive.

2. Impact

PG16 (5433, TerraPulse primary): down ~17 h (PANIC at crash → recovered 13:42).
PG15 (5432, secondary): running-but-unusable (every read returned I/O error), then required a separate fix for a corrupt replication-checkpoint file.
TerraPulse API/site: served degraded (static/cached 200s) but the data layer was broken; ingestion stopped at 21:04 and resumed ~13:57.
nominate.ai / Campaign Brain stack: taken offline during the maintenance window (cbcomfy, cbcomfy-api, dragonfli, cbcron, cjgaldescom, cbmcp, cbmobile, etc.).
Nightly backup: the 03:32 run on 05-31 failed (0-byte) because the DB was down.

3. Root cause

Proximate cause: a USB transport disconnect of the /mnt/ursa drive.

Evidence:

lsblk -d -o ... TRAN: the disk is usb — a WD easystore 264D, USB 3.0 (5 Gbps), driven by uas (USB Attached SCSI).
The device re-enumerated /dev/sdl1 → /dev/sda1 (identical filesystem UUID a74383e2-0b13-4e6e-806e-ebe069ac92d3). A device changing kernel names mid-life is the signature of a disconnect/reconnect, not a media fault.
During the drop, writes failed and the ext4 metadata was corrupted. On recovery attempts the kernel logged, on every write: EXT4-fs (sda1): Delayed block allocation failed ... error 117 / This should not happen!! Data will be lost (errno 117 = EUCLEAN, "structure needs cleaning"), and Postgres' checkpoint PANIC'd could not flush dirty data: Structure needs cleaning.

The drive itself is NOT failing:

smartctl -H /dev/sda: PASSED.
Reallocated sectors 0, Current-pending 0, Offline-uncorrectable 0, Raw-read-error 0, UDMA-CRC-error 0. Power-on hours ~2,910 (≈4 months). Drive SMART error log: No Errors Logged.
USB autosuspend was already disabled for this specific drive (power/control=on), so autosuspend was ruled out as the trigger.

Most likely trigger: a uas-level reset/hang. The UAS driver has well-documented reset bugs with certain USB-SATA bridge chips under heavy I/O; a UAS reset drops the device and re-enumerates it — exactly what we observed. A full system reboot also occurred at 21:01:24 (≈3 min before the disk errors); whether that was cause or coincidence is unconfirmed because the crash-window kernel logs had already rotated out of the (then-volatile) journal.

Underlying/systemic cause: production databases for two businesses run from a single consumer USB external drive with no redundancy, and (until this incident) the backups lived on that same drive.

4. Timeline (EDT)

05-30 21:01:24 — Box reboots (cause unconfirmed).
05-30 21:04 — /mnt/ursa USB drive drops; PG16 PANICs on WAL write I/O error; drive re-enumerates sdl1→sda1 and is remounted on top of the dead sdl1 (stacked "ghost" mount). PG15 keeps running but reads fail (fds pinned to dead device). crawwwl-recovery.service had already failed at 21:03 (unrelated, see §7).
05-31 ~13:00 — Investigation begins (Mike: "make sure all services and DBs recovered"). Diagnosis: USB drop, stacked ghost mount, both DBs down, fs flagged.
05-31 ~13:05 — Attempt to start PG16 proves the fs is actively corrupt (checkpoint PANIC + repeating "Data will be lost"). Decision escalates from "DB-only recovery" to a full-box e2fsck window.
05-31 ~13:08 — Clean pre-crash dump copied to a separate disk before any destructive step.
05-31 ~13:10–13:25 — Quiesce the entire box; clear the ghost mount; free the device (incl. the dragonfli shared-mount propagation gotcha, §6).
05-31 13:29–13:40 — e2fsck -f -y -D /dev/sda1 (exit 1, errors corrected); re-run to a clean exit 0.
05-31 13:42 — Remount; PG16 starts, checkpoint completes, ready.
05-31 13:56 — PG15 recovered after the replication-checkpoint fix (§5).
05-31 13:57 — TerraPulse services back; ingestion writing live.
05-31 ~14:00 — nominate.ai stack back; site 200, health {"db":true}.
05-31 14:33–14:56 — Hardening (power-saving disabled, backups repointed), validated backup to the new disk (6.0 GB, restorable).

5. Recovery procedure (what was executed)

Protect data first — cp the clean pre-crash dump to a separate physical disk (/mnt/backup-ursa, sdq1) before touching anything.
Quiesce the box — stop every service pinning /mnt/ursa:
- TerraPulse: terrapulse, terrapulse-web, blitzortung, glm, pulse; backup.timer
- nominate.ai system units: cbcron, cjgaldescom, dragonfli
- nominate.ai user units (systemctl --user): cbmcp, cbmcp-inspector, cbmobile-dev, cbmobile-web
- orphaned/stray holders (a 2nd cbmcp, vite/next-server/mcp-inspector trees)
Clear the stacked ghost mount and confirm /dev/sda1 is fully released (jbd2/sda1 thread gone, no holders in any mount namespace).
Repair — e2fsck -f -y -D /dev/sda1 → corrected extent trees, free inode/block counts, directory counts, inode-bitmap padding. No unattached or deleted inodes. Re-ran twice until a clean exit 0.
Remount — single clean mount; write test OK (no EUCLEAN); no new ext4 errors.
PG16 — crash recovery replayed WAL and the end-of-recovery checkpoint completed (the exact step that PANIC'd pre-fsck). 264.7M observations intact.
PG15 — separate crash artifact: PANIC: replication checkpoint has wrong magic. Confirmed PG15 has no logical replication slots and never ran an apply worker (zero subscriptions), so the corrupt pg_logical/replorigin_checkpoint was moved aside (preserved as .corrupt-2026-05-31, not deleted). Started clean.
Restart all services, both businesses; verify API {"status":"ok","db":true}, site 200, live ingestion.

6. Notable gotchas (for the next responder)

Stacked "ghost" mount. After re-enumeration the live sda1 was mounted on top of the dead sdl1. umount /mnt/ursa pops the live one first and exposes the dead one underneath — fully clearing it requires removing the live mount, which needs every holder stopped.
dragonfli.service shared-mount propagation. It mounts /mnt/ursa with shared propagation; stopping it propagated the mount back into the host namespace instead of tearing it down. Fix: mount --make-private /mnt/ursa before umount.
Holders hide in multiple places. Plain processes, a systemd --user tree, AND a private mount namespace (dragonfli) all pinned the device. e2fsck's O_EXCL check correctly refused to run until the very last holder was gone — scan /proc/*/mountinfo across namespaces, not just /proc/mounts.
Two independent corruptions. The filesystem (ext4) AND a Postgres-internal file (PG15 replorigin) were both damaged. Fixing one didn't fix the other.

7. Remediation completed (2026-05-31)

Power management fully disabled (this is a server):
- sleep/suspend/hibernate/hybrid-sleep targets masked (systemctl suspend now refused).
- USB autosuspend off — all devices power/control=on, usbcore.autosuspend=-1 (runtime + udev rule + GRUB usbcore.autosuspend=-1).
- CPU governor pinned to performance (runtime + cpu-performance.service).
- SATA link power max_performance; disk APM/spindown disabled via udev.
- upower disabled; logind set to ignore idle/suspend/lid.
- GRUB also gains pcie_aspm=off (applies on next reboot).
Backups moved off the data drive — scripts/backup_postgres.sh now writes the DB dump and the Claude-memory snapshot to /mnt/backup-ursa (separate disk, sdq1), not /mnt/ursa. Validated end-to-end: 6.0 GB dump, pg_restore --list reads a valid 592-entry archive.
fstab self-heal — /mnt/ursa and /mnt/backup-ursa passno 0→2 so an unclean filesystem is checked/repaired on boot instead of silently mounted corrupt.
journald Storage=persistent — so the next incident's crash-window kernel logs survive (this time the 21:04 evidence had already rotated out of the volatile store).

8. Action items (open)

[HIGH] Migrate Postgres data off the USB drive. This is the real fix. Move both clusters' data directories to an internal SATA/NVMe disk, ideally with RAID/mirroring. USB transport drops are inherent to the current setup and will recur. Until then, the box is one USB glitch away from another corruption event.
[MED] If it must stay on USB: quirk the drive out of UAS into usb-storage (BOT) mode if resets recur — usb-storage.quirks=<VID>:<PID>:u on the kernel cmdline (more stable, slightly slower).
[MED] Pre-existing fstab risk: findmnt --verify flags /mnt/joker (and one other entry) as required-on-boot without nofail. If those disks are absent at boot the box can drop to emergency mode. Add nofail or remove the stale entries before any planned reboot.
[LOW] crawwwl-recovery.service fails 216/GROUP because it sets User=/Group= in a systemd --user unit (invalid). Pre-existing (failed a minute before the crash). Remove those two lines from ~/.config/systemd/user/crawwwl-recovery.service.
[LOW] Cleanup: delete pg_logical/replorigin_checkpoint.corrupt-2026-05-31 once PG15 has run healthy for a few days.
[LOW] Investigate the 21:01 reboot — confirm whether it was a kernel panic / watchdog / power event (now that journald persists, the next one is diagnosable).

9. What went well

No data loss: a clean pre-crash dump was secured before any destructive step, and both DBs recovered with full row counts.
e2fsck's O_EXCL safety check and the decision to observe PG16's startup (rather than assume) caught the active filesystem corruption before it was trusted with writes.
Recovery decisions were gated through the operator at each escalation rather than guessed.

Incident Postmortem — /mnt/ursa USB drop + Postgres corruption