Incident Postmortem — /mnt/ursa USB drop + Postgres corruption
| Incident date | 2026-05-30 21:04 EDT (onset) |
| Recovery window | 2026-05-31 ~13:00–14:00 EDT |
| Severity | High — both Postgres clusters down; data filesystem actively corrupting writes |
| Data loss | None confirmed (264.7M observations intact; valid pre-crash + post-recovery backups) |
| Status | Resolved; root cause identified; hardening applied; one architectural action item open |
| Author | Claude (managing/ops), with Mike (EIC) on the decision gates |
1. Summary
The 7.3 TB disk mounted at /mnt/ursa — a consumer WD easystore USB 3.0 external
drive that holds everything on this multi-tenant box (both Postgres clusters'
data, every service's Python/node runtime environment for two businesses, and the
nightly backups) — dropped off the USB bus at 21:04 EDT. The drive re-enumerated
under a new device name, its ext4 filesystem was left corrupt and actively
discarding writes, and both databases were down or unusable until a full-box
maintenance window the next afternoon: stop everything → e2fsck → remount →
recover both clusters.
The drive hardware is healthy. The failure was a USB transport disconnect, not a dying disk. The systemic risk is the architecture: production data for two businesses on a single consumer USB drive.
2. Impact
- PG16 (5433, TerraPulse primary): down ~17 h (PANIC at crash → recovered 13:42).
- PG15 (5432, secondary): running-but-unusable (every read returned I/O error), then required a separate fix for a corrupt replication-checkpoint file.
- TerraPulse API/site: served degraded (static/cached 200s) but the data layer was broken; ingestion stopped at 21:04 and resumed ~13:57.
- nominate.ai / Campaign Brain stack: taken offline during the maintenance window (cbcomfy, cbcomfy-api, dragonfli, cbcron, cjgaldescom, cbmcp, cbmobile, etc.).
- Nightly backup: the 03:32 run on 05-31 failed (0-byte) because the DB was down.
3. Root cause
Proximate cause: a USB transport disconnect of the /mnt/ursa drive.
Evidence:
lsblk -d -o ... TRAN: the disk isusb— a WD easystore 264D, USB 3.0 (5 Gbps), driven byuas(USB Attached SCSI).- The device re-enumerated
/dev/sdl1→/dev/sda1(identical filesystem UUIDa74383e2-0b13-4e6e-806e-ebe069ac92d3). A device changing kernel names mid-life is the signature of a disconnect/reconnect, not a media fault. - During the drop, writes failed and the ext4 metadata was corrupted. On recovery
attempts the kernel logged, on every write:
EXT4-fs (sda1): Delayed block allocation failed ... error 117/This should not happen!! Data will be lost(errno 117 =EUCLEAN, "structure needs cleaning"), and Postgres' checkpoint PANIC'dcould not flush dirty data: Structure needs cleaning.
The drive itself is NOT failing:
smartctl -H /dev/sda: PASSED.- Reallocated sectors 0, Current-pending 0, Offline-uncorrectable 0, Raw-read-error 0, UDMA-CRC-error 0. Power-on hours ~2,910 (≈4 months). Drive SMART error log: No Errors Logged.
- USB autosuspend was already disabled for this specific drive (
power/control=on), so autosuspend was ruled out as the trigger.
Most likely trigger: a uas-level reset/hang. The UAS driver has well-documented
reset bugs with certain USB-SATA bridge chips under heavy I/O; a UAS reset drops the
device and re-enumerates it — exactly what we observed. A full system reboot also
occurred at 21:01:24 (≈3 min before the disk errors); whether that was cause or
coincidence is unconfirmed because the crash-window kernel logs had already rotated
out of the (then-volatile) journal.
Underlying/systemic cause: production databases for two businesses run from a single consumer USB external drive with no redundancy, and (until this incident) the backups lived on that same drive.
4. Timeline (EDT)
- 05-30 21:01:24 — Box reboots (cause unconfirmed).
- 05-30 21:04 —
/mnt/ursaUSB drive drops; PG16 PANICs on WAL write I/O error; drive re-enumeratessdl1→sda1and is remounted on top of the deadsdl1(stacked "ghost" mount). PG15 keeps running but reads fail (fds pinned to dead device).crawwwl-recovery.servicehad already failed at 21:03 (unrelated, see §7). - 05-31 ~13:00 — Investigation begins (Mike: "make sure all services and DBs recovered"). Diagnosis: USB drop, stacked ghost mount, both DBs down, fs flagged.
- 05-31 ~13:05 — Attempt to start PG16 proves the fs is actively corrupt
(checkpoint PANIC + repeating "Data will be lost"). Decision escalates from
"DB-only recovery" to a full-box
e2fsckwindow. - 05-31 ~13:08 — Clean pre-crash dump copied to a separate disk before any destructive step.
- 05-31 ~13:10–13:25 — Quiesce the entire box; clear the ghost mount; free the
device (incl. the
dragonflishared-mount propagation gotcha, §6). - 05-31 13:29–13:40 —
e2fsck -f -y -D /dev/sda1(exit 1, errors corrected); re-run to a clean exit 0. - 05-31 13:42 — Remount; PG16 starts, checkpoint completes, ready.
- 05-31 13:56 — PG15 recovered after the replication-checkpoint fix (§5).
- 05-31 13:57 — TerraPulse services back; ingestion writing live.
- 05-31 ~14:00 — nominate.ai stack back; site 200, health
{"db":true}. - 05-31 14:33–14:56 — Hardening (power-saving disabled, backups repointed), validated backup to the new disk (6.0 GB, restorable).
5. Recovery procedure (what was executed)
- Protect data first —
cpthe clean pre-crash dump to a separate physical disk (/mnt/backup-ursa, sdq1) before touching anything. - Quiesce the box — stop every service pinning
/mnt/ursa:- TerraPulse: terrapulse, terrapulse-web, blitzortung, glm, pulse; backup.timer
- nominate.ai system units: cbcron, cjgaldescom, dragonfli
- nominate.ai user units (
systemctl --user): cbmcp, cbmcp-inspector, cbmobile-dev, cbmobile-web - orphaned/stray holders (a 2nd cbmcp, vite/next-server/mcp-inspector trees)
- Clear the stacked ghost mount and confirm
/dev/sda1is fully released (jbd2/sda1thread gone, no holders in any mount namespace). - Repair —
e2fsck -f -y -D /dev/sda1→ corrected extent trees, free inode/block counts, directory counts, inode-bitmap padding. No unattached or deleted inodes. Re-ran twice until a clean exit 0. - Remount — single clean mount; write test OK (no EUCLEAN); no new ext4 errors.
- PG16 — crash recovery replayed WAL and the end-of-recovery checkpoint completed (the exact step that PANIC'd pre-fsck). 264.7M observations intact.
- PG15 — separate crash artifact:
PANIC: replication checkpoint has wrong magic. Confirmed PG15 has no logical replication slots and never ran an apply worker (zero subscriptions), so the corruptpg_logical/replorigin_checkpointwas moved aside (preserved as.corrupt-2026-05-31, not deleted). Started clean. - Restart all services, both businesses; verify API
{"status":"ok","db":true}, site 200, live ingestion.
6. Notable gotchas (for the next responder)
- Stacked "ghost" mount. After re-enumeration the live
sda1was mounted on top of the deadsdl1.umount /mnt/ursapops the live one first and exposes the dead one underneath — fully clearing it requires removing the live mount, which needs every holder stopped. dragonfli.serviceshared-mount propagation. It mounts/mnt/ursawith shared propagation; stopping it propagated the mount back into the host namespace instead of tearing it down. Fix:mount --make-private /mnt/ursabeforeumount.- Holders hide in multiple places. Plain processes, a
systemd --usertree, AND a private mount namespace (dragonfli) all pinned the device.e2fsck's O_EXCL check correctly refused to run until the very last holder was gone — scan/proc/*/mountinfoacross namespaces, not just/proc/mounts. - Two independent corruptions. The filesystem (ext4) AND a Postgres-internal file (PG15 replorigin) were both damaged. Fixing one didn't fix the other.
7. Remediation completed (2026-05-31)
- Power management fully disabled (this is a server):
- sleep/suspend/hibernate/hybrid-sleep targets masked (
systemctl suspendnow refused). - USB autosuspend off — all devices
power/control=on,usbcore.autosuspend=-1(runtime + udev rule + GRUBusbcore.autosuspend=-1). - CPU governor pinned to performance (runtime +
cpu-performance.service). - SATA link power
max_performance; disk APM/spindown disabled via udev. upowerdisabled; logind set to ignore idle/suspend/lid.- GRUB also gains
pcie_aspm=off(applies on next reboot).
- sleep/suspend/hibernate/hybrid-sleep targets masked (
- Backups moved off the data drive —
scripts/backup_postgres.shnow writes the DB dump and the Claude-memory snapshot to/mnt/backup-ursa(separate disk, sdq1), not/mnt/ursa. Validated end-to-end: 6.0 GB dump,pg_restore --listreads a valid 592-entry archive. - fstab self-heal —
/mnt/ursaand/mnt/backup-ursapassno0→2so an unclean filesystem is checked/repaired on boot instead of silently mounted corrupt. - journald
Storage=persistent— so the next incident's crash-window kernel logs survive (this time the 21:04 evidence had already rotated out of the volatile store).
8. Action items (open)
- [HIGH] Migrate Postgres data off the USB drive. This is the real fix. Move both clusters' data directories to an internal SATA/NVMe disk, ideally with RAID/mirroring. USB transport drops are inherent to the current setup and will recur. Until then, the box is one USB glitch away from another corruption event.
- [MED] If it must stay on USB: quirk the drive out of UAS into
usb-storage(BOT) mode if resets recur —usb-storage.quirks=<VID>:<PID>:uon the kernel cmdline (more stable, slightly slower). - [MED] Pre-existing fstab risk:
findmnt --verifyflags/mnt/joker(and one other entry) as required-on-boot withoutnofail. If those disks are absent at boot the box can drop to emergency mode. Addnofailor remove the stale entries before any planned reboot. - [LOW]
crawwwl-recovery.servicefails216/GROUPbecause it setsUser=/Group=in asystemd --userunit (invalid). Pre-existing (failed a minute before the crash). Remove those two lines from~/.config/systemd/user/crawwwl-recovery.service. - [LOW] Cleanup: delete
pg_logical/replorigin_checkpoint.corrupt-2026-05-31once PG15 has run healthy for a few days. - [LOW] Investigate the 21:01 reboot — confirm whether it was a kernel panic / watchdog / power event (now that journald persists, the next one is diagnosable).
9. What went well
- No data loss: a clean pre-crash dump was secured before any destructive step, and both DBs recovered with full row counts.
e2fsck's O_EXCL safety check and the decision to observe PG16's startup (rather than assume) caught the active filesystem corruption before it was trusted with writes.- Recovery decisions were gated through the operator at each escalation rather than guessed.