Listening for events…

Incident Postmortem — /mnt/ursa USB drop + Postgres corruption

Incident date 2026-05-30 21:04 EDT (onset)
Recovery window 2026-05-31 ~13:00–14:00 EDT
Severity High — both Postgres clusters down; data filesystem actively corrupting writes
Data loss None confirmed (264.7M observations intact; valid pre-crash + post-recovery backups)
Status Resolved; root cause identified; hardening applied; one architectural action item open
Author Claude (managing/ops), with Mike (EIC) on the decision gates

1. Summary

The 7.3 TB disk mounted at /mnt/ursa — a consumer WD easystore USB 3.0 external drive that holds everything on this multi-tenant box (both Postgres clusters' data, every service's Python/node runtime environment for two businesses, and the nightly backups) — dropped off the USB bus at 21:04 EDT. The drive re-enumerated under a new device name, its ext4 filesystem was left corrupt and actively discarding writes, and both databases were down or unusable until a full-box maintenance window the next afternoon: stop everything → e2fsck → remount → recover both clusters.

The drive hardware is healthy. The failure was a USB transport disconnect, not a dying disk. The systemic risk is the architecture: production data for two businesses on a single consumer USB drive.

2. Impact

  • PG16 (5433, TerraPulse primary): down ~17 h (PANIC at crash → recovered 13:42).
  • PG15 (5432, secondary): running-but-unusable (every read returned I/O error), then required a separate fix for a corrupt replication-checkpoint file.
  • TerraPulse API/site: served degraded (static/cached 200s) but the data layer was broken; ingestion stopped at 21:04 and resumed ~13:57.
  • nominate.ai / Campaign Brain stack: taken offline during the maintenance window (cbcomfy, cbcomfy-api, dragonfli, cbcron, cjgaldescom, cbmcp, cbmobile, etc.).
  • Nightly backup: the 03:32 run on 05-31 failed (0-byte) because the DB was down.

3. Root cause

Proximate cause: a USB transport disconnect of the /mnt/ursa drive.

Evidence:

  • lsblk -d -o ... TRAN: the disk is usb — a WD easystore 264D, USB 3.0 (5 Gbps), driven by uas (USB Attached SCSI).
  • The device re-enumerated /dev/sdl1/dev/sda1 (identical filesystem UUID a74383e2-0b13-4e6e-806e-ebe069ac92d3). A device changing kernel names mid-life is the signature of a disconnect/reconnect, not a media fault.
  • During the drop, writes failed and the ext4 metadata was corrupted. On recovery attempts the kernel logged, on every write: EXT4-fs (sda1): Delayed block allocation failed ... error 117 / This should not happen!! Data will be lost (errno 117 = EUCLEAN, "structure needs cleaning"), and Postgres' checkpoint PANIC'd could not flush dirty data: Structure needs cleaning.

The drive itself is NOT failing:

  • smartctl -H /dev/sda: PASSED.
  • Reallocated sectors 0, Current-pending 0, Offline-uncorrectable 0, Raw-read-error 0, UDMA-CRC-error 0. Power-on hours ~2,910 (≈4 months). Drive SMART error log: No Errors Logged.
  • USB autosuspend was already disabled for this specific drive (power/control=on), so autosuspend was ruled out as the trigger.

Most likely trigger: a uas-level reset/hang. The UAS driver has well-documented reset bugs with certain USB-SATA bridge chips under heavy I/O; a UAS reset drops the device and re-enumerates it — exactly what we observed. A full system reboot also occurred at 21:01:24 (≈3 min before the disk errors); whether that was cause or coincidence is unconfirmed because the crash-window kernel logs had already rotated out of the (then-volatile) journal.

Underlying/systemic cause: production databases for two businesses run from a single consumer USB external drive with no redundancy, and (until this incident) the backups lived on that same drive.

4. Timeline (EDT)

  • 05-30 21:01:24 — Box reboots (cause unconfirmed).
  • 05-30 21:04/mnt/ursa USB drive drops; PG16 PANICs on WAL write I/O error; drive re-enumerates sdl1sda1 and is remounted on top of the dead sdl1 (stacked "ghost" mount). PG15 keeps running but reads fail (fds pinned to dead device). crawwwl-recovery.service had already failed at 21:03 (unrelated, see §7).
  • 05-31 ~13:00 — Investigation begins (Mike: "make sure all services and DBs recovered"). Diagnosis: USB drop, stacked ghost mount, both DBs down, fs flagged.
  • 05-31 ~13:05 — Attempt to start PG16 proves the fs is actively corrupt (checkpoint PANIC + repeating "Data will be lost"). Decision escalates from "DB-only recovery" to a full-box e2fsck window.
  • 05-31 ~13:08 — Clean pre-crash dump copied to a separate disk before any destructive step.
  • 05-31 ~13:10–13:25 — Quiesce the entire box; clear the ghost mount; free the device (incl. the dragonfli shared-mount propagation gotcha, §6).
  • 05-31 13:29–13:40e2fsck -f -y -D /dev/sda1 (exit 1, errors corrected); re-run to a clean exit 0.
  • 05-31 13:42 — Remount; PG16 starts, checkpoint completes, ready.
  • 05-31 13:56 — PG15 recovered after the replication-checkpoint fix (§5).
  • 05-31 13:57 — TerraPulse services back; ingestion writing live.
  • 05-31 ~14:00 — nominate.ai stack back; site 200, health {"db":true}.
  • 05-31 14:33–14:56 — Hardening (power-saving disabled, backups repointed), validated backup to the new disk (6.0 GB, restorable).

5. Recovery procedure (what was executed)

  1. Protect data firstcp the clean pre-crash dump to a separate physical disk (/mnt/backup-ursa, sdq1) before touching anything.
  2. Quiesce the box — stop every service pinning /mnt/ursa:
    • TerraPulse: terrapulse, terrapulse-web, blitzortung, glm, pulse; backup.timer
    • nominate.ai system units: cbcron, cjgaldescom, dragonfli
    • nominate.ai user units (systemctl --user): cbmcp, cbmcp-inspector, cbmobile-dev, cbmobile-web
    • orphaned/stray holders (a 2nd cbmcp, vite/next-server/mcp-inspector trees)
  3. Clear the stacked ghost mount and confirm /dev/sda1 is fully released (jbd2/sda1 thread gone, no holders in any mount namespace).
  4. Repaire2fsck -f -y -D /dev/sda1 → corrected extent trees, free inode/block counts, directory counts, inode-bitmap padding. No unattached or deleted inodes. Re-ran twice until a clean exit 0.
  5. Remount — single clean mount; write test OK (no EUCLEAN); no new ext4 errors.
  6. PG16 — crash recovery replayed WAL and the end-of-recovery checkpoint completed (the exact step that PANIC'd pre-fsck). 264.7M observations intact.
  7. PG15 — separate crash artifact: PANIC: replication checkpoint has wrong magic. Confirmed PG15 has no logical replication slots and never ran an apply worker (zero subscriptions), so the corrupt pg_logical/replorigin_checkpoint was moved aside (preserved as .corrupt-2026-05-31, not deleted). Started clean.
  8. Restart all services, both businesses; verify API {"status":"ok","db":true}, site 200, live ingestion.

6. Notable gotchas (for the next responder)

  • Stacked "ghost" mount. After re-enumeration the live sda1 was mounted on top of the dead sdl1. umount /mnt/ursa pops the live one first and exposes the dead one underneath — fully clearing it requires removing the live mount, which needs every holder stopped.
  • dragonfli.service shared-mount propagation. It mounts /mnt/ursa with shared propagation; stopping it propagated the mount back into the host namespace instead of tearing it down. Fix: mount --make-private /mnt/ursa before umount.
  • Holders hide in multiple places. Plain processes, a systemd --user tree, AND a private mount namespace (dragonfli) all pinned the device. e2fsck's O_EXCL check correctly refused to run until the very last holder was gone — scan /proc/*/mountinfo across namespaces, not just /proc/mounts.
  • Two independent corruptions. The filesystem (ext4) AND a Postgres-internal file (PG15 replorigin) were both damaged. Fixing one didn't fix the other.

7. Remediation completed (2026-05-31)

  • Power management fully disabled (this is a server):
    • sleep/suspend/hibernate/hybrid-sleep targets masked (systemctl suspend now refused).
    • USB autosuspend off — all devices power/control=on, usbcore.autosuspend=-1 (runtime + udev rule + GRUB usbcore.autosuspend=-1).
    • CPU governor pinned to performance (runtime + cpu-performance.service).
    • SATA link power max_performance; disk APM/spindown disabled via udev.
    • upower disabled; logind set to ignore idle/suspend/lid.
    • GRUB also gains pcie_aspm=off (applies on next reboot).
  • Backups moved off the data drivescripts/backup_postgres.sh now writes the DB dump and the Claude-memory snapshot to /mnt/backup-ursa (separate disk, sdq1), not /mnt/ursa. Validated end-to-end: 6.0 GB dump, pg_restore --list reads a valid 592-entry archive.
  • fstab self-heal/mnt/ursa and /mnt/backup-ursa passno 02 so an unclean filesystem is checked/repaired on boot instead of silently mounted corrupt.
  • journald Storage=persistent — so the next incident's crash-window kernel logs survive (this time the 21:04 evidence had already rotated out of the volatile store).

8. Action items (open)

  1. [HIGH] Migrate Postgres data off the USB drive. This is the real fix. Move both clusters' data directories to an internal SATA/NVMe disk, ideally with RAID/mirroring. USB transport drops are inherent to the current setup and will recur. Until then, the box is one USB glitch away from another corruption event.
  2. [MED] If it must stay on USB: quirk the drive out of UAS into usb-storage (BOT) mode if resets recur — usb-storage.quirks=<VID>:<PID>:u on the kernel cmdline (more stable, slightly slower).
  3. [MED] Pre-existing fstab risk: findmnt --verify flags /mnt/joker (and one other entry) as required-on-boot without nofail. If those disks are absent at boot the box can drop to emergency mode. Add nofail or remove the stale entries before any planned reboot.
  4. [LOW] crawwwl-recovery.service fails 216/GROUP because it sets User=/Group= in a systemd --user unit (invalid). Pre-existing (failed a minute before the crash). Remove those two lines from ~/.config/systemd/user/crawwwl-recovery.service.
  5. [LOW] Cleanup: delete pg_logical/replorigin_checkpoint.corrupt-2026-05-31 once PG15 has run healthy for a few days.
  6. [LOW] Investigate the 21:01 reboot — confirm whether it was a kernel panic / watchdog / power event (now that journald persists, the next one is diagnosable).

9. What went well

  • No data loss: a clean pre-crash dump was secured before any destructive step, and both DBs recovered with full row counts.
  • e2fsck's O_EXCL safety check and the decision to observe PG16's startup (rather than assume) caught the active filesystem corruption before it was trusted with writes.
  • Recovery decisions were gated through the operator at each escalation rather than guessed.
Live Feed