Listening for events…

Computational Fragility Assessment — 2026-06-07

Author: Claude (managing-editor / ops) Context: Requested after two server crashes in three days (2026-05-30 USB transport drop, 2026-06-01 hard freeze). Mike's question: how fragile is the system, and what is the risk that running workloads causes another crash? Method: Live measurement (not estimation) on impera, 2026-06-07 ~12:18 local. Read-only diagnostics: free, df, lsblk, vmstat, sensors, ps, dmesg, systemctl.

Revised 2026-06-07 ~12:45 after challenge: the original draft ranked disk fullness as Fragility #1 (HIGH). That was anchoring on the percentage. On review it is a latent constraint, not an active fragility — see the correction in §"Fragility #1" and the revised TL;DR. The hardware watchdog (§4) was implemented and verified this session.

Related: docs/incident-2026-05-30-ursa-io-crash.md, docs/incident-2026-06-01-hard-freeze.md, docs/migration-pg-off-usb.md.


TL;DR

The box is healthy this minute. The fear of "compute job exhausts memory and locks the box" is not supported — RAM is the least of the worries (57 GiB free). The genuine fragilities, in order:

  1. Un-root-caused hard freeze (2026-06-01) — power/hardware signature, never explained. The one real unknown.
  2. 16-drive USB I/O surface — the transport that caused crash #1; /mnt/ursa (which dropped once) still hosts PG15.
  3. No hardware watchdognow FIXED this session (§4). This is why the last freeze cost 36 h instead of an auto-reboot.

Disk fullness is NOT in this list. Root sits at 96% but it is static data with nothing actively writing to it (PostgreSQL is on its own mount; DuckDB staging is 0 bytes; logs are capped/rotated). A near-full filesystem is only dangerous if something is filling it or a job dumps a large output there. Neither happens in normal operation. It is a latent constraint — "don't write 19 GB to root" — not an active fragility. See §"Fragility #1" for the corrected reasoning.


What is NOT a problem (measured)

  • RAM: 62 GiB total, 4.7 GiB used, 57 GiB available. No workload under consideration comes close.
  • OOM: zero out-of-memory / oom-kill events this boot.
  • Active swap thrash: vmstat shows si/so ≈ 0 — not currently paging. (But see swap note below.)
  • Kernel errors: dmesg --level=err,crit,alert,emerg empty since the 2026-06-03 recovery.
  • USB resets/disconnects: 0 this boot.
  • Failed systemd units: none.
  • Thermals: CPU Tctl 74.2 °C against a 85.8 °C critical — warm under load, not dangerous. NVMe composite 44–48 °C, fine.

Conclusion: nothing is on fire. The fragility is structural, not a present fault.


Fragility #1 — Disk fullness (DOWNGRADED: latent constraint, NOT active fragility)

Correction (2026-06-07): the original draft called this HIGH and "the most likely way a job we run causes a failure." That over-weighted the raw percentage. A near-full filesystem is only a problem if (a) something is actively growing on it, or (b) a job dumps a large output there. Checking both against evidence:

  • (a) Active growth on root? No. PostgreSQL 16 is on its own mount (/var/lib/pgdata, 39%); PG15 is on /mnt/ursa; DuckDB staging is 0 bytes; the journal is capped and /var/log is rotated. Root's 417G is static — checked-out repos, AI models, build/source trees. Static data at 96% just sits there; it is not a countdown.
  • (b) Will a job fill the last 19G? Only if a heavy output is written to a root-resident path — the relevant one is workspaces/ (5.3G, on root). That is self-inflicted and trivially avoided: write heavy outputs to /mnt/ursa (4.2 TB free).

So this is a latent constraint, not a fragility. No relocation of data is required. The mitigation is a habit ("heavy writes → roomy mount"), not a project. The fill snapshot is retained below for reference.

Filesystems at or near full:

Mount Device Use% Free Notes
/ nvme1n1p3 96% 19 GB DuckDB staging + logs write here
/mnt/blue sdl1 (USB) 99% 20 GB
/mnt/backup sdp1 (USB) 97% 31 GB
/mnt/storage nvme0n1 95% 23 GB
/mnt/shared sdb (USB) 95% 45 GB
/mnt/working sdh1 (USB) 88% 210 GB

Roomy mounts for heavy writes: /mnt/ursa (4.2 TB free), /mnt/marzano (1.2 TB free), /mnt/nom01 (2.9 TB free), /mnt/nom02 (2.8 TB free).

Note on "freeing" root: investigated and abandoned. Safe cleanups (journal vacuum, crash dumps, apt cache) freed <1 GB on disk — there is no junk. The 417G is real data, and TerraPulse is only 6.3G of it; the bulk belongs to other tenants on the shared box (projects/nominate 74G, home/lib 84G of AI models/build trees). Relocating that would touch live non-TerraPulse services for no real safety gain. Not worth doing.

Actual mitigation (a habit, not a task):

  • Redirect heavy job outputs to /mnt/ursa (4.2 TB) or /mnt/marzano (1.2 TB), never /.
  • Glance at df / before any deliberately large write. That's it.

Fragility #2 — Un-root-caused hard freeze (MEDIUM, unknown, treat as live)

The 2026-06-01 freeze left no log entry — healthy right up to the instant it locked. That is a power or hardware signature, not software. It has not been root-caused.

Current load context that bears on this:

  • 12-core AMD, load average 6.40 / 5.62 / 4.02 (rising at time of measurement).
  • CPU Tctl 74 °C under that load.
  • 16 USB drives all drawing power from this one chassis.

The honest position: we cannot prove what froze it. Precisely because we can't, a max-parallel compute job that simultaneously spikes CPU and spins every drive is the exact transient power draw an aging/undersized PSU could choke on. Two crashes in three days with no root cause ⇒ be conservative with heavy parallel workloads until the hardware is checked.

Real fix (needs Mike + physical access): PSU age/capacity check, memtest86+ pass, consider a UPS (also guards against the dirty-power transient).


Fragility #3 — USB I/O surface (MEDIUM, mostly NOT compute-driven)

16 USB-transport block devices on one box. Crash #1 was a USB transport drop (/mnt/ursa fell off the bus → ext4 corruption). That is a hardware property of the drives/hub/cabling, largely independent of compute load — but heavy I/O to a flaky drive raises the odds of a reset, and /mnt/ursa (the drive that already dropped once) still hosts PG15 (migration Phase 4 pending).

Mitigation: finish PG15 → NVMe (Phase 4); reduce the number of always-mounted USB drives if any are dormant.


Fragility #4 — No hardware watchdog → FIXED 2026-06-07

Originally: /dev/watchdog absent, RuntimeWatchdogSec=0. Not a crash cause, but why the 06-01 freeze became a ~36-hour outage — nothing auto-rebooted the box.

Implemented and verified this session:

  • Loaded the AMD chipset watchdog module sp5100_tco (SP5100/SB800 TCO timer). /dev/watchdog now present.
  • Persisted at boot: /etc/modules-load.d/sp5100_tco.conf.
  • Wired systemd: /etc/systemd/system.conf.d/10-watchdog.confRuntimeWatchdogSec=30s, RebootWatchdogSec=10min. Applied via daemon-reexec.
  • Confirmed in journal: "Using hardware watchdog 'SP5100 TCO timer'... Set hardware watchdog to 30s." systemd now pets /dev/watchdog ~every 15 s; if systemd or the kernel hangs >30 s, the box hard-resets.

Net effect: a future hard freeze auto-recovers in ~30 s instead of waiting for a manual power cycle.


Minor — swap + swappiness

6.3 GiB sits in swap despite 57 GiB free RAM, and vm.swappiness = 60. Not dangerous (no active thrash), but 60 is a desktop default; a server with this much RAM should run 10. Trivial fix.


Recommended order (revised)

  1. Free rootdropped. Investigated; no junk, all real data, nothing actively writing to root. Not a real task. (Habit only: heavy writes → /mnt/ursa.)
  2. Enable hardware watchdog — ✅ DONE 2026-06-07 (see §4).
  3. vm.swappiness 60 → 10 — trivial, persistent via sysctl. Optional.
  4. Hardware/power review — PSU age/capacity, memtest86+, consider a UPS. The only real fix for crash #2 (the genuine unknown). Needs Mike + physical access. ← now the top open item.
  5. PG15 off USB (migration Phase 4) — shrinks the USB failure surface and removes the second DB from the drive that already dropped once.

Workload guidance until #4 is done

  • Safe now (low draw, low disk): editorial/paper work, UI, git, read-only PG queries, small extracts. Go anywhere.
  • Hold for the hardware review: large backfills, ≥10k-permutation jobs, anything multi-process parallel, anything writing GB to / or /mnt/ursa.

Aside (not TerraPulse)

dragonfli-cache (PID observed at 190% CPU) was the dominant load driver at measurement time. That is a nominate.ai / Campaign Brain process, not TerraPulse — flagged for Mike as a possible runaway on that side; out of scope for this assessment.

Live Feed