Computational Fragility Assessment — 2026-06-07
Author: Claude (managing-editor / ops)
Context: Requested after two server crashes in three days (2026-05-30 USB transport drop, 2026-06-01 hard freeze). Mike's question: how fragile is the system, and what is the risk that running workloads causes another crash?
Method: Live measurement (not estimation) on impera, 2026-06-07 ~12:18 local. Read-only diagnostics: free, df, lsblk, vmstat, sensors, ps, dmesg, systemctl.
Revised 2026-06-07 ~12:45 after challenge: the original draft ranked disk fullness as Fragility #1 (HIGH). That was anchoring on the percentage. On review it is a latent constraint, not an active fragility — see the correction in §"Fragility #1" and the revised TL;DR. The hardware watchdog (§4) was implemented and verified this session.
Related: docs/incident-2026-05-30-ursa-io-crash.md, docs/incident-2026-06-01-hard-freeze.md, docs/migration-pg-off-usb.md.
TL;DR
The box is healthy this minute. The fear of "compute job exhausts memory and locks the box" is not supported — RAM is the least of the worries (57 GiB free). The genuine fragilities, in order:
- Un-root-caused hard freeze (2026-06-01) — power/hardware signature, never explained. The one real unknown.
- 16-drive USB I/O surface — the transport that caused crash #1;
/mnt/ursa(which dropped once) still hosts PG15. - No hardware watchdog — now FIXED this session (§4). This is why the last freeze cost 36 h instead of an auto-reboot.
Disk fullness is NOT in this list. Root sits at 96% but it is static data with nothing actively writing to it (PostgreSQL is on its own mount; DuckDB staging is 0 bytes; logs are capped/rotated). A near-full filesystem is only dangerous if something is filling it or a job dumps a large output there. Neither happens in normal operation. It is a latent constraint — "don't write 19 GB to root" — not an active fragility. See §"Fragility #1" for the corrected reasoning.
What is NOT a problem (measured)
- RAM: 62 GiB total, 4.7 GiB used, 57 GiB available. No workload under consideration comes close.
- OOM: zero out-of-memory / oom-kill events this boot.
- Active swap thrash:
vmstatshows si/so ≈ 0 — not currently paging. (But see swap note below.) - Kernel errors:
dmesg --level=err,crit,alert,emergempty since the 2026-06-03 recovery. - USB resets/disconnects: 0 this boot.
- Failed systemd units: none.
- Thermals: CPU Tctl 74.2 °C against a 85.8 °C critical — warm under load, not dangerous. NVMe composite 44–48 °C, fine.
Conclusion: nothing is on fire. The fragility is structural, not a present fault.
Fragility #1 — Disk fullness (DOWNGRADED: latent constraint, NOT active fragility)
Correction (2026-06-07): the original draft called this HIGH and "the most likely way a job we run causes a failure." That over-weighted the raw percentage. A near-full filesystem is only a problem if (a) something is actively growing on it, or (b) a job dumps a large output there. Checking both against evidence:
- (a) Active growth on root? No. PostgreSQL 16 is on its own mount (
/var/lib/pgdata, 39%); PG15 is on/mnt/ursa; DuckDB staging is 0 bytes; the journal is capped and/var/logis rotated. Root's 417G is static — checked-out repos, AI models, build/source trees. Static data at 96% just sits there; it is not a countdown. - (b) Will a job fill the last 19G? Only if a heavy output is written to a root-resident path — the relevant one is
workspaces/(5.3G, on root). That is self-inflicted and trivially avoided: write heavy outputs to/mnt/ursa(4.2 TB free).
So this is a latent constraint, not a fragility. No relocation of data is required. The mitigation is a habit ("heavy writes → roomy mount"), not a project. The fill snapshot is retained below for reference.
Filesystems at or near full:
| Mount | Device | Use% | Free | Notes |
|---|---|---|---|---|
/ |
nvme1n1p3 | 96% | 19 GB | DuckDB staging + logs write here |
/mnt/blue |
sdl1 (USB) | 99% | 20 GB | |
/mnt/backup |
sdp1 (USB) | 97% | 31 GB | |
/mnt/storage |
nvme0n1 | 95% | 23 GB | |
/mnt/shared |
sdb (USB) | 95% | 45 GB | |
/mnt/working |
sdh1 (USB) | 88% | 210 GB |
Roomy mounts for heavy writes: /mnt/ursa (4.2 TB free), /mnt/marzano (1.2 TB free), /mnt/nom01 (2.9 TB free), /mnt/nom02 (2.8 TB free).
Note on "freeing" root: investigated and abandoned. Safe cleanups (journal vacuum, crash dumps, apt cache) freed <1 GB on disk — there is no junk. The 417G is real data, and TerraPulse is only 6.3G of it; the bulk belongs to other tenants on the shared box (projects/nominate 74G, home/lib 84G of AI models/build trees). Relocating that would touch live non-TerraPulse services for no real safety gain. Not worth doing.
Actual mitigation (a habit, not a task):
- Redirect heavy job outputs to
/mnt/ursa(4.2 TB) or/mnt/marzano(1.2 TB), never/. - Glance at
df /before any deliberately large write. That's it.
Fragility #2 — Un-root-caused hard freeze (MEDIUM, unknown, treat as live)
The 2026-06-01 freeze left no log entry — healthy right up to the instant it locked. That is a power or hardware signature, not software. It has not been root-caused.
Current load context that bears on this:
- 12-core AMD, load average 6.40 / 5.62 / 4.02 (rising at time of measurement).
- CPU Tctl 74 °C under that load.
- 16 USB drives all drawing power from this one chassis.
The honest position: we cannot prove what froze it. Precisely because we can't, a max-parallel compute job that simultaneously spikes CPU and spins every drive is the exact transient power draw an aging/undersized PSU could choke on. Two crashes in three days with no root cause ⇒ be conservative with heavy parallel workloads until the hardware is checked.
Real fix (needs Mike + physical access): PSU age/capacity check, memtest86+ pass, consider a UPS (also guards against the dirty-power transient).
Fragility #3 — USB I/O surface (MEDIUM, mostly NOT compute-driven)
16 USB-transport block devices on one box. Crash #1 was a USB transport drop (/mnt/ursa fell off the bus → ext4 corruption). That is a hardware property of the drives/hub/cabling, largely independent of compute load — but heavy I/O to a flaky drive raises the odds of a reset, and /mnt/ursa (the drive that already dropped once) still hosts PG15 (migration Phase 4 pending).
Mitigation: finish PG15 → NVMe (Phase 4); reduce the number of always-mounted USB drives if any are dormant.
Fragility #4 — No hardware watchdog → FIXED 2026-06-07
Originally: /dev/watchdog absent, RuntimeWatchdogSec=0. Not a crash cause, but why the 06-01 freeze became a ~36-hour outage — nothing auto-rebooted the box.
Implemented and verified this session:
- Loaded the AMD chipset watchdog module
sp5100_tco(SP5100/SB800 TCO timer)./dev/watchdognow present. - Persisted at boot:
/etc/modules-load.d/sp5100_tco.conf. - Wired systemd:
/etc/systemd/system.conf.d/10-watchdog.conf→RuntimeWatchdogSec=30s,RebootWatchdogSec=10min. Applied viadaemon-reexec. - Confirmed in journal: "Using hardware watchdog 'SP5100 TCO timer'... Set hardware watchdog to 30s." systemd now pets
/dev/watchdog~every 15 s; if systemd or the kernel hangs >30 s, the box hard-resets.
Net effect: a future hard freeze auto-recovers in ~30 s instead of waiting for a manual power cycle.
Minor — swap + swappiness
6.3 GiB sits in swap despite 57 GiB free RAM, and vm.swappiness = 60. Not dangerous (no active thrash), but 60 is a desktop default; a server with this much RAM should run 10. Trivial fix.
Recommended order (revised)
Free root— dropped. Investigated; no junk, all real data, nothing actively writing to root. Not a real task. (Habit only: heavy writes →/mnt/ursa.)- Enable hardware watchdog — ✅ DONE 2026-06-07 (see §4).
vm.swappiness60 → 10 — trivial, persistent via sysctl. Optional.- Hardware/power review — PSU age/capacity,
memtest86+, consider a UPS. The only real fix for crash #2 (the genuine unknown). Needs Mike + physical access. ← now the top open item. - PG15 off USB (migration Phase 4) — shrinks the USB failure surface and removes the second DB from the drive that already dropped once.
Workload guidance until #4 is done
- Safe now (low draw, low disk): editorial/paper work, UI, git, read-only PG queries, small extracts. Go anywhere.
- Hold for the hardware review: large backfills, ≥10k-permutation jobs, anything multi-process parallel, anything writing GB to
/or/mnt/ursa.
Aside (not TerraPulse)
dragonfli-cache (PID observed at 190% CPU) was the dominant load driver at measurement time. That is a nominate.ai / Campaign Brain process, not TerraPulse — flagged for Mike as a possible runaway on that side; out of scope for this assessment.