Listening for events…

Migration plan — move Postgres off the USB drive onto internal SSD

Goal: get TerraPulse's Postgres data off /mnt/ursa (the consumer WD easystore USB drive that dropped and corrupted on 2026-05-30, see incident postmortem) and onto internal NVMe, eliminating the USB-transport drop risk for the primary database.

Decision (Mike, 2026-06-01): reclaim /mnt/whoru for the landing zone, archive its /home first. Migrate PG16 (TerraPulse, 153 GB) first.


Situation / constraints (verified 2026-06-01)

  • Postgres data: PG16 (5433, TerraPulse) = 153 GB; PG15 (5432, secondary / nominate.ai) = 268 GB. Total 421 GB.
  • Internal storage = 2 NVMe only, both ~full:
    • Samsung 980 1 TB: / (19 GB free) + /mnt/whoru (456 GB, 100% full).
    • WD SN520 512 GB: /mnt/storage (23 GB free).
    • Every other disk on the box (17 of them) is USB.
  • Landing zone = /mnt/whoru (nvme0n1p2, UUID 381fb52a-f773-4dda-b774-1647189209d5). It is a dormant OS clone of host "akira" (its own root tree + fstab; /home for bisenbek+ubuntu; last modified 2024-12; nothing written in 2025–2026). In the current fstab as nofail.
  • Capacity math: reclaimed 456 GB fits PG16 (153 GB) with ~300 GB headroom. It does NOT safely fit PG16+PG15 (421/456 leaves no room for WAL/temp/vacuum/ growth). So PG16-only on internal; PG15 is a separate later decision (Phase 3).
  • Backups: a valid 6 GB dump from 2026-05-31 is on /mnt/backup-ursa (separate disk), verified restorable. This is the safety net if anything goes wrong.

Archive target for whoru's /home (380 GB)

Needs a non-critical USB disk with ≥380 GB free. Candidates (free space): /mnt/marzano (1.6 TB, 8% used) ← proposed, /mnt/nom01 (2.9 TB), /mnt/nom02 (2.8 TB), /mnt/tm (1.4 TB). NOT /mnt/backup-ursa (reserved for PG backups). Mike to confirm/redirect.


Phase 0 — Pre-flight (no downtime, no risk)

  1. Confirm fresh valid backup exists on /mnt/backup-ursa (already true; re-run terrapulse-backup.service if stale).
  2. Confirm no symlinks/tablespaces point outside the PG16 data dir:
    sudo -u postgres psql -p 5433 -Atc "select spcname, pg_tablespace_location(oid) from pg_tablespace;"
    
    (Expect only pg_default/pg_global = inside data dir. If external tablespaces exist, add them to the rsync set.)
  3. Record current data_directory + row-count baseline for post-migration validation:
    sudo -u postgres psql -p 5433 -d terrapulse -Atc "select count(*) from observations;"  # expect ~264.7M
    

Phase 1 — Archive & reclaim /mnt/whoru (no PG downtime)

  1. Archive akira's /home to the chosen USB disk (verify-on-copy):
    sudo mkdir -p /mnt/marzano/akira-archive-2026-06
    sudo rsync -aHAX --info=progress2 /mnt/whoru/home/ /mnt/marzano/akira-archive-2026-06/home/
    # plus a small reference tar of system config
    sudo tar -czf /mnt/marzano/akira-archive-2026-06/etc-var-log.tar.gz -C /mnt/whoru etc var/log 2>/dev/null || true
    
  2. Verify the archive (counts + a checksum dry-run must report no differences):
    sudo rsync -aHAXn --checksum /mnt/whoru/home/ /mnt/marzano/akira-archive-2026-06/home/ | head
    diff <(sudo find /mnt/whoru/home -type f | wc -l) <(sudo find /mnt/marzano/akira-archive-2026-06/home -type f | wc -l)
    
    DO NOT proceed until the archive verifies clean.
  3. Unmount + remove from fstab (it's nofail, safe):
    sudo umount /mnt/whoru
    sudo sed -i '/381fb52a-f773-4dda-b774-1647189209d5/d' /etc/fstab
    
  4. Reformat the partition fresh and mount as the PG data home:
    sudo mkfs.ext4 -L pgdata /dev/nvme0n1p2
    sudo mkdir -p /var/lib/pgdata
    echo 'LABEL=pgdata /var/lib/pgdata ext4 defaults 0 2' | sudo tee -a /etc/fstab
    sudo systemctl daemon-reload && sudo mount /var/lib/pgdata
    sudo chown postgres:postgres /var/lib/pgdata && sudo chmod 700 /var/lib/pgdata
    

Phase 2 — Migrate PG16 (minimal downtime)

  1. Pre-sync LIVE (PG still serving — copies the bulk of 153 GB now):
    sudo rsync -aHAX --delete --info=progress2 \
      /mnt/ursa/data/terrapulse/postgres/16/main/ /var/lib/pgdata/16-main/
    
  2. Maintenance window (downtime starts — target a few minutes):
    # quiesce writers + API, then stop the cluster
    sudo systemctl stop terrapulse terrapulse-blitzortung terrapulse-glm terrapulse-pulse
    sudo systemctl stop postgresql@16-main
    # final delta sync (fast — only what changed since pre-sync)
    sudo rsync -aHAX --delete --info=progress2 \
      /mnt/ursa/data/terrapulse/postgres/16/main/ /var/lib/pgdata/16-main/
    sudo chown -R postgres:postgres /var/lib/pgdata/16-main && sudo chmod 700 /var/lib/pgdata/16-main
    # repoint the cluster
    sudo sed -i "s#^data_directory = .*#data_directory = '/var/lib/pgdata/16-main'#" \
      /etc/postgresql/16/main/postgresql.conf
    sudo systemctl start postgresql@16-main
    
  3. Validate before declaring success:
    • log shows database system is ready to accept connections and a completed checkpoint (not the EUCLEAN PANIC from the incident).
    • select count(*) from observations; matches the Phase-0 baseline (~264.7M).
    • select pg_relation_filepath('observations'); resolves under /var/lib/pgdata.
  4. Bring services back + verify: start terrapulse units; API health {"status":"ok","db":true}; site 200; fresh observation timestamp advancing.

Phase 3 — Soak & reclaim (days later)

  1. Leave the old USB data dir intact, renamed, for a soak period:
    sudo mv /mnt/ursa/data/terrapulse/postgres/16/main /mnt/ursa/data/terrapulse/postgres/16/main.PRE-MIGRATION
    
    (Do this only after Phase 2 validates; the rename makes accidental use obvious.)
  2. After ~3–7 days of healthy operation on internal NVMe, delete the old dir to reclaim USB space.

Phase 4 — PG15 (separate decision, later)

268 GB won't fit on the reclaimed SSD alongside PG16 with safe headroom. Options: (a) slim PG15 then co-locate; (b) dedicated hardware; (c) leave on a hardened USB setup (usb-storage not uas, power-saving already off). This is the nominate.ai DB; lower priority for TerraPulse's risk. Decide after PG16 is settled.


Rollback

At every point before Phase 3's delete, the original USB data dir is untouched. If PG16 won't start on internal:

sudo systemctl stop postgresql@16-main
sudo sed -i "s#^data_directory = .*#data_directory = '/mnt/ursa/data/terrapulse/postgres/16/main'#" \
  /etc/postgresql/16/main/postgresql.conf
sudo systemctl start postgresql@16-main   # back on USB, zero data loss

Last-resort: restore from the verified dump on /mnt/backup-ursa.

Risk notes

  • rsync of a live PG data dir (Phase 2 pre-sync) is safe ONLY because a final delta sync runs after the cluster is stopped; the stopped-state copy is the authoritative one.
  • The internal Samsung 980 is fast NVMe; 153 GB final-delta sync after a good pre-sync should be seconds-to-minutes. Pre-sync read is gated by USB read speed.
  • Keep the box's power-saving disabled (done 2026-05-31) so nothing suspends mid-copy.

FLAG FOR BRAD — PG15 (nominatim) is still on the USB drive (2026-06-09)

This part is not TerraPulse's to migrate — it's yours. Raising it because it's a real, unresolved risk on shared hardware.

What: PG15 on port 5432 (data dir /mnt/ursa/data/postgresql, the nominatim database backing nominate.ai / geo.campaignbrain.dev) is still living on /mnt/ursa — the consumer WD easystore USB drive whose transport drop corrupted the ext4 filesystem and took both businesses down on 2026-05-30.

Why it matters: TerraPulse already moved its own DB (PG16) off that drive onto internal NVMe on 2026-06-01, which is the only reason TerraPulse survived the 2026-06-01 hard-freeze with zero data loss. PG15/nominatim did not get that protection — it's still exposed to the same USB-transport-drop failure mode. If the drive drops again, nominate.ai's geocoding DB is the thing that corrupts.

Why TerraPulse isn't doing it: that cluster is yours (Campaign Brain / nominate.ai). TerraPulse only consumes it over HTTP at geo.campaignbrain.dev; we never touch the cluster directly, and per the box's tenancy boundary we don't move other tenants' data. Hence: flag, not action.

Constraints worth knowing (from the PG16 migration above):

  • PG15 nominatim is ~268 GB. The internal NVMe landing zone used for PG16 does not have room for both (421 GB of PG vs 456 GB reclaimed). PG15 needs its own internal target, or a cleanup, before it can move.
  • /mnt/ursa is a SHARED mount; dragonfli.service propagates it into the host namespace (mount --make-private before any umount). Quiesce both businesses' services before touching the mount.
  • A verified-restorable backup pattern and the full PG16 playbook are above — the same Phase 1–3 approach applies to PG15 if/when you decide to move it.

Ask: decide whether to migrate PG15/nominatim off the USB drive (recommended, same rationale that saved PG16), and if so, where its ~268 GB lands.

Live Feed