hey all, hoping someone can point me at a recovery path before i do something stupid.
i had a bulk ingest go sideways and now both my big tables (`quotes` ~30B rows / 2 TB, `trades` ~1.5B / 95 GB) are suspended with:
```
Unrecoverable storage corruption detected: column version mismatch
[table=quotes~9, txnVersion=0, actualFileVersion=12460]
[table=trades~8, txnVersion=7769, actualFileVersion=8007]
```
partition files on disk are all there and untouched. it’s just the sequencer that’s hosed. a third table on the same instance (`bars`) is totally fine, daily ingest into it still works.
main question: **is there a way to forcibly resync the sequencer to the actual column versions on disk?** something like an `ALTER … RECOVER`, or a documented edit of `_txn` / `_txnlog`? i’ve got a 37 MB metadata-only snapshot so i can try stuff safely.
- QuestDB 9.3.3 (windows runtime bundle)
- Windows 11 Pro for Workstations
- E: drive, NTFS, 3.7 TB
- python questdb client 4.1.0, ILP-over-HTTP
bulk loading two historical tick files into the existing `quotes` and `trades` tables via the python ILP client:
- daxtq.csv.gz (20 GB compressed, FDXc1)
- stxtq.csv.gz (45 GB compressed, STXEc1)
config on the Sender: `auto_flush_rows=200000, auto_flush_interval=15000, request_timeout=60000, max_buf_size=209715200`
first attempt, after ~20 min every `flush()` started hitting the client-side timeout. server log was spamming:
```
WalPurgeJob broad sweep failed [table=quotes~9, msg=Transaction read timeout [src=writer, timeout=1000ms]]
```
every 30s. seems like WAL apply / purge couldn’t grab the writer lock because the bulk writer was always holding it.
killed, retried with smaller batches. same thing. eventually the HTTP listener itself stopped accepting connections (java process up, port 9000 timing out). by that point `quotes~9` had 39 unapplied WAL dirs piled up (wal32 → wal70).
## what i did to “fix” it (and probably made it worse)
honest hands-up here, this is the bit i need you to tell me how to recover from:
1. stopped questdb
2. **deleted `quotes~9/wal32` through `wal70`** thinking they were just stuck flush attempts
3. briefly moved `quotes~9/_todo_` aside, restarted, then put it back and restarted again
4. added to server.conf:
```
cairo.spin.lock.timeout=60000
cairo.wal.recreate.distressed.sequencer.attempts=10
cairo.writer.alter.busy.wait.timeout=60000
```
on the next startup, instead of the writer-timeout errors i now get:
```
could not process table sequencer [table=quotes~9, errno=0,
error=Transaction read timeout \[src=writer, timeout=60000ms\]\]
skipping table during write tracker hydration [table=quotes~9, …]
```
and on `ALTER TABLE quotes RESUME WAL`:
```
could not open [table=quotes~9, thread=82,
msg=Unrecoverable storage corruption detected: column version mismatch
\[table=quotes\~9, txnVersion=0, actualFileVersion=12460\]\]
ApplyWal2TableJob job failed, table suspended
```
same for trades (txnVersion=7769, actualFileVersion=8007).
on disk now
`E:\questdb\data\db\quotes~9\`:
```
_cv 64 KB Mar 18 (original)
_meta 64 KB May 22 (rewritten on startup)
_meta.prev 64 KB Mar 24
_name 17 B Mar 18
_todo_ 64 KB May 20 (restored after my brief rename)
_txn 64 KB Mar 18 (mtime is original?!)
ric.c/k/o 64 KB May 20
ric.v 16 MB May 21
txn_seq/ (3.1 MB, last write Mar 23)
2019-01.30850 …
… (~7 years of monthly partitions, intact)
```
no `wal*` dirs anymore (i removed them, see above).
my theory
- `_txn` mtime is march, so i think it’s just initial state and the live txn is in `txn_seq/_txnlog`
- by deleting wal32-70 i killed the WAL segments the sequencer needed to replay
- the “column version mismatch” is downstream of that — sequencer thinks txn=0, column files were last written at version 12460, they can’t reconcile because the bridge (the deleted WALs) is gone
am i reading that right|?
snapshot
before posting i took `E:\questdb\snapshot_2026-05-22\` — every non-partition file from both broken tables + server.conf. 37 MB total. happy to share if anyone wants to look.
what i’d love to know
1. is there any way to forcibly resync the sequencer to the actual column file version? `ALTER … RECOVER`, manual `_txn` rewrite, anything?
2. can `txn_seq/` be rebuilt from the partition state on disk?
3. worst case — if recovery is dead, can i read the partition files directly (DuckDB? a python tool?) to dump them as CSV and re-ingest into a fresh table? would really like to avoid losing 30B rows of tick data.
partition files weren’t touched in any of this so the actual data should all be there, i just need a way to get questdb to talk to it again.
thanks!