QuestDB Recovery

hey all, hoping someone can point me at a recovery path before i do something stupid.

i had a bulk ingest go sideways and now both my big tables (`quotes` ~30B rows / 2 TB, `trades` ~1.5B / 95 GB) are suspended with:

```

Unrecoverable storage corruption detected: column version mismatch

[table=quotes~9, txnVersion=0, actualFileVersion=12460]

[table=trades~8, txnVersion=7769, actualFileVersion=8007]

```

partition files on disk are all there and untouched. it’s just the sequencer that’s hosed. a third table on the same instance (`bars`) is totally fine, daily ingest into it still works.

main question: **is there a way to forcibly resync the sequencer to the actual column versions on disk?** something like an `ALTER … RECOVER`, or a documented edit of `_txn` / `_txnlog`? i’ve got a 37 MB metadata-only snapshot so i can try stuff safely.

  • QuestDB 9.3.3 (windows runtime bundle)

- Windows 11 Pro for Workstations

- E: drive, NTFS, 3.7 TB

- python questdb client 4.1.0, ILP-over-HTTP

bulk loading two historical tick files into the existing `quotes` and `trades` tables via the python ILP client:

- daxtq.csv.gz (20 GB compressed, FDXc1)

- stxtq.csv.gz (45 GB compressed, STXEc1)
config on the Sender: `auto_flush_rows=200000, auto_flush_interval=15000, request_timeout=60000, max_buf_size=209715200`

first attempt, after ~20 min every `flush()` started hitting the client-side timeout. server log was spamming:

```

WalPurgeJob broad sweep failed [table=quotes~9, msg=Transaction read timeout [src=writer, timeout=1000ms]]

```

every 30s. seems like WAL apply / purge couldn’t grab the writer lock because the bulk writer was always holding it.

killed, retried with smaller batches. same thing. eventually the HTTP listener itself stopped accepting connections (java process up, port 9000 timing out). by that point `quotes~9` had 39 unapplied WAL dirs piled up (wal32 → wal70).

## what i did to “fix” it (and probably made it worse)

honest hands-up here, this is the bit i need you to tell me how to recover from:

1. stopped questdb

2. **deleted `quotes~9/wal32` through `wal70`** thinking they were just stuck flush attempts

3. briefly moved `quotes~9/_todo_` aside, restarted, then put it back and restarted again

4. added to server.conf:

```

cairo.spin.lock.timeout=60000

cairo.wal.recreate.distressed.sequencer.attempts=10

cairo.writer.alter.busy.wait.timeout=60000

```

on the next startup, instead of the writer-timeout errors i now get:

```

could not process table sequencer [table=quotes~9, errno=0,

error=Transaction read timeout \[src=writer, timeout=60000ms\]\]

skipping table during write tracker hydration [table=quotes~9, …]

```

and on `ALTER TABLE quotes RESUME WAL`:

```

could not open [table=quotes~9, thread=82,

msg=Unrecoverable storage corruption detected: column version mismatch

     \[table=quotes\~9, txnVersion=0, actualFileVersion=12460\]\]

ApplyWal2TableJob job failed, table suspended

```

same for trades (txnVersion=7769, actualFileVersion=8007).

on disk now

`E:\questdb\data\db\quotes~9\`:

```

_cv 64 KB Mar 18 (original)

_meta 64 KB May 22 (rewritten on startup)

_meta.prev 64 KB Mar 24

_name 17 B Mar 18

_todo_ 64 KB May 20 (restored after my brief rename)

_txn 64 KB Mar 18 (mtime is original?!)

ric.c/k/o 64 KB May 20

ric.v 16 MB May 21

txn_seq/ (3.1 MB, last write Mar 23)

2019-01.30850 …

… (~7 years of monthly partitions, intact)

```

no `wal*` dirs anymore (i removed them, see above).

my theory

- `_txn` mtime is march, so i think it’s just initial state and the live txn is in `txn_seq/_txnlog`

- by deleting wal32-70 i killed the WAL segments the sequencer needed to replay

- the “column version mismatch” is downstream of that — sequencer thinks txn=0, column files were last written at version 12460, they can’t reconcile because the bridge (the deleted WALs) is gone

am i reading that right|?

snapshot

before posting i took `E:\questdb\snapshot_2026-05-22\` — every non-partition file from both broken tables + server.conf. 37 MB total. happy to share if anyone wants to look.

what i’d love to know

1. is there any way to forcibly resync the sequencer to the actual column file version? `ALTER … RECOVER`, manual `_txn` rewrite, anything?

2. can `txn_seq/` be rebuilt from the partition state on disk?

3. worst case — if recovery is dead, can i read the partition files directly (DuckDB? a python tool?) to dump them as CSV and re-ingest into a fresh table? would really like to avoid losing 30B rows of tick data.

partition files weren’t touched in any of this so the actual data should all be there, i just need a way to get questdb to talk to it again.

thanks!

Hi. Deleting WAL files (or anything under the questdb database folder) is a recipe for disaster. We advise to do this only after contacting the QuestDB team via community or slack.

We don’t have a good solution for this, but I created a branch of the questdb project with a new utils script that MIGHT be able to get data from all the unapplied wal files and create a parquet file for each. Then you can just ingest those parquet files again into questdb after resuming your tables?

By default, you can just point it to the db root folder and provide an output folder. The tool will try its best to convert to parquet. It might be the case there are some missing files with important information, like symbol mappings, and some files need to be skipped.

I hope it helps, but I cannot guarantee it. Let me know how it goes