How to configure O3?

Hello,
Would it be possible to configure O3 for the following example script ?
It is generating random data being sent every 5 seconds. In a first case, if the timestamp are chosen randomly over the last 4 seconds, the disk usage is very low. In a second case, by change the 4 to a 6 on line 23, I generate the timestamp randomly over the last 6 seconds, therefore with an overlap between batches. In this case, the disk usage is high (~1Mo/s). Even a small overlap is causing the problem.

How can it be optimized to avoid such a high disk usage? Could I specify the O3 buffer to be longer ? O3 paramertes are currently the default ones in QDB 8.2.1.

Thank you,
Gabriel

import pandas as pd
import numpy as np
import datetime
import time
import random

from questdb.ingress import Sender, TimestampNanos

def send(n):
    symbols = ['ETH-USD', 'BTC-USD']
    sides = ['sell', 'buy']
    price = 1 + 9999 * np.random.rand(n)
    amount = np.random.rand(n)

    isymbols = np.random.randint(0, len(symbols), size=n)
    isides = np.random.randint(0, len(sides), size=n)
    symbol = np.array(symbols)[isymbols]
    side = np.array(sides)[isides]

    now = pd.Timestamp.now()

    # create a list of timestamp over 4 or 6 seconds
    date_range = list(pd.date_range(end=now, periods=n, freq=str(4/n) + 's'))

    random.shuffle(date_range)

    df = pd.DataFrame({
        'symbol': pd.Categorical(symbol),
        'side': pd.Categorical(side),
        'price': price,
        'amount': amount,
        'timestamp': date_range})

    conf = f'http::addr=localhost:9000;'
    with Sender.from_conf(conf) as sender:
        for i in range(len(df)):
            line = df.iloc[i]

            sender.row('test_table',
                        symbols={'symbol': line['symbol'], 'side': line['side']},
                        columns={'price': float(line['price']), 'amount': float(line['amount'])},
                        at=line['timestamp'])
        sender.flush()
    
total_time = 10 * 60  # 10 minutes

# Temps initial
t0 = time.time()

while time.time() - t0 < total_time:
  send(5000)
  time.sleep(5)

Hi @Gabriel ,

Note that some of O3 parameters in the config are for non-WAL tables only.

You could try increasing cairo.o3.lag.calculation.windows.size and cairo.o3.max.lag. You could also try lowering partition size i.e if you have day partitions, try hours.

The O3 process is copy-on-write so reasonable write costs are expected. To avoid this entirely, you can also split your time-series data into separate tables, so that each is append-only. If you try this route, you should reduce cairo.writer.data.append.page.size.

Hi @nwoolmer ,

I wish you a happy new year.

I don’t find the parameter cairo.o3.lag.calculation.windows.size in our config file and it is not described in the documentation Configuration | QuestDB.

Could you confirm the name and the corresponding units ? is it in time, size or number of lines ?

I did test with WAL or non-WAL tables with the same behavior. Partitions are already pretty small (most are one page of 1 MB) but rewriting 1 MB every 5 seconds on each column is costly for our application. As we we about 80 columns to write, it makes a continuous 16 MB/s on the disk.

Thank you,
Gabriel

To you too!

It is unitless, and defaults to 4. The value corresponds to the number of timestamps tracked within an O3 window. Essentially, the database tracks recent timestamps during the O3 process. It then checks the min and max and compares this to the timestamps in the table. This determines how accurate the delay was before commit, and whether it needs to be increased.

Since you send data every 5 seconds, and choose a timestamp in the last 4 seconds, this is more or less an append-only pattern. When you send every 5 seconds choosing from the last 6 seconds, it may become a O3 process if you get a timestamp in seconds 5->6, since this is likely before the latest timestamp in your table.

Therefore, I was hoping that by increasing the number of tracked timestamps, the lag might increase a little, and also stabilise.

Ultimately, swapping an append-only process to an always-O3 process will always be expensive and ideally you would change your data layout or sending to avoid this - or send in larger batches to minimise write-amplification (i.e rewrite smaller partitions, less frequently).

I tried to increased up to 10000 without any visible impact.
I set alsocairo.o3.min.lag=10000 and tried both ILP and HTTP.

I logged the data without designated timestamp on one table to see how much the were out of order, and they are only a few lines off. Most of the time it is less than 5 lines out of order, and less than 10 ms.

I’m hoping to find the right parametrs to handle this as otherwise, we would need to add an intermediate buffer to reorder the data. They are naturally slighly but regularly out of order.

What kind of batch size are you using?

If it resolves within 10ms, then you could use something like a 10ms sliding window, paired with a 10ms lag in your sender.

Therefore, at T=20, you send data for T=(0, 10).
Then at T=30, you send data for T=(10, 19)

etc.

I had a third party integration in a system, which could give us O3 data up to 30 mins after real-time. Since the data is not guaranteed to be correct for 30 mins, I would wait 30 mins and then send data for (-35, -30). Then after the next 5 mins had passed, I’d send (-30, -25), and so on.

It worked great!

P.S In case it wasn’t clear from the above, since you receive data up to 10 ms late or so, you can’t trust all the data within that window until 10ms has passed, or query it accurately. So for ‘correctness’, you already would have to wait anyway.

Batches can contain a few entries or up to a few thousands. They are sent every ~5 seconds like in my test script.

We could add a reordering buffer on our side, but that implies implementing it in different pieces of software (in different languages). I was expecting the O3 system of QDB capable of handling those data with just configuration.

In fact, I don’t understand the cairo.o3.min.lag=10000 parameter if its goal is not to handle such situations.

That parameter is used for ILP/TCP to determine when to commit the data. Since the data is being streamed continuously over the connection, the database must decide to commit it at some point.

With ILP/HTTP, each request is self-contained and represents a single WAL commit. These connections are independent and written to separate WALs, in whole, or thrown away entirely.

I will feed this back and see what is said!

Hi @Gabriel ,

So here’s the situation.

Firstly, we have some WIP to optimise small transactions (handful of rows), in a process that will group them up and speed up WAL application. This helps but doesn’t fix your problem i.e of avoiding O3 entirely.

We have another piece of work lined up to optimise use of WAL and avoid applying O3 txns eagerly i.e allowing you to query committed data without having to apply it to the table (the slow part). This would help your situation.

One thing to try is to swap to ILP/TCP. With ILP/HTTP, each request (per-table) becomes its own transaction, and we do not split these by time. With ILP/TCP, there is a dynamic commit lag. If you are lucky, this will sometimes expand to cover your scenario. Other times, it will not and you will still get O3. However, you might find that this performs better than what you currently have.

Otherwise this is going to be a todo for the time being. Hopefully, you can ensure that your block storage/other device has enough headroom to handle the O3.

Hi Nick,

Thank you for the answer. I will follow the todo’s :slight_smile: If data can be queried before being commited, this is indeed ideal.

I tested with ILP/TCP early and the result was indeed not reproducible with sometimes no overhead in the write process, and sometimes no diffrence with ILP/HTTP.

Gabriel

1 Like