Hi Ingmar,
You are right that if you go down the telegraf sparse-to-dense migration, you would need to do one of these three options:
- Keep telegraf in-between (not ideal)
- Change your ingestion method, so data is sent in a dense format. This would be my preference, if possible, as it avoids any further processing. InfluxDB themselves recommend this approach when moving from v2 to v3, as in their v3 version they followed the QuestDB architecture and when with a dense row format. However, this might not always be easy, which takes me to option three
- Take advantage of some of the latest additions to QuestDB: Time-To-Live (TTL) for data in your tables, plus materialized views.
Let me tell you more about this. The problem with sparse data is that it can be hard to query, and inefficient to store, so we prefer a dense format. To move from sparse to dense, we basically need to accumulate data for the same logical series and the same timestamp (can be exact same, or around the same time, for example same-second) and output a single row of data. A query might look like this
SELECT timestamp, vehicle_id, last(sensor1) as sensor1, last(sensor2) as sensor2,last(sensor3) as sensor3, last(sensor4) as sensor4, last(status) as status
FROM sensors
SAMPLE BY 500T;
The query would return a row for each vehicle and 500 millisecond interval (supposing any sensor sent data within that 500 millisecond bucket for that vehicle). The output would be a dense row will the last value within that 500 millis interval for all the sensors in the vehicle.
Of course you could do this at 1 second intervals, at 1 millisecond intervals, or at whatever resolution that makes sense for you.
So, if you ingest all your sparse data directly into a table named sensors
and execute the query above, you would see a dense result. But of course that might get slow if the table grows, and it is also inconvenient to query. Which is why I would suggest using materialized views.
With materialized views, you can create a view from a table that updates continuosly.
Suppose you have this table
CREATE table raw_sensors AS
timestamp timestamp,
vehicle_id SYMBOL CAPACITY 20000,
sensor1 float,
sensor2 long,
sensor3 float,
sensor4 float,
status varchar
)
timestamp(timestamp) PARTITION BY DAY TTL 7 DAYS;
That would create a table where you can ingest your raw (sparse) data, and it uses TTL 7 DAYS
to remove data after 7 days. Why am I doing this? Well, because I am suggesting you could also create this:
CREATE MATERIALIZED VIEW sensors AS (
SELECT timestamp, vehicle_id, last(sensor1) as sensor1, last(sensor2) as sensor2,last(sensor3) as sensor3, last(sensor4) as sensor4, last(status) as status
FROM raw_sensors
SAMPLE BY 500T
) PARTITION BY DAY;
This materialized view would execute every time there is a transaction on your raw_sensors
table, and would automatically store the result of the query into a “table” named sensors
, which would have the dense format.
If you don’t need the view to materialize in real-time, and you are OK materializing, for example, every 1 minute, you can create the view instead as
CREATE MATERIALIZED VIEW sensors REFRESH START '2025-06-02T00:00:00.000000Z' EVERY 1m AS (
SELECT timestamp, vehicle_id, last(sensor1) as sensor1, last(sensor2) as sensor2,last(sensor3) as sensor3, last(sensor4) as sensor4, last(status) as status
FROM raw_sensors
SAMPLE BY 500T
) PARTITION BY DAY;
So, the raw table would get your sparse data from your sensors, the materialize view would (immediately or in a timer) materialize the results as dense, and after 7 days the original data would be removed from the raw table, so you have the sparse data around for a while in case there are any errors and you need to manually update anything.
If you want, you can also add TTL to the materialized view to remove the dense data after a while.