Migration from InfluxDB

I’m using Home Assistant’s InfluxDB integration to write data to a time-series database using the Influx Python client.

Currently, this writes to an InfluxDB 2.0 installation, but I’m planning to migrate this to QuestDB. I found Datamigration from InfluxDB to QuestDB with Telegraf - #3 by Armin_Diehl and am confident that splitting the influxd inspect export-lp output into smaller chunks that can be ingested through Telegraf into QuestDB will work.

However, I’m wondering whether the aggregation to transform a sparse to a dense data set will also be needed after the initial import. That is, once I’m done with ingesting all the existing data from an export, will future writes from HA create more sparse rows if sent directly to QuestDB rather than Telegraf? I would prefer not to add another component to the stack permanently.

Hi Ingmar,

You are right that if you go down the telegraf sparse-to-dense migration, you would need to do one of these three options:

  • Keep telegraf in-between (not ideal)
  • Change your ingestion method, so data is sent in a dense format. This would be my preference, if possible, as it avoids any further processing. InfluxDB themselves recommend this approach when moving from v2 to v3, as in their v3 version they followed the QuestDB architecture and when with a dense row format. However, this might not always be easy, which takes me to option three
  • Take advantage of some of the latest additions to QuestDB: Time-To-Live (TTL) for data in your tables, plus materialized views.

Let me tell you more about this. The problem with sparse data is that it can be hard to query, and inefficient to store, so we prefer a dense format. To move from sparse to dense, we basically need to accumulate data for the same logical series and the same timestamp (can be exact same, or around the same time, for example same-second) and output a single row of data. A query might look like this

SELECT timestamp, vehicle_id, last(sensor1) as sensor1, last(sensor2) as sensor2,last(sensor3) as sensor3, last(sensor4) as sensor4, last(status) as status
FROM sensors
SAMPLE BY 500T;

The query would return a row for each vehicle and 500 millisecond interval (supposing any sensor sent data within that 500 millisecond bucket for that vehicle). The output would be a dense row will the last value within that 500 millis interval for all the sensors in the vehicle.

Of course you could do this at 1 second intervals, at 1 millisecond intervals, or at whatever resolution that makes sense for you.

So, if you ingest all your sparse data directly into a table named sensors and execute the query above, you would see a dense result. But of course that might get slow if the table grows, and it is also inconvenient to query. Which is why I would suggest using materialized views.

With materialized views, you can create a view from a table that updates continuosly.

Suppose you have this table

CREATE table raw_sensors AS
timestamp timestamp, 
vehicle_id SYMBOL CAPACITY 20000,
sensor1 float,
sensor2 long,
sensor3 float,
sensor4 float,
status varchar
)
timestamp(timestamp) PARTITION BY DAY TTL 7 DAYS;

That would create a table where you can ingest your raw (sparse) data, and it uses TTL 7 DAYS to remove data after 7 days. Why am I doing this? Well, because I am suggesting you could also create this:

CREATE MATERIALIZED VIEW sensors AS (
SELECT timestamp, vehicle_id, last(sensor1) as sensor1, last(sensor2) as sensor2,last(sensor3) as sensor3, last(sensor4) as sensor4, last(status) as status
FROM raw_sensors
SAMPLE BY 500T
) PARTITION BY DAY;

This materialized view would execute every time there is a transaction on your raw_sensors table, and would automatically store the result of the query into a “table” named sensors, which would have the dense format.

If you don’t need the view to materialize in real-time, and you are OK materializing, for example, every 1 minute, you can create the view instead as

CREATE MATERIALIZED VIEW sensors REFRESH START '2025-06-02T00:00:00.000000Z' EVERY 1m AS (
SELECT timestamp, vehicle_id, last(sensor1) as sensor1, last(sensor2) as sensor2,last(sensor3) as sensor3, last(sensor4) as sensor4, last(status) as status
FROM raw_sensors
SAMPLE BY 500T
) PARTITION BY DAY;

So, the raw table would get your sparse data from your sensors, the materialize view would (immediately or in a timer) materialize the results as dense, and after 7 days the original data would be removed from the raw table, so you have the sparse data around for a while in case there are any errors and you need to manually update anything.

If you want, you can also add TTL to the materialized view to remove the dense data after a while.

Hi Javier,

thanks so much for the response!

Let me go through the options one-by-one:

  1. if I had to introduce Telegraf in between, I would probably just stick with InfluxDB 2, to keep the number of moving pieces minimal.
  2. I can’t really change the ingestion method other than contributing to Home Assistant’s InfluxDB integration.
  3. A problem here is that the schema is rather dynamic. HA adds measurements per unit (think: W, Hz, GiB, MBit/s, USD, …) and the columns (tags) can be all kinds of entity properties which come and go as integrations are updated, removed, or added. That means, I will not be able to produce a fixed schema for a materialized view.

So, none of the options is a clear winner, unfortunately.
The cleanest would likely be #2. Looking at core/homeassistant/components/influxdb/__init__.py at 6692b9b71fb8fd991d37ddb6f68e878c06d0900b · home-assistant/core · GitHub, it might actually be the case already. The integration listens on HA’s event bus for STATE_CHANGED events and writes the full updated state to Influx (as opposed to writing every single changed attribute).

So just to double-check: if data is written in a dense way, no telegraf is needed?

The main thing you want to avoid is a wide schema with lots of nulls.

If the data for all sensors comes in one row, i.e. timestamp, sensor1, sensor2, sensor3 etc., that’s great.

If it comes in multiple rows like timestamp, symbol, value, that is suboptimal, but also fine.

If it comes like this:

| timestamp         | sensor1      | sensor2      | sensor3     |
| ----------------- | ------------ | ------------ | ----------- |
|    1us            |  345         |  null        | null        |
|    2us            |  null        |  567         | null        |
|    3us            |  null        |  null        | 789         |

That is less ideal, and where you might want to do as Javier described - group them into a common approximate timestamp. If you can do that from source, even better!

Traditionally, Influx connectors tended to send in a row-modelled way, which then could lead to sparse wide-column tables. This fit in with the prior indexing, which was generally per-symbol.

Nulls take space to store in QuestDB. They can be compressed, but it just increases row count, which may not be necessary!