"symbol" type is invalid of query

CREATE TABLE 'sys_telemetry' ( 
	timestamp TIMESTAMP,
	deviceId INT,
	propId INT,
	propValue FLOAT,
	maxValue FLOAT,
	maxValueTime TIMESTAMP,
	minValue FLOAT,
	minValueTime TIMESTAMP,
	avgValue FLOAT
) timestamp(timestamp) PARTITION BY MONTH WAL
WITH maxUncommittedRows=500000, o3MaxLag=600000000us;
 select timestamp,propId,first(propValue) ,max(maxValue) ,min(minValue),avg(avgValue) from sys_telemetry 
 where  (propId=1 or propId=2 or propId=3 or propId=4 or propId=5 or propId=6) and timestamp in '2024'
 SAMPLE by 1M

There are approximately 800 million pieces of data.

When “propId” is of the Int type, the first query takes approximately 5 seconds.

After I changed the “propId” to the “symbol” type, the query time was set to be longer. Why is that? I have tried setting “propId(symbol)” as the index, but the result was still not satisfactory.

I believe that the “symbol” type should be able to indicate the query efficiency.But it doesn’t seem to be working now.

Hi,

I think with this condition it’s a little faster (propId of the Int type).

WHERE (propId>=1 AND propId<=6) AND timestamp >= ‘2024’ AND timestamp < ‘2025’

or

WHERE propId IN (1, 2, 3, 4, 5, 6) AND YEAR(timestamp) = 2024

WHERE propId IN (…) should be preferred over OR.

Whether you use SYMBOL or INT, you should get a JIT filter (on an x86-64 machine).

I checked the plan, and it correctly generates the filter and interval scan.

Adding an index is unlikely to speed up this query, it is an aggregation over multiple keys. A parallel scan will almost certainly be faster (and scale better with more cores).

When you swapped to a symbol, did you set the symbol capacity appropriately? It should be greater than or equal to the max number of unique values in the column. Not doing this can cause some performance degradation, especially on the ingestion side (high CPU usage).

If ingestion starts taking up more CPU time, that would slow down your queries.

I have tried this demo. In 2024, there will be approximately 400 million pieces of data.

 select count(1) from trades 
 where  symbol in ('DOGE-USD','SOL-USDT','BTC-USD','ETH-USDT','SOL-USD','ETH-USD') and timestamp in '2024'

Then a comprehensive query is carried out, which takes approximately 3 seconds.

 select timestamp,symbol,first(price) ,max(price) ,min(price),avg(price) from trades 
 where symbol in ('DOGE-USD','SOL-USDT','BTC-USD','ETH-USDT','SOL-USD','ETH-USD') and timestamp in '2024'
 SAMPLE by 1M

this demo only have 400 million pieces of data, while my full-year 2024 data is approximately 2 billion.

So, how should the table be designed in order to achieve the maximum query efficiency?

Has anyone ever practiced this way of doing things, which is to categorize the devices by their names? For example, something like “sys_telemetry_{deviceId}”.

Can you please run it with EXPLAIN and post the plan?

If you re-run it, is it faster? 3s seems reasonable at face value. On demo the demo trades table, 2 billion rows is about 66 GB. So you might be filtering and grouping a similar amount?

the demo is from this link:QuestDB Demo | Production

  select timestamp,symbol,first(price) ,max(price) ,min(price),avg(price) from trades   where symbol in ('DOGE-USD','SOL-USDT','BTC-USD','ETH-USDT','SOL-USD','ETH-USD') and timestamp in '2024'  SAMPLE by 1M

This is the EXPLAIN:

What I want to convey is that currently, processing 4 million pieces of data takes approximately 3 seconds. If the amount of data increases to 2 billion, the query process will become even slower. Do you have any suggestions for better designing the table?

Because my project is expected to generate 2 billion pieces of data in the coming year, I am currently conducting tests as well. I have already inserted 1.3 billion pieces of data locally.

This is my demo:

Aggregating 1.3 billion pieces of data would take approximately 28 seconds using queries.

This is my explain:

There are 400 million records in 2024 in that table.

The first time you run it, if the pages are not cached in RAM, they will need to be read from disk. On subsequent runs, it will be faster. For example, after re-running a couple of times, it can go to <2 seconds execution time on that box:

Execute: 1.95s Network: 71.23ms Total: 2.02s

If you run the query on the full dataset (2 billion rows):

Execute: 6.83s Network: 102.3ms Total: 6.93s

However, the long term solution to this is simpler - don’t query such a huge time range. Instead, use a materialized view to store the data pre-aggregated. By way of example, there is a view on trades trades_ohlc_15m.

CREATE MATERIALIZED VIEW 'trades_OHLC_15m' WITH BASE 'trades' REFRESH IMMEDIATE AS (
  SELECT
    timestamp, symbol, 
      first(price) AS open, 
      max(price) as high, 
      min(price) as low,
      last(price) AS close, 
      sum(amount) AS volume
 FROM trades
 SAMPLE BY 15m
) PARTITION BY DAY;

If you run the underlying query, it takes:

Execute: 6.38s Network: 54.68ms Total: 6.44s 

But if we query the pre-aggregated view:

trades_OHLC_15m;

it takes milliseconds:

Execute: 1.51ms Network: 64.49ms Total: 66ms

They will give the same answer for your historical data. The views are also refreshed automatically.