Efficient Querying of Multiple Time Ranges from High-Volume Sensor Data in QuestDB

Dear QuestDB community,

We are currently evaluating QuestDB for our company, and I need some help determining the most suitable table structure for our use case.

Our application consists of 30 sensors in total, each collecting data at a sampling rate of approximately 4 ms over periods of several weeks. Each sensor generates a high-precision value (HP) as well as 5 peaks, with each peak consisting of a distance and an amplitude value.

The sensors are distinguished by their side (A or B), as well as by Position_B and Position_L. The measurement type is either “Measurement” (M) or “Reference” (R).
Currently, I’ve set up two separate tables — one for the HP data and one for the peak data — as this seemed to be the most reasonable structure for filtering.

For our data processing, it is important that we can efficiently retrieve data for specific sensors in Python (e.g., HP values from site A, Position_B 1 to 3, for values between 1 and 3). Efficient, targeted retrieval is our first major requirement.

From the processed data, we identify time intervals ranging between 5 and 10 seconds (e.g., 2025-05-06T00:01:18.395000Z to 2025-05-06T00:01:22.519000Z). There can be as many as 1,000 such intervals per day.
Our next requirement is to be able to query these intervals as efficiently as possible from the database.
I’ve experimented with different approaches, such as nested queries that batch the intervals using OR conditions. This method has shown the most promise so far, but with several trillion rows, it’s still too slow.

SELECT timestamp, value
FROM HP_Values_Table
WHERE (timestamp >= '2025-05-06T00:00:00Z' AND timestamp <= '2025-05-06T23:59:59Z') AND
  ((timestamp BETWEEN '2025-05-06 18:55:17.456000+00:00' AND '2025-05-06 18:55:21.976000+00:00') OR 
   (timestamp BETWEEN '2025-05-06 18:55:30.744000+00:00' AND '2025-05-06 18:55:35.304000+00:00') OR 
   (timestamp BETWEEN '2025-05-06 18:55:38.384000+00:00' AND '2025-05-06 18:55:42.480000+00:00') OR 
   (timestamp BETWEEN '2025-05-06 18:55:47.712000+00:00' AND '2025-05-06 18:55:52.312000+00:00') OR 
   (timestamp BETWEEN '2025-05-06 18:55:57.160000+00:00' AND '2025-05-06 18:55:58.784000+00:00') OR 
   (timestamp BETWEEN '2025-05-06 18:56:18.344000+00:00' AND '2025-05-06 18:56:22.732000+00:00') OR 
   (timestamp BETWEEN '2025-05-06 19:20:42.044000+00:00' AND '2025-05-06 19:20:53.756000+00:00') OR 
   (timestamp BETWEEN '2025-05-06 19:21:03.316000+00:00' AND '2025-05-06 19:21:17.676000+00:00') OR 
   (timestamp BETWEEN '2025-05-06 19:21:44.484000+00:00' AND '2025-05-06 19:21:45.332000+00:00')) AND
  value >= 1.8 AND
  value <= 2.3 AND
  Position_B = 1 AND
  site = 'A';

Another idea was to add a new boolean flag column to mark the relevant time intervals, but updating the table turned out to be very inefficient.
I also considered storing the relevant data in a new table — but I would prefer to avoid that redundancy.
Ideally, the flexibility to define intervals using different algorithms should be preserved, which is why I would prefer to query the intervals directly via SQL.

Apologies if my explanation is a bit confusing — English is not my first language. I would be grateful for any input from you!

Sam

Hi @Sam ,

Welcome, and thanks for reaching out!

When we parse queries that have timestamp filters without OR, we can compile them into interval scan nodes. These nodes quickly skip to the correct timestamp region, and slice out the data in the interval.

Let’s talk about the execution plan for your query:

SelectedRecord
    Async Filter workers: 8
      filter: ((((((((((timestamp between 1746557717456000 and 1746557721976000 or timestamp between 1746557730744000 and 1746557735304000) or timestamp between 1746557738384000 and 1746557742480000) or timestamp between 1746557747712000 and 1746557752312000) or timestamp between 1746557757160000 and 1746557758784000) or timestamp between 1746557778344000 and 1746557782732000) or timestamp between 1746559242044000 and 1746559253756000) or timestamp between 1746559263316000 and 1746559277676000) or timestamp between 1746559304484000 and 1746559305332000) and value>=1.8 and 2.3>=value and Position_B=1) and site='A') [pre-touch]
        PageFrame
            Row forward scan
            Interval forward scan on: HP_VALUES_Table
              intervals: [("2025-05-06T00:00:00.000000Z","2025-05-06T23:59:59.000000Z")]

In this plan, you can see that we limit the scanned interval to 2025-05-06. However, the rest of the query will still scan the whole partition for anything that matches the complex filter above.
For the fastest result, you want each of those intervals to be part of an interval scan.

With a simpler filter like this:

SELECT timestamp, value
FROM HP_Values_Table
WHERE timestamp BETWEEN '2025-05-06 18:55:17.456000+00:00' AND '2025-05-06 18:55:21.976000+00:00'
AND value >= 1.8 AND
  value <= 2.3 AND
  Position_B = 1 AND
  site = 'A';
SelectedRecord
    Async Filter workers: 8
      filter: (value>=1.8 and 2.3>=value and Position_B=1 and site='A') [pre-touch]
        PageFrame
            Row forward scan
            Interval forward scan on: HP_VALUES_Table
              intervals: [("2025-05-06T18:55:17.456000Z","2025-05-06T18:55:21.976000Z")]

You can see that we will only touch the very tiny timestamp range you’ve identified, and run the subfilter on that.

This kind of query will likely use a JIT-compiled filter on AMD hardware (I am using an ARM Mac), because the filter only uses timestamp, integers and symbols.

So first thing to try is just running each small query with a few workers in parallel. Since there are only 1000 intervals or so, it shouldn’t take long as each query will be individually quick.

The next option is to try combining withUNION ALL instead of OR

SELECT timestamp, value
FROM HP_Values_Table
WHERE timestamp BETWEEN '2025-05-06 18:55:17.456000+00:00' AND '2025-05-06 18:55:21.976000+00:00'
AND value >= 1.8 AND
  value <= 2.3 AND
  Position_B = 1 AND
  site = 'A'
UNION ALL
SELECT timestamp, value
FROM HP_Values_Table
WHERE timestamp BETWEEN '2025-05-06 18:55:30.744000+00:00' AND '2025-05-06 18:55:35.304000+00:00'
AND value >= 1.8 AND
  value <= 2.3 AND
  Position_B = 1 AND
  site = 'A';

You can see this gives you two separate interval scans:

Union All
    SelectedRecord
        Async Filter workers: 8
          filter: (value>=1.8 and 2.3>=value and Position_B=1 and site='A')
            PageFrame
                Row forward scan
                Interval forward scan on: HP_VALUES_Table
                  intervals: [("2025-05-06T18:55:17.456000Z","2025-05-06T18:55:21.976000Z")]
    SelectedRecord
        Async Filter workers: 8
          filter: (value>=1.8 and 2.3>=value and Position_B=1 and site='A')
            PageFrame
                Row forward scan
                Interval forward scan on: HP_VALUES_Table
                  intervals: [("2025-05-06T18:55:30.744000Z","2025-05-06T18:55:35.304000Z")]

After this, you can check your hardware sizing. If the data is cached in RAM, fetching it will be fast. Let’s saying there is 20 GB of data in day 2025-05-06. The first time you scan it, we will lift it from disk into RAM. The next queries will be hot and won’t incur so much IO overhead.

Before running the above queries, you can ‘prefetch’ the data from disk, so it will be hot already. We have a function called touch to do this:

SELECT touch(SELECT timestamp, value FROM HP_Values_Table WHERE timestamp IN '2025-05-06');

That will prefetch data for that day, and then speed up subsequent queries.

Other than the above, it’d be great to get some info about the hardware setup, OS etc., as well as some hard numbers on query latency, to help quantify if any of the above steps show improvements. i.e. how fast was it before and after.

As a general note, getting data out to Python is probably fastest using ConnectorX: GitHub - sfu-db/connector-x: Fastest library to load data from DB to DataFrames in Rust and Python

We will be improving connectivity to Polars soon too.

Let me know how you get on, and I’m happy to answer any other questions you may have about your use case or the database in general.

Hi Nick,

Wow, thank you so much for your quick and very detailed response.

I ran a few tests, and using UNION ALL in the query seems to significantly improve performance.

Would you mind taking a look at how I implemented multithreading in Python? I’m not sure if I’m unintentionally creating a bottleneck or if the performance is actually within the expected range.

For my test, I created a table in QuestDB with dummy values (test_long). It includes 6 sensors, split evenly across sites A and B, with a sampling rate of 4m, spanning from 2025-01-01 00:00:00 to 2025-01-04 00:00:00 with a total of 388.800.000 rows. Within this timeframe, I generated about 3000 periods to query, each ranging between 5 and 10 seconds. These are stored in a DataFrame called df_transitions.

The function build_union_queries_from_transitions automatically creates a list of queries from this DataFrame, grouped into a defined number of batches. Here’s an example with 3 batches:

SELECT timestamp, value, site, Position_B FROM test_long 
WHERE timestamp BETWEEN '2025-01-01T00:00:00' AND '2025-01-01T00:00:05.499145036' 
AND Position_B = 1 AND site = 'A'
UNION ALL
SELECT timestamp, value, site, Position_B FROM test_long
WHERE timestamp BETWEEN '2025-01-01T00:01:19.813226215' AND '2025-01-01T00:01:27.375849409'
AND Position_B = 1 AND site = 'A'
UNION ALL
SELECT timestamp, value, site, Position_B FROM test_long
WHERE timestamp BETWEEN '2025-01-01T00:03:08.000918300' AND '2025-01-01T00:03:14.820363808'
AND Position_B = 1 AND site = 'A'
UNION ALL

With a batch_size of 50, the query takes about 9 seconds to run. The resulting DataFrame df_all has 5.609.788 rows and 4 columns.
With a batch_size of 1, meaning individual queries only, it takes around 14 seconds.

import time
import pandas as pd
import threading
from sqlalchemy import create_engine
from concurrent.futures import ThreadPoolExecutor, as_completed

# Database connection configuration
host = 'localhost'
port = 8812
user = 'admin'
password = 'quest'
database = 'qdb'
url = f"postgresql+psycopg2://{user}:{password}@{host}:{port}/{database}"

# Create SQLAlchemy engine with connection pool
engine = create_engine(
    url,
    pool_size=50,
    max_overflow=30,
    pool_timeout=60,
)

# Limit concurrent queries to avoid overloading the database
MAX_PARALLEL_QUERIES = 32
semaphore = threading.BoundedSemaphore(value=MAX_PARALLEL_QUERIES)

def run_batch_query_with_limiter(query, max_retries=3, delay=0.05):
    """
    Execute a SQL query with a semaphore to limit concurrency.
    Retries in case of failure.
    """
    for attempt in range(1, max_retries + 1):
        with semaphore:
            try:
                df = query_questdb_to_df(query)
                if df is not None:
                    return df
                raise ValueError("Result is None")
            except Exception as e:
                print(f"Attempt {attempt} failed: {e}")
                if attempt < max_retries:
                    print(f"Retrying in {delay} seconds...")
                    time.sleep(delay)
                else:
                    print(f"Query failed after {max_retries} attempts.")
                    return None

# Parameters for query construction
batch_size = 50

# Build UNION ALL queries from transition DataFrame
queries_UNION = build_union_queries_from_transitions(
    df_transitions,
    batch_size=batch_size,
    table_name='test_long',
    value_min=None,
    value_max=None,
    position_b=1,
    site='A'
)

# Prepare for parallel execution
all_results = []
failed_queries = []
total = len(queries_UNION)
completed = 0

print(f"Starting batch query execution for {total} queries...")

# Run all queries in parallel with limited workers
with ThreadPoolExecutor(max_workers=MAX_PARALLEL_QUERIES) as executor:
    future_to_index = {
        executor.submit(run_batch_query_with_limiter, q): (i, q)
        for i, q in enumerate(queries_UNION)
    }

    for future in as_completed(future_to_index):
        idx, query_str = future_to_index[future]
        try:
            df_part = future.result()
            if df_part is not None:
                all_results.append(df_part)
            else:
                print(f"Batch {idx} could not be loaded.")
                failed_queries.append((idx, query_str))
        except Exception as e:
            print(f"Unexpected error in batch {idx}: {e}")
            failed_queries.append((idx, query_str))
        completed += 1
        print(f"Progress: {completed}/{total} batches completed")

# Combine all results into a single DataFrame
df_all = pd.concat(all_results, ignore_index=True) if all_results else None

print("All queries completed.")

if failed_queries:
    print(f"{len(failed_queries)} batches failed. These should be reviewed or retried:")
    for idx, _ in failed_queries:
        print(f" - Batch {idx}")

As for my hardware: I’m using a Lenovo ThinkStation P3 Tower with a 13th Gen Intel Core i9-13900K, 128GB RAM, NVIDIA RTX A1000, running Windows 11 Pro. QuestDB is running locally on this machine.

Thanks again for all your help!

How much faster was the union version versus the original?

It is hard to say where the bottleneck is, frankly it could be Python-side rather than the database side. Even if the query executes fast, it must still transfer the data to your python program, which then adds it to the dataframe.

Have you tried ConnectorX, which I mentioned above? I think it will pull data in parallel for you, behind the scenes. Maybe many unions plus ConnectorX is superior to manual parallel execution on the Python side.

df_all = pd.concat(all_results, ignore_index=True) if all_results else 

Stuff like this will create full copies of each dataframe, so is likely to be slow.