SELECT from multiple timeseries tables with the same query

Hello! I have a timeseries database with financial data in which each table represent a financial asset price over the time. All tables have the same structure (symbol, timestamp, bid and ask).

This database is part of a simulation software where ticks are streamed from the db to the application tick by tick over a period of time. Now if I want to stream a single asset I can just run a normal SELECT from the asset’s table, but if I want to stream multiple assets things gets hard, because two or more tables needs to be joined and ordered by timestamp… what could be the best way to query that data?

I’m using the native Java CairoEngine, since the application is Java itself

1 Like

Hi @fgnm ,

I think your best option is to use ASOF JOIN.

You could also consider saving all prices into a single table, that might makes it easier to query prices for multiple instruments ina single query.

Thanks! I initially thought to save all data in a single table, however during calculation of the OHCL with something like: SELECT timestamp AS tick_time, first(bid) AS open, last(bid) AS close, min(bid) AS low, max(bid) AS high FROM tableName WHERE timestamp < 1706745600000000 SAMPLE BY 1h ALIGN TO CALENDAR ORDER BY tick_time DESC LIMIT 1002 the query seems to be very slow while incresing the size (3/400 mln of rows)… but I might have wrong un the table format… this is the structure:

CREATE TABLE 'tableName' (
  symbol SYMBOL capacity 256 CACHE index capacity 256,
  timestamp TIMESTAMP,
  bid DOUBLE,
  ask DOUBLE
) timestamp (timestamp) PARTITION BY DAY WAL;

I’ve also tried the ASOF JOIN but I think I didn’t get how to use it properly…

WITH 
first AS (SELECT * FROM broker_ETHUSD WHERE timestamp > 1706745600000000 AND timestamp < 1730246400000000),
second AS (SELECT * FROM broker_EURUSD WHERE timestamp > 1706745600000000 AND timestamp < 1730246400000000)
SELECT
   *
FROM first ASOF JOIN second;

The result is something like this:


Which is wrong, both assets are in the same row, but I need different assets in different rows, they should be sequentially read from both tables

Hi @fgnm,

Please can you run the query with EXPLAIN and post the plan?

Chances are you will be bettter off without the index for this kind of query. Also, without an index, the query will speed up with more cores and more RAM.

Also, this query will most likely be sampling the whole interval before selecting the last rows.

If you want to select 1000 hours of rows, then it should help to adjust your interval to cover that range explicitly in your WHERE clause.

ASOF JOIN joins two times-series together based on nearest timestamp. ASOF JOIN joins rows together into one, from the two separate tables. Like a horizontal merge.

If you need them as separate rows from two tables, then you probably need to use two separate queries combined with a UNION (ALL) . This will perform a vertical merge and effectively stack the results set on top of each other. Make sure the column count and types match, and you may need to sort afterwards to ensure that the final result set is sorted by your timestamp.

Hi! Thanks for reply, this is the original query plan:


I did a big table with more then 600mln of rows (more assets combined) and running that query takes more then 5sec (i9-12th)…

I’ve add your suggestion to include also a lower bound in WHERE clause and same query runs now in 400ms, this is a huge improvement and I might consider to move again into a single table if performances will be costant even if table increases to let’s say 2/3 billions of rows


I’d like to try your other suggestion removing the index (the symbol) but not sure how to test

You can drop the index using this:

ALTER TABLE pepperstone_live ALTER COLUMN symbol DROP INDEX

If you feel the need to add it back later, you can run a similar query:

ALTER TABLE pepperstone_live ALTER COLUMN symbol ADD INDEX

How many rows are present with that filter? i.e run a SELECT count FROM pepperstone_live and add in your WHERE filter.

If its more than a handful of rows, then removing the index should help.

Once removed, you can check the query plan against using EXPLAIN. You should see some nodes with Async appear i.e Async Group By or Async Filter. These nodes will run in parallel. As your table grows in size and queries slow a little, you can keep speeding them up with more cores/RAM.

If you want to run some test benchmarks on similar data, you can use the trades table on demo. This has over 1.2 billion rows of data, over about 1000 days. It consists of crypto trading data, which isn’t quite the same as your order book data, but not a million miles away:

CREATE TABLE 'trades' (
  symbol SYMBOL,
  side SYMBOL,
  price DOUBLE,
  amount DOUBLE,
  timestamp TIMESTAMP
) timestamp (timestamp) PARTITION BY DAY WAL;

I don’t recall the exact setup, but demo runs somewhere in the range of 24-48 cores.

Fwiw, we have many users with order book data, often in the billions of rows per table. But whether that makes sense for you depends on several factors. Sometimes, smaller tables with a subset of symbols that you often query together can make sense, especially if you only query a few tables at a time (more RAM to go around!)

Yup dropping the index boosts the query a lot! Selecting the count in the interval can returns also more then 200mln of rows so I guess removing it is fine… I don’t even need other complex queries, I just need to stream ticks from quest to my application which is a simulator running on VPS with 12 vCore (I know not too much, so trying to keep things as much efficient as possibile)…

Union is not a solution btw, it’s incredibly slow, amost 30sec to merge 2000 minutes of series, most of the time it goes in timeout.

So I think the final solution will be to make a big table for a single broker and put all symbols there, drop index and use the proper WHERE clause… Thanks a lot for the help!

May I ask you if you can review my snippet to execute queries? I think it’s fine but maybe I might have forgot some important performance consideration :slight_smile:

long startTime = currentTime - (timeframe.millis * (count + 1));

        String query = "SELECT timestamp AS tick_time, first(bid) AS open, last(bid) AS close, min(bid) AS low, max(bid) AS high FROM " + symbol.dataSource + " WHERE timestamp >= " + startTime + " AND timestamp <= " + currentTime + "000 AND symbol = '" + symbol.symbol + "' SAMPLE BY " + tf + " ALIGN TO CALENDAR ORDER BY tick_time DESC LIMIT " + count;
        System.out.println(query);
        int cores = Runtime.getRuntime().availableProcessors();
        System.out.println("Number of cores: " + cores);
        try (SqlExecutionContext ctx = new SqlExecutionContextImpl(cairoEngine, cores)
                .with(AllowAllSecurityContext.INSTANCE, null)) {
            try (RecordCursorFactory factory = cairoEngine.select(query, ctx)) {
                try (RecordCursor cursor = factory.getCursor(ctx)) {
                    final Record record = cursor.getRecord();
                    while (cursor.hasNext()) {
                        Period period = new Period();
                        period.period = timeframe;
                        period.symbol = symbolId;

                        period.startDate = record.getLong(0) / 1000;
                        period.endDate = period.startDate + period.period.millis;

                        period.open = record.getDouble(1);
                        period.close = record.getDouble(2);
                        period.low = record.getDouble(3);
                        period.high = record.getDouble(4);

                        periods.insert(0, period);
                    }
                }
            }
        } catch (SqlException e) {
            throw new RuntimeException(e);
        }

Sounds good!

Re: union, that’s interesting. It’d be handy if you could post the query and plan for that, it may have something strange going on!

Your queries should remain quick so long as you don’t forget to give a good time interval for the query. In general, if your plan contains Interval scan and Async Group By in it, it should be quick!

Re: your code snippet, in general, we discourage embedded usage like this except in very specific scenarios. I appreciate that you are using this as part of some wider software suite (which we would be interested to hear more about, by the way!). When inserting and querying via the internal APIs, there can be all sorts of gotchas that are not obvious (nor documented, as its internal!)

However, long-term, it’d be worth considering using it as a standalone DBMS. We don’t guarantee stability of internal APIs. We also design the database to greedily use available resources, so it could cause ‘noisy neighbour’ issues with the rest of your application.

This is the UNION I tried pepperstone_live_EURUSD UNION pepperstone_live_GBPUSD WHERE timestamp > 1659336669000000 AND timestamp < 1669881069000000 as seen here UNION EXCEPT INTERSECT keywords | QuestDB and this is the explain:


I can see the interval scan keyword, but the query is very slow in executing… If the interval is small it works, but with an high interval it ends in a timeout.

About the usage, I tired in the past to use Quest as standalone database using pg driver connection but RAM usage and performances were not accetable at all… PG driver isn’t GC friendly at all and it was using more then 30gb of RAM while running the simulation… When I moved to the native Java API RAM useage went down to 1GB!

I’d say this is very handful for my usecase, I’ve built an high perfomance algotrading software with backtest simulation included, completely written Java. QuestDB is its only dependence, and just fits wonderfully, especially for the GC. Just to give you an order, common trading softwares runs simulation of 2/3 years in days or even weeks, and they didn’t even respect full market dynamics. Thanks to QuestDB, I was able do the same with a much more precision in just few minutes (depending on the hardware).

My goal is now to move from a single asset simulation to multi asset simulation while keeping performances high. I started using QuestDB since 7.1, I saw that pg integration was enhanced in the last release, which I still didn’t try (I’m on 8.1.2 right now), so maybe things could have changed. But I really like the naitive API solution, don’t think I want to change… My use case is pretty simple btw, I just need SELECT queries because ingetion is made thorugh COPY and CSV files 2 o 3 times per year :slight_smile:

Re: union slowness, there are two scans here - full table over EURUSD and interval over GBPUSD. You need to push the WHERE on both sides (its only applying to the rhs of the UNION)

(pepperstone_live_EURUSD WHERE timestamp > 1659336669000000 AND timestamp < 166988106900000)
UNION
(pepperstone_live_GBPUSD WHERE timestamp > 1659336669000000 AND timestamp < 166988106900000)

In general, using subqueries to filter or limit data before its used elsewhere is a good way to create better optimised queries!

Re: PG driver woes, that sounds surprising - perhaps you were running more queries simultaneously with that driver, and now run fewer simultaneously since you are using it embedded.

We have actually just released a re-written PG wire connector that is much faster and more efficient. Small inserts are much faster when using pipelining, for example.

I saw that pg integration was enhanced in the last release

Having read ahead now, I see that you knew this already!

It may also be that query plans etc. have changed since then - small changes to your query can make a dramatic difference in runtime, memory usage etc. - so when you get stuck, its always best to reach out!

Re: use case, some of our roadmap should be beneficial for you:

  • ADBC support - more efficient server-client data transfer, improved compatibility with analysis libraries
  • Parquet support - query parquet files directly, natively compressed, import Parquet into database (better than CSV!)

Okay got it for the UNION, now it works in reasonable amout of time, so I’ll do some bench with the full dataset to understand which case better fits my data.

About PG driver, no I didn’t run any simultaneously query, just a SELECT * FROM table… The problem was in official Java PG driver which allocates a lot of garbage while converting timestamps (long to Java object) and strings… But I really don’t care to have timestamps in a Date format, I’m more then happy to work with longs :smiley:
I remember I profiled a lot that part, and the problem was always in the PG library, not in quest nor in my app

I’ll keep my eyes on parquet format :wink:

Gotcha! Keep us posted with how you get on :slight_smile:

We have utilities in io.questdb.std.datetime.microtime.Timestamps which make it easier to work with microsecond timestamps - add arbitrary units (like months, days), convert timezones. It also has constants to help for unit conversions, and you can format the Timestamp into ISO strings with toString().

It might be helpful since you are using this embedded and not able to work with Date directly.

1 Like

Nope UNION isn’t usable…

(pepperstone_live_EURUSD WHERE timestamp > 1577836800000000 AND timestamp < 1695859200000000)
UNION
(pepperstone_live_GBPUSD WHERE timestamp > 1577836800000000 AND timestamp < 1695859200000000)

Those are 3 years of data and query timeout and they aren’t even sorted!

While this query successfully in the range of 1/6sec producing 254+ mln rows

SELECT * FROM pepperstone_live WHERE timestamp > 1577836800000000 AND timestamp < 1695859200000000 AND (symbol = 'EURUSD' OR symbol = 'GBPUSD')

Currently, the pepperstone_live table contains 2.1 billions data… Few seconds are acceptable for me.

I’m executing directly on the VPS, so I’ve 12 vCore and around 40GB of RAM and total usage of quest + my application is around 3.7GB… I’d say, the big table solution seems to be the most effective one.

I’ve only one last question, if I run the query it takes 6/7secs, if I run the same query it takes 3/400ms… I guess Quest internally caches the queries, but after few minutes if I run again the same query it takes again 6/7 sec, and I still guess because this cache have a short ttl. Since my db changes just few times per year and I have a lot of ram, is there a way I can maintain more data in the cache?

I’ve only one last question, if I run the query it takes 6/7secs, if I run the same query it takes 3/400ms…

The first time you run the query, it’ll be compiled and we will cache the query plan. We do not cache result sets.

However, when we read the files, we map them into memory with mmap. Therefore, the OS will cache some of the pages in memory.

On subsequent runs, the data is already ‘hot’ and in-memory, so they are quick!

You can make a table hot using touch(): Touch function | QuestDB

Other than that, the database is at the mercy of the OS as to when it decides to page data in and out of memory. If you aren’t hitting RAM limits, you should have better luck at the data staying in the page cache longer!

P.S On web console, queries are run with LIMIT 10000 and thereafter paged - so don’t get confused about query speeds in web console versus other APIs if you aren’t comparing like-for-like.

Coming back here beacuse focused on fix this stuff I completely forget to thank you for the support! Apologize :smiling_face_with_tear:

At the end the solution was perfect, I was just did wrong the startTime in the query, which I was using milliseconds instead of nanoseconds… In a table with more 3 billions rows the query is executed with an amazing speed of 100ms on avarage!

Thanks again for all

Thanks for following up with great news! Keep ua posted with how you get on :slight_smile: