Poor table to table insert performance, but extremely fast read perf

Due to the sensitive nature of my workplace, I am using a minimum example which displays the performance issue. The selection speed is phenomenal, but writing seems off.

Specs:

  • Storage: 240GB Sata SSD 300MB/s tested (for now)

  • Filesystem: ZFS with LZ4 compression
    sudo zpool create \
    -m none \
    -o ashift=12 \
    -o autoexpand=on \
    -o feature@lz4_compress=enabled \
    -O compression=lz4 \
    -O atime=off \
    -O recordsize=128K \
    -O logbias=throughput \
    -O sync=disabled \
    db /dev/sdb1

    sudo zfs create -o mountpoint=/mnt/db db/questdb

  • CPU: Rockchip RK3588S

  • RAM: 8GB

  • OS: Armbian Ubuntu v26.2.1

  • Kernel: 6.18.21-current-rockchip64

Create 2x table to insert into refined_catalogue 1 & 2

CREATE TABLE refined_catalogue_X (
  _id                         STRING,
  _correlation_id             SYMBOL CAPACITY 5000,
  _created_utc                TIMESTAMP,
  _loaded_utc                 TIMESTAMP,
  _version                    SYMBOL,
  banner_name                 SYMBOL,
  store_name                  SYMBOL,
  slug                        STRING,
  name                        STRING,
  item_name                   STRING,
  item_id                     STRING,
  page_url                    STRING,
  image_uri                   STRING,
  category                    STRING,
  item_category               STRING,
  item_category2              STRING,
  item_category3              STRING,
  badge                       STRING,
  currency                    STRING,
  measurement                 STRING,
  unit                        STRING,
  price                       DOUBLE,
  price_original              DOUBLE
) TIMESTAMP(_created_utc) PARTITION BY DAY WAL;

I then proceed to fill up refined_catalogue_1 with 1,253,258 rows

insert into refined_catalogue_1
select * from transform_raw_to_refined_catalogue

-- View: transform_raw_to_refined_catalogue
CREATE OR REPLACE VIEW 'transform_raw_to_refined_catalogue' AS ( 
  DECLARE OVERRIDABLE
    @competitor_banner_name := 'Abc',
    @scraper_method := 'Def'

  SELECT
    _id AS _id,
    _correlation_id AS _correlation_id,
    _created_utc AS _created_utc,
    NOW() AS _loaded_utc,
    _version AS _version,
    @banner_name AS banner_name,
    ...
  FROM raw_table
  WHERE scraper_method = @scraper_method
  ORDER BY _created_utc ASC
);

Followed by the slow insert query from one table into another:

insert into refined_catalogue_2
select * from refined_catalogue_1

Why is this taking a few minutes? It’s ordered, I can even copy the entire database containing all 22m rows in less time than performing this insert.

Hi @Nom , please can you share the plan from the query inside your view? You can extract it and use EXPLAIN.

It may simply be overhead decompressing and recompressing all those strings. Can we optimise your schema a bit?

  • Anything that is a STRING, make it a VARCHAR. It takes 50% of the storage space for ASCII data.
  • _correlation_id is fine, but its not necessary to set capacity any more, it will auto-scale as needed.
  • slug…unit → can we choose better types for these? At minimum, VARCHAR as mentioned, but I’m sure that the %category%, currency, measurement, unit fields etc. could instead be symbols, chars, numbers.
  • Same story for the URIs - if they are not infinite cardinality, and reasonably repetitive, symbols.

The goal is to store as much as possible in fixed sized types i.e. numbers or symbols, and deprecate strings fully in favour of varchar.

The savings in storage and more efficient types should also reduce any slowdown caused by LZ4 roundtrips.

CREATE TABLE refined_catalogue_X (
  _id                         STRING, -- how 
  _correlation_id             SYMBOL CAPACITY 5000,
  _created_utc                TIMESTAMP,
  _loaded_utc                 TIMESTAMP,
  _version                    SYMBOL,
  banner_name                 SYMBOL,
  store_name                  SYMBOL,
  slug                        STRING,
  name                        STRING,
  item_name                   STRING,
  item_id                     STRING,
  page_url                    STRING,
  image_uri                   STRING,
  category                    STRING,
  item_category               STRING,
  item_category2              STRING,
  item_category3              STRING,
  badge                       STRING,
  currency                    STRING,
  measurement                 STRING,
  unit                        STRING,
  price                       DOUBLE,
  price_original              DOUBLE
) TIMESTAMP(_created_utc) PARTITION BY DAY WAL;