SELECT t1.id,
t2.n,
COUNT(*) AS cnt
FROM "ami_sb_insights"."ami_sbe_config"."biac_audit_session" t1
CROSS JOIN UNNEST(sequence(1, 10000)) AS t2(n)
GROUP BY t1.id, t2.n
LIMIT 25000000;
It ran for ~15 minutes but still showed as non-spooled.
A few quick questions:
What are the main benefits of using spooling + spill together in a cluster?
What’s the best way to verify that spooling and spill are actually active (logs, UI, metrics, specific flags)?
Is there a recommended stress query that will clearly force spooling + spill so we can validate the setup?
How can I confirm that S3 spool segments are being created and later cleaned up?
Generally, as called out in Spill to disk — Trino 478 Documentation , spilling to disk is not recommended. If needed, it is suggested to configure fault-tolerant execution.
That said, spill helps with stage-to-stage memory issues while spooling helps with returning large results at the end of the query execution. They are solving different things.
I’m thinking a low TTL and manually verifying the files spooling creates are there initially and then manually verify they are gone after the TTL time has elapsed.
We have two AMI Dev clusters — one in us-east and one in us-central-122. Spooling is enabled only on the central-122 cluster, and I can see the log message “spooling is enabled with …” which confirms that the configuration is loaded correctly.
Now I’d like to compare the two clusters (spooling ON vs OFF) to understand the real execution differences. I’m trying to validate:
How query execution changes with and without spooling
The impact on memory usage, performance, and result handling
Whether spooling is actually being triggered when running large result-set queries through the trino-python-client
Here is the test query I’m using:
WITH base AS (
SELECT *
FROM "ami_sb_insights"."ami_sbe_config"."biac_audit_session"
LIMIT 400000
)
SELECT b.*, t.n
FROM base AS b
CROSS JOIN UNNEST(sequence(1,10000)) AS t(n);
This produces ~4 billion rows, so it should be large enough to trigger spooling if it’s active.
Could you guide me on the best way to validate spooling behavior across both clusters? For example:
Is this a suitable query to show a clear difference between “spooling enabled” and “spooling disabled”?
Should I run the same query on both clusters and compare worker memory, S3 spool segment creation, or runtime?
Are there specific metrics, logs, or UI pages you recommend monitoring while testing?
I highly recommend you check out the demo in the following Trino Community Broadcast…
You will see a few things. Since you really are just trying to see how much the spooling protocol helps, I would just do a very simple query (demo is selecting all from the tpch.sf10.lineitem table limiting results to 1 million rows) and you can see that using the Command line interface — Trino 478 Documentation allows you to do a little property “magic” to disable spooling so you can actually compare it with and without on the same cluster.