Originally posted on the Trino slack by Moh. Hac.
Hi all, I have a question regarding trino/iceberg, I might be missing some basic understanding of iceberg, would be great to either help with an explanation or point to right article(s)
Let’s say we have an iceberg table in trino named iceberg.core.orders , which is partitioned by report_date. This table is managed by dbt through delete+insert strategy (e.g. we delete the last 10 days of orders and reinsert again incld. latest report_date)
My understanding is that the “$files” metadata table contains all files related to the current snapshot. Nevertheless, running a count distinct on the hidden column “$path”, in the affected orders table, is showing a lower count of files than by querying the “$files” metadata table
select count(distinct file_path)
from iceberg.core."orders$files" f
;
-- higher file count than from the next query
select count(distinct "$path") as file_path
FROM iceberg.core.orders
;
-- much lower count
What am I missing/misunderstanding? How to use only “$files” metadata table to run statistics on file size, size deviation, etc..
Background: We are running statistics on data files for current data and want to use to avoid running full table scan on data tables





