Presto Iceberg Day Partitioning Issue: Wrong Results
Introduction
This article addresses a peculiar issue encountered while working with Iceberg tables in Presto, specifically when using the day(timestamp) partition transform. The problem manifests as incorrect query results when filtering data based on the timestamp column used in the day partitioning. This can lead to significant data discrepancies and impact the reliability of data analysis. This article provides a detailed exploration of the issue, including the environment setup, observed behavior, steps to reproduce, and potential solutions.
Your Environment
Understanding the environment in which the issue occurs is crucial for effective troubleshooting. Here's a breakdown of the setup used to reproduce the bug:
- Presto version: master
- Storage: HDFS
- Data source and connector: Iceberg
- Deployment: Local debug
Expected Behavior
When querying an Iceberg table partitioned by day(ts), where ts is a timestamp column, the expectation is that filtering the data using a specific timestamp should return all rows that fall within the same day as the specified timestamp. For instance, if a row contains a timestamp of 2022-05-06 07:08:09, then a query filtering for ts = cast('2022-05-06 07:08:09' as timestamp) should return that row. This behavior ensures accurate and intuitive data retrieval based on the day partition.
Current Behavior
Instead of returning the expected rows, the query returns an empty result set. This indicates a mismatch between the query's filtering logic and how Iceberg handles the day(timestamp) partition transform. The following steps illustrate the issue:
-
Create an Iceberg table:
create table iceberg.default.iceberg_ts1( id bigint, category varchar, ts timestamp ) WITH ( location = 'hdfs://localhost:8020/warehouse/iceberg/db_name/iceberg_ts1', partitioning = ARRAY['bucket(id,8)','day(ts)','category'] );This statement creates an Iceberg table named
iceberg_ts1with columnsid,category, andts(timestamp). The table is partitioned bybucket(id,8),day(ts), andcategory. -
Insert data into the table:
insert into iceberg.default.iceberg_ts1 select 11, 'cc1', cast('2022-05-06 07:08:09' as timestamp); insert into iceberg.default.iceberg_ts1 select 11, 'cc1', cast('2022-05-06 07:08:09' as timestamp);These statements insert two identical rows into the
iceberg_ts1table. Each row contains anidof 11, acategoryof 'cc1', and a timestamp of2022-05-06 07:08:09. -
Query the table with a timestamp filter:
select * from iceberg.default.iceberg_ts1 where ts=cast('2022-05-06 07:08:09' as timestamp);This query attempts to retrieve all rows where the
tscolumn matches the specified timestamp. Ideally, this query should return the two rows that were previously inserted. However, the actual result is an empty set, indicating a problem with the filtering mechanism.
Possible Solution
The root cause of this issue might stem from how Presto and Iceberg interact when handling timestamp comparisons with day partitioning. Here are a few potential solutions and areas to investigate:
-
Timestamp Precision: Investigate whether there are discrepancies in how Presto and Iceberg handle timestamp precision. The
day(ts)transform might be truncating the timestamp to the day level, while the query filter might be considering the full timestamp with millisecond precision. Ensure that the comparison is performed at the same level of granularity. -
Type Conversion: Review the type conversion between the timestamp literal in the query and the
tscolumn in the Iceberg table. Implicit or explicit type conversions might be introducing subtle differences that cause the comparison to fail. -
Iceberg Metadata: Examine the Iceberg metadata to understand how the data is partitioned and how the
day(ts)transform is represented. There might be inconsistencies in the metadata that affect the query planning and filtering process. -
Presto Version: Consider upgrading to the latest version of Presto, as the issue may have been addressed in a more recent release. Review the release notes for any relevant bug fixes or improvements related to Iceberg integration and timestamp handling.
-
Connector Configuration: Review the Iceberg connector configuration in Presto. There might be specific configuration options that affect how timestamp values are handled during query execution.
Steps to Reproduce
To reproduce this bug, follow these steps:
- Set up a Presto environment with the Iceberg connector configured to connect to an HDFS storage.
- Create an Iceberg table with a timestamp column and partition it using
day(ts). Ensure that the location is valid. - Insert a row into the table with a specific timestamp value.
- Query the table using a
WHEREclause to filter the data based on the timestamp column, using the same timestamp value used during insertion. - Observe that the query returns an empty result set, even though a matching row exists in the table.
Screenshots (if appropriate)
Unfortunately, I am unable to directly generate screenshots, but I can provide textual representations of what screenshots would show:
- Screenshot 1: Table Creation: Shows the SQL command used to create the Iceberg table, highlighting the
day(ts)partitioning. - Screenshot 2: Data Insertion: Shows the SQL command used to insert data into the table, displaying the specific timestamp value used.
- Screenshot 3: Query Execution: Shows the SQL command used to query the table with the timestamp filter.
- Screenshot 4: Empty Result: Shows the output of the query, which is an empty result set, indicating the issue.
Context
This issue significantly impacts data analysis workflows that rely on accurate timestamp-based filtering. When users are unable to retrieve the correct data based on timestamp criteria, it leads to distrust in the data and hinders decision-making. This can be particularly problematic in time-series analysis, event tracking, and other applications where timestamp precision is critical. Resolving this issue is essential for ensuring the reliability and usability of Presto with Iceberg tables.
Conclusion
The discrepancy between expected and actual results when querying Iceberg tables partitioned by day(timestamp) in Presto is a serious issue. It undermines data integrity and can lead to flawed insights. Addressing this bug requires a careful examination of timestamp precision, type conversion, Iceberg metadata, Presto versions, and connector configurations. By systematically investigating these areas, a solution can be found to ensure accurate and reliable timestamp-based filtering in Presto with Iceberg tables.
For further information on Apache Iceberg and its features, please visit the Apache Iceberg official website. This external resource can provide additional insights and documentation relevant to the topic discussed in this article.