Presto Iceberg Day Partitioning Issue: Wrong Results

Alex Johnson
-
Presto Iceberg Day Partitioning Issue: Wrong Results

Introduction

This article addresses a peculiar issue encountered while working with Iceberg tables in Presto, specifically when using the day(timestamp) partition transform. The problem manifests as incorrect query results when filtering data based on the timestamp column used in the day partitioning. This can lead to significant data discrepancies and impact the reliability of data analysis. This article provides a detailed exploration of the issue, including the environment setup, observed behavior, steps to reproduce, and potential solutions.

Your Environment

Understanding the environment in which the issue occurs is crucial for effective troubleshooting. Here's a breakdown of the setup used to reproduce the bug:

  • Presto version: master
  • Storage: HDFS
  • Data source and connector: Iceberg
  • Deployment: Local debug

Expected Behavior

When querying an Iceberg table partitioned by day(ts), where ts is a timestamp column, the expectation is that filtering the data using a specific timestamp should return all rows that fall within the same day as the specified timestamp. For instance, if a row contains a timestamp of 2022-05-06 07:08:09, then a query filtering for ts = cast('2022-05-06 07:08:09' as timestamp) should return that row. This behavior ensures accurate and intuitive data retrieval based on the day partition.

Current Behavior

Instead of returning the expected rows, the query returns an empty result set. This indicates a mismatch between the query's filtering logic and how Iceberg handles the day(timestamp) partition transform. The following steps illustrate the issue:

  1. Create an Iceberg table:

    create table iceberg.default.iceberg_ts1(
     id bigint,
     category varchar,
     ts timestamp
    ) WITH (
     location = 'hdfs://localhost:8020/warehouse/iceberg/db_name/iceberg_ts1',
     partitioning = ARRAY['bucket(id,8)','day(ts)','category']
    );
    

    This statement creates an Iceberg table named iceberg_ts1 with columns id, category, and ts (timestamp). The table is partitioned by bucket(id,8), day(ts), and category.

  2. Insert data into the table:

    insert into iceberg.default.iceberg_ts1 select 11, 'cc1', cast('2022-05-06 07:08:09' as timestamp);
    insert into iceberg.default.iceberg_ts1 select 11, 'cc1', cast('2022-05-06 07:08:09' as timestamp);
    

    These statements insert two identical rows into the iceberg_ts1 table. Each row contains an id of 11, a category of 'cc1', and a timestamp of 2022-05-06 07:08:09.

  3. Query the table with a timestamp filter:

    select * from iceberg.default.iceberg_ts1 where ts=cast('2022-05-06 07:08:09' as timestamp);
    

    This query attempts to retrieve all rows where the ts column matches the specified timestamp. Ideally, this query should return the two rows that were previously inserted. However, the actual result is an empty set, indicating a problem with the filtering mechanism.

Possible Solution

The root cause of this issue might stem from how Presto and Iceberg interact when handling timestamp comparisons with day partitioning. Here are a few potential solutions and areas to investigate:

  1. Timestamp Precision: Investigate whether there are discrepancies in how Presto and Iceberg handle timestamp precision. The day(ts) transform might be truncating the timestamp to the day level, while the query filter might be considering the full timestamp with millisecond precision. Ensure that the comparison is performed at the same level of granularity.

  2. Type Conversion: Review the type conversion between the timestamp literal in the query and the ts column in the Iceberg table. Implicit or explicit type conversions might be introducing subtle differences that cause the comparison to fail.

  3. Iceberg Metadata: Examine the Iceberg metadata to understand how the data is partitioned and how the day(ts) transform is represented. There might be inconsistencies in the metadata that affect the query planning and filtering process.

  4. Presto Version: Consider upgrading to the latest version of Presto, as the issue may have been addressed in a more recent release. Review the release notes for any relevant bug fixes or improvements related to Iceberg integration and timestamp handling.

  5. Connector Configuration: Review the Iceberg connector configuration in Presto. There might be specific configuration options that affect how timestamp values are handled during query execution.

Steps to Reproduce

To reproduce this bug, follow these steps:

  1. Set up a Presto environment with the Iceberg connector configured to connect to an HDFS storage.
  2. Create an Iceberg table with a timestamp column and partition it using day(ts). Ensure that the location is valid.
  3. Insert a row into the table with a specific timestamp value.
  4. Query the table using a WHERE clause to filter the data based on the timestamp column, using the same timestamp value used during insertion.
  5. Observe that the query returns an empty result set, even though a matching row exists in the table.

Screenshots (if appropriate)

Unfortunately, I am unable to directly generate screenshots, but I can provide textual representations of what screenshots would show:

  • Screenshot 1: Table Creation: Shows the SQL command used to create the Iceberg table, highlighting the day(ts) partitioning.
  • Screenshot 2: Data Insertion: Shows the SQL command used to insert data into the table, displaying the specific timestamp value used.
  • Screenshot 3: Query Execution: Shows the SQL command used to query the table with the timestamp filter.
  • Screenshot 4: Empty Result: Shows the output of the query, which is an empty result set, indicating the issue.

Context

This issue significantly impacts data analysis workflows that rely on accurate timestamp-based filtering. When users are unable to retrieve the correct data based on timestamp criteria, it leads to distrust in the data and hinders decision-making. This can be particularly problematic in time-series analysis, event tracking, and other applications where timestamp precision is critical. Resolving this issue is essential for ensuring the reliability and usability of Presto with Iceberg tables.

Conclusion

The discrepancy between expected and actual results when querying Iceberg tables partitioned by day(timestamp) in Presto is a serious issue. It undermines data integrity and can lead to flawed insights. Addressing this bug requires a careful examination of timestamp precision, type conversion, Iceberg metadata, Presto versions, and connector configurations. By systematically investigating these areas, a solution can be found to ensure accurate and reliable timestamp-based filtering in Presto with Iceberg tables.

For further information on Apache Iceberg and its features, please visit the Apache Iceberg official website. This external resource can provide additional insights and documentation relevant to the topic discussed in this article.

You may also like