TensorRT 10.14: Accuracy Bug In Pdist With Dynamic Shapes

Alex Johnson
-
TensorRT 10.14: Accuracy Bug In Pdist With Dynamic Shapes

Introduction

This document outlines a critical bug encountered in the pdist function within the TensorRT framework, specifically versions 10.14.1.48 and tensorrt_rtx 1.2. The issue manifests as an accuracy discrepancy when dealing with dynamic shapes, leading to test failures. This problem was identified during automated testing and is particularly pronounced when all dimensions are configured as dynamic. The failure did not occur in previous versions (TensorRT 10.13 and tensorrt_rtx 1.0), indicating a regression in the newer releases. This article provides a detailed description of the bug, steps to reproduce it, expected behavior, environment details, and additional context to aid in resolving the issue. Addressing this bug is crucial for maintaining the reliability and accuracy of TensorRT, especially in applications relying on dynamic shape inputs. A patch or workaround is needed to ensure consistent performance across different TensorRT versions and deployment scenarios.

Bug Description

We have identified a significant accuracy issue in the pdist function when used with dynamic shapes in TensorRT versions 10.14.1.48 and tensorrt_rtx 1.2. Specifically, the automated tests are failing with an AssertionError, indicating that the tensor outputs are not within the acceptable tolerance. The error message reveals a substantial mismatch between the expected and actual tensor elements. This issue arises during the conversion and testing of the test_pdist_aten.py test suite, particularly within the TestDynamicShapePdistConverter::test_pdist_float_4_dim0_dynamic_dim1_dynamic_p_other test case. The error message indicates that a significant percentage of elements are mismatched, with the greatest absolute and relative differences exceeding the allowed thresholds. This suggests a fundamental problem in how pdist handles dynamic shapes in these TensorRT versions. The failing test can be reproduced by executing the specified Python command from the base repository directory, providing a straightforward way to verify the bug. This accuracy issue is a critical concern as it affects the reliability of TensorRT in scenarios where dynamic shapes are essential, potentially leading to incorrect results in downstream applications. Therefore, a thorough investigation and a timely fix are necessary to restore the accuracy and trustworthiness of TensorRT.

The specific error message is:

FAILED conversion/test_pdist_aten.py::TestDynamicShapePdistConverter::test_pdist_float_4_dim0_dynamic_dim1_dynamic_p_other - AssertionError: Tensor-likes are not close!

Mismatched elements: 4 / 6 (66.7%)
Greatest absolute difference: 0.291101336479187 at index (5,) (up to 0.005 allowed)
Greatest relative difference: 0.23916198313236237 at index (5,) (up to 0.005 allowed)

To execute this test, run the following from the base repo dir:

python test_pdist_aten.py TestDynamicShapePdistConverter.test_pdist_float_4_dim0_dynamic_dim1_dynamic_p_other

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

==== 1 failed, 1857 passed, 50 skipped, 9757 warnings in 102.78s (0:01:42) =====

The issue is observed in both tensorrt 10.14 and tensorrt_rtx 1.2, and it exclusively occurs when all dimensions are configured as dynamic shapes. Previously, the tests passed successfully in tensorrt 10.13 and tensorrt_rtx 1.0, indicating a regression in the newer versions. This suggests that the changes introduced between these versions have negatively impacted the accuracy of pdist when dealing with dynamic shapes. The failure highlights the importance of thorough testing during the development and release of new TensorRT versions, particularly when changes are made to core functionalities like dynamic shape handling. Addressing this regression is essential to ensure that users can rely on the accuracy of TensorRT in dynamic shape scenarios, maintaining the framework's credibility and usability.

Steps to reproduce the behavior:

  1. Set up an environment with TensorRT 10.14.1.48 or tensorrt_rtx 1.2.
  2. Configure a PyTorch model that utilizes the pdist function with dynamic shapes.
  3. Run the test case test_pdist_aten.py TestDynamicShapePdistConverter.test_pdist_float_4_dim0_dynamic_dim1_dynamic_p_other.

Expected behavior

When executing the pdist function with dynamic shapes in TensorRT, the expected behavior is that the output tensor should be numerically accurate and consistent with the output produced by PyTorch. Specifically, the elements of the output tensor should closely match the expected values, with differences falling within a predefined tolerance threshold. In the failing test case, test_pdist_aten.py TestDynamicShapePdistConverter.test_pdist_float_4_dim0_dynamic_dim1_dynamic_p_other, the assertion error indicates that the tensor outputs are not close enough, implying that the calculated distances are inaccurate. This suggests that TensorRT's implementation of pdist for dynamic shapes has introduced an error or approximation that deviates significantly from the correct result. The expected behavior would be for the test to pass, confirming that the pdist function accurately computes distances even when the input tensor dimensions are dynamic. Achieving this level of accuracy is crucial for ensuring the reliability of TensorRT in various applications, especially those relying on precise distance calculations with dynamically shaped data. A successful execution of pdist should produce results that are both accurate and consistent with the ground truth, thereby validating the correctness of the implementation.

Environment

To provide a comprehensive understanding of the issue, it is essential to detail the environment in which the bug was encountered. The following information outlines the key components and configurations:

  • Torch-TensorRT Version: (e.g. 1.0.0)
  • PyTorch Version: (e.g. 1.0)
  • CPU Architecture:
  • OS: (e.g., Linux)
  • How you installed PyTorch: (conda, pip, libtorch, source)
  • Build command you used: (if compiling from source)
  • Are you using local sources or building from archives:
  • Python version:
  • CUDA version:
  • GPU models and configuration:
  • Any other relevant information:

It is important to provide specific versions and configurations to ensure that the issue can be accurately reproduced and addressed. Detailed environment information helps in identifying potential conflicts or dependencies that may be contributing to the bug. For instance, the specific CUDA version or GPU model could reveal compatibility issues or driver-related problems. Similarly, the method used to install PyTorch (e.g., conda, pip) might highlight differences in dependencies or library versions. Including the build command, if compiling from source, can provide insights into custom configurations or build flags that could be influencing the behavior of TensorRT. The more comprehensive the environment information, the easier it becomes to pinpoint the root cause of the bug and develop an effective solution. Therefore, gathering and documenting these details is a crucial step in the debugging process.

Additional context

In addition to the bug description and environment details, it is helpful to provide any additional context that might be relevant to understanding and resolving the issue. This could include information about the specific use case, the model architecture, or any custom modifications made to the TensorRT framework. For example, if the pdist function is being used in a particular type of neural network, such as a convolutional neural network (CNN) or a recurrent neural network (RNN), this could provide insights into the potential impact of the bug. Similarly, if the dynamic shapes are being used to handle variable-length sequences or input images of different sizes, this could help to narrow down the scope of the problem. Any information about the performance characteristics of the model, such as its inference time or memory usage, could also be relevant. Furthermore, if there have been any attempts to work around the bug or mitigate its effects, these should be documented as well. This could include changes to the model architecture, modifications to the input data, or adjustments to the TensorRT configuration. By providing as much context as possible, it becomes easier for developers to understand the problem and develop an effective solution. Therefore, gathering and sharing this additional information is a valuable step in the bug reporting and debugging process.

TensorRT Documentation

You may also like