CBioPortal Validator: Fixing NCBI Build Hardcoding

Alex Johnson

-Nov 12, 2025

CBioPortal Validator: Fixing NCBI Build Hardcoding

Bug Report: NCBI Build Hardcoded to GRCh37 in Validator

In the realm of bioinformatics and cancer genomics, accurate data validation is paramount. A recent bug report highlights a critical issue within the cBioPortal validator, specifically concerning the hardcoded NCBI build. This article delves into the intricacies of the problem, its impact, and the proposed solution. Let's explore how this fix ensures the cBioPortal validator correctly interprets genomic data, enhancing the reliability of cancer research.

Description:

The NCBI build is currently hardcoded to GRCh37 in the validator, creating a significant bottleneck in processing genomic data. The validator's inability to dynamically read the NCBI build from the current study—whether through the API or from offline JSON dumps—results in validation errors. This issue stems from the validator's failure to adapt to the specific genomic reference build used in the study, leading to inconsistencies and potential misinterpretations of the data. This hardcoding means that the validator does not accurately reflect the nuances of different genomic builds, impacting the integrity of the validation process. Accurate validation is essential for ensuring that genomic data is correctly interpreted, leading to reliable insights in cancer research. Understanding the implications of this hardcoding is crucial for researchers who rely on cBioPortal for their analyses. This issue not only affects the accuracy of data validation but also introduces unnecessary complexities in data preparation and submission. By addressing this problem, the cBioPortal can maintain its reputation as a reliable tool for cancer genomics research, fostering trust among researchers and clinicians who depend on the platform for their critical work.

Historically, the validator read this information from the ncbi.build property in application.properties, based on code comments. However, this behavior is no longer active, leaving the validator stuck with the default GRCh37 build. This discrepancy causes significant issues when researchers use more recent builds like GRCh38 in their studies. When a MAF file specifies the NCBI_Build column as GRCh38, the validator incorrectly throws a validation error. This false error can lead to unnecessary delays, confusion, and potential data misinterpretations. Ensuring the validator accurately reads and interprets the NCBI build is essential for maintaining the integrity of genomic data analysis. The current workaround of setting the NCBI_Build to GRCh37 or omitting the column entirely is not ideal. These workarounds force researchers to either compromise the accuracy of their metadata or modify their data preparation pipelines, adding extra steps and potential for errors. A robust solution is needed to address this issue and ensure that the validator correctly handles different NCBI builds, allowing researchers to focus on their analyses without worrying about these technical hurdles.

Currently, the only ways to avoid this validation issue are:

Set NCBI_Build to GRCh37, or
Omit the column entirely.

These workarounds are not ideal, as they force users to either misrepresent their data or remove crucial information. This limitation undermines the validator's utility and introduces unnecessary friction into the data submission process. By addressing this issue, cBioPortal can significantly improve the user experience and ensure that researchers can accurately validate their data without resorting to temporary fixes. A proper solution would involve dynamically retrieving the NCBI build from the study metadata, allowing the validator to adapt to different builds and accurately assess the data. This would not only resolve the current validation errors but also enhance the flexibility and reliability of the cBioPortal as a whole.

💡 Expected Behavior

The validator should correctly read the NCBI build of the study under validation, rather than relying on a hardcoded default. Accurate identification of the NCBI build is crucial for ensuring data integrity and preventing false validation errors. By dynamically reading the build from the study metadata, the validator can adapt to different genomic reference builds, providing a more reliable and flexible validation process. This improvement would eliminate the need for manual workarounds and ensure that the validator accurately reflects the specific genomic context of each study. Implementing this change would enhance the overall user experience and solidify cBioPortal's reputation as a robust tool for cancer genomics research. This dynamic approach is essential for keeping pace with the evolving landscape of genomic data and ensuring that researchers can confidently validate their data without encountering unnecessary obstacles. The expected behavior should be that the validator seamlessly integrates with different NCBI builds, providing accurate and consistent validation results across a wide range of studies.

By reading the NCBI build from the study metadata, the validator can correctly interpret the genomic coordinates and annotations in the MAF files. This ensures that the validation process accurately reflects the specific genomic context of each study, reducing the risk of false positives and false negatives. The validator's ability to adapt to different builds is essential for maintaining data integrity and ensuring that researchers can confidently rely on the validation results. This dynamic approach is crucial for supporting the evolving landscape of genomic research and ensuring that cBioPortal remains a valuable resource for the cancer genomics community. The correct identification of the NCBI build also allows the validator to perform more accurate checks for data consistency and compliance with relevant standards. This helps to improve the overall quality of the data submitted to cBioPortal, leading to more reliable and reproducible research findings.

This enhancement would provide a more seamless and accurate validation process, allowing researchers to focus on their analyses without being hindered by technical limitations. The validator's ability to dynamically read the NCBI build from the study metadata ensures that the validation process is aligned with the specific genomic context of each study, preventing unnecessary errors and improving the overall quality of the data. This improvement would not only enhance the user experience but also contribute to the reliability and reproducibility of cancer genomics research.

✅ Proposed Fix

To resolve the NCBI build hardcoding issue, the proposed fix involves making the validator retrieve the NCBI build from the study metadata. This information is readily available in the database and should be used to dynamically determine the correct reference build. By implementing this change, the validator can adapt to different genomic reference builds, eliminating the need for manual workarounds and ensuring accurate data validation. This fix would not only address the current validation errors but also enhance the flexibility and reliability of the cBioPortal as a whole. The validator's ability to dynamically retrieve the NCBI build from the study metadata would provide a more seamless and accurate validation process, allowing researchers to focus on their analyses without being hindered by technical limitations. This improvement would contribute to the overall quality of cancer genomics research by ensuring that data is accurately validated and interpreted.

By retrieving the NCBI build from the study metadata, the validator can correctly interpret the genomic coordinates and annotations in the MAF files. This ensures that the validation process accurately reflects the specific genomic context of each study, reducing the risk of false positives and false negatives. The validator's ability to adapt to different builds is essential for maintaining data integrity and ensuring that researchers can confidently rely on the validation results. This dynamic approach is crucial for supporting the evolving landscape of genomic research and ensuring that cBioPortal remains a valuable resource for the cancer genomics community. The correct identification of the NCBI build also allows the validator to perform more accurate checks for data consistency and compliance with relevant standards. This helps to improve the overall quality of the data submitted to cBioPortal, leading to more reliable and reproducible research findings.

This fix would ensure that the validator accurately reflects the genomic context of each study, preventing unnecessary errors and improving the overall quality of the data. This enhancement would provide a more seamless and accurate validation process, allowing researchers to focus on their analyses without being hindered by technical limitations. The validator's ability to dynamically read the NCBI build from the study metadata ensures that the validation process is aligned with the specific genomic context of each study, preventing unnecessary errors and improving the overall quality of the data. This improvement would not only enhance the user experience but also contribute to the reliability and reproducibility of cancer genomics research.

For more information on NCBI builds, visit the NCBI website.

CBioPortal Validator: Fixing NCBI Build Hardcoding

Description:

💡 Expected Behavior

✅ Proposed Fix

You may also like