Enhance Data Contract Generation: Add File Format Support

Alex Johnson
-
Enhance Data Contract Generation: Add File Format Support

Currently, the contract generator primarily supports CSV files. While CSV is a widely used format, modern data pipelines often involve a variety of file types. To make our contract generator more versatile and efficient, we need to extend its capabilities to handle additional file formats.

The Problem: Limited File Format Support

Our current contract generator is limited to CSV files, which presents a challenge for modern data pipelines that frequently work with various other formats, including:

  • JSON: A common format for APIs, configurations, and semi-structured data.
  • Parquet: A column-oriented format that's a standard for data lakes and warehouses.
  • Excel/XLSX: A ubiquitous format in business contexts.
  • Avro: A self-describing schema format, commonly used in Kafka and streaming systems.

This limited support forces users to convert their data into CSV format before generating contracts, adding extra steps and potentially losing valuable format-specific metadata. This not only complicates the process but also introduces potential inefficiencies and data fidelity concerns. Supporting a broader range of file formats directly addresses these issues, streamlining workflows and ensuring more accurate data handling.

Why Support Additional File Formats?

Expanding file format support offers numerous benefits, enhancing the contract generator's utility and efficiency. By directly accommodating formats like JSON, Parquet, Excel/XLSX, and Avro, we eliminate the need for intermediate conversions to CSV. This reduces complexity, minimizes potential data loss, and accelerates the contract generation process. Furthermore, native support allows us to leverage format-specific features, such as schema introspection for Parquet and schema extraction for Avro, improving the accuracy and richness of the generated contracts.

Ultimately, this enhancement will empower users to work seamlessly with their data, regardless of the format, and unlock the full potential of our contract generator. Supporting various file formats can significantly improve the user experience and data integrity.

Use Cases

Let's explore some specific use cases for each of the file formats we plan to support:

JSON Files

  1. Newline-Delimited JSON (NDJSON): Often used for log files and streaming data.
  2. JSON Arrays: Commonly found in API responses and data exports.
  3. Nested JSON: Used for complex hierarchical data structures.
  4. Schema Inference: Automatically detect data types from multiple JSON samples to create accurate contracts.

Parquet Files

  1. Schema Introspection: Directly read the built-in schema of Parquet files.
  2. Column Statistics: Utilize min/max values and null counts for data validation.
  3. Partition Awareness: Recognize and handle Hive-style partitions for efficient data processing.
  4. Compression Detection: Automatically detect compression types like Snappy and Gzip.

Excel/XLSX Files

  1. Multiple Sheets: Treat each sheet as a separate data source.
  2. Header Detection: Automatically identify header rows, even with merged cells or multiple header rows.
  3. Type Inference: Convert Excel data types to appropriate contract types.
  4. Formula Evaluation: Read and use calculated values from Excel formulas.

Avro Files

  1. Schema Extraction: Use the embedded Avro schema for contract generation.
  2. Union Types: Properly handle Avro's nullable pattern.
  3. Complex Types: Support records, arrays, and maps within Avro files.
  4. Code Generation: Enable optional schema evolution for Avro data.

Proposed Solution

To address the limitations of our current contract generator, we propose the addition of format-specific contract generation tools. These tools will allow users to directly generate contracts from JSON, Parquet, Excel/XLSX, and Avro files, without the need for intermediate conversions. Each tool will be tailored to the specific characteristics of its respective file format, ensuring accurate and efficient contract generation.

Here's a breakdown of the proposed solutions for each file format:

1. JSON File Support

generate_json_source_contract(
    source_id: str,
    source_path: str,
    json_format: str = "ndjson",  # "ndjson", "array", "single"
    sample_size: int = 100,
    flatten_nested: bool = False,  # Flatten nested objects
    max_depth: int = 5,  # Max nesting depth to analyze
    config: dict[str, Any] | None = None
) -> SourceContract

generate_json_destination_contract(
    destination_id: str,
    destination_path: str,
    json_format: str = "ndjson",
    pretty_print: bool = False,
    config: dict[str, Any] | None = None
) -> DestinationContract

Example:

# NDJSON file (one JSON object per line)
contract = generate_json_source_contract(
    source_id="user_events",
    source_path="/data/events.jsonl",
    json_format="ndjson",
    flatten_nested=True  # Flatten {"user": {"id": 1}} -> {"user_id": 1}
)

# JSON array
contract = generate_json_source_contract(
    source_id="api_response",
    source_path="/data/users.json",
    json_format="array"  # Expects: [{
        "id": 1
    }, {
        "id": 2
    }, ...]
)

2. Parquet File Support

generate_parquet_source_contract(
    source_id: str,
    source_path: str,
    read_schema_only: bool = False,  # Don't read data, just schema
    sample_size: int = 1000,
    config: dict[str, Any] | None = None
) -> SourceContract

generate_parquet_destination_contract(
    destination_id: str,
    destination_path: str,
    compression: str = "snappy",  # "snappy", "gzip", "none"
    partition_cols: list[str] | None = None,
    config: dict[str, Any] | None = None
) -> DestinationContract

Example:

# Read Parquet schema
contract = generate_parquet_source_contract(
    source_id="transactions",
    source_path="/data/transactions.parquet",
    read_schema_only=True  # Fast - just read schema
)

# Partitioned Parquet destination
contract = generate_parquet_destination_contract(
    destination_id="events_partitioned",
    destination_path="/data/events/",
    partition_cols=["year", "month", "day"]  # Hive-style partitioning
)

3. Excel/XLSX Support

generate_excel_source_contract(
    source_id: str,
    source_path: str,
    sheet_name: str | int | None = None,  # None = first sheet
    header_row: int = 0,  # 0-indexed
    skip_rows: int = 0,
    sample_size: int = 100,
    config: dict[str, Any] | None = None
) -> SourceContract

generate_excel_destination_contract(
    destination_id: str,
    destination_path: str,
    sheet_name: str = "Sheet1",
    config: dict[str, Any] | None = None
) -> DestinationContract

Example:

# Read specific sheet
contract = generate_excel_source_contract(
    source_id="sales_data",
    source_path="/data/Q4_Sales.xlsx",
    sheet_name="Regional Sales",
    header_row=2,  # Header is on row 3 (0-indexed)
    skip_rows=1  # Skip title row
)

# Multiple sheets (generate multiple contracts)
for sheet in ["North", "South", "East", "West"]:
    contract = generate_excel_source_contract(
        source_id=f"sales_{sheet.lower()}",
        source_path="/data/Regional_Sales.xlsx",
        sheet_name=sheet
    )

4. Avro File Support

generate_avro_source_contract(
    source_id: str,
    source_path: str,
    use_embedded_schema: bool = True,  # Use Avro's built-in schema
    sample_size: int = 100,
    config: dict[str, Any] | None = None
) -> SourceContract

generate_avro_destination_contract(
    destination_id: str,
    destination_path: str,
    codec: str = "snappy",  # "snappy", "deflate", "null"
    config: dict[str, Any] | None = None
) -> DestinationContract

Example:

# Avro file (schema is embedded)
contract = generate_avro_source_contract(
    source_id="kafka_messages",
    source_path="/data/messages.avro",
    use_embedded_schema=True  # Fast - read from file
)

Technical Considerations

JSON Files

1. Format Detection

  • Auto-detect NDJSON vs JSON array
  • Handle both .json and .jsonl extensions
  • Detect nested vs flat structures

2. Type Inference

  • Sample multiple records for accurate types
  • Handle union types (field sometimes string, sometimes null)
  • Detect arrays of primitives vs objects

3. Nested Object Handling

Options for nested structures:

  • Flatten: {"user": {"id": 1}}{"user_id": 1}
  • JSON column: Store nested object as JSON string
  • Array of structs: Represent as structured type

4. Large Files

  • Stream reading for large JSON arrays
  • Memory-efficient NDJSON processing

Parquet Files

1. Schema Reading

  • PyArrow or fastparquet for reading
  • Direct schema extraction (fast)
  • Column statistics from metadata

2. Type Mapping

# Parquet -> Contract types
"int32" → "integer"
"int64" → "bigint"
"float" → "float"
"double" → "double"
"string", "binary" → "text"
"bool" → "boolean"
"timestamp" → "datetime"
"date" → "date"
"list" → "array[type]"
"struct" → "json" or flatten

3. Partitioning

  • Detect Hive-style partitions (year=2024/month=01/)
  • Include partition columns in schema
  • Respect partitioning for performance

4. Compression

  • Detect compression codec from metadata
  • Include in destination contracts

Excel Files

1. Library Choice

  • openpyxl - For .xlsx (Office Open XML)
  • xlrd - For legacy .xls
  • pandas - High-level interface

2. Header Detection

  • Auto-detect header row
  • Handle merged cells
  • Skip title/info rows

3. Type Inference

Excel types to contract types:

  • Numbers → numeric
  • Dates → date (Excel stores as numbers)
  • Text → text
  • Formulas → evaluate and infer type
  • Empty cells → nullable

4. Multiple Sheets

  • List available sheets
  • Generate contract per sheet
  • Or combine sheets with matching schema

Avro Files

1. Schema Extraction

  • Read embedded Avro schema
  • Convert to contract schema
  • Handle schema evolution

2. Type Mapping

# Avro -> Contract types
"int", "long" → "integer", "bigint"
"float", "double" → "float", "double"
"string" → "text"
"bytes" → "binary"
"boolean" → "boolean"
"null" → (nullable modifier)
"array" → "array[type]"
"map" → "json"
"record" → "json" or flatten
"union" → (handle nullable)

3. Union Types

Avro's nullable pattern:

["null", "string"]  // Nullable string

Convert to contract's nullable field.

Implementation Plan

Phase 1: JSON Support (MVP)

  • ✅ NDJSON format support
  • ✅ JSON array format support
  • ✅ Basic type inference
  • ✅ Nested object detection (represent as JSON)
  • ✅ Source and destination contracts

Phase 2: Parquet Support

  • ✅ Schema reading with PyArrow
  • ✅ Type mapping
  • ✅ Column statistics
  • ✅ Partition detection
  • ✅ Source and destination contracts

Phase 3: Excel Support

  • ✅ Single sheet support (.xlsx)
  • ✅ Header detection
  • ✅ Type inference
  • ✅ Multiple sheet handling
  • ✅ Source and destination contracts

Phase 4: Avro Support

  • ✅ Schema extraction
  • ✅ Type mapping
  • ✅ Union type handling
  • ✅ Source and destination contracts

Phase 5: Advanced Features

  • Nested object flattening strategies
  • Schema registry integration (for Avro)
  • Excel formula evaluation
  • JSON schema validation
  • Parquet predicate pushdown optimization

Dependencies

Required Libraries

[project.optional-dependencies]
json = []  # Built-in
parquet = ["pyarrow>=10.0.0"]
excel = ["openpyxl>=3.0.0", "xlrd>=2.0.0"]
avro = ["fastavro>=1.7.0"]

Acceptance Criteria

JSON:

  • [ ] Can generate source contract from NDJSON file
  • [ ] Can generate source contract from JSON array file
  • [ ] Correctly infers types from nested JSON
  • [ ] Handles large files efficiently (streaming)
  • [ ] Supports flattening nested objects

Parquet:

  • [ ] Can generate source contract from Parquet file
  • [ ] Reads schema without loading data
  • [ ] Correctly maps Parquet types to contract types
  • [ ] Detects partition columns
  • [ ] Includes column statistics in quality metrics

Excel:

  • [ ] Can generate source contract from XLSX file
  • [ ] Correctly detects header row
  • [ ] Handles multiple sheets
  • [ ] Maps Excel types to contract types
  • [ ] Handles merged cells gracefully

Avro:

  • [ ] Can generate source contract from Avro file
  • [ ] Uses embedded schema
  • [ ] Handles union types (nullable)
  • [ ] Maps Avro types to contract types

All:

  • [ ] Documentation updated with examples
  • [ ] Tests cover happy path and error cases
  • [ ] Performance tested with large files

Alternative Approaches

Use pandas for Everything

  • Pros: Uniform interface, well-tested
  • Cons: Heavy dependency, not always most efficient

Recommendation: Use format-specific libraries for best performance and features.

Streaming vs Loading

For large files:

  • Streaming: Memory-efficient, slower type inference
  • Loading: Faster type inference, memory-intensive

Recommendation: Stream for sampling, load when files are small.

Nice to Have

  • Auto-detect file format from extension or content
  • Schema evolution detection between file versions
  • Compression level optimization
  • Row group size optimization (Parquet)
  • JSON schema generation
  • Excel template generation from destination contract

Related Issues

  • #[number] - Support API endpoints
  • #[number] - Support database connections
  • #[number] - Support object storage (S3, GCS)
  • #[number] - Data quality profiling

Labels

enhancement, file-formats, data-source, data-destination, high-priority

For more in-depth information on data contracts and their benefits, you can visit Stitch Data's blog.

You may also like