Enhance Data Contract Generation: Add File Format Support

Alex Johnson

-Nov 12, 2025

Enhance Data Contract Generation: Add File Format Support

Enhance Data Contract Generation: Adding Support for More File Formats

Currently, the contract generator primarily supports CSV files. While CSV is a widely used format, modern data pipelines often involve a variety of file types. To make our contract generator more versatile and efficient, we need to extend its capabilities to handle additional file formats.

The Problem: Limited File Format Support

Our current contract generator is limited to CSV files, which presents a challenge for modern data pipelines that frequently work with various other formats, including:

JSON: A common format for APIs, configurations, and semi-structured data.
Parquet: A column-oriented format that's a standard for data lakes and warehouses.
Excel/XLSX: A ubiquitous format in business contexts.
Avro: A self-describing schema format, commonly used in Kafka and streaming systems.

This limited support forces users to convert their data into CSV format before generating contracts, adding extra steps and potentially losing valuable format-specific metadata. This not only complicates the process but also introduces potential inefficiencies and data fidelity concerns. Supporting a broader range of file formats directly addresses these issues, streamlining workflows and ensuring more accurate data handling.

Why Support Additional File Formats?

Expanding file format support offers numerous benefits, enhancing the contract generator's utility and efficiency. By directly accommodating formats like JSON, Parquet, Excel/XLSX, and Avro, we eliminate the need for intermediate conversions to CSV. This reduces complexity, minimizes potential data loss, and accelerates the contract generation process. Furthermore, native support allows us to leverage format-specific features, such as schema introspection for Parquet and schema extraction for Avro, improving the accuracy and richness of the generated contracts.

Ultimately, this enhancement will empower users to work seamlessly with their data, regardless of the format, and unlock the full potential of our contract generator. Supporting various file formats can significantly improve the user experience and data integrity.

Use Cases

Let's explore some specific use cases for each of the file formats we plan to support:

JSON Files

Newline-Delimited JSON (NDJSON): Often used for log files and streaming data.
JSON Arrays: Commonly found in API responses and data exports.
Nested JSON: Used for complex hierarchical data structures.
Schema Inference: Automatically detect data types from multiple JSON samples to create accurate contracts.

Parquet Files

Schema Introspection: Directly read the built-in schema of Parquet files.
Column Statistics: Utilize min/max values and null counts for data validation.
Partition Awareness: Recognize and handle Hive-style partitions for efficient data processing.
Compression Detection: Automatically detect compression types like Snappy and Gzip.

Excel/XLSX Files

Multiple Sheets: Treat each sheet as a separate data source.
Header Detection: Automatically identify header rows, even with merged cells or multiple header rows.
Type Inference: Convert Excel data types to appropriate contract types.
Formula Evaluation: Read and use calculated values from Excel formulas.

Avro Files

Schema Extraction: Use the embedded Avro schema for contract generation.
Union Types: Properly handle Avro's nullable pattern.
Complex Types: Support records, arrays, and maps within Avro files.
Code Generation: Enable optional schema evolution for Avro data.

Proposed Solution

To address the limitations of our current contract generator, we propose the addition of format-specific contract generation tools. These tools will allow users to directly generate contracts from JSON, Parquet, Excel/XLSX, and Avro files, without the need for intermediate conversions. Each tool will be tailored to the specific characteristics of its respective file format, ensuring accurate and efficient contract generation.

Here's a breakdown of the proposed solutions for each file format:

1. JSON File Support

generate_json_source_contract(
    source_id: str,
    source_path: str,
    json_format: str = "ndjson",  # "ndjson", "array", "single"
    sample_size: int = 100,
    flatten_nested: bool = False,  # Flatten nested objects
    max_depth: int = 5,  # Max nesting depth to analyze
    config: dict[str, Any] | None = None
) -> SourceContract

generate_json_destination_contract(
    destination_id: str,
    destination_path: str,
    json_format: str = "ndjson",
    pretty_print: bool = False,
    config: dict[str, Any] | None = None
) -> DestinationContract

Example:

# NDJSON file (one JSON object per line)
contract = generate_json_source_contract(
    source_id="user_events",
    source_path="/data/events.jsonl",
    json_format="ndjson",
    flatten_nested=True  # Flatten {"user": {"id": 1}} -> {"user_id": 1}
)

# JSON array
contract = generate_json_source_contract(
    source_id="api_response",
    source_path="/data/users.json",
    json_format="array"  # Expects: [{
        "id": 1
    }, {
        "id": 2
    }, ...]
)

2. Parquet File Support

generate_parquet_source_contract(
    source_id: str,
    source_path: str,
    read_schema_only: bool = False,  # Don't read data, just schema
    sample_size: int = 1000,
    config: dict[str, Any] | None = None
) -> SourceContract

generate_parquet_destination_contract(
    destination_id: str,
    destination_path: str,
    compression: str = "snappy",  # "snappy", "gzip", "none"
    partition_cols: list[str] | None = None,
    config: dict[str, Any] | None = None
) -> DestinationContract

Example:

# Read Parquet schema
contract = generate_parquet_source_contract(
    source_id="transactions",
    source_path="/data/transactions.parquet",
    read_schema_only=True  # Fast - just read schema
)

# Partitioned Parquet destination
contract = generate_parquet_destination_contract(
    destination_id="events_partitioned",
    destination_path="/data/events/",
    partition_cols=["year", "month", "day"]  # Hive-style partitioning
)

3. Excel/XLSX Support

generate_excel_source_contract(
    source_id: str,
    source_path: str,
    sheet_name: str | int | None = None,  # None = first sheet
    header_row: int = 0,  # 0-indexed
    skip_rows: int = 0,
    sample_size: int = 100,
    config: dict[str, Any] | None = None
) -> SourceContract

generate_excel_destination_contract(
    destination_id: str,
    destination_path: str,
    sheet_name: str = "Sheet1",
    config: dict[str, Any] | None = None
) -> DestinationContract

Example:

# Read specific sheet
contract = generate_excel_source_contract(
    source_id="sales_data",
    source_path="/data/Q4_Sales.xlsx",
    sheet_name="Regional Sales",
    header_row=2,  # Header is on row 3 (0-indexed)
    skip_rows=1  # Skip title row
)

# Multiple sheets (generate multiple contracts)
for sheet in ["North", "South", "East", "West"]:
    contract = generate_excel_source_contract(
        source_id=f"sales_{sheet.lower()}",
        source_path="/data/Regional_Sales.xlsx",
        sheet_name=sheet
    )

4. Avro File Support

generate_avro_source_contract(
    source_id: str,
    source_path: str,
    use_embedded_schema: bool = True,  # Use Avro's built-in schema
    sample_size: int = 100,
    config: dict[str, Any] | None = None
) -> SourceContract

generate_avro_destination_contract(
    destination_id: str,
    destination_path: str,
    codec: str = "snappy",  # "snappy", "deflate", "null"
    config: dict[str, Any] | None = None
) -> DestinationContract

Example:

# Avro file (schema is embedded)
contract = generate_avro_source_contract(
    source_id="kafka_messages",
    source_path="/data/messages.avro",
    use_embedded_schema=True  # Fast - read from file
)

Technical Considerations

JSON Files

1. Format Detection

Auto-detect NDJSON vs JSON array
Handle both .json and .jsonl extensions
Detect nested vs flat structures

2. Type Inference

Sample multiple records for accurate types
Handle union types (field sometimes string, sometimes null)
Detect arrays of primitives vs objects

3. Nested Object Handling

Options for nested structures:

Flatten: {"user": {"id": 1}} → {"user_id": 1}
JSON column: Store nested object as JSON string
Array of structs: Represent as structured type

4. Large Files

Stream reading for large JSON arrays
Memory-efficient NDJSON processing

Parquet Files

1. Schema Reading

PyArrow or fastparquet for reading
Direct schema extraction (fast)
Column statistics from metadata

2. Type Mapping

# Parquet -> Contract types
"int32" → "integer"
"int64" → "bigint"
"float" → "float"
"double" → "double"
"string", "binary" → "text"
"bool" → "boolean"
"timestamp" → "datetime"
"date" → "date"
"list" → "array[type]"
"struct" → "json" or flatten

3. Partitioning

Detect Hive-style partitions (year=2024/month=01/)
Include partition columns in schema
Respect partitioning for performance

4. Compression

Detect compression codec from metadata
Include in destination contracts

Excel Files

1. Library Choice

openpyxl - For .xlsx (Office Open XML)
xlrd - For legacy .xls
pandas - High-level interface

2. Header Detection

Auto-detect header row
Handle merged cells
Skip title/info rows

3. Type Inference

Excel types to contract types:

Numbers → numeric
Dates → date (Excel stores as numbers)
Text → text
Formulas → evaluate and infer type
Empty cells → nullable

4. Multiple Sheets

List available sheets
Generate contract per sheet
Or combine sheets with matching schema

Avro Files

1. Schema Extraction

Read embedded Avro schema
Convert to contract schema
Handle schema evolution

2. Type Mapping

# Avro -> Contract types
"int", "long" → "integer", "bigint"
"float", "double" → "float", "double"
"string" → "text"
"bytes" → "binary"
"boolean" → "boolean"
"null" → (nullable modifier)
"array" → "array[type]"
"map" → "json"
"record" → "json" or flatten
"union" → (handle nullable)

3. Union Types

Avro's nullable pattern:

["null", "string"]  // Nullable string

Convert to contract's nullable field.

Implementation Plan

Phase 1: JSON Support (MVP)

✅ NDJSON format support
✅ JSON array format support
✅ Basic type inference
✅ Nested object detection (represent as JSON)
✅ Source and destination contracts

Phase 2: Parquet Support

✅ Schema reading with PyArrow
✅ Type mapping
✅ Column statistics
✅ Partition detection
✅ Source and destination contracts

Phase 3: Excel Support

✅ Single sheet support (.xlsx)
✅ Header detection
✅ Type inference
✅ Multiple sheet handling
✅ Source and destination contracts

Phase 4: Avro Support

✅ Schema extraction
✅ Type mapping
✅ Union type handling
✅ Source and destination contracts

Phase 5: Advanced Features

Nested object flattening strategies
Schema registry integration (for Avro)
Excel formula evaluation
JSON schema validation
Parquet predicate pushdown optimization

Dependencies

Required Libraries

[project.optional-dependencies]
json = []  # Built-in
parquet = ["pyarrow>=10.0.0"]
excel = ["openpyxl>=3.0.0", "xlrd>=2.0.0"]
avro = ["fastavro>=1.7.0"]

Acceptance Criteria

JSON:

[ ] Can generate source contract from NDJSON file
[ ] Can generate source contract from JSON array file
[ ] Correctly infers types from nested JSON
[ ] Handles large files efficiently (streaming)
[ ] Supports flattening nested objects

Parquet:

[ ] Can generate source contract from Parquet file
[ ] Reads schema without loading data
[ ] Correctly maps Parquet types to contract types
[ ] Detects partition columns
[ ] Includes column statistics in quality metrics

Excel:

[ ] Can generate source contract from XLSX file
[ ] Correctly detects header row
[ ] Handles multiple sheets
[ ] Maps Excel types to contract types
[ ] Handles merged cells gracefully

Avro:

[ ] Can generate source contract from Avro file
[ ] Uses embedded schema
[ ] Handles union types (nullable)
[ ] Maps Avro types to contract types

All:

[ ] Documentation updated with examples
[ ] Tests cover happy path and error cases
[ ] Performance tested with large files

Alternative Approaches

Use pandas for Everything

Pros: Uniform interface, well-tested
Cons: Heavy dependency, not always most efficient

Recommendation: Use format-specific libraries for best performance and features.

Streaming vs Loading

For large files:

Streaming: Memory-efficient, slower type inference
Loading: Faster type inference, memory-intensive

Recommendation: Stream for sampling, load when files are small.

Nice to Have

Auto-detect file format from extension or content
Schema evolution detection between file versions
Compression level optimization
Row group size optimization (Parquet)
JSON schema generation
Excel template generation from destination contract

Related Issues

#[number] - Support API endpoints
#[number] - Support database connections
#[number] - Support object storage (S3, GCS)
#[number] - Data quality profiling

Labels

enhancement, file-formats, data-source, data-destination, high-priority

For more in-depth information on data contracts and their benefits, you can visit Stitch Data's blog.

Enhance Data Contract Generation: Add File Format Support

The Problem: Limited File Format Support

Why Support Additional File Formats?

Use Cases

JSON Files

Parquet Files

Excel/XLSX Files

Avro Files

Proposed Solution

1. JSON File Support

2. Parquet File Support

3. Excel/XLSX Support

4. Avro File Support

Technical Considerations

JSON Files

1. Format Detection

2. Type Inference

3. Nested Object Handling

4. Large Files

Parquet Files

1. Schema Reading

2. Type Mapping

3. Partitioning

4. Compression

Excel Files

1. Library Choice

2. Header Detection

3. Type Inference

4. Multiple Sheets

Avro Files

1. Schema Extraction

2. Type Mapping

3. Union Types

Implementation Plan

Phase 1: JSON Support (MVP)

Phase 2: Parquet Support

Phase 3: Excel Support

Phase 4: Avro Support

Phase 5: Advanced Features

Dependencies

Required Libraries

Acceptance Criteria

Alternative Approaches

Use pandas for Everything

Streaming vs Loading

Nice to Have

Related Issues

Labels

You may also like