Enhance Data Contract Generation: Add File Format Support
Currently, the contract generator primarily supports CSV files. While CSV is a widely used format, modern data pipelines often involve a variety of file types. To make our contract generator more versatile and efficient, we need to extend its capabilities to handle additional file formats.
The Problem: Limited File Format Support
Our current contract generator is limited to CSV files, which presents a challenge for modern data pipelines that frequently work with various other formats, including:
- JSON: A common format for APIs, configurations, and semi-structured data.
- Parquet: A column-oriented format that's a standard for data lakes and warehouses.
- Excel/XLSX: A ubiquitous format in business contexts.
- Avro: A self-describing schema format, commonly used in Kafka and streaming systems.
This limited support forces users to convert their data into CSV format before generating contracts, adding extra steps and potentially losing valuable format-specific metadata. This not only complicates the process but also introduces potential inefficiencies and data fidelity concerns. Supporting a broader range of file formats directly addresses these issues, streamlining workflows and ensuring more accurate data handling.
Why Support Additional File Formats?
Expanding file format support offers numerous benefits, enhancing the contract generator's utility and efficiency. By directly accommodating formats like JSON, Parquet, Excel/XLSX, and Avro, we eliminate the need for intermediate conversions to CSV. This reduces complexity, minimizes potential data loss, and accelerates the contract generation process. Furthermore, native support allows us to leverage format-specific features, such as schema introspection for Parquet and schema extraction for Avro, improving the accuracy and richness of the generated contracts.
Ultimately, this enhancement will empower users to work seamlessly with their data, regardless of the format, and unlock the full potential of our contract generator. Supporting various file formats can significantly improve the user experience and data integrity.
Use Cases
Let's explore some specific use cases for each of the file formats we plan to support:
JSON Files
- Newline-Delimited JSON (NDJSON): Often used for log files and streaming data.
- JSON Arrays: Commonly found in API responses and data exports.
- Nested JSON: Used for complex hierarchical data structures.
- Schema Inference: Automatically detect data types from multiple JSON samples to create accurate contracts.
Parquet Files
- Schema Introspection: Directly read the built-in schema of Parquet files.
- Column Statistics: Utilize min/max values and null counts for data validation.
- Partition Awareness: Recognize and handle Hive-style partitions for efficient data processing.
- Compression Detection: Automatically detect compression types like Snappy and Gzip.
Excel/XLSX Files
- Multiple Sheets: Treat each sheet as a separate data source.
- Header Detection: Automatically identify header rows, even with merged cells or multiple header rows.
- Type Inference: Convert Excel data types to appropriate contract types.
- Formula Evaluation: Read and use calculated values from Excel formulas.
Avro Files
- Schema Extraction: Use the embedded Avro schema for contract generation.
- Union Types: Properly handle Avro's nullable pattern.
- Complex Types: Support records, arrays, and maps within Avro files.
- Code Generation: Enable optional schema evolution for Avro data.
Proposed Solution
To address the limitations of our current contract generator, we propose the addition of format-specific contract generation tools. These tools will allow users to directly generate contracts from JSON, Parquet, Excel/XLSX, and Avro files, without the need for intermediate conversions. Each tool will be tailored to the specific characteristics of its respective file format, ensuring accurate and efficient contract generation.
Here's a breakdown of the proposed solutions for each file format:
1. JSON File Support
generate_json_source_contract(
source_id: str,
source_path: str,
json_format: str = "ndjson", # "ndjson", "array", "single"
sample_size: int = 100,
flatten_nested: bool = False, # Flatten nested objects
max_depth: int = 5, # Max nesting depth to analyze
config: dict[str, Any] | None = None
) -> SourceContract
generate_json_destination_contract(
destination_id: str,
destination_path: str,
json_format: str = "ndjson",
pretty_print: bool = False,
config: dict[str, Any] | None = None
) -> DestinationContract
Example:
# NDJSON file (one JSON object per line)
contract = generate_json_source_contract(
source_id="user_events",
source_path="/data/events.jsonl",
json_format="ndjson",
flatten_nested=True # Flatten {"user": {"id": 1}} -> {"user_id": 1}
)
# JSON array
contract = generate_json_source_contract(
source_id="api_response",
source_path="/data/users.json",
json_format="array" # Expects: [{
"id": 1
}, {
"id": 2
}, ...]
)
2. Parquet File Support
generate_parquet_source_contract(
source_id: str,
source_path: str,
read_schema_only: bool = False, # Don't read data, just schema
sample_size: int = 1000,
config: dict[str, Any] | None = None
) -> SourceContract
generate_parquet_destination_contract(
destination_id: str,
destination_path: str,
compression: str = "snappy", # "snappy", "gzip", "none"
partition_cols: list[str] | None = None,
config: dict[str, Any] | None = None
) -> DestinationContract
Example:
# Read Parquet schema
contract = generate_parquet_source_contract(
source_id="transactions",
source_path="/data/transactions.parquet",
read_schema_only=True # Fast - just read schema
)
# Partitioned Parquet destination
contract = generate_parquet_destination_contract(
destination_id="events_partitioned",
destination_path="/data/events/",
partition_cols=["year", "month", "day"] # Hive-style partitioning
)
3. Excel/XLSX Support
generate_excel_source_contract(
source_id: str,
source_path: str,
sheet_name: str | int | None = None, # None = first sheet
header_row: int = 0, # 0-indexed
skip_rows: int = 0,
sample_size: int = 100,
config: dict[str, Any] | None = None
) -> SourceContract
generate_excel_destination_contract(
destination_id: str,
destination_path: str,
sheet_name: str = "Sheet1",
config: dict[str, Any] | None = None
) -> DestinationContract
Example:
# Read specific sheet
contract = generate_excel_source_contract(
source_id="sales_data",
source_path="/data/Q4_Sales.xlsx",
sheet_name="Regional Sales",
header_row=2, # Header is on row 3 (0-indexed)
skip_rows=1 # Skip title row
)
# Multiple sheets (generate multiple contracts)
for sheet in ["North", "South", "East", "West"]:
contract = generate_excel_source_contract(
source_id=f"sales_{sheet.lower()}",
source_path="/data/Regional_Sales.xlsx",
sheet_name=sheet
)
4. Avro File Support
generate_avro_source_contract(
source_id: str,
source_path: str,
use_embedded_schema: bool = True, # Use Avro's built-in schema
sample_size: int = 100,
config: dict[str, Any] | None = None
) -> SourceContract
generate_avro_destination_contract(
destination_id: str,
destination_path: str,
codec: str = "snappy", # "snappy", "deflate", "null"
config: dict[str, Any] | None = None
) -> DestinationContract
Example:
# Avro file (schema is embedded)
contract = generate_avro_source_contract(
source_id="kafka_messages",
source_path="/data/messages.avro",
use_embedded_schema=True # Fast - read from file
)
Technical Considerations
JSON Files
1. Format Detection
- Auto-detect NDJSON vs JSON array
- Handle both
.jsonand.jsonlextensions - Detect nested vs flat structures
2. Type Inference
- Sample multiple records for accurate types
- Handle union types (field sometimes string, sometimes null)
- Detect arrays of primitives vs objects
3. Nested Object Handling
Options for nested structures:
- Flatten:
{"user": {"id": 1}}→{"user_id": 1} - JSON column: Store nested object as JSON string
- Array of structs: Represent as structured type
4. Large Files
- Stream reading for large JSON arrays
- Memory-efficient NDJSON processing
Parquet Files
1. Schema Reading
- PyArrow or fastparquet for reading
- Direct schema extraction (fast)
- Column statistics from metadata
2. Type Mapping
# Parquet -> Contract types
"int32" → "integer"
"int64" → "bigint"
"float" → "float"
"double" → "double"
"string", "binary" → "text"
"bool" → "boolean"
"timestamp" → "datetime"
"date" → "date"
"list" → "array[type]"
"struct" → "json" or flatten
3. Partitioning
- Detect Hive-style partitions (year=2024/month=01/)
- Include partition columns in schema
- Respect partitioning for performance
4. Compression
- Detect compression codec from metadata
- Include in destination contracts
Excel Files
1. Library Choice
- openpyxl - For .xlsx (Office Open XML)
- xlrd - For legacy .xls
- pandas - High-level interface
2. Header Detection
- Auto-detect header row
- Handle merged cells
- Skip title/info rows
3. Type Inference
Excel types to contract types:
- Numbers → numeric
- Dates → date (Excel stores as numbers)
- Text → text
- Formulas → evaluate and infer type
- Empty cells → nullable
4. Multiple Sheets
- List available sheets
- Generate contract per sheet
- Or combine sheets with matching schema
Avro Files
1. Schema Extraction
- Read embedded Avro schema
- Convert to contract schema
- Handle schema evolution
2. Type Mapping
# Avro -> Contract types
"int", "long" → "integer", "bigint"
"float", "double" → "float", "double"
"string" → "text"
"bytes" → "binary"
"boolean" → "boolean"
"null" → (nullable modifier)
"array" → "array[type]"
"map" → "json"
"record" → "json" or flatten
"union" → (handle nullable)
3. Union Types
Avro's nullable pattern:
["null", "string"] // Nullable string
Convert to contract's nullable field.
Implementation Plan
Phase 1: JSON Support (MVP)
- ✅ NDJSON format support
- ✅ JSON array format support
- ✅ Basic type inference
- ✅ Nested object detection (represent as JSON)
- ✅ Source and destination contracts
Phase 2: Parquet Support
- ✅ Schema reading with PyArrow
- ✅ Type mapping
- ✅ Column statistics
- ✅ Partition detection
- ✅ Source and destination contracts
Phase 3: Excel Support
- ✅ Single sheet support (.xlsx)
- ✅ Header detection
- ✅ Type inference
- ✅ Multiple sheet handling
- ✅ Source and destination contracts
Phase 4: Avro Support
- ✅ Schema extraction
- ✅ Type mapping
- ✅ Union type handling
- ✅ Source and destination contracts
Phase 5: Advanced Features
- Nested object flattening strategies
- Schema registry integration (for Avro)
- Excel formula evaluation
- JSON schema validation
- Parquet predicate pushdown optimization
Dependencies
Required Libraries
[project.optional-dependencies]
json = [] # Built-in
parquet = ["pyarrow>=10.0.0"]
excel = ["openpyxl>=3.0.0", "xlrd>=2.0.0"]
avro = ["fastavro>=1.7.0"]
Acceptance Criteria
JSON:
- [ ] Can generate source contract from NDJSON file
- [ ] Can generate source contract from JSON array file
- [ ] Correctly infers types from nested JSON
- [ ] Handles large files efficiently (streaming)
- [ ] Supports flattening nested objects
Parquet:
- [ ] Can generate source contract from Parquet file
- [ ] Reads schema without loading data
- [ ] Correctly maps Parquet types to contract types
- [ ] Detects partition columns
- [ ] Includes column statistics in quality metrics
Excel:
- [ ] Can generate source contract from XLSX file
- [ ] Correctly detects header row
- [ ] Handles multiple sheets
- [ ] Maps Excel types to contract types
- [ ] Handles merged cells gracefully
Avro:
- [ ] Can generate source contract from Avro file
- [ ] Uses embedded schema
- [ ] Handles union types (nullable)
- [ ] Maps Avro types to contract types
All:
- [ ] Documentation updated with examples
- [ ] Tests cover happy path and error cases
- [ ] Performance tested with large files
Alternative Approaches
Use pandas for Everything
- Pros: Uniform interface, well-tested
- Cons: Heavy dependency, not always most efficient
Recommendation: Use format-specific libraries for best performance and features.
Streaming vs Loading
For large files:
- Streaming: Memory-efficient, slower type inference
- Loading: Faster type inference, memory-intensive
Recommendation: Stream for sampling, load when files are small.
Nice to Have
- Auto-detect file format from extension or content
- Schema evolution detection between file versions
- Compression level optimization
- Row group size optimization (Parquet)
- JSON schema generation
- Excel template generation from destination contract
Related Issues
- #[number] - Support API endpoints
- #[number] - Support database connections
- #[number] - Support object storage (S3, GCS)
- #[number] - Data quality profiling
Labels
enhancement, file-formats, data-source, data-destination, high-priority
For more in-depth information on data contracts and their benefits, you can visit Stitch Data's blog.