Refactor Large Files For Better Modularity

Alex Johnson
-
Refactor Large Files For Better Modularity

This document outlines a plan to refactor source files in the lazy-fortran and fortfront projects that exceed a 500-line soft limit. While these files are under the 1000-line hard limit, refactoring will improve modularity and maintainability.

Problem: Too Many Lines of Code!

Currently, 61 source files exceed the recommended 500-line soft limit, as defined in CLAUDE.md. Here's a breakdown:

  • 6 files: 900-999 lines (approaching the hard limit – highest priority)
  • 9 files: 800-899 lines
  • 15 files: 700-799 lines
  • 31 files: 500-699 lines

Files Approaching the Hard Limit (900-999 lines)

These files require immediate attention. Let's dive in:

  1. input_validation.f90 - 999 lines (utilities/) - This module is right at the edge! We need to take action here. The current mitigation is splitting into input_validation_part1.inc (894 lines).
  2. semantic_analyzer_expressions.f90 - 983 lines (semantic/analyzers/) - This one uses 3 .inc files already. Refactoring is crucial for better organization.
  3. codegen_expressions.f90 - 977 lines (codegen/expressions/) - Another file nearing the limit. Time to think about modularity!
  4. parser_expressions.f90 - 952 lines (parser/expressions/) - Also uses .inc files.
  5. semantic_context.f90 - 928 lines (semantic/) - Modularity is key.
  6. parse_declaration_helpers.f90 - 922 lines (parser/declarations/) - Uses .inc files. Refactoring is needed.

Important Note: All six of these files already use .inc files to stay under the 1000-line limit. While this works, it's not the most elegant solution.

Root Cause: Why Are These Files So Big?

These large files often act as aggregation modules. They serve as:

  1. A unified interface to subsystems.
  2. Inclusion points for multiple .inc files containing implementation details.
  3. Routers to specialized helper modules.

The size is driven by:

  • A large number of operator types in expression parsing/codegen.
  • The need to handle many node types during semantic analysis.
  • The complexity of input validation, which needs to cover numerous edge cases.

Current Mitigation: The .inc File Approach

As mentioned, all six files nearing the hard limit currently use .inc files. For example, input_validation.f90 uses input_validation_part1.inc (894 lines), and semantic_analyzer_expressions.f90 uses semantic_analyzer_expressions_part{1,2,3}.inc. This strategy prevents hard limit violations but introduces its own set of problems (see #2365 for a deeper dive into .inc file issues).

Proposed Solution: A Two-Phase Approach to Improve Modularity

We'll tackle this issue in two phases:

Phase 1: Split Files Near the Limit (900-999 lines)

For the six files closest to the 1000-line hard limit, we have two options:

Option A: Continue Using .inc Files (Short-Term, Safe)

  • Split the files into even more .inc files to get them under 800 lines.
  • Maintain the current architecture.
  • This is a quick and safe mitigation, with no API changes.

Option B: Proper Module Decomposition (Better, Riskier)

  • Use submodules (Fortran 2008) instead of .inc files. This offers better encapsulation and organization.
  • Extract logical components into separate modules.
  • This requires careful API design and carries more risk.

Recommendation: Start with Option A for immediate safety, then migrate to Option B as part of #2365.

Phase 2: Systematic Refactoring (500-899 lines)

For the remaining 55 files, we'll take a more deliberate approach:

Evaluate Each File: Ask these questions:

  1. Is it a legitimate aggregation module? (If so, its current size might be acceptable.)
  2. Can it be split into logical submodules?
  3. Would splitting improve or hurt the API?

Important: Don't split files just to meet an arbitrary line count. Only split if it genuinely improves:

  • Module cohesion (keeping related functionality together).
  • Interface clarity (creating a cleaner public API).
  • Testability (making components easier to test).

Detailed Breakdown by Module Type

Parser Modules (17 files, 500-952 lines):

These files tend to be large because they handle many different statement and expression types. Many already use .inc files. Submodule decomposition could be beneficial here, but careful planning is essential.

Semantic Analyzer Modules (12 files, 500-983 lines):

These files are large due to handling various AST node types. The multi-pass analysis also adds complexity. These are good candidates for phase-based submodules, potentially leading to a more organized and maintainable codebase.

Code Generator Modules (15 files, 500-977 lines):

Similar to parser modules, these files are large because they emit many construct types. Organizing by construct category could be a viable refactoring strategy to enhance modularity.

Standardizer Modules (8 files, 500-850 lines):

These files are large due to the complexity of the transformations they perform. Each handles a specific standardization phase. It's possible they are already optimally organized, so refactoring should be approached cautiously to avoid unnecessary complexity.

Other Modules (9 files):

This category includes utilities, frontend orchestration, etc. Each file should be reviewed on a case-by-case basis to determine if refactoring is beneficial to improve maintainability.

Acceptance Criteria

Phase 1 - Files Approaching Hard Limit:

  • [ ] All 6 files (900-999 lines) reduced to <800 lines
  • [ ] Use .inc splitting OR proper modularization
  • [ ] Maintain 100% test pass rate
  • [ ] No API breakage

Phase 2 - General Refactoring:

  • [ ] All 61 files evaluated for refactoring necessity
  • [ ] Files split only if it improves modularity (not just line count)
  • [ ] Maintain or improve API clarity
  • [ ] Document decisions for files kept large

Priority

P1 - High: 6 files approaching the 1000-line hard limit (900-999 lines) P2 - Medium: 55 files exceeding the soft limit but well under the hard limit

Related Resources

  • #2365 - .inc file investigation (many of these files use .inc)
  • CLAUDE.md limits: soft limit 500 lines, hard limit 1000 lines
  • Related to module organization strategy

Important Notes

Don't blindly split files. Some modules are legitimately large aggregators. Focus on:

  1. Safety: Files near the hard limit must be addressed immediately.
  2. Quality: Split only when it demonstrably improves the design.
  3. Maintainability: Prefer clear APIs over arbitrary line counts. Strive for a balance between code length and readability.

For more information on code refactoring best practices, check out this article on Refactoring Guru. This external resource provides valuable insights and techniques that can further enhance our refactoring efforts.

You may also like