Manual Snapshot Sync: SYNC Command Design

Alex Johnson
-
Manual Snapshot Sync: SYNC Command Design

Introduction: Taking Control of Snapshots

Manual Snapshot Synchronization, also known as the SYNC command, is a critical design element for managing database snapshots in MygramDB. This feature grants operators precise control over the initial synchronization process, which is essential for database health and performance. The primary goal is to prevent unexpected load on the MySQL master server during server startup. This design document outlines the key aspects of the SYNC command, ensuring data safety and operational flexibility. We will dive into the requirements, architecture, command specifications, and considerations for conflict detection, state management, and shutdown handling. This approach provides a more controlled and transparent method for managing snapshots compared to the previous automatic build process, which could cause unforeseen strain during server startup. By giving operators the ability to trigger snapshot synchronization, we enhance both operational safety and overall system performance.

Core Objectives

  • Operational Safety: Prevent excessive load on the MySQL master server during startup.
  • Flexibility: Enable operators to choose when snapshot synchronization occurs.
  • Transparency: Provide clear feedback on synchronization progress and replication status.
  • Multi-table Support: Facilitate independent synchronization for various tables.
  • Data Safety: Protect data integrity by managing conflicts effectively.

Requirements: Ensuring a Robust Implementation

The requirements for the SYNC command are multifaceted, addressing both functional and non-functional aspects. These requirements ensure that the command is not only powerful but also safe, efficient, and user-friendly. Functional requirements dictate what the system must do, while non-functional requirements focus on how well it performs these actions.

Functional Requirements: What the SYNC command must do

  1. Configuration Control: Allow the auto_initial_snapshot flag to disable automatic snapshot building.
  2. Manual Trigger: Implement the SYNC command for manual snapshot building.
  3. Progress Monitoring: Provide the SYNC STATUS command to monitor the synchronization process.
  4. Asynchronous Execution: Ensure the SYNC command returns immediately and runs in the background.
  5. Replication Management: Automatically start or restart binlog replication after the SYNC command.
  6. Conflict Prevention: Block conflicting operations like DUMP LOAD and REPLICATION START during the SYNC process.
  7. Multi-table Support: Enable independent SYNC operations for various tables.

Non-Functional Requirements: How well the SYNC command performs

  • Non-blocking: The SYNC command should not block other client connections.
  • Safe by Default: The default configuration must prevent unexpected MySQL load.
  • Graceful Shutdown: The system should cleanly cancel ongoing SYNC operations during server shutdown.
  • Memory Safe: Ensure memory availability before starting SYNC operations.
  • Thread-safe: Support concurrent SYNC operations on different tables.
  • Idempotent: Running SYNC multiple times should be safe with proper checks in place.

Architecture: Design and Component Interactions

The architecture of the SYNC command is designed to provide a smooth and efficient process for manual snapshot synchronization. It involves several components that interact to ensure data integrity and operational control. The default behavior change is a cornerstone of this new system, moving away from automatic snapshots on startup to manual control.

Default Behavior Change

Aspect Current Behavior New Behavior (Default)
Startup Auto-build snapshots Skip snapshot build
MySQL Load Immediate on startup Only when operator runs SYNC
Replication Auto-start after snapshot Start after manual SYNC
Configuration No control auto_initial_snapshot: false

High-Level SYNC Flow

The high-level flow illustrates the process from the client's perspective down to the background thread's execution:

  • Client: Issues SYNC articles command.
  • TcpServer::HandleSyncCommand():
    • Checks if the table is already syncing; rejects if it is.
    • Checks memory health and rejects if critical.
    • Marks the table as syncing.
    • Launches a background thread.
    • Returns "OK SYNC STARTED".
  • Background Thread: BuildSnapshotAsync():
    • Creates a MySQL connection.
    • Builds a snapshot using SnapshotBuilder.
    • Updates progress, including rows processed and the processing rate.
    • Captures the GTID (Global Transaction Identifier).
    • Starts or restarts the BinlogReader with the GTID.
    • Marks the process as completed.
    • Removes the table from the syncing set.

Component Interactions

The mermaid diagram below visualizes the component interactions:

graph TD
    A[Client] -->|SYNC articles| B[TcpServer]
    B --> C{Memory OK?}
    C -->|No| D[Error: Critical Memory]
    C -->|Yes| E{Already Syncing?}
    E -->|Yes| F[Error: SYNC in progress]
    E -->|No| G[Launch Background Thread]
    G --> H[BuildSnapshotAsync]
    H --> I[SnapshotBuilder]
    I --> J[MySQL Connection]
    J --> K[Capture GTID]
    K --> L[BinlogReader::StartFromGTID]
    L --> M[Mark Completed]

    N[Client] -->|SYNC STATUS| B
    B --> O[Return Progress]

    P[Client] -->|DUMP LOAD| Q[DumpHandler]
    Q --> R{Any Table Syncing?}
    R -->|Yes| S[Error: Cannot LOAD during SYNC]
    R -->|No| T[Proceed with LOAD]

    style H fill:#4ecdc4
    style L fill:#51cf66
    style S fill:#ff6b6b

Components

  1. SyncHandler: Manages the SYNC and SYNC STATUS commands.
  2. SyncState: Tracks the synchronization state for each table, including progress, status, and GTID.
  3. BuildSnapshotAsync(): The background thread function that builds the snapshot.
  4. Conflict Detector: Checks for syncing tables within the DumpHandler and ReplicationHandler.
  5. Shutdown Canceller: Cancels active SYNC operations during server shutdown.

Command Specification: Interacting with the SYNC Command

The SYNC and SYNC STATUS commands provide the primary interface for users to manage snapshot synchronization. These commands offer clear syntax and responses to ensure ease of use and effective monitoring of the process.

SYNC Command

  • Syntax:

    SYNC [table_name]
    
  • Examples:

    # Sync all tables
    SYNC
    
    # Sync specific table
    SYNC articles
    
  • Response (Success):

    OK SYNC STARTED table=articles job_id=1
    
  • Response (Error):

    ERROR SYNC already in progress for table 'articles'
    ERROR Memory critically low. Cannot start SYNC. Check system memory.
    ERROR Table 'products' not found in configuration
    

SYNC STATUS Command

  • Syntax:

    SYNC STATUS
    
  • Response (In Progress):

    table=articles status=IN_PROGRESS progress=10000/25000 rows (40%) rate=5000 rows/s
    
  • Response (Completed):

    table=articles status=COMPLETED rows=25000 time=5.2s gtid=xxxx:123 replication=STARTED
    
  • Response (Idle):

    status=IDLE message="No sync operation performed"
    
  • Response (Failed):

    table=articles status=FAILED rows=5000 error="MySQL connection lost"
    

Response Fields

The table below details the fields returned by the SYNC STATUS command:

Field Description Example
table Table name being synced articles
status Current status IN_PROGRESS, COMPLETED, FAILED, IDLE, CANCELLED
progress Current/total rows processed 10000/25000 rows (40%)
rate Processing rate 5000 rows/s
rows Total rows processed 25000
time Total processing time 5.2s
gtid Captured snapshot GTID xxxx:123
replication Replication status STARTED, ALREADY_RUNNING, DISABLED, FAILED
error Error message (if failed) MySQL connection lost

Replication Status Values

The replication field provides the status of the binlog replication:

  • STARTED: Binlog replication started from snapshot GTID (success).
  • ALREADY_RUNNING: Replication was already running (GTID not updated).
  • DISABLED: Replication is disabled in the configuration.
  • FAILED: Snapshot succeeded, but replication failed to start; check the logs.

Conflict Detection: Ensuring Data Integrity

Conflict detection is a critical component of the SYNC command. It prevents data corruption by carefully managing how other operations interact with the synchronization process. This involves blocking conflicting commands and providing appropriate warnings.

Command Behavior During SYNC

The following table outlines how different commands behave during a SYNC operation:

Command Behavior Rationale
SEARCH ✅ Execute normally Returns partial results (data loaded so far)
COUNT ✅ Execute normally Returns the current count (increases as sync progresses)
GET ✅ Execute normally Returns the document if loaded, or an error if not yet loaded
INFO ✅ Execute normally Shows current server state, including sync progress
OPTIMIZE ⚠️ Allowed with caution Safe but increases memory usage during sync
DUMP SAVE ⚠️ Warning logged Allowed but logs a warning (saving an incomplete snapshot)
DUMP LOAD ❌ Error Cannot load a dump while SYNC is in progress (potential data corruption)
SYNC (same table) ❌ Error SYNC already in progress for this table
SYNC (different table) ✅ Start new sync Multi-table support, independent operations
REPLICATION START ❌ Error Cannot start replication while SYNC is in progress; SYNC handles replication start.
REPLICATION STOP ✅ Execute normally Independent operation

Implementation: Syncing Tables Tracker

The syncing_tables_ set is used to track which tables are currently syncing to prevent conflicts.

Add to tcp_server.h:

// Track which tables are currently syncing (for conflict detection)
std::unordered_set<std::string> syncing_tables_;
std::mutex syncing_tables_mutex_;

Update server_types.h HandlerContext:

struct HandlerContext {
  // ... existing fields ...
  std::unordered_set<std::string>& syncing_tables;
  std::mutex& syncing_tables_mutex;
};

DumpHandler: Block DUMP LOAD

std::string DumpHandler::HandleDumpLoad(const query::Query& query) {
  // Check if any table is currently syncing
  {
    std::lock_guard<std::mutex> lock(ctx_.syncing_tables_mutex);
    if (!ctx_.syncing_tables.empty()) {
      std::ostringstream oss;
      oss << "Cannot load dump while SYNC is in progress for tables: ";
      for (const auto& table : ctx_.syncing_tables) {
        oss << table << " ";
      }
      return ResponseFormatter::FormatError(oss.str());
    }
  }
  // ... existing DUMP LOAD logic ...
}

DumpHandler: Warn on DUMP SAVE

std::string DumpHandler::HandleDumpSave(const query::Query& query) {
  // Warn if any table is currently syncing
  {
    std::lock_guard<std::mutex> lock(ctx_.syncing_tables_mutex);
    if (!ctx_.syncing_tables.empty()) {
      spdlog::warn("DUMP SAVE executed while SYNC is in progress. Snapshot may be incomplete.");
    }
  }
  // ... existing DUMP SAVE logic ...
}

ReplicationHandler: Block REPLICATION START

std::string ReplicationHandler::HandleReplicationStart(const query::Query& query) {
  // Check if any table is currently syncing
  {
    std::lock_guard<std::mutex> lock(ctx_.syncing_tables_mutex);
    if (!ctx_.syncing_tables.empty()) {
      return ResponseFormatter::FormatError(
        "Cannot start replication while SYNC is in progress. "
        "SYNC will automatically start replication when complete.");
    }
  }
  // ... existing REPLICATION START logic ...
}

State Management: Tracking the Synchronization Process

State management is critical for the SYNC command to track and report the synchronization progress accurately. The SyncState structure holds all necessary information about each synchronization operation.

SyncState Structure

struct SyncState {
  std::atomic<bool> is_running{false};
  std::string table_name;
  uint64_t total_rows = 0;
  std::atomic<uint64_t> processed_rows{0};
  std::chrono::steady_clock::time_point start_time;
  std::string status;  // "IDLE", "IN_PROGRESS", "COMPLETED", "FAILED", "CANCELLED"
  std::string error_message;
  std::string gtid;
  std::string replication_status;
};

Add to TcpServer

// Per-table sync state
std::unordered_map<std::string, SyncState> sync_states_;
std::mutex sync_mutex_;

// Active snapshot builders (for shutdown cancellation)
std::unordered_map<std::string, storage::SnapshotBuilder*> active_snapshot_builders_;
std::mutex snapshot_builders_mutex_;

// Shutdown flag
std::atomic<bool> shutdown_requested_{false};

HandleSyncCommand Implementation

std::string TcpServer::HandleSyncCommand(const query::Query& query) {
  std::lock_guard<std::mutex> lock(sync_mutex_);

  const std::string& table_name = query.table;

  // Check if already running for this table
  if (sync_states_[table_name].is_running) {
    return ResponseFormatter::FormatError(
      "SYNC already in progress for table " + table_name);
  }

  // Check memory health before starting
  auto memory_health = utils::GetMemoryHealthStatus();
  if (memory_health == utils::MemoryHealthStatus::CRITICAL) {
    return ResponseFormatter::FormatError(
      "Memory critically low. Cannot start SYNC. Check system memory.");
  }

  // Mark table as syncing
  {
    std::lock_guard<std::mutex> sync_lock(syncing_tables_mutex_);
    syncing_tables_.insert(table_name);
  }

  // Initialize state
  sync_states_[table_name].is_running = true;
  sync_states_[table_name].status = "STARTING";
  sync_states_[table_name].table_name = table_name;

  // Launch async snapshot build
  std::thread([this, table_name]() {
    BuildSnapshotAsync(table_name);
  }).detach();

  return "OK SYNC STARTED table=" + table_name + " job_id=1";
}

BuildSnapshotAsync Implementation

void TcpServer::BuildSnapshotAsync(const std::string& table_name) {
  auto& state = sync_states_[table_name];
  state.status = "IN_PROGRESS";
  state.start_time = std::chrono::steady_clock::now();

  // RAII guard to ensure cleanup even on exceptions
  struct SyncGuard {
    TcpServer* server;
    std::string table;
    explicit SyncGuard(TcpServer* s, std::string t) : server(s), table(std::move(t)) {}
    ~SyncGuard() {
      // Remove from syncing set
      std::lock_guard<std::mutex> lock(server->syncing_tables_mutex_);
      server->syncing_tables_.erase(table);
    }
  };
  SyncGuard guard(this, table_name);

  try {
    // Create MySQL connection
    auto mysql_conn = CreateMysqlConnection();
    if (!mysql_conn || !mysql_conn->IsConnected()) {
      state.status = "FAILED";
      state.error_message = "Failed to connect to MySQL";
      state.is_running = false;
      return;
    }

    // Get table context
    auto* ctx = table_contexts_[table_name];

    // Build snapshot with cancellation support
    storage::SnapshotBuilder builder(*mysql_conn, *ctx->index, *ctx->doc_store,
                                     ctx->config, build_config_);

    // Store builder pointer for shutdown cancellation
    {
      std::lock_guard<std::mutex> lock(snapshot_builders_mutex_);
      active_snapshot_builders_[table_name] = &builder;
    }

    bool success = builder.Build([&](const auto& progress) {
      state.total_rows = progress.total_rows;
      state.processed_rows = progress.processed_rows;

      // Check for shutdown signal
      if (shutdown_requested_) {
        builder.Cancel();
      }
    });

    // Clear builder pointer
    {
      std::lock_guard<std::mutex> lock(snapshot_builders_mutex_);
      active_snapshot_builders_.erase(table_name);
    }

    // Check if cancelled by shutdown
    if (shutdown_requested_) {
      state.status = "CANCELLED";
      state.error_message = "Server shutdown requested";
      state.is_running = false;
      return;
    }

    if (success) {
      state.status = "COMPLETED";
      state.gtid = builder.GetSnapshotGTID();
      state.processed_rows = builder.GetProcessedRows();

      // Start or restart binlog replication if enabled
      if (replication_enabled_ && !state.gtid.empty()) {
        bool repl_started = StartOrRestartBinlogReplication(table_name, state.gtid);
        if (repl_started) {
          state.replication_status = "STARTED";
        } else {
          state.replication_status = "FAILED";
          state.error_message = "Snapshot succeeded but replication failed to start";
        }
      } else {
        state.replication_status = "DISABLED";
      }

      spdlog::info("SYNC completed for table: {} (rows={}, gtid={}, replication={})",
                   table_name, state.processed_rows, state.gtid, state.replication_status);
    } else {
      state.status = "FAILED";
      state.error_message = builder.GetLastError();
      spdlog::error("SYNC failed for table: {} - {}", table_name, state.error_message);
    }
  } catch (const std::exception& e) {
    state.status = "FAILED";
    state.error_message = e.what();
    spdlog::error("SYNC exception for table: {} - {}", table_name, e.what());
  }

  state.is_running = false;
}

Shutdown Handling: Ensuring a Clean Exit

Shutdown handling is essential to gracefully terminate the SYNC operations when the server receives a shutdown signal. This ensures that no data is left in an inconsistent state and prevents potential issues during the shutdown process.

Problem

Active SYNC operations must be cancelled cleanly when the server receives a shutdown signal (SIGTERM/SIGINT).

Solution

  1. Set the shutdown_requested_ flag.
  2. Cancel all active SnapshotBuilder instances.
  3. Wait for the SYNC operations to complete (with a timeout).
  4. Proceed with the normal shutdown sequence.

Implementation

void TcpServer::Stop() {
  shutdown_requested_ = true;

  // Cancel all active SYNC operations
  {
    std::lock_guard<std::mutex> lock(snapshot_builders_mutex_);
    for (auto& [table_name, builder] : active_snapshot_builders_) {
      spdlog::info("Cancelling SYNC for table: {}", table_name);
      builder->Cancel();
    }
  }

  // Wait for SYNC operations to complete (with timeout)
  auto wait_start = std::chrono::steady_clock::now();
  constexpr int kSyncWaitTimeoutSec = 30;

  while (!syncing_tables_.empty()) {
    auto elapsed = std::chrono::duration_cast<std::chrono::seconds>(
      std::chrono::steady_clock::now() - wait_start).count();

    if (elapsed > kSyncWaitTimeoutSec) {
      spdlog::warn("Timeout waiting for SYNC operations to complete");
      break;
    }

    std::this_thread::sleep_for(std::chrono::milliseconds(100));
  }

  // ... existing Stop() logic ...
}

Configuration: Customizing SYNC Behavior

Configuration settings allow operators to customize the SYNC command's behavior according to the specific needs of their deployments. This section outlines the new configuration field and its integration within the system.

New Configuration Field

The auto_initial_snapshot setting controls whether snapshots are automatically built on server startup or if the SYNC command must be used. This provides flexibility and control over the snapshot process.

Add to ReplicationConfig (in config.h):

struct ReplicationConfig {
  bool enable = true;
  bool auto_initial_snapshot = false;  // NEW: Default false (manual sync required)
  uint32_t server_id = 0;
  std::string start_from = "snapshot";
  int queue_size = defaults::kReplicationQueueSize;
  int reconnect_backoff_min_ms = defaults::kReconnectBackoffMinMs;
  int reconnect_backoff_max_ms = defaults::kReconnectBackoffMaxMs;
};

Configuration Schema

The configuration schema defines the structure and types of the configuration settings.

Add to config-schema.json:

{
  "replication": {
    "properties": {
      "auto_initial_snapshot": {
        "type": "boolean",
        "default": false,
        "description": "Automatically build snapshots on server startup (default: false). Set to true to restore pre-1.0 behavior. When false, use SYNC command to manually trigger snapshot build."
      }
    }
  }
}

Configuration Examples

Here are some examples of how to configure the SYNC command using both YAML and JSON:

YAML (examples/config.yaml):

replication:
  enable: true
  auto_initial_snapshot: false  # Default: false (safe by default)
                                 # Set to true for automatic snapshot on startup
  server_id: 12345
  start_from: "snapshot"

JSON (examples/config.json):

{
  "replication": {
    "enable": true,
    "auto_initial_snapshot": false,
    "server_id": 12345,
    "start_from": "snapshot"
  }
}

Migration Guide: Transitioning to Manual Snapshot Synchronization

This guide outlines the steps for migrating to the manual snapshot synchronization process using the SYNC command. This ensures a smooth transition and minimizes any potential disruptions.

Breaking Change Notice

Starting from version 1.X.X, MygramDB no longer automatically builds snapshots on startup by default. This change prevents unexpected load on the MySQL master server.

Migration Options

There are two main options for migrating to the manual snapshot synchronization:

Option 1: Enable Auto Snapshot (Restore Old Behavior)

To restore the pre-1.X behavior, update the config.yaml:

replication:
  enable: true
  auto_initial_snapshot: true  # Restore pre-1.X behavior
  server_id: 12345
  start_from: "snapshot"
  • Use Case: Ideal for existing deployments that require automatic snapshots on startup.

Option 2: Use Manual SYNC (Recommended)

Follow these steps to use the manual SYNC command:

  1. Update the configuration to use the default auto_initial_snapshot: false setting.

  2. Start the MygramDB server.

  3. When ready, execute the SYNC command:

    # Option A: Use CLI client
    mygram-cli SYNC
    
    # Option B: Use telnet
    telnet localhost 11016
    SYNC articles
    
    # Monitor progress
    mygram-cli SYNC STATUS
    
  • Use Case: Recommended for new deployments and production environments where load control is critical.

Migration Checklist

The following checklist helps ensure a smooth migration:

  • Review the current startup behavior.
  • Update the configuration file.
  • Test the changes in a staging environment.
  • Update operational runbooks.
  • Train operators on the SYNC command usage.
  • Monitor the first production deployment.

Future Enhancements: Expanding the SYNC Command

This section outlines potential future enhancements for the SYNC command. These improvements aim to add greater functionality and flexibility to the system.

Phase 2 Features

  1. SYNC CANCEL Command:
    • Abort the ongoing SYNC operation.
    • Clean up any partial data.
    • Return resources.
  2. SYNC ALL Command:
    • Sync all configured tables in parallel.
    • Provide aggregated progress reporting.
  3. Resume from Checkpoint:
    • Save progress periodically.
    • Resume from the last checkpoint on failure.
    • Useful for synchronizing very large tables.
  4. Scheduled SYNC:
    • Configuration option: sync_schedule: "0 2 * * *" (cron format).
    • Automatically run the SYNC command at specified times.
    • Useful for daily data refresh.

Phase 3 Features

  1. Prometheus Metrics:
    • mygramdb_sync_in_progress{table="articles"} (gauge)
    • mygramdb_sync_rows_total{table="articles"} (counter)
    • mygramdb_sync_duration_seconds{table="articles"} (histogram)
    • mygramdb_sync_errors_total{table="articles"} (counter)
  2. SYNC LIST Command:
    • Display sync history.
    • Show past sync operations with timestamps.
    • Useful for auditing.
  3. Progress Percentage in INFO:
    • Add sync progress to the INFO command output.
    • Display progress across all tables.
    • Example: sync_progress: articles=45% products=completed.
  4. Parallel Batch Processing:
    • Split large tables into ranges.
    • Build multiple batches in parallel.
    • Merge results at the end.
    • Significant speedup for multi-million row tables.

Design Decisions Summary: Key Considerations

This summary highlights the critical design decisions made during the development of the SYNC command. These decisions were made to provide a balance of performance, safety, and operational ease.

1. Async Non-blocking Execution

  • Decision: The SYNC command runs in a background thread and returns immediately.
  • Rationale: Provides better user experience, allows progress monitoring, and prevents blocking other clients.
  • Tradeoff: Involves more complex state management.

2. Per-table Conflict Detection

  • Decision: Use the syncing_tables_ set to track active SYNC operations.
  • Rationale: Enables multi-table support and prevents data corruption.
  • Implementation: Checks in DumpHandler and ReplicationHandler.

3. Memory Health Check

  • Decision: Check memory availability before starting SYNC (similar to OPTIMIZE).
  • Rationale: Snapshot building can be memory-intensive for large tables.
  • Implementation: Use existing utils::GetMemoryHealthStatus().

4. Shutdown Cancellation

  • Decision: Cancel active SYNC operations during server shutdown.
  • Rationale: Ensures clean shutdown and prevents MySQL connection hangs.
  • Implementation: Store active SnapshotBuilder pointers and call Cancel() on shutdown.

5. BinlogReader Management

  • Decision: The SYNC command automatically starts or restarts the BinlogReader with the new GTID.
  • Rationale: Provides seamless replication setup and eliminates manual steps.
  • Tradeoff: Requires implementing BinlogReader restart logic.

6. DUMP SAVE Behavior

  • Decision: Allow DUMP SAVE during SYNC but log a warning.
  • Rationale: Not strictly dangerous, but the snapshot may be incomplete.
  • Alternative: Could block DUMP SAVE entirely (more restrictive).

7. Default to Manual Sync

  • Decision: auto_initial_snapshot: false by default.
  • Rationale: Safe by default, prevents unexpected MySQL master load.
  • Migration: This is a breaking change, and existing deployments require configuration updates.

The manual snapshot synchronization with the SYNC command provides a robust and flexible way to manage snapshots in MygramDB, ensuring both operational efficiency and data integrity.

For additional information, you can consult the official MySQL Documentation.

You may also like