Manual Snapshot Sync: SYNC Command Design
Introduction: Taking Control of Snapshots
Manual Snapshot Synchronization, also known as the SYNC command, is a critical design element for managing database snapshots in MygramDB. This feature grants operators precise control over the initial synchronization process, which is essential for database health and performance. The primary goal is to prevent unexpected load on the MySQL master server during server startup. This design document outlines the key aspects of the SYNC command, ensuring data safety and operational flexibility. We will dive into the requirements, architecture, command specifications, and considerations for conflict detection, state management, and shutdown handling. This approach provides a more controlled and transparent method for managing snapshots compared to the previous automatic build process, which could cause unforeseen strain during server startup. By giving operators the ability to trigger snapshot synchronization, we enhance both operational safety and overall system performance.
Core Objectives
- Operational Safety: Prevent excessive load on the MySQL master server during startup.
- Flexibility: Enable operators to choose when snapshot synchronization occurs.
- Transparency: Provide clear feedback on synchronization progress and replication status.
- Multi-table Support: Facilitate independent synchronization for various tables.
- Data Safety: Protect data integrity by managing conflicts effectively.
Requirements: Ensuring a Robust Implementation
The requirements for the SYNC command are multifaceted, addressing both functional and non-functional aspects. These requirements ensure that the command is not only powerful but also safe, efficient, and user-friendly. Functional requirements dictate what the system must do, while non-functional requirements focus on how well it performs these actions.
Functional Requirements: What the SYNC command must do
- Configuration Control: Allow the
auto_initial_snapshotflag to disable automatic snapshot building. - Manual Trigger: Implement the
SYNCcommand for manual snapshot building. - Progress Monitoring: Provide the
SYNC STATUScommand to monitor the synchronization process. - Asynchronous Execution: Ensure the
SYNCcommand returns immediately and runs in the background. - Replication Management: Automatically start or restart binlog replication after the
SYNCcommand. - Conflict Prevention: Block conflicting operations like
DUMP LOADandREPLICATION STARTduring theSYNCprocess. - Multi-table Support: Enable independent
SYNCoperations for various tables.
Non-Functional Requirements: How well the SYNC command performs
- Non-blocking: The
SYNCcommand should not block other client connections. - Safe by Default: The default configuration must prevent unexpected MySQL load.
- Graceful Shutdown: The system should cleanly cancel ongoing
SYNCoperations during server shutdown. - Memory Safe: Ensure memory availability before starting
SYNCoperations. - Thread-safe: Support concurrent
SYNCoperations on different tables. - Idempotent: Running
SYNCmultiple times should be safe with proper checks in place.
Architecture: Design and Component Interactions
The architecture of the SYNC command is designed to provide a smooth and efficient process for manual snapshot synchronization. It involves several components that interact to ensure data integrity and operational control. The default behavior change is a cornerstone of this new system, moving away from automatic snapshots on startup to manual control.
Default Behavior Change
| Aspect | Current Behavior | New Behavior (Default) |
|---|---|---|
| Startup | Auto-build snapshots | Skip snapshot build |
| MySQL Load | Immediate on startup | Only when operator runs SYNC |
| Replication | Auto-start after snapshot | Start after manual SYNC |
| Configuration | No control | auto_initial_snapshot: false |
High-Level SYNC Flow
The high-level flow illustrates the process from the client's perspective down to the background thread's execution:
- Client: Issues
SYNC articlescommand. - TcpServer::HandleSyncCommand():
- Checks if the table is already syncing; rejects if it is.
- Checks memory health and rejects if critical.
- Marks the table as syncing.
- Launches a background thread.
- Returns "OK SYNC STARTED".
- Background Thread: BuildSnapshotAsync():
- Creates a MySQL connection.
- Builds a snapshot using
SnapshotBuilder. - Updates progress, including rows processed and the processing rate.
- Captures the GTID (Global Transaction Identifier).
- Starts or restarts the BinlogReader with the GTID.
- Marks the process as completed.
- Removes the table from the syncing set.
Component Interactions
The mermaid diagram below visualizes the component interactions:
graph TD
A[Client] -->|SYNC articles| B[TcpServer]
B --> C{Memory OK?}
C -->|No| D[Error: Critical Memory]
C -->|Yes| E{Already Syncing?}
E -->|Yes| F[Error: SYNC in progress]
E -->|No| G[Launch Background Thread]
G --> H[BuildSnapshotAsync]
H --> I[SnapshotBuilder]
I --> J[MySQL Connection]
J --> K[Capture GTID]
K --> L[BinlogReader::StartFromGTID]
L --> M[Mark Completed]
N[Client] -->|SYNC STATUS| B
B --> O[Return Progress]
P[Client] -->|DUMP LOAD| Q[DumpHandler]
Q --> R{Any Table Syncing?}
R -->|Yes| S[Error: Cannot LOAD during SYNC]
R -->|No| T[Proceed with LOAD]
style H fill:#4ecdc4
style L fill:#51cf66
style S fill:#ff6b6b
Components
- SyncHandler: Manages the
SYNCandSYNC STATUScommands. - SyncState: Tracks the synchronization state for each table, including progress, status, and GTID.
- BuildSnapshotAsync(): The background thread function that builds the snapshot.
- Conflict Detector: Checks for syncing tables within the DumpHandler and ReplicationHandler.
- Shutdown Canceller: Cancels active
SYNCoperations during server shutdown.
Command Specification: Interacting with the SYNC Command
The SYNC and SYNC STATUS commands provide the primary interface for users to manage snapshot synchronization. These commands offer clear syntax and responses to ensure ease of use and effective monitoring of the process.
SYNC Command
-
Syntax:
SYNC [table_name] -
Examples:
# Sync all tables SYNC # Sync specific table SYNC articles -
Response (Success):
OK SYNC STARTED table=articles job_id=1 -
Response (Error):
ERROR SYNC already in progress for table 'articles' ERROR Memory critically low. Cannot start SYNC. Check system memory. ERROR Table 'products' not found in configuration
SYNC STATUS Command
-
Syntax:
SYNC STATUS -
Response (In Progress):
table=articles status=IN_PROGRESS progress=10000/25000 rows (40%) rate=5000 rows/s -
Response (Completed):
table=articles status=COMPLETED rows=25000 time=5.2s gtid=xxxx:123 replication=STARTED -
Response (Idle):
status=IDLE message="No sync operation performed" -
Response (Failed):
table=articles status=FAILED rows=5000 error="MySQL connection lost"
Response Fields
The table below details the fields returned by the SYNC STATUS command:
| Field | Description | Example |
|---|---|---|
table |
Table name being synced | articles |
status |
Current status | IN_PROGRESS, COMPLETED, FAILED, IDLE, CANCELLED |
progress |
Current/total rows processed | 10000/25000 rows (40%) |
rate |
Processing rate | 5000 rows/s |
rows |
Total rows processed | 25000 |
time |
Total processing time | 5.2s |
gtid |
Captured snapshot GTID | xxxx:123 |
replication |
Replication status | STARTED, ALREADY_RUNNING, DISABLED, FAILED |
error |
Error message (if failed) | MySQL connection lost |
Replication Status Values
The replication field provides the status of the binlog replication:
- STARTED: Binlog replication started from snapshot GTID (success).
- ALREADY_RUNNING: Replication was already running (GTID not updated).
- DISABLED: Replication is disabled in the configuration.
- FAILED: Snapshot succeeded, but replication failed to start; check the logs.
Conflict Detection: Ensuring Data Integrity
Conflict detection is a critical component of the SYNC command. It prevents data corruption by carefully managing how other operations interact with the synchronization process. This involves blocking conflicting commands and providing appropriate warnings.
Command Behavior During SYNC
The following table outlines how different commands behave during a SYNC operation:
| Command | Behavior | Rationale |
|---|---|---|
SEARCH |
✅ Execute normally | Returns partial results (data loaded so far) |
COUNT |
✅ Execute normally | Returns the current count (increases as sync progresses) |
GET |
✅ Execute normally | Returns the document if loaded, or an error if not yet loaded |
INFO |
✅ Execute normally | Shows current server state, including sync progress |
OPTIMIZE |
⚠️ Allowed with caution | Safe but increases memory usage during sync |
DUMP SAVE |
⚠️ Warning logged | Allowed but logs a warning (saving an incomplete snapshot) |
DUMP LOAD |
❌ Error | Cannot load a dump while SYNC is in progress (potential data corruption) |
SYNC (same table) |
❌ Error | SYNC already in progress for this table |
SYNC (different table) |
✅ Start new sync | Multi-table support, independent operations |
REPLICATION START |
❌ Error | Cannot start replication while SYNC is in progress; SYNC handles replication start. |
REPLICATION STOP |
✅ Execute normally | Independent operation |
Implementation: Syncing Tables Tracker
The syncing_tables_ set is used to track which tables are currently syncing to prevent conflicts.
Add to tcp_server.h:
// Track which tables are currently syncing (for conflict detection)
std::unordered_set<std::string> syncing_tables_;
std::mutex syncing_tables_mutex_;
Update server_types.h HandlerContext:
struct HandlerContext {
// ... existing fields ...
std::unordered_set<std::string>& syncing_tables;
std::mutex& syncing_tables_mutex;
};
DumpHandler: Block DUMP LOAD
std::string DumpHandler::HandleDumpLoad(const query::Query& query) {
// Check if any table is currently syncing
{
std::lock_guard<std::mutex> lock(ctx_.syncing_tables_mutex);
if (!ctx_.syncing_tables.empty()) {
std::ostringstream oss;
oss << "Cannot load dump while SYNC is in progress for tables: ";
for (const auto& table : ctx_.syncing_tables) {
oss << table << " ";
}
return ResponseFormatter::FormatError(oss.str());
}
}
// ... existing DUMP LOAD logic ...
}
DumpHandler: Warn on DUMP SAVE
std::string DumpHandler::HandleDumpSave(const query::Query& query) {
// Warn if any table is currently syncing
{
std::lock_guard<std::mutex> lock(ctx_.syncing_tables_mutex);
if (!ctx_.syncing_tables.empty()) {
spdlog::warn("DUMP SAVE executed while SYNC is in progress. Snapshot may be incomplete.");
}
}
// ... existing DUMP SAVE logic ...
}
ReplicationHandler: Block REPLICATION START
std::string ReplicationHandler::HandleReplicationStart(const query::Query& query) {
// Check if any table is currently syncing
{
std::lock_guard<std::mutex> lock(ctx_.syncing_tables_mutex);
if (!ctx_.syncing_tables.empty()) {
return ResponseFormatter::FormatError(
"Cannot start replication while SYNC is in progress. "
"SYNC will automatically start replication when complete.");
}
}
// ... existing REPLICATION START logic ...
}
State Management: Tracking the Synchronization Process
State management is critical for the SYNC command to track and report the synchronization progress accurately. The SyncState structure holds all necessary information about each synchronization operation.
SyncState Structure
struct SyncState {
std::atomic<bool> is_running{false};
std::string table_name;
uint64_t total_rows = 0;
std::atomic<uint64_t> processed_rows{0};
std::chrono::steady_clock::time_point start_time;
std::string status; // "IDLE", "IN_PROGRESS", "COMPLETED", "FAILED", "CANCELLED"
std::string error_message;
std::string gtid;
std::string replication_status;
};
Add to TcpServer
// Per-table sync state
std::unordered_map<std::string, SyncState> sync_states_;
std::mutex sync_mutex_;
// Active snapshot builders (for shutdown cancellation)
std::unordered_map<std::string, storage::SnapshotBuilder*> active_snapshot_builders_;
std::mutex snapshot_builders_mutex_;
// Shutdown flag
std::atomic<bool> shutdown_requested_{false};
HandleSyncCommand Implementation
std::string TcpServer::HandleSyncCommand(const query::Query& query) {
std::lock_guard<std::mutex> lock(sync_mutex_);
const std::string& table_name = query.table;
// Check if already running for this table
if (sync_states_[table_name].is_running) {
return ResponseFormatter::FormatError(
"SYNC already in progress for table " + table_name);
}
// Check memory health before starting
auto memory_health = utils::GetMemoryHealthStatus();
if (memory_health == utils::MemoryHealthStatus::CRITICAL) {
return ResponseFormatter::FormatError(
"Memory critically low. Cannot start SYNC. Check system memory.");
}
// Mark table as syncing
{
std::lock_guard<std::mutex> sync_lock(syncing_tables_mutex_);
syncing_tables_.insert(table_name);
}
// Initialize state
sync_states_[table_name].is_running = true;
sync_states_[table_name].status = "STARTING";
sync_states_[table_name].table_name = table_name;
// Launch async snapshot build
std::thread([this, table_name]() {
BuildSnapshotAsync(table_name);
}).detach();
return "OK SYNC STARTED table=" + table_name + " job_id=1";
}
BuildSnapshotAsync Implementation
void TcpServer::BuildSnapshotAsync(const std::string& table_name) {
auto& state = sync_states_[table_name];
state.status = "IN_PROGRESS";
state.start_time = std::chrono::steady_clock::now();
// RAII guard to ensure cleanup even on exceptions
struct SyncGuard {
TcpServer* server;
std::string table;
explicit SyncGuard(TcpServer* s, std::string t) : server(s), table(std::move(t)) {}
~SyncGuard() {
// Remove from syncing set
std::lock_guard<std::mutex> lock(server->syncing_tables_mutex_);
server->syncing_tables_.erase(table);
}
};
SyncGuard guard(this, table_name);
try {
// Create MySQL connection
auto mysql_conn = CreateMysqlConnection();
if (!mysql_conn || !mysql_conn->IsConnected()) {
state.status = "FAILED";
state.error_message = "Failed to connect to MySQL";
state.is_running = false;
return;
}
// Get table context
auto* ctx = table_contexts_[table_name];
// Build snapshot with cancellation support
storage::SnapshotBuilder builder(*mysql_conn, *ctx->index, *ctx->doc_store,
ctx->config, build_config_);
// Store builder pointer for shutdown cancellation
{
std::lock_guard<std::mutex> lock(snapshot_builders_mutex_);
active_snapshot_builders_[table_name] = &builder;
}
bool success = builder.Build([&](const auto& progress) {
state.total_rows = progress.total_rows;
state.processed_rows = progress.processed_rows;
// Check for shutdown signal
if (shutdown_requested_) {
builder.Cancel();
}
});
// Clear builder pointer
{
std::lock_guard<std::mutex> lock(snapshot_builders_mutex_);
active_snapshot_builders_.erase(table_name);
}
// Check if cancelled by shutdown
if (shutdown_requested_) {
state.status = "CANCELLED";
state.error_message = "Server shutdown requested";
state.is_running = false;
return;
}
if (success) {
state.status = "COMPLETED";
state.gtid = builder.GetSnapshotGTID();
state.processed_rows = builder.GetProcessedRows();
// Start or restart binlog replication if enabled
if (replication_enabled_ && !state.gtid.empty()) {
bool repl_started = StartOrRestartBinlogReplication(table_name, state.gtid);
if (repl_started) {
state.replication_status = "STARTED";
} else {
state.replication_status = "FAILED";
state.error_message = "Snapshot succeeded but replication failed to start";
}
} else {
state.replication_status = "DISABLED";
}
spdlog::info("SYNC completed for table: {} (rows={}, gtid={}, replication={})",
table_name, state.processed_rows, state.gtid, state.replication_status);
} else {
state.status = "FAILED";
state.error_message = builder.GetLastError();
spdlog::error("SYNC failed for table: {} - {}", table_name, state.error_message);
}
} catch (const std::exception& e) {
state.status = "FAILED";
state.error_message = e.what();
spdlog::error("SYNC exception for table: {} - {}", table_name, e.what());
}
state.is_running = false;
}
Shutdown Handling: Ensuring a Clean Exit
Shutdown handling is essential to gracefully terminate the SYNC operations when the server receives a shutdown signal. This ensures that no data is left in an inconsistent state and prevents potential issues during the shutdown process.
Problem
Active SYNC operations must be cancelled cleanly when the server receives a shutdown signal (SIGTERM/SIGINT).
Solution
- Set the
shutdown_requested_flag. - Cancel all active
SnapshotBuilderinstances. - Wait for the
SYNCoperations to complete (with a timeout). - Proceed with the normal shutdown sequence.
Implementation
void TcpServer::Stop() {
shutdown_requested_ = true;
// Cancel all active SYNC operations
{
std::lock_guard<std::mutex> lock(snapshot_builders_mutex_);
for (auto& [table_name, builder] : active_snapshot_builders_) {
spdlog::info("Cancelling SYNC for table: {}", table_name);
builder->Cancel();
}
}
// Wait for SYNC operations to complete (with timeout)
auto wait_start = std::chrono::steady_clock::now();
constexpr int kSyncWaitTimeoutSec = 30;
while (!syncing_tables_.empty()) {
auto elapsed = std::chrono::duration_cast<std::chrono::seconds>(
std::chrono::steady_clock::now() - wait_start).count();
if (elapsed > kSyncWaitTimeoutSec) {
spdlog::warn("Timeout waiting for SYNC operations to complete");
break;
}
std::this_thread::sleep_for(std::chrono::milliseconds(100));
}
// ... existing Stop() logic ...
}
Configuration: Customizing SYNC Behavior
Configuration settings allow operators to customize the SYNC command's behavior according to the specific needs of their deployments. This section outlines the new configuration field and its integration within the system.
New Configuration Field
The auto_initial_snapshot setting controls whether snapshots are automatically built on server startup or if the SYNC command must be used. This provides flexibility and control over the snapshot process.
Add to ReplicationConfig (in config.h):
struct ReplicationConfig {
bool enable = true;
bool auto_initial_snapshot = false; // NEW: Default false (manual sync required)
uint32_t server_id = 0;
std::string start_from = "snapshot";
int queue_size = defaults::kReplicationQueueSize;
int reconnect_backoff_min_ms = defaults::kReconnectBackoffMinMs;
int reconnect_backoff_max_ms = defaults::kReconnectBackoffMaxMs;
};
Configuration Schema
The configuration schema defines the structure and types of the configuration settings.
Add to config-schema.json:
{
"replication": {
"properties": {
"auto_initial_snapshot": {
"type": "boolean",
"default": false,
"description": "Automatically build snapshots on server startup (default: false). Set to true to restore pre-1.0 behavior. When false, use SYNC command to manually trigger snapshot build."
}
}
}
}
Configuration Examples
Here are some examples of how to configure the SYNC command using both YAML and JSON:
YAML (examples/config.yaml):
replication:
enable: true
auto_initial_snapshot: false # Default: false (safe by default)
# Set to true for automatic snapshot on startup
server_id: 12345
start_from: "snapshot"
JSON (examples/config.json):
{
"replication": {
"enable": true,
"auto_initial_snapshot": false,
"server_id": 12345,
"start_from": "snapshot"
}
}
Migration Guide: Transitioning to Manual Snapshot Synchronization
This guide outlines the steps for migrating to the manual snapshot synchronization process using the SYNC command. This ensures a smooth transition and minimizes any potential disruptions.
Breaking Change Notice
Starting from version 1.X.X, MygramDB no longer automatically builds snapshots on startup by default. This change prevents unexpected load on the MySQL master server.
Migration Options
There are two main options for migrating to the manual snapshot synchronization:
Option 1: Enable Auto Snapshot (Restore Old Behavior)
To restore the pre-1.X behavior, update the config.yaml:
replication:
enable: true
auto_initial_snapshot: true # Restore pre-1.X behavior
server_id: 12345
start_from: "snapshot"
- Use Case: Ideal for existing deployments that require automatic snapshots on startup.
Option 2: Use Manual SYNC (Recommended)
Follow these steps to use the manual SYNC command:
-
Update the configuration to use the default
auto_initial_snapshot: falsesetting. -
Start the MygramDB server.
-
When ready, execute the
SYNCcommand:# Option A: Use CLI client mygram-cli SYNC # Option B: Use telnet telnet localhost 11016 SYNC articles # Monitor progress mygram-cli SYNC STATUS
- Use Case: Recommended for new deployments and production environments where load control is critical.
Migration Checklist
The following checklist helps ensure a smooth migration:
- Review the current startup behavior.
- Update the configuration file.
- Test the changes in a staging environment.
- Update operational runbooks.
- Train operators on the
SYNCcommand usage. - Monitor the first production deployment.
Future Enhancements: Expanding the SYNC Command
This section outlines potential future enhancements for the SYNC command. These improvements aim to add greater functionality and flexibility to the system.
Phase 2 Features
- SYNC CANCEL Command:
- Abort the ongoing
SYNCoperation. - Clean up any partial data.
- Return resources.
- Abort the ongoing
- SYNC ALL Command:
- Sync all configured tables in parallel.
- Provide aggregated progress reporting.
- Resume from Checkpoint:
- Save progress periodically.
- Resume from the last checkpoint on failure.
- Useful for synchronizing very large tables.
- Scheduled SYNC:
- Configuration option:
sync_schedule: "0 2 * * *"(cron format). - Automatically run the
SYNCcommand at specified times. - Useful for daily data refresh.
- Configuration option:
Phase 3 Features
- Prometheus Metrics:
mygramdb_sync_in_progress{table="articles"}(gauge)mygramdb_sync_rows_total{table="articles"}(counter)mygramdb_sync_duration_seconds{table="articles"}(histogram)mygramdb_sync_errors_total{table="articles"}(counter)
- SYNC LIST Command:
- Display sync history.
- Show past sync operations with timestamps.
- Useful for auditing.
- Progress Percentage in INFO:
- Add sync progress to the
INFOcommand output. - Display progress across all tables.
- Example:
sync_progress: articles=45% products=completed.
- Add sync progress to the
- Parallel Batch Processing:
- Split large tables into ranges.
- Build multiple batches in parallel.
- Merge results at the end.
- Significant speedup for multi-million row tables.
Design Decisions Summary: Key Considerations
This summary highlights the critical design decisions made during the development of the SYNC command. These decisions were made to provide a balance of performance, safety, and operational ease.
1. Async Non-blocking Execution
- Decision: The
SYNCcommand runs in a background thread and returns immediately. - Rationale: Provides better user experience, allows progress monitoring, and prevents blocking other clients.
- Tradeoff: Involves more complex state management.
2. Per-table Conflict Detection
- Decision: Use the
syncing_tables_set to track activeSYNCoperations. - Rationale: Enables multi-table support and prevents data corruption.
- Implementation: Checks in
DumpHandlerandReplicationHandler.
3. Memory Health Check
- Decision: Check memory availability before starting
SYNC(similar toOPTIMIZE). - Rationale: Snapshot building can be memory-intensive for large tables.
- Implementation: Use existing
utils::GetMemoryHealthStatus().
4. Shutdown Cancellation
- Decision: Cancel active
SYNCoperations during server shutdown. - Rationale: Ensures clean shutdown and prevents MySQL connection hangs.
- Implementation: Store active
SnapshotBuilderpointers and callCancel()on shutdown.
5. BinlogReader Management
- Decision: The
SYNCcommand automatically starts or restarts theBinlogReaderwith the new GTID. - Rationale: Provides seamless replication setup and eliminates manual steps.
- Tradeoff: Requires implementing
BinlogReaderrestart logic.
6. DUMP SAVE Behavior
- Decision: Allow
DUMP SAVEduringSYNCbut log a warning. - Rationale: Not strictly dangerous, but the snapshot may be incomplete.
- Alternative: Could block
DUMP SAVEentirely (more restrictive).
7. Default to Manual Sync
- Decision:
auto_initial_snapshot: falseby default. - Rationale: Safe by default, prevents unexpected MySQL master load.
- Migration: This is a breaking change, and existing deployments require configuration updates.
The manual snapshot synchronization with the SYNC command provides a robust and flexible way to manage snapshots in MygramDB, ensuring both operational efficiency and data integrity.
For additional information, you can consult the official MySQL Documentation.