Fixing NEAR Node Sync Issues After Delayed Start
When operating a NEAR node, encountering synchronization problems, especially after a delayed start, can be a significant hurdle. This article delves into an issue where a NEAR node fails to sync after a delayed start, potentially stemming from state synchronization or retention issues. We'll explore the background, user story, and solutions to ensure your node remains reliable and resilient. This issue is particularly relevant in Multi-Party Computation (MPC) node setups where timely synchronization is crucial for network integrity.
Background: Understanding the Sync Failure
In the realm of blockchain technology, node synchronization is the process by which a node catches up with the current state of the network. This involves downloading and processing blocks of transactions, ensuring the node has an up-to-date copy of the blockchain ledger. When a node starts with a delay, it needs to retrieve a backlog of blocks to reach the current state. However, if the node cannot access these historical blocks, synchronization fails.
During tests of the MPC node startup sequence, it was observed that if a node isn't started immediately after the local network setup, it might fail to synchronize correctly. This issue persists even with recent fixes implemented, such as those from PR #1414, which aimed to address similar problems. The primary symptom is that the node synchronizes block headers but not the actual block data. This leads to errors like:
Failed to get balance. Waiting for sync? account=signer err=Account sam.test.near does not exist while viewing at block #0
error reading TEE accounts from chain: Account mpc-contract.test.near does not exist while viewing at block #0
INFO indexer: # 0 | Blocks processing: 0 | Blocks done: 0. Bps 0.00 b/s
These errors indicate that the node cannot access the account states at specific block heights, suggesting that the required historical data is missing. Discussions within the community point to the default neard configuration as a potential cause. By default, neard retains only a limited number of historical blocks (approximately 3-5 epochs). This limitation prevents nodes that start later from syncing because the necessary historical data has been pruned or garbage collected (gc).
Possible Solutions
To address this synchronization issue, two primary solutions have been identified:
- Increasing
gc_num_epochs_to_keepinneardconfiguration: This approach involves modifying theneardconfiguration to retain more historical epochs. By increasing the number of epochs the node keeps, delayed nodes have a higher chance of accessing the necessary historical data to synchronize. This method is generally preferred as it ensures a more complete historical record. - Using a bucket/state-sync approach: If retaining the full history is not feasible due to storage constraints, a state-sync approach can be used. This involves syncing the latest state from a trusted source rather than replaying the entire history. However, this method might not be suitable for all scenarios, especially those requiring complete historical data for auditing or other purposes.
User Story: Ensuring Reliable Node Operation
The perspective of a MPC node operator is crucial in understanding the impact of this issue. Consider the following user story:
As an MPC node operator,
I want the node to be able to fully sync even if it starts after the network has progressed several epochs,
So that the system remains reliable and resilient to restarts or deployment delays.
This user story highlights the need for nodes to be resilient to delays. In a practical setting, nodes might be restarted for maintenance, experience unexpected downtime, or be deployed after the network has already been running for some time. The ability to synchronize despite these delays is paramount for maintaining network stability and reliability.
Acceptance Criteria: Defining Success
To ensure the proposed solutions effectively address the synchronization issue, specific acceptance criteria must be met. These criteria provide a clear benchmark for verifying that the problem is resolved.
- A node started after some time (e.g., after multiple epochs) can still sync fully. This is the core requirement. The node should be able to catch up with the network regardless of the delay in startup.
- The node successfully retrieves and processes all missing blocks, not just headers. It's not sufficient for the node to only sync headers; it must also retrieve the block data to have a complete view of the blockchain state.
- Relevant configuration parameters (e.g.,
gc_num_epochs_to_keep,state_sync_enabled) are clearly documented and set correctly for MPC localnet setups. Proper documentation and configuration are essential for preventing future issues and ensuring ease of use. - No “account does not exist at block #0” or similar errors appear in the logs during sync. The absence of these errors indicates that the node can access the necessary historical data.
- Verified that enabling
enable_debug_rpcor increasing history retention resolves or isolates the issue. These steps help in diagnosing the root cause and verifying the effectiveness of the solutions. - Root cause (e.g., localnet history pruning, missing headers, port/config issues) identified and documented. Understanding the root cause is crucial for preventing recurrence and implementing long-term solutions.
Detailed Solutions and Implementation
To effectively address the synchronization issue, let’s delve deeper into the proposed solutions and how they can be implemented.
1. Increasing gc_num_epochs_to_keep
The gc_num_epochs_to_keep parameter in the neard configuration determines how many historical epochs the node will retain. By default, this value is set to a low number, typically around 3-5 epochs. This means that after a few epochs, the node will garbage collect older blocks to save storage space. For a node starting after a delay, this can be problematic as the required historical data might have been pruned.
Implementation Steps:
-
Locate the
config.jsonfile: Theneardconfiguration file is usually located in the~/.near/directory. Navigate to this directory and open theconfig.jsonfile using a text editor. -
Modify
gc_num_epochs_to_keep: Find thegc_num_epochs_to_keepparameter in the configuration file. It might look something like this:"gc_num_epochs_to_keep": 5,Increase this value to a larger number. A value of 100 or more should be sufficient for most localnet setups. For example:
"gc_num_epochs_to_keep": 100, -
Save the changes: Save the modified
config.jsonfile. -
Restart
neard: Restart theneardservice for the changes to take effect. You can typically do this using systemd:sudo systemctl restart neard
Benefits:
- Ensures that the node retains a longer history, making it more resilient to delayed starts.
- Simple to implement and configure.
- Reduces the likelihood of synchronization failures due to missing historical data.
Considerations:
- Increasing
gc_num_epochs_to_keepwill increase storage requirements as the node retains more data. - May not be feasible for resource-constrained environments.
2. Using a Bucket/State-Sync Approach
If retaining the full history is not feasible, a state-sync approach can be used. This involves syncing the latest state from a trusted source rather than replaying the entire history. This method is particularly useful for nodes that need to catch up quickly without downloading a large amount of historical data.
Implementation Steps:
-
Enable State Sync: Modify the
config.jsonfile to enable state sync. Look for thestate_sync_enabledparameter and set it totrue:"state_sync_enabled": true, -
Configure State Sync Source: Specify the source from which to sync the state. This might involve configuring a trusted peer or using a pre-existing state snapshot.
-
Restart
neard: Restart theneardservice for the changes to take effect:sudo systemctl restart neard
Benefits:
- Faster synchronization, especially for nodes that are significantly behind.
- Reduces storage requirements as the node does not need to download the entire history.
Considerations:
- Requires a trusted source for state synchronization.
- May not be suitable for all use cases, particularly those requiring full historical data.
- More complex to configure compared to increasing
gc_num_epochs_to_keep.
Additional Troubleshooting Steps
In addition to the primary solutions, consider the following troubleshooting steps to further diagnose and resolve synchronization issues:
- Enable Debug RPC: Enabling the debug Remote Procedure Call (RPC) can provide valuable insights into the node’s operation and help identify potential issues. Set
enable_debug_rpctotruein theconfig.jsonfile. - Check Network Connectivity: Ensure that the node can connect to the NEAR network. Verify that there are no firewall or network configuration issues blocking the connection.
- Review Logs: Examine the
neardlogs for any error messages or warnings. These logs can provide clues about the cause of the synchronization failure. - Verify Ports: Ensure that the necessary ports are open and accessible. The default NEAR port is 3030.
- Check Configuration: Double-check the
config.jsonfile for any misconfigurations or typos.
Resources and Additional Notes
Further information and discussions related to this issue can be found in the NEAR community channels and forums. The following resources may be helpful:
- NEAR Protocol Documentation: https://docs.near.org/
- NEAR GitHub Repository: https://github.com/near/nearcore
- NEAR Community Forum: https://gov.near.org/
Conclusion: Ensuring Robust Node Synchronization
In conclusion, addressing synchronization issues in NEAR nodes, especially after delayed starts, is crucial for maintaining network reliability and resilience. By understanding the potential causes, such as limited history retention, and implementing solutions like increasing gc_num_epochs_to_keep or using a state-sync approach, node operators can ensure their nodes remain in sync with the network. Additionally, thorough documentation, proper configuration, and proactive troubleshooting are essential for preventing future issues. By following the guidelines and best practices outlined in this article, MPC node operators can confidently manage their nodes and contribute to the overall health of the NEAR ecosystem.
For more information about NEAR Protocol and its technology, you can visit the official NEAR website.