Operations
Validator Restart Procedures
Understanding when and how to restart your validator safely is crucial for maintaining uptime and avoiding jailing.
Graceful Restart
When to use: Routine maintenance, configuration changes, or planned updates when you have time.
What it does: Sends a termination signal (SIGTERM) to the evmd process, allowing it to cleanly shut down by:
Completing any in-progress operations
Flushing data to disk
Closing database connections gracefully
Expected downtime: 10-30 seconds
# Stop the node gracefully
sudo systemctl stop evmd
# Verify process stopped (should return no results)
ps aux | grep evmd
# Start the node
sudo systemctl start evmd
# Monitor logs to ensure clean startup
sudo journalctl -u evmd -f --output catEmergency Restart
When to use: When the node is unresponsive, hung, or graceful restart fails.
What it does: Forces immediate termination (SIGKILL) without waiting for cleanup. Use only when graceful restart doesn't work, as it can potentially cause database inconsistencies.
Risk: May require database repair if process was killed during a write operation.
Post-Restart Validation
Purpose: Verify your validator is running correctly and signing blocks after restart.
What to check:
Service is active and running
Node is syncing/synced with the network
Block height is increasing
Validator is participating in consensus (signing blocks)
Monitor for: 5-10 minutes after restart to ensure stable operation.
Backup and Recovery
Protecting your validator keys and state is critical. Lost keys mean permanent loss of validator identity and staked funds.
⚠️ Double Signing Warning: While backups are essential, NEVER restore your priv_validator_key.json to a second node while another node is running with the same key. This will cause double signing and permanent slashing. Always stop the old node before migrating.
Regular Backup Schedule
Daily Backups:
Purpose: Protect critical keys and configuration that define your validator identity.
What's backed up:
priv_validator_key.json- Your validator signing key (most critical)node_key.json- Your P2P network identitykeyring-file/- Your account keys and addressesconfig.tomlandapp.toml- Your node configuration
Why daily: Keys rarely change, but having recent backups ensures quick recovery if hardware fails.
Storage: Keep backups in multiple locations (external drive, encrypted cloud storage, offline USB).
Backup Script:
State Snapshot (Weekly):
Purpose: Backup blockchain data for faster recovery without re-syncing from genesis.
What's backed up: Complete database state (
~/.evmd/data/) containing all blockchain data.Why weekly: State is large (100s of GB), changes constantly, but you can always re-sync if needed.
Downtime required: Yes (5-30 minutes depending on disk speed).
State Snapshot Script:
Note: State snapshots are optional but dramatically reduce recovery time (hours vs days).
Recovery Procedures
Recover from Key Backup:
When to use:
Lost or corrupted keys
Moving validator to new hardware (ONLY after stopping old hardware)
Disaster recovery
What it recovers: Your validator identity and configuration (does not restore blockchain state).
Recovery time: Keys restore instantly, but node may need to sync (hours to days without state snapshot).
Recover from State Snapshot:
When to use:
Corrupted database
Faster recovery after hardware migration
Node won't start due to state errors
What it does: Restores blockchain data to a previous point, avoiding full re-sync.
Recovery time: Extraction (10-30 min) + sync from snapshot height to current (minutes to hours).
Important: State snapshots don't include keys. Always restore keys first if moving to new hardware.
Monitoring Setup
Prometheus Metrics
Enable Prometheus metrics in config.toml:
Prometheus Configuration (prometheus.yml):
Key Metrics to Monitor
Alerting Configuration
Alert Script Example
Purpose: Automated monitoring script that sends Slack notifications when critical validator issues are detected.
What it monitors:
Validator jailed status (critical)
Missed block count (warning threshold: 100 blocks)
Node sync status (out of sync detection)
Setup Instructions:
Create Slack webhook:
Go to https://api.slack.com/messaging/webhooks
Create a webhook for your workspace
Copy the webhook URL
Set environment variable:
Create and configure the script:
Script Features:
Error handling: Validates all retrieved values before use
Configurable: Uses environment variables for flexibility
Logging: Prints timestamped events to stdout
Security: Webhook URL stored in environment, not in script
Thresholds: Configurable missed blocks threshold
Cron Setup:
Testing the Script:
Alert Severity Levels:
🚨 CRITICAL: Validator jailed - requires immediate action
⚠️ WARNING: High missed blocks or out of sync - investigate soon
✅ INFO: Regular monitoring checks (logged, not alerted)
Next Steps
Review setup procedures in Setup & Configuration
Follow step-by-step procedures in Runbooks
Troubleshoot common issues in Troubleshooting
Access quick reference in Additional Resources
Last updated