Operations

Validator Restart Procedures

Understanding when and how to restart your validator safely is crucial for maintaining uptime and avoiding jailing.

Graceful Restart

  • When to use: Routine maintenance, configuration changes, or planned updates when you have time.

  • What it does: Sends a termination signal (SIGTERM) to the evmd process, allowing it to cleanly shut down by:

    • Completing any in-progress operations

    • Flushing data to disk

    • Closing database connections gracefully

  • Expected downtime: 10-30 seconds

# Stop the node gracefully
sudo systemctl stop evmd

# Verify process stopped (should return no results)
ps aux | grep evmd

# Start the node
sudo systemctl start evmd

# Monitor logs to ensure clean startup
sudo journalctl -u evmd -f --output cat

Emergency Restart

  • When to use: When the node is unresponsive, hung, or graceful restart fails.

  • What it does: Forces immediate termination (SIGKILL) without waiting for cleanup. Use only when graceful restart doesn't work, as it can potentially cause database inconsistencies.

  • Risk: May require database repair if process was killed during a write operation.

Post-Restart Validation

  • Purpose: Verify your validator is running correctly and signing blocks after restart.

  • What to check:

    • Service is active and running

    • Node is syncing/synced with the network

    • Block height is increasing

    • Validator is participating in consensus (signing blocks)

  • Monitor for: 5-10 minutes after restart to ensure stable operation.

Backup and Recovery

Protecting your validator keys and state is critical. Lost keys mean permanent loss of validator identity and staked funds.

⚠️ Double Signing Warning: While backups are essential, NEVER restore your priv_validator_key.json to a second node while another node is running with the same key. This will cause double signing and permanent slashing. Always stop the old node before migrating.

Regular Backup Schedule

Daily Backups:

  • Purpose: Protect critical keys and configuration that define your validator identity.

  • What's backed up:

    • priv_validator_key.json - Your validator signing key (most critical)

    • node_key.json - Your P2P network identity

    • keyring-file/ - Your account keys and addresses

    • config.toml and app.toml - Your node configuration

  • Why daily: Keys rarely change, but having recent backups ensures quick recovery if hardware fails.

  • Storage: Keep backups in multiple locations (external drive, encrypted cloud storage, offline USB).

  • Backup Script:

State Snapshot (Weekly):

  • Purpose: Backup blockchain data for faster recovery without re-syncing from genesis.

  • What's backed up: Complete database state (~/.evmd/data/) containing all blockchain data.

  • Why weekly: State is large (100s of GB), changes constantly, but you can always re-sync if needed.

  • Downtime required: Yes (5-30 minutes depending on disk speed).

  • State Snapshot Script:

Note: State snapshots are optional but dramatically reduce recovery time (hours vs days).

Recovery Procedures

Recover from Key Backup:

  • When to use:

    • Lost or corrupted keys

    • Moving validator to new hardware (ONLY after stopping old hardware)

    • Disaster recovery

  • What it recovers: Your validator identity and configuration (does not restore blockchain state).

  • Recovery time: Keys restore instantly, but node may need to sync (hours to days without state snapshot).

Recover from State Snapshot:

  • When to use:

    • Corrupted database

    • Faster recovery after hardware migration

    • Node won't start due to state errors

  • What it does: Restores blockchain data to a previous point, avoiding full re-sync.

  • Recovery time: Extraction (10-30 min) + sync from snapshot height to current (minutes to hours).

  • Important: State snapshots don't include keys. Always restore keys first if moving to new hardware.

Monitoring Setup

Prometheus Metrics

Enable Prometheus metrics in config.toml:

Prometheus Configuration (prometheus.yml):

Key Metrics to Monitor

Alerting Configuration

Alert Script Example

Purpose: Automated monitoring script that sends Slack notifications when critical validator issues are detected.

What it monitors:

  • Validator jailed status (critical)

  • Missed block count (warning threshold: 100 blocks)

  • Node sync status (out of sync detection)

Setup Instructions:

  1. Create Slack webhook:

    • Go to https://api.slack.com/messaging/webhooks

    • Create a webhook for your workspace

    • Copy the webhook URL

  2. Set environment variable:

  3. Create and configure the script:

Script Features:

  • Error handling: Validates all retrieved values before use

  • Configurable: Uses environment variables for flexibility

  • Logging: Prints timestamped events to stdout

  • Security: Webhook URL stored in environment, not in script

  • Thresholds: Configurable missed blocks threshold

Cron Setup:

Testing the Script:

Alert Severity Levels:

  • 🚨 CRITICAL: Validator jailed - requires immediate action

  • ⚠️ WARNING: High missed blocks or out of sync - investigate soon

  • INFO: Regular monitoring checks (logged, not alerted)

Next Steps

Last updated