Operations

Validator Restart Procedures

Understanding when and how to restart your validator safely is crucial for maintaining uptime and avoiding jailing.

Graceful Restart

When to use: Routine maintenance, configuration changes, or planned updates when you have time.
What it does: Sends a termination signal (SIGTERM) to the evmd process, allowing it to cleanly shut down by:
- Completing any in-progress operations
- Flushing data to disk
- Closing database connections gracefully
Expected downtime: 10-30 seconds

# Stop the node gracefully
sudo systemctl stop evmd

# Verify process stopped (should return no results)
ps aux | grep evmd

# Start the node
sudo systemctl start evmd

# Monitor logs to ensure clean startup
sudo journalctl -u evmd -f --output cat

Emergency Restart

When to use: When the node is unresponsive, hung, or graceful restart fails.
What it does: Forces immediate termination (SIGKILL) without waiting for cleanup. Use only when graceful restart doesn't work, as it can potentially cause database inconsistencies.
Risk: May require database repair if process was killed during a write operation.

# Force stop if graceful stop hangs
sudo systemctl kill evmd

# Clear any leftover processes
pkill -9 evmd

# Start node
sudo systemctl start evmd

Post-Restart Validation

Purpose: Verify your validator is running correctly and signing blocks after restart.
What to check:
- Service is active and running
- Node is syncing/synced with the network
- Block height is increasing
- Validator is participating in consensus (signing blocks)
Monitor for: 5-10 minutes after restart to ensure stable operation.

# Check service status (should show "active (running)")
sudo systemctl status evmd

# Check sync status (catching_up should be false)
curl -s http://localhost:26657/status | jq '.result.sync_info'

# Check latest block height (should be increasing)
evmd status --node http://localhost:26657 | jq '.sync_info.latest_block_height'

# Check if validator is signing blocks (missed_blocks_counter should not be rapidly increasing)
# Note: Ensure VALCONS is set (see "Checking Validator Status" section above)
evmd query slashing signing-info $VALCONS \
  --node http://localhost:26657

Backup and Recovery

Protecting your validator keys and state is critical. Lost keys mean permanent loss of validator identity and staked funds.

⚠️ Double Signing Warning: While backups are essential, NEVER restore your priv_validator_key.json to a second node while another node is running with the same key. This will cause double signing and permanent slashing. Always stop the old node before migrating.

Regular Backup Schedule

Daily Backups:

Purpose: Protect critical keys and configuration that define your validator identity.
What's backed up:
- priv_validator_key.json - Your validator signing key (most critical)
- node_key.json - Your P2P network identity
- keyring-file/ - Your account keys and addresses
- config.toml and app.toml - Your node configuration
Why daily: Keys rarely change, but having recent backups ensures quick recovery if hardware fails.
Storage: Keep backups in multiple locations (external drive, encrypted cloud storage, offline USB).
Backup Script:

#!/bin/bash
# backup-validator.sh

set -e  # Exit on error

BACKUP_DIR="$HOME/evmd-backups/$(date +%Y-%m-%d)"
mkdir -p "$BACKUP_DIR"

# Backup critical keys (MOST IMPORTANT - without these you lose your validator)
cp $HOME/.evmd/config/priv_validator_key.json "$BACKUP_DIR/"
cp $HOME/.evmd/config/node_key.json "$BACKUP_DIR/"

# Backup keyring (your wallet keys)
cp -r $HOME/.evmd/keyring-file/ "$BACKUP_DIR/"

# Backup configuration (easier than reconfiguring from scratch)
cp $HOME/.evmd/config/config.toml "$BACKUP_DIR/"
cp $HOME/.evmd/config/app.toml "$BACKUP_DIR/"

# Create archive for easy storage
tar -czf "$BACKUP_DIR.tar.gz" -C "$HOME/evmd-backups" "$(basename $BACKUP_DIR)"

# Verify backup was created successfully
if [ ! -f "$BACKUP_DIR.tar.gz" ]; then
    echo "ERROR: Backup failed!" >&2
    exit 1
fi

# Remove old backups (keep last 30 days)
find $HOME/evmd-backups -name "*.tar.gz" -mtime +30 -delete

echo "Backup completed: $BACKUP_DIR.tar.gz"

State Snapshot (Weekly):

Purpose: Backup blockchain data for faster recovery without re-syncing from genesis.
What's backed up: Complete database state (~/.evmd/data/) containing all blockchain data.
Why weekly: State is large (100s of GB), changes constantly, but you can always re-sync if needed.
Downtime required: Yes (5-30 minutes depending on disk speed).
State Snapshot Script:

#!/bin/bash
# state-snapshot.sh

set -e  # Exit on error

SNAPSHOT_FILE="evmd-state-$(date +%Y-%m-%d).tar.gz"

# Stop node (required for consistent snapshot)
sudo systemctl stop evmd

# Create state snapshot (this can take 10-30 minutes)
tar -czf "$SNAPSHOT_FILE" -C $HOME/.evmd/data .

# Verify snapshot was created successfully
if [ ! -f "$SNAPSHOT_FILE" ]; then
    echo "ERROR: Snapshot failed! Restarting node..." >&2
    sudo systemctl start evmd
    exit 1
fi

# Restart node
sudo systemctl start evmd

echo "Snapshot completed: $SNAPSHOT_FILE"

# Upload to secure storage (e.g., S3, cloud storage)
# aws s3 cp "$SNAPSHOT_FILE" s3://your-bucket/backups/

Note: State snapshots are optional but dramatically reduce recovery time (hours vs days).

Recovery Procedures

Recover from Key Backup:

When to use:
- Lost or corrupted keys
- Moving validator to new hardware (ONLY after stopping old hardware)
- Disaster recovery
What it recovers: Your validator identity and configuration (does not restore blockchain state).
Recovery time: Keys restore instantly, but node may need to sync (hours to days without state snapshot).

# Stop node
sudo systemctl stop evmd

# Restore keys (use most recent backup)
cp evmd-backups/<date>/priv_validator_key.json $HOME/.evmd/config/
cp evmd-backups/<date>/node_key.json $HOME/.evmd/config/

# Restore keyring (your wallet keys)
cp -r evmd-backups/<date>/keyring-file/ $HOME/.evmd/

# Set proper permissions (critical for security)
chmod 600 $HOME/.evmd/config/priv_validator_key.json
chmod 600 $HOME/.evmd/config/node_key.json

# Restart node (will sync from network if no state data)
sudo systemctl start evmd

Recover from State Snapshot:

When to use:
- Corrupted database
- Faster recovery after hardware migration
- Node won't start due to state errors
What it does: Restores blockchain data to a previous point, avoiding full re-sync.
Recovery time: Extraction (10-30 min) + sync from snapshot height to current (minutes to hours).
Important: State snapshots don't include keys. Always restore keys first if moving to new hardware.

# Stop node
sudo systemctl stop evmd

# Backup current corrupted state (just in case)
mv $HOME/.evmd/data $HOME/.evmd/data.backup

# Restore from snapshot (choose most recent snapshot)
mkdir -p $HOME/.evmd/data
tar -xzf evmd-state-<date>.tar.gz -C $HOME/.evmd/data

# Restart node (will sync remaining blocks from snapshot point)
sudo systemctl start evmd

# Monitor sync (should catch up from snapshot height)
sudo journalctl -u evmd -f

Monitoring Setup

Prometheus Metrics

Enable Prometheus metrics in config.toml:

[instrumentation]
prometheus = true
prometheus_listen_addr = ":26660"
max_open_connections = 3
namespace = "tendermint"

Prometheus Configuration (prometheus.yml):

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'bitplanet-validator'
    static_configs:
      - targets: ['localhost:26660']
        labels:
          instance: 'validator-1'

Key Metrics to Monitor

# Block height
curl -s http://localhost:26657/status | jq '.result.sync_info.latest_block_height'

# Validator voting power (simplified)
VALOPER=$(evmd keys show validator --bech val -a --keyring-backend file --home $HOME/.evmd)
evmd query staking validator $VALOPER --node http://localhost:26657 | jq '.tokens'

# Missed blocks
VALCONS=$(evmd tendermint show-address --home $HOME/.evmd)
evmd query slashing signing-info $VALCONS --node http://localhost:26657 | jq '.missed_blocks_counter'

# Peer count
curl -s http://localhost:26657/net_info | jq '.result.n_peers'

# Mempool size
curl -s http://localhost:26657/num_unconfirmed_txs | jq '.result.total'

Alerting Configuration

Alert Script Example

Purpose: Automated monitoring script that sends Slack notifications when critical validator issues are detected.

What it monitors:

Validator jailed status (critical)
Missed block count (warning threshold: 100 blocks)
Node sync status (out of sync detection)

Setup Instructions:

Create Slack webhook:
- Go to https://api.slack.com/messaging/webhooks
- Create a webhook for your workspace
- Copy the webhook URL

Set environment variable:

# Add to ~/.bashrc or /etc/environment
export SLACK_WEBHOOK_URL="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"

Create and configure the script:

#!/bin/bash
# alert-validator.sh
# Automated validator monitoring and alerting script

set -e  # Exit on error

# Configuration
HOME_DIR="${HOME_DIR:-$HOME/.evmd}"
NODE_URL="${NODE_URL:-http://localhost:26657}"
KEYRING_BACKEND="${KEYRING_BACKEND:-file}"
MISSED_BLOCKS_THRESHOLD=100

# Get validator addresses
VALOPER=$(evmd keys show validator --bech val -a --keyring-backend "$KEYRING_BACKEND" --home "$HOME_DIR" 2>/dev/null)
VALCONS=$(evmd tendermint show-address --home "$HOME_DIR" 2>/dev/null)

# Validate addresses were retrieved
if [ -z "$VALOPER" ] || [ -z "$VALCONS" ]; then
    echo "Error: Failed to retrieve validator addresses" >&2
    exit 1
fi

# SECURITY: Store webhook URL in environment variable, not in script
SLACK_WEBHOOK="${SLACK_WEBHOOK_URL}"

if [ -z "$SLACK_WEBHOOK" ] || [ "$SLACK_WEBHOOK" == "https://hooks.slack.com/services/YOUR/WEBHOOK/URL" ]; then
    echo "Error: SLACK_WEBHOOK_URL environment variable not set" >&2
    exit 1
fi

# Function to send alert
send_alert() {
    local message="$1"
    curl -s -X POST "$SLACK_WEBHOOK" \
        -H 'Content-Type: application/json' \
        -d "{\"text\":\"$message\"}" \
        > /dev/null 2>&1
}

# Check if validator is jailed
JAILED=$(evmd query staking validator "$VALOPER" --node "$NODE_URL" --output json 2>/dev/null | jq -r '.jailed')

if [ "$JAILED" == "true" ]; then
    send_alert "🚨 CRITICAL: Validator is JAILED! Immediate action required."
    echo "$(date): ALERT - Validator jailed"
fi

# Check missed blocks
MISSED=$(evmd query slashing signing-info "$VALCONS" --node "$NODE_URL" --output json 2>/dev/null | jq -r '.missed_blocks_counter')

if [ -n "$MISSED" ] && [ "$MISSED" -gt "$MISSED_BLOCKS_THRESHOLD" ]; then
    send_alert "⚠️  WARNING: Validator missed $MISSED blocks (threshold: $MISSED_BLOCKS_THRESHOLD)"
    echo "$(date): WARNING - Missed $MISSED blocks"
fi

# Check sync status
CATCHING_UP=$(curl -s "$NODE_URL/status" 2>/dev/null | jq -r '.result.sync_info.catching_up')

if [ "$CATCHING_UP" == "true" ]; then
    send_alert "⚠️  WARNING: Node is catching up (not fully synced)"
    echo "$(date): WARNING - Node out of sync"
fi

echo "$(date): Monitoring check completed"

Script Features:

Error handling: Validates all retrieved values before use
Configurable: Uses environment variables for flexibility
Logging: Prints timestamped events to stdout
Security: Webhook URL stored in environment, not in script
Thresholds: Configurable missed blocks threshold

Cron Setup:

# Make script executable
chmod +x /home/validator/alert-validator.sh

# Add to crontab (run every 5 minutes)
crontab -e

# Add this line:
*/5 * * * * /home/validator/alert-validator.sh >> /var/log/validator-alerts.log 2>&1

# Or for system-wide (add to /etc/cron.d/validator-alerts):
*/5 * * * * validator /home/validator/alert-validator.sh >> /var/log/validator-alerts.log 2>&1

Testing the Script:

# Test with environment variable
export SLACK_WEBHOOK_URL="your-webhook-url"
./alert-validator.sh

# Check logs
tail -f /var/log/validator-alerts.log

Alert Severity Levels:

🚨 CRITICAL: Validator jailed - requires immediate action
⚠️ WARNING: High missed blocks or out of sync - investigate soon
✅ INFO: Regular monitoring checks (logged, not alerted)

Next Steps

Review setup procedures in Setup & Configuration
Follow step-by-step procedures in Runbooks
Troubleshoot common issues in Troubleshooting
Access quick reference in Additional Resources

PreviousSetup & Configuration NextTroubleshooting

Last updated 1 month ago

hashtagValidator Restart Procedures

hashtagGraceful Restart

hashtagEmergency Restart

hashtagPost-Restart Validation

hashtagBackup and Recovery

hashtagRegular Backup Schedule

hashtagRecovery Procedures

hashtagMonitoring Setup

hashtagPrometheus Metrics

hashtagKey Metrics to Monitor

hashtagAlerting Configuration

hashtagAlert Script Example

hashtagNext Steps