Runbooks

Overview

This section provides step-by-step operational procedures for common validator tasks and emergency scenarios.

These runbooks are designed for production validator operations and assume you have basic familiarity with Linux system administration, the Cosmos SDK, and the evmd command-line interface.

Important: Always test procedures in a non-production environment first. For critical operations affecting validator uptime, ensure you have recent backups and a rollback plan.


Runbook 1: Validator Restart

Purpose: Gracefully restart validator without getting jailed

Prerequisites:

  • SSH access to validator

  • Backup of critical files

Steps:

  1. Pre-restart checks:

    # Check current status
    curl -s http://localhost:26657/status | jq '.result.sync_info'
    
    # Check validator signing
    # Note: Set $VALCONS if not already set (see "Checking Validator Status" section)
    evmd query slashing signing-info $VALCONS --node http://localhost:26657
  2. Stop validator:

    sudo systemctl stop evmd
    
    # Verify stopped
    ps aux | grep evmd
  3. Perform maintenance:

    # Update binary if needed
    make install
    
    # Verify new version
    evmd version

    Note: The make install command must be run from the Bitplanet repository directory. It builds the binary with the correct version information and installs it to your $GOPATH/bin (typically ~/go/bin/evmd).

  4. Start validator:

    sudo systemctl start evmd
  5. Post-restart validation:

    # Check service status
    sudo systemctl status evmd
    
    # Monitor logs for 2 minutes
    sudo journalctl -u evmd -f --output cat
    
    # Verify syncing
    curl -s http://localhost:26657/status | jq '.result.sync_info.catching_up'
    
    # Check signing (VALCONS should already be set from "Checking Validator Status" section)
    sleep 60
    evmd query slashing signing-info $VALCONS --node http://localhost:26657

Expected Duration: 2-5 minutes

Rollback Plan: Restore from backup if new binary fails

Runbook 2: Consensus Failure Recovery

Purpose: Recover from consensus failure and rejoin network

Symptoms:

  • Node not producing blocks

  • Persistent errors in logs

  • Validator jailed

Steps:

  1. Immediate actions:

  2. Diagnose issue:

  3. Recovery options:

    Option A: Reset Tendermint state (if app state is intact)

    What this does: Removes all blockchain data (blocks, consensus state, transaction index) but keeps your configuration and keys. The node will resync from the network starting from genesis or a state sync snapshot. This is useful when consensus state is corrupted but your keys and configuration are fine.

    Option B: Restore from snapshot (if corruption is severe)

    Option C: Resync from genesis (last resort)

  4. Monitor recovery:

  5. Re-enable validator (if was jailed):

Expected Duration: 30 minutes to several hours (depending on sync method)

Runbook 3: Reward Claim Failure Resolution

Purpose: Troubleshoot and resolve issues with claiming validator rewards

Symptoms:

  • Transaction fails when claiming rewards

  • "insufficient funds" error

  • Rewards not visible

Steps:

  1. Check rewards availability:

  2. Check account balance for fees:

  3. Attempt reward withdrawal:

    Note: The --commission flag withdraws validator commission only. To withdraw delegation rewards (your own staked tokens' rewards), omit the --commission flag. You can withdraw both in separate transactions or combine them.

  4. If transaction fails:

  5. Verify withdrawal:

Expected Duration: 5-10 minutes

Runbook 4: Emergency Validator Shutdown

Purpose: Emergency procedure for immediate validator shutdown

When to Use:

  • Security breach detected

  • Double signing risk

  • Critical infrastructure failure

Steps:

  1. Immediate shutdown:

  2. Secure validator keys:

    Critical: The priv_validator_key.json is your validator's consensus key. If this is compromised, an attacker could double-sign blocks using your validator identity. Always store backups encrypted and in multiple secure locations (hardware security module, encrypted USB drive, secure cloud storage with strong encryption).

  3. Prevent automatic restart:

  4. Document incident:

  5. Notify stakeholders:

    • Inform delegators via social media

    • Update status page

    • Contact network coordinators if needed

Recovery: Follow Runbook 2 or 3 based on the incident cause

Next Steps

Last updated