Checkpointing Operations

Checkpointing Operations

This document describes how ByteBiota handles checkpointing and state restoration in distributed simulations.

Overview

ByteBiota uses distributed checkpointing to save simulation state across server and worker components. This enables resuming simulations after server restarts, worker failures, or planned maintenance.

Checkpoint Discovery {#checkpoint-discovery}

The list_checkpoints() method in CheckpointService discovers available checkpoints by:

  1. In-memory history: First checks the current session's checkpoint history
  2. Directory scanning: Scans the {data_dir}/checkpoints/server/ directory for *_server.json files
  3. Metadata extraction: Reads checkpoint metadata from each file
  4. Deduplication: Avoids duplicate entries between in-memory and disk sources
  5. Sorting: Returns checkpoints sorted by timestamp (newest first)

This dual approach ensures checkpoints are found even after server restarts when the in-memory history is empty.

State Restoration {#state-restoration}

The _restore_server_state() method restores server state from checkpoint data:

Restored Components

  • Worker assignments: Maps organism IDs to worker IDs
  • Seed bank data: Genome storage and usage statistics
  • Global statistics: Population metrics, diversity data, energy levels
  • Worker manager state: Global step count and worker statistics
  • Organism registry: Creates placeholders with correct organism IDs

Organism State Handling

The system uses an optimized approach for organism restoration:

  1. Checkpoint storage: Saves organism data with taxonomy when available (optimization to avoid 5-10 minute wait after restart)
  2. Taxonomy preservation: Organisms with valid taxonomy data are saved with their classification
  3. Placeholder creation: Creates organism entries with correct IDs from worker assignments
  4. Worker synchronization: Workers report back full organism state on reconnection
  5. Data population: Server populates organism details from worker execution results

This design balances checkpoint size with restoration accuracy, allowing workers to maintain detailed organism state locally while the server tracks assignments and aggregates. The taxonomy optimization eliminates the need to wait for workers to report back organism data after server restarts.

Checkpoint File Format

Server checkpoints are stored as JSON files with the following structure:

{
  "metadata": {
    "checkpoint_id": "checkpoint_<timestamp>",
    "timestamp": <unix_timestamp>,
    "global_step": <step_count>,
    "simulation_id": "sim-<timestamp>-<uuid>",
    "worker_count": <number_of_workers>,
    "organism_count": <number_of_organisms>
  },
  "worker_assignments": {
    "<organism_id>": "<worker_id>",
    ...
  },
  "organism_data": {
    "<organism_id>": {
      "id": <organism_id>,
      "taxonomy": {
        "kingdom": "Digitalis Anomalica",
        "phylum": "Polymorphid",
        "class": "Migrata",
        "order": "Linearis",
        "family": "Standardidae",
        "genus": "Replicatus",
        "species": "Replicatus vulgaris"
      },
      "energy": <energy_value>,
      "age_in_instructions": <age>,
      "error_count": <error_count>,
      "size": <genome_size>,
      "max_energy": <max_energy>,
      "created_at": <timestamp>
    },
    ...
  },
  "seed_bank_data": { ... },
  "global_stats": { ... },
  "worker_stats": { ... }
}

## Monitoring history persistence {#monitoring-history-persistence}

Server checkpoints include recent monitoring history (`historical_stats`, capped by `StateAggregator.max_history`) to warm-start the monitoring charts after a restart or migration.

- Save: `historical_stats` is embedded in the server checkpoint JSON.
- Restore: `historical_stats` is loaded if present; older checkpoints without this field remain compatible.

This complements on-startup hydration from the server run log. If logs are rotated or unavailable, the latest checkpoint still restores recent chart history.

Troubleshooting

Checkpoint Not Found

If the server starts a fresh simulation instead of resuming:

  1. Check that {data_dir}/checkpoints/server/ directory exists and contains *_server.json files
  2. Verify checkpoint files are valid JSON and contain required metadata
  3. Ensure organism_count > 0 in checkpoint metadata (empty checkpoints are ignored)

Organism Assignment Warnings

If you see "Organism X not found in state aggregator" warnings:

  1. This indicates a mismatch between worker assignments and organism registry
  2. The fix ensures organism IDs match between these two data structures
  3. Workers should receive proper assignments after the server restores state

Worker State Loss

If workers lose organism state after reconnection:

  1. Workers clear local state when simulation_id changes
  2. Server triggers repopulation from seed bank if needed
  3. Monitor logs for repopulation events and seed bank usage

Back to Home