Checkpointing Operations

Checkpointing Operations¶

This document describes how ByteBiota handles checkpointing and state restoration in distributed simulations.

Overview¶

ByteBiota uses distributed checkpointing to save simulation state across server and worker components. This enables resuming simulations after server restarts, worker failures, or planned maintenance.

Checkpoint Discovery {#checkpoint-discovery}¶

The list_checkpoints() method in CheckpointService discovers available checkpoints by:

In-memory history: First checks the current session's checkpoint history
Directory scanning: Scans the {data_dir}/checkpoints/server/ directory for *_server.json files
Metadata extraction: Reads checkpoint metadata from each file
Deduplication: Avoids duplicate entries between in-memory and disk sources
Sorting: Returns checkpoints sorted by timestamp (newest first)

This dual approach ensures checkpoints are found even after server restarts when the in-memory history is empty.

State Restoration {#state-restoration}¶

The _restore_server_state() method restores server state from checkpoint data:

Restored Components¶

Worker assignments: Maps organism IDs to worker IDs
Seed bank data: Genome storage and usage statistics
Global statistics: Population metrics, diversity data, energy levels
Worker manager state: Global step count and worker statistics
Organism registry: Creates placeholders with correct organism IDs

Organism State Handling¶

The system uses an optimized approach for organism restoration:

Checkpoint storage: Saves organism data with taxonomy when available (optimization to avoid 5-10 minute wait after restart)
Taxonomy preservation: Organisms with valid taxonomy data are saved with their classification
Placeholder creation: Creates organism entries with correct IDs from worker assignments
Worker synchronization: Workers report back full organism state on reconnection
Data population: Server populates organism details from worker execution results

This design balances checkpoint size with restoration accuracy, allowing workers to maintain detailed organism state locally while the server tracks assignments and aggregates. The taxonomy optimization eliminates the need to wait for workers to report back organism data after server restarts.

Checkpoint File Format¶

Server checkpoints are stored as JSON files with the following structure:

{
  "metadata": {
    "checkpoint_id": "checkpoint_<timestamp>",
    "timestamp": <unix_timestamp>,
    "global_step": <step_count>,
    "simulation_id": "sim-<timestamp>-<uuid>",
    "worker_count": <number_of_workers>,
    "organism_count": <number_of_organisms>
  },
  "worker_assignments": {
    "<organism_id>": "<worker_id>",
    ...
  },
  "organism_data": {
    "<organism_id>": {
      "id": <organism_id>,
      "taxonomy": {
        "kingdom": "Digitalis Anomalica",
        "phylum": "Polymorphid",
        "class": "Migrata",
        "order": "Linearis",
        "family": "Standardidae",
        "genus": "Replicatus",
        "species": "Replicatus vulgaris"
      },
      "energy": <energy_value>,
      "age_in_instructions": <age>,
      "error_count": <error_count>,
      "size": <genome_size>,
      "max_energy": <max_energy>,
      "created_at": <timestamp>
    },
    ...
  },
  "seed_bank_data": { ... },
  "global_stats": { ... },
  "worker_stats": { ... }
}

## Monitoring history persistence {#monitoring-history-persistence}

Server checkpoints include recent monitoring history (`historical_stats`, capped by `StateAggregator.max_history`) to warm-start the monitoring charts after a restart or migration.

- Save: `historical_stats` is embedded in the server checkpoint JSON.
- Restore: `historical_stats` is loaded if present; older checkpoints without this field remain compatible.

This complements on-startup hydration from the server run log. If logs are rotated or unavailable, the latest checkpoint still restores recent chart history.

Troubleshooting¶

Checkpoint Not Found¶

If the server starts a fresh simulation instead of resuming:

Check that {data_dir}/checkpoints/server/ directory exists and contains *_server.json files
Verify checkpoint files are valid JSON and contain required metadata
Ensure organism_count > 0 in checkpoint metadata (empty checkpoints are ignored)

Organism Assignment Warnings¶

If you see "Organism X not found in state aggregator" warnings:

This indicates a mismatch between worker assignments and organism registry
The fix ensures organism IDs match between these two data structures
Workers should receive proper assignments after the server restores state

Worker State Loss¶

If workers lose organism state after reconnection:

Workers clear local state when simulation_id changes
Server triggers repopulation from seed bank if needed
Monitor logs for repopulation events and seed bank usage

Back to Home