Distributed architecture

Back to Home

Distributed architecture¶

This page documents the distributed ByteBiota architecture, including worker → server result payloads and server-side aggregation.

Architecture Overview {#architecture-overview}¶

The distributed ByteBiota system consists of three main components:

Server - Centralized coordinator managing seed bank, global state aggregation, and worker coordination
Worker - Distributed execution nodes running organism time slices with configurable resource limits
Web UI - Integrated with server for real-time monitoring and control

Communication Pattern {#communication-pattern}¶

Workers establish a persistent WebSocket for assignments, result batches, and config pushes
HTTP is reserved for lifecycle endpoints (register/deregister), configuration snapshots, and rare fallbacks (assignment fetch when the queue is empty, organism hydration, bulk-ack)
Server still exposes REST APIs for administrative tooling, but steady-state orchestration stays on WebSocket
Critical: Server actively pushes work assignments to workers via background task

Work Assignment System {#work-assignment-system}¶

The server uses a background task (_background_work_assignment) that runs every second to push work assignments to all active workers. This ensures:

Responsive work distribution: Workers receive assignments within 1 second
Automatic recovery: Dead organisms are cleaned up from assignments
Environment synchronization: Server state is included in work assignments
Queue management: Only pushes work when worker queues are low

Key Methods:
- _push_work_if_needed(worker_id) - Pushes work assignment to specific worker
- get_work_assignment(worker_id) - Creates work assignment with organism data
- Background task runs continuously when simulation is active

Component Details {#component-details}¶

Server Component {#server-component}¶

Auto-reload behaviour: the server CLI now starts in no-reload mode by default. Use --reload when you want the development watcher to restart on code changes.

Location: src/bytebiota/server/

Key Files:
- server.py - Main FastAPI server with core orchestration, WebSocket handlers, and lifecycle management
- worker_manager.py - Worker lifecycle and work assignment management
- seed_bank_service.py - Centralized seed bank with diversity management
- state_aggregator.py - Aggregates statistics from all workers
- checkpoint_service.py - Coordinates distributed checkpointing
- websocket_manager.py - WebSocket connection management and message handling
- config.py - Server-specific configuration

Service Modules (refactored from server.py):
- analytics_service.py - Advanced analytics endpoints (phylogenetic analysis, population forecasting, etc.)
- simulation_api_service.py - Simulation control, organism data, taxonomy, and wiki endpoints
- monitoring_api_service.py - System health monitoring and distributed system metrics
- ui_routes_service.py - Web UI page routes and template rendering

Refactoring Benefits:
- Modularity: Each service handles a specific domain (analytics, simulation, monitoring, UI)
- Maintainability: Easier to locate and modify specific functionality
- Testability: Services can be tested independently
- Scalability: Services can be extracted to separate processes if needed
- Code Size: Reduced server.py from 2581 to 1637 lines (36.5% reduction)

Worker Component {#worker-component}¶

Location: src/bytebiota/worker/

Core Files:
- worker.py - Main worker orchestration and lifecycle management
- executor.py - Local simulation execution engine
- resource_limiter.py - CPU/memory throttling and monitoring
- sync_client.py - HTTP client for server communication
- config.py - Worker-specific configuration

Manager Modules (refactored from worker.py):
- batch_manager.py - Adaptive batching logic and result coordination
- assignment_handler.py - Work assignment coordination and execution
- organism_manager.py - Organism lifecycle and factory management
- config_sync.py - Configuration synchronization with server
- websocket_client.py - WebSocket communication with server (fixed race condition in assignment queue)
- connection_manager.py - Network connection and WebSocket management
- statistics_tracker.py - Progress monitoring and metrics collection

Supporting Files:
- offline_cache.py - Offline result caching for resilience
- error_handler.py - Error handling and recovery
- resource_detector.py - System resource detection

Platform Support:

ByteBiota workers are designed for cross-platform deployment:

Windows 10/11: Full support with Windows-specific process priority classes
macOS 12+: Full support with Unix nice levels (may require elevated permissions for negative values)
Linux (Ubuntu 20.04+): Full support with standard Unix nice levels

Cross-Platform Features:
- Process priority management using platform-appropriate APIs (Windows priority classes vs Unix nice levels)
- CPU detection with fallbacks for all platforms (os.cpu_count(), multiprocessing.cpu_count(), platform-specific methods)
- WebSocket event loop compatibility (ProactorEventLoop on Windows, SelectorEventLoop on Unix)
- File path handling using pathlib.Path for cross-platform compatibility

Resource Management:

Workers support configurable resource limits with presets:

minimal: 10% CPU, 256MB memory
background: 25% CPU, 512MB memory
standard: 50% CPU, 1024MB memory
full: 100% CPU, 4096MB memory

Console Output:

Workers show essential information only:

Always shown: Startup banner, registration status, connection events, periodic progress summaries (once per minute), error messages, and final statistics
Verbose mode (--verbose flag): Additionally shows detailed organism execution, genome verification, memory allocation, and per-assignment details

Use verbose mode for debugging. Normal mode provides a clean user experience for production runs.

Worker Architecture:

The worker has been refactored into a modular architecture with specialized manager classes:

DistributedWorker: Main orchestration class that coordinates all managers
BatchManager: Handles adaptive batching and result submission optimization
AssignmentHandler: Manages work assignment execution and local queue coordination
OrganismManager: Handles organism creation, caching, and lifecycle management
ConfigSyncManager: Manages configuration synchronization with the server
ConnectionManager: Handles network parameters, WebSocket, and offline mode
StatisticsTracker: Tracks progress, metrics, and performance statistics

Execution safeguards {#worker-execution-safeguards}¶

Workers apply the same stochastic reaper heuristics that power the distributed LocalExecutor. Each time slice they calculate population-average age/error figures, apply the configured multipliers, and only reap when the reap_chance roll succeeds and the organism exceeds the dynamic thresholds. This keeps long-lived lineages alive under heavy batching while maintaining consistent selection pressure across workers.

Stagnation seeding {#stagnation-seeding}¶

StateAggregator honours reaper.stagnation_spawn_interval / _count the same way LocalExecutor does. When births stall for that many global steps, the server injects fresh ancestor genomes (seed-bank first, ancestor fallback) and assigns them to workers immediately. This prevents long-running distributed experiments from freezing at a fixed census.

Simulation resets {#simulation-resets}¶

Workers now treat the simulation_id returned from HTTP assignments as authoritative. If the server restarts with a new run (for example via --reset), the next assignment response carries the new identifier; the worker clears its local queues, re-registers automatically, and establishes a fresh WebSocket without manual intervention.

Worker Loop:
1. Register with server (one-time)
2. Poll server for work assignment
3. Execute assigned organism time slices locally
4. Submit results (organism state updates, births, deaths)
5. Periodic heartbeat
6. Handle seed bank synchronization

Adaptive batching {#adaptive-batching}¶

The distributed worker system implements adaptive batching to optimize network communication and reduce server load. The BatchManager class handles intelligent batching of execution results based on work rate and time intervals.

Configuration Parameters¶

The AdaptiveBatchConfig class supports the following parameters:

target_interval_seconds: Target time interval between batch submissions (default: 300)
min_batch_size: Minimum number of results per batch (default: 8)
max_batch_size: Maximum number of results per batch (default: 200)
adjustment_factor: Factor for batch size adjustments (default: 0.15)
work_rate_threshold: Threshold for work rate calculations (default: 200)

Adaptive batching is always enabled on distributed workers; the CLI and environment toggle have been removed to prevent accidental regressions.

Batching Logic¶

The adaptive batching system uses multiple criteria to determine when to submit batches:

Time-based submission: Submit when target interval is reached
Size-based submission: Submit when current adaptive batch size is reached
Partial submission: Submit when minimum batch size is reached and half the target interval has passed

The system continuously adjusts batch size based on submission intervals to optimize performance.

API Endpoints {#api-endpoints}¶

# Worker Management
POST   /api/workers/register      # Worker registration
POST   /api/workers/{id}/heartbeat # Worker heartbeat
DELETE /api/workers/{id}          # Worker deregistration

# Work Assignment
GET    /api/workers/{id}/assignment # Get work for worker
POST   /api/workers/{id}/results    # Submit execution results

# Seed Bank (Centralized)
GET    /api/seedbank/genomes       # Get genome from seed bank
POST   /api/seedbank/genomes       # Submit genome to seed bank
GET    /api/seedbank/stats         # Seed bank statistics

# Global State
GET    /api/simulation/stats       # Aggregated simulation stats
GET    /api/simulation/organisms   # Combined organism data
POST   /api/simulation/control     # Start/stop/pause

# Checkpointing
POST   /api/checkpoint/create      # Trigger distributed checkpoint
GET    /api/checkpoint/status      # Checkpoint status

# WebSocket (UI only)
WS     /ws/realtime               # Real-time stats for web UI

Work Assignment Strategy {#work-assignment-strategy}¶

Workers execute time slices for assigned organisms and return results.

Assignment Model:

@dataclass
class WorkAssignment:
    organism_ids: List[int]          # Organisms assigned to this worker
    time_slice_steps: int            # Steps to execute per organism
    seed_bank_genomes: List[bytes]   # New genomes for reproduction
    simulation_config: Config        # Current simulation parameters

Assignment Logic:
- Round-robin distribution of organisms across available workers
- Dynamic load balancing based on worker resource utilization
- Seed bank synchronization to maintain genetic diversity
- Configurable time slice size (default: 1000 steps per organism)

Worker local stats schema {#worker-local-stats-schema}¶

Workers submit execution results that include local_stats with the following fields. Snapshot batches flush on the adaptive schedule (default target 300 s with a minimum of 8 assignments) to balance server load with timely observability. Emergency flushes fire at 2× the target or when the batch reaches the hard cap.

WebSocket Communication {#websocket-communication}¶

The distributed system uses WebSocket for real-time, bidirectional communication between server and workers, eliminating HTTP polling and reducing network overhead.

Architecture Overview {#websocket-architecture}¶

WebSocket-First Communication: Assignments, snapshot batches, heartbeats, and tuning deltas stay on the worker WebSocket
Server-Push Assignments: Server pushes work assignments to workers when queue is low
Real-Time Result Submission: Workers submit results immediately via WebSocket
Offline Operation: Workers cache results locally when disconnected, resume on reconnection
Adaptive Heartbeats: Heartbeat frequency adapts based on worker queue size (30/60/120s)
Limited HTTP Fallback: HTTP remains for registration, deregistration, config snapshots, and last-resort assignment/organism fetches

Message Protocol {#websocket-message-protocol}¶

All messages use JSON envelope with optional gzip compression:

{
  "type": "WORK_ASSIGNMENT|RESULT_SUBMISSION|HEARTBEAT|...",
  "simulation_id": "sim-12345-abc",
  "timestamp": 1234567890.123,
  "compressed": false,
  "payload": { ... }
}

Message Types:
- HANDSHAKE: Initial connection setup with simulation_id and config
- WORK_ASSIGNMENT: Work assignment with organism data included
- RESULT_SUBMISSION: Execution results with submission_id for deduplication
- RESULT_ACK: Server acknowledgment of result submission with submission_id and accepted status
- HEARTBEAT: Worker status (queue size, cache size, offline mode)
- CONFIG_UPDATE: Server-pushed configuration changes
- SIMULATION_CHANGE: Notification of simulation restart/change
- CHUNK: Large message chunk for reliable transmission
- ERROR: Error messages with backpressure handling

Result Acknowledgment {#result-acknowledgment}¶

Server sends acknowledgments for result submissions:

{
  "type": "RESULT_ACK",
  "payload": {
    "submission_id": "uuid-string",
    "accepted": true,
    "dedup": false,
    "message": "Success"
  }
}

Fields:
- submission_id: Matches the submission_id from the original RESULT_SUBMISSION
- accepted: Boolean indicating if the result was accepted and processed
- dedup: Boolean indicating if this was a duplicate submission
- message: Human-readable status message

Organism Data Transmission {#organism-data-transmission}¶

Work assignments include complete organism data to eliminate HTTP fallback:

{
  "type": "WORK_ASSIGNMENT",
  "payload": {
    "assignment_id": "uuid",
    "organism_ids": [1, 2, 3],
    "organism_data": [
      {
        "id": 1,
        "genome": [...],
        "energy": 100,
        "registers": {...},
        "start_addr": 1000
      }
    ],
    "time_slice_steps": 1000,
    "global_step": 42
  }
}

Dead Organism Handling:
- Organisms that die are removed from both state aggregator and worker assignments
- Work assignments filter out dead organisms before transmission
- Workers receive only valid organisms, preventing HTTP fallback errors

Offline Operation {#offline-operation}¶

Workers operate in offline mode when WebSocket is disconnected:

Result Caching: Cache results locally up to 100MB (configurable)
Work Continuation: Continue processing local work queue
Cache Management: FIFO cleanup when cache is full
Reconnection: Submit cached results on reconnection
Simulation Change: Clear cache if simulation_id changes

Connection Management {#connection-management}¶

Registration: HTTP POST /api/workers/register → receive worker_id
WebSocket Connection: ws://host:port/ws/worker/{worker_id}
Reconnection: Indefinite attempts with exponential backoff (1→300s, 5 min max)
Health Monitoring: Server tracks heartbeat timeouts (3× interval)
Deduplication: Result submissions use (worker_id, submission_id) with 1-hour TTL

Worker ID Generation {#worker-id-generation}¶

Workers use persistent, machine-based IDs by default:

Default Format: worker-{hostname}-{machine-hash}
- Example: worker-macbook-pro-a1b2c3

Custom IDs:

# Environment variable
export WORKER_ID="gpu-worker"
python -m bytebiota worker
# → worker-gpu-worker

# CLI argument
python -m bytebiota worker --worker-id "my-worker"
# → worker-my-worker

Multiple Workers:
The server automatically handles collisions by adding suffixes:
- First worker: worker-macbook-pro-a1b2c3
- Second worker: worker-macbook-pro-a1b2c3-1
- Third worker: worker-macbook-pro-a1b2c3-2

Benefits:
- Persistent IDs across restarts
- Work statistics preserved
- Automatic multi-worker support
- No manual instance management

Reconnection and Recovery {#reconnection-recovery}¶

The system implements robust reconnection mechanisms to handle server restarts and network interruptions:

Indefinite Reconnection Strategy:
- Workers attempt reconnection indefinitely with exponential backoff
- Backoff intervals: 1, 2, 4, 8, 16, 32, 60, 120, 300 seconds (5 min max)
- No hard limits on reconnection attempts - workers will always try to reconnect

Simulation Continuity:
- Same Simulation: When reconnecting to the same simulation, workers:
- Submit cached results generated during offline period
- Sync configuration with server
- Resume work assignments immediately
- Maintain simulation state continuity
- New Simulation: When server starts a new simulation:
- Clear local state and cached results
- Re-register with new worker ID
- Fetch updated configuration
- Start fresh with new simulation

Connection State Management:
- HTTP registration and WebSocket connection are coordinated
- Successful HTTP re-registration resets WebSocket connection state
- Circuit breaker prevents permanent disconnection
- Periodic health checks detect and recover from connection issues

Server-Side Recovery:
- Server assigns existing organisms to reconnected workers
- Ensures running simulations continue with reconnected workers
- Maintains work distribution across all active workers
- Handles worker ID collisions intelligently with force mode support

Assignment Queue Synchronization {#assignment-queue-synchronization}¶

The WebSocket client uses proper synchronization to handle the race condition between async message handlers and the synchronous main work loop:

Async Context: Uses get_nowait() when called from async context
Sync Context: Uses run_until_complete() with timeout when called from sync context
Fallback: Handles cases where no event loop is running
Debugging: Logs queue operations to track assignment flow

This ensures work assignments are reliably retrieved regardless of the calling context.

Request Optimization {#request-optimization}¶

The distributed system implements several optimizations to prevent server overload:

WebSocket Push: Eliminates HTTP polling entirely
Result batching: Workers target 5-minute snapshot submissions (configurable) and fall back to an emergency flush at 2× the target or when the batch hits the hard cap
Adaptive Heartbeats: Heartbeats adapt to queue size (30/60/120s)
Rate Limiting: Server implements rate limiting to prevent abuse
Compression: Messages >1KB are gzip compressed
Chunking: Messages >10MB are split into chunks

Result streaming pipeline {#result-streaming}¶

Workers split result traffic into a fast event lane and a slower snapshot lane to keep the server responsive while suppressing redundant data:

Event fast lane {#result-event-fast-lane}: births, deaths, and seed submissions are packaged as lean payloads and flushed immediately so WorkerManager.update_work_results() can react in real time. These envelopes carry only lifecycle events and omit organism snapshots. Event results have step_count=0 and do not trigger tuning system assessments to prevent excessive tuning activity.
Snapshot batching {#result-snapshot-batching}: assignment summaries accumulate inside BatchManager until both the target interval (default 300 s) and the adaptive minimum batch size (default 8) are satisfied. An emergency flush fires at 2× the target interval or when the batch reaches 200 assignments. BATCH_TARGET_INTERVAL_SECONDS, BATCH_MIN_SIZE, and BATCH_MAX_SIZE override these thresholds.
Per-organism throttling {#result-snapshot-throttling}: LocalExecutor records the last snapshot per organism and suppresses repeats unless one of the following occurs: age advanced by ≥25 000 instructions, energy shifted by ≥5 %, error count increased, or 600 s elapsed since the last broadcast. Runtime overrides (snapshot_age_threshold, snapshot_energy_delta, snapshot_heartbeat_interval) let tuning policies adjust the heuristics.

Configuration {#worker-config}¶

Key environment variables for tuning worker behavior:

WebSocket Settings:
- WEBSOCKET_URL: WebSocket server URL (auto-derived from SERVER_URL)
- OFFLINE_CACHE_MAX_MB: Maximum offline cache size (default: 100MB)
- ADAPTIVE_HEARTBEAT: Enable adaptive heartbeat (default: true)
- WEBSOCKET_RECONNECT_MAX_ATTEMPTS: Max reconnection attempts (default: unlimited)
- WEBSOCKET_MAX_MESSAGE_SIZE: Max message size (default: 10MB)

Batching Controls:
- BATCH_SIZE: Number of work cycles per batch (default: 8, higher = more efficient, less responsive)
- BATCH_TARGET_INTERVAL_SECONDS: Target delay between snapshot flushes (default: 300)
- BATCH_MIN_SIZE: Minimum merged assignments before flushing (default: 8)
- BATCH_MAX_SIZE: Hard cap for backlog size before forcing submission (default: 200)

Adaptive batching is always enabled; the legacy ADAPTIVE_BATCHING environment variable is now ignored.

Server Settings:
- WEBSOCKET_ENABLED: Enable WebSocket communication (default: true)
- WEBSOCKET_MAX_MESSAGE_SIZE: Max message size (default: 10MB)
- WEBSOCKET_COMPRESSION_THRESHOLD: Compression threshold (default: 1KB)

Legacy HTTP Settings (fallback only):
- POLL_INTERVAL: Assignment polling interval (default: 2.0s)
- HEARTBEAT_INTERVAL: Heartbeat frequency (default: 10.0s)

With WebSocket enabled, network traffic is reduced by ~95% compared to HTTP polling.

steps_executed: number
execution_time: seconds
organisms_processed: number
memory_occupancy: float (0–1) of local soup occupancy
environment_stats:
total_resources: number
total_signals: number
task_attempts: number
task_successes: number
mutation_stats:
global_instructions: number
copy_bit_flips: number
background_flips: number
insertions: number
deletions: number
indels: number (insertions + deletions; convenience aggregate)
resource_usage: throttle stats
execution_stats: cumulative execution counters
allocation_failures: number of MAL allocation failures observed during the assignment
allocation_failure_rate: failures per executed step (1.0 when no progress occurs but failures are logged)

Workers also include organism update records with basic phenotype data and genome bytes where applicable.

Aggregated memory occupancy {#aggregated-memory-occupancy}¶

The server aggregates a global memory_occupancy value by summing the size of all tracked organisms and dividing by the configured soup size from Config.soup.size.

Formula: occupancy = (∑ organism.size) / soup_size.

Monitoring Metrics {#monitoring-metrics}¶

The StateAggregator calculates comprehensive metrics for the monitoring dashboard by aggregating data from all workers and organism states. These metrics are stored in historical_stats and served via the /api/simulation/metrics endpoint.

Historical stats sampling {#historical-stats-sampling}¶

The server supports two retrieval modes for chart time series via get_historical_stats(limit, time_slice=False):

Recent window (default): returns the most recent limit points. This preserves legacy behavior and is used by all existing callers that do not specify time_slice.
Time-sliced: when time_slice=true is passed through the API, the server samples evenly across the entire retained history to produce limit representative points, always ensuring the latest point is included. This mirrors the standalone monitor's /data?time_slice=true behavior.

API usage:

GET /api/simulation/metrics?points=1000&time_slice=true   # time-sliced across full timeline
GET /api/simulation/metrics?points=1000&time_slice=false  # most recent 1000 points

Organism-Level Metrics {#organism-level-metrics}¶

Basic Metrics:
- average_size: Mean organism size in bytes
- average_age: Mean age in instructions executed
- average_errors: Mean error count per organism
- average_energy: Mean energy level per organism

Task Metrics:
- total_task_attempts: Sum of task attempts across all organisms
- total_task_rewards: Sum of task rewards earned
- active_priority_boosts: Count of organisms with priority boosts

Environment Metrics {#environment-metrics}¶

Resource Levels:
- average_resource_level: Mean resource level per organism
- average_signal_level: Mean signal level per organism
- current_task_value: Current environment task reward value

Economic Metrics {#economic-metrics}¶

Resource Utilization:
- storage_utilization: Fraction of soup memory used by organisms
- energy_efficiency: Energy consumed per instruction ratio
- total_cpu_cost: Total CPU cost (instructions × cost per instruction)
- total_rent_collected: Total rent collected for memory usage
- seed_usage_events: Number of seed bank usage events

Allocation Metrics {#allocation-metrics}¶

Memory Pressure:
- allocation_failures: Sum of the most recent MAL failures reported by each worker
- allocation_failure_rate: Failures per executed step (falls back to 1.0 when workers cannot advance but keep failing MAL)
- cumulative_failures: Running total of all failures since the simulation started

These figures allow Hybrid Tuning and dashboards to distinguish "no diversity" warnings caused by true evolutionary stagnation from simple soup exhaustion.

Implementation {#monitoring-implementation}¶

The metrics are calculated in StateAggregator._recalculate_global_stats() using three helper methods:

_calculate_organism_metrics(): Computes organism-level aggregations
_calculate_environment_metrics(): Computes environment and resource metrics
_calculate_economic_metrics(): Computes economic and efficiency metrics

All metrics are stored in historical_stats with top-level fields for easy access by the monitoring API.

Mutation Statistics {#mutation-statistics}¶

Mutation Metrics:
- copy_bit_flips: Copy-time bit flips during genome reproduction
- background_flips: Background mutations triggered by instruction count intervals
- insertions: Structural mutations that add bytes to genomes
- deletions: Structural mutations that remove bytes from genomes
- global_instructions: Total instruction count across all organisms

Implementation Details:
- Workers track mutation counters in MutationEngine.mutation_counters
- Server aggregates mutation stats from all workers in _calculate_mutation_metrics()
- Separate tracking for insertions and deletions (not just combined structural mutations)
- Background mutations triggered every 2M instructions (configurable via BACKGROUND_FLIP_INTERVAL)
- Copy-time mutations occur during genome reproduction with configurable rates

Configuration:
- INSERTION_DELETION_RATE: Probability of structural mutations (default: 0.14)
- INDEL_INSERTION_BIAS: Bias toward insertions vs deletions (default: 0.6)
- BACKGROUND_FLIP_INTERVAL: Instructions between background mutations (default: 2000000)

Global Step Counting {#global-step-counting}¶

The distributed system maintains a global step count that accumulates the actual number of steps executed by all workers. This is critical for maintaining simulation consistency and proper metrics reporting.

Step Counting Process:
1. Workers execute organism time slices and report the total number of steps executed
2. Server receives execution results with step_count field containing actual steps executed
3. Server accumulates these step counts into the global step counter: global_step += results.step_count
4. Global step count is used in metrics reporting and simulation state tracking

Implementation: The StateAggregator.process_execution_results() method adds the actual step count from worker results to the global step counter, ensuring accurate simulation progress tracking across all distributed workers.

Worker Result Submission {#worker-result-submission}¶

The distributed system requires workers to always submit results to the server, even when all organisms die, to ensure continuous work assignment and prevent simulation stalls.

Critical Bug Fix: The AssignmentHandler.execute_assignment() method now returns an empty ExecutionResults object instead of None when no organisms are available for execution. This ensures the worker always submits results to the server, allowing the simulation to continue.

Configuration: Workers use adaptive batching with configurable submission thresholds:
- BATCH_TARGET_INTERVAL_SECONDS: Controls the adaptive batch target interval for time-based submissions
- BATCH_MIN_SIZE: Minimum number of assignments before a flush
- BATCH_MAX_SIZE: Hard cap that triggers an emergency flush

Implementation: The worker's main loop processes results only when result is truthy, so returning empty results instead of None ensures continuous result submission and prevents simulation stalls.

Memory Allocation Issues {#memory-allocation-issues}¶

The distributed system has a critical issue where organisms cannot reproduce due to memory allocation failures, preventing population growth and causing the simulation to reach a steady state with no progress.

Root Cause: The ancestor program is designed to overallocate memory for reproduction (adding 16 extra bytes to the original 48-byte size, requiring 64 bytes total), but the memory allocation algorithm fails to find contiguous blocks of this size.

Symptoms:
- Repeated "ALLOCATION FAILED: Organism X needs 64 bytes but allocation failed" errors
- All organisms die without reproducing, leading to population decline
- Server continuously creates new organisms to replace dead ones, but they also die
- Simulation reaches steady state with no work progress

Investigation Results:
- Energy parameters are generous (ENERGY_INITIAL=2000, ENERGY_MAX=3000, CPU_COST_PER_INSTRUCTION=0.001)
- Memory sizes are large (SOUP_SIZE=10000000, LOCAL_SOUP_SIZE=5000000)
- Worker result submission is working correctly
- Issue persists even with increased memory sizes and separate worker local memory

Current Status: The memory allocation algorithm or the ancestor program's reproduction strategy needs to be redesigned to allow successful organism reproduction and population growth.

Mutation metrics aggregation {#mutation-metrics-aggregation}¶

The server aggregates per-worker mutation counters into global_stats.mutation_metrics with keys:

copy_time_mutations: sum of worker copy_bit_flips
background_mutations: sum of worker background_flips
structural_mutations: sum of worker indels if present, else insertions + deletions
global_instructions: sum of worker global_instructions
total_mutations: copy_time_mutations + background_mutations + structural_mutations

Diversity metrics {#diversity-metrics}¶

The server computes diversity_metrics from organism states:

unique_genomes: count of distinct genome hashes observed across live organisms
total_genomes: current population size
genome_diversity_ratio: unique_genomes / max(1, total_genomes)
dominant_genome_ratio: share of the population occupied by the most common genome hash
dominant_genome_count: number of organisms sharing that dominant hash
size_distribution: histogram of organism sizes
taxonomy_distribution: counts by taxonomy.kingdom when present
missing_genome_count: organisms that did not report genome bytes (should remain 0)

Server API {#server-api}¶

The distributed server provides a comprehensive REST API for web interface integration and external tooling.

Web Interface Integration {#web-interface-integration}¶

The server integrates the UI system located in src/bytebiota/ui/:

Static Assets: Serves CSS, JavaScript, and images from src/bytebiota/ui/static/
HTML Templates: Renders Jinja2 templates from src/bytebiota/ui/templates/
Icon Generation: Dynamic SVG icon generation for organisms

Icon Generation API {#icon-generation-api}¶

The server provides dynamic icon generation through the integrated ByteBiotaIconGenerator:

Endpoint: `/api/organism/{organism_id}/icon`¶

Method: GET
Description: Generate or retrieve an SVG icon for a specific organism

Response:

{
  "icon_path": "/static/icons/Replicatus_vulgaris_Digitalis_Animata.svg"
}

Process:
1. Retrieves organism data from state aggregator
2. Extracts taxonomic classification and behavioral traits
3. Generates deterministic SVG icon using genome hash as seed
4. Saves icon to src/bytebiota/ui/static/icons/
5. Returns relative path for web serving

Error Handling: Falls back to default icon (/static/icons/organism.svg) if generation fails

Icon Generation Features¶

Deterministic: Same organism always generates identical icon
Taxonomic Mapping: Visual elements reflect classification hierarchy
Trait Encoding: Behavioral traits add visual overlays
Caching: Generated icons are cached by filename
Scalable: SVG format supports any display size