Metrics and Observability

Metrics and Observability

Back to Home

Reliable monitoring keeps long-running simulations healthy and debuggable. This page explains how we emit, store, and visualise telemetry from ByteBiota runs.

Logging {#logging}

run_logger (src/bytebiota/run_logger.py) emits structured JSONL streams for the server and each worker, while CheckpointService (src/bytebiota/server/checkpoint_service.py) captures distributed snapshots. Document log schemas, rotation policies, and dashboard integrations here.

Run Logging System {#run-logging}

The comprehensive run logging system (src/bytebiota/run_logger.py) provides detailed file-based logging for both distributed servers and workers with unique log files per run, automatic rotation, and rich debugging information.

Log File Naming Convention {#log-file-naming}
  • Server logs: {data_dir}/logs/server_YYYYMMDD_HHMMSS_PID.log
  • Worker logs: {data_dir}/logs/worker_YYYYMMDD_HHMMSS_PID.log
  • Format: {component}_{timestamp}_{process_id}.log
  • Example: data/logs/server_20241201_143022_12345.log (default)
  • Configurable: Log directory can be customized via DATA_BASE_DIR environment variable
Log Rotation Policy {#log-rotation}
  • Retention: Keeps only the last 5 log files per component
  • Automatic cleanup: Old logs are automatically removed on startup
  • No manual intervention: Rotation happens transparently
Log Format and Structure {#log-format}

Each log entry is a JSON object with the following structure:

{
  "event_type": "startup|shutdown|metrics|state_change|connection_event|work_event|error",
  "context": {
    "component": "server|worker",
    "process_id": 12345,
    "thread_id": 67890,
    "worker_id": "worker-abc123",
    "simulation_id": "sim-1234567890-abcdef12",
    "timestamp": 1234567890.123
  },
  "data": {
    // Event-specific data
    "command_line": "python -m bytebiota server --host 0.0.0.0 --port 8080"
  },
  "runtime_seconds": 123.45,
  "exception": {
    "type": "ExceptionType",
    "message": "Error message",
    "traceback": "Full stack trace"
  }
}
Command Line Logging {#command-line-logging}

Every log file includes the complete command line used to start the process in the initialization event. This helps with debugging and reproducing issues by showing exactly how the server or worker was launched.

The command line is captured automatically and logged in the run_logger_initialized event under the command_line field in the data section.

Server Logging Events {#server-logging-events}

The distributed server logs comprehensive events including:
- Startup/shutdown: Configuration details, simulation state
- Worker management: Registration, deregistration, heartbeat events
- WebSocket connections: Connection, disconnection, error events
- Simulation state: Start, stop, pause, resume transitions
- Checkpoint events: Creation, loading, error handling
- Result processing: Work assignments, result submissions, duplicates
- Periodic metrics: Population, workers, global state every 30 seconds
- System metrics: CPU, memory, disk, network usage, load average
- Error tracking: Full stack traces with context

Worker Logging Events {#worker-logging-events}

Distributed workers log detailed events including:
- Startup/shutdown: Configuration, resource limits, connection status
- Server communication: Registration, deregistration, config sync
- WebSocket events: Connection, disconnection, message handling
- Work execution: Assignment reception, organism processing, result generation
- Result submission: Batch creation, WebSocket submission, offline caching
- Offline mode: Transitions, cache management, reconnection
- Performance metrics: Work rate, queue size, resource usage every 10 work cycles
- Resource monitoring: CPU, memory, disk usage, efficiency metrics
- Execution performance: Assignment timing, steps per second, throughput
- Error tracking: Execution errors, connection failures, submission errors with full context

Using Logs for Debugging {#debugging-with-logs}

The structured logs provide comprehensive debugging information:

  1. Trace execution flow: Follow worker assignments from reception to completion
  2. Identify bottlenecks: Monitor work rates, queue sizes, and submission intervals
  3. Debug connection issues: Track WebSocket connections, reconnections, and failures
  4. Analyze performance: Review execution times, batch sizes, and resource usage
  5. Monitor resource utilization: Track CPU, memory, disk usage and efficiency
  6. Investigate errors: Full stack traces with context for all exceptions
  7. Monitor system health: Periodic metrics and state transitions
  8. Performance optimization: Analyze steps per second, throughput, and execution efficiency
  9. Resource management: Monitor resource limits, throttling, and efficiency metrics
  10. Network analysis: Track message rates, connection quality, and communication patterns
Log Analysis Tools {#log-analysis-tools}
  • JSON parsing: All logs are valid JSON for easy parsing
  • Time-based filtering: Timestamps for chronological analysis
  • Event correlation: Worker IDs and simulation IDs for cross-referencing
  • Performance tracking: Runtime seconds and execution metrics
  • Error aggregation: Exception types and frequencies

Telemetry Hooks {#telemetry-hooks}

Outline additional metrics sourcesβ€”such as executed opcode counters, environment stats, or external monitoring exportersβ€”and reference any supporting scripts or dashboards.

  • Allocation health: Workers now report per-assignment allocation failure counts (allocation_failures and allocation_failure_rate). The state aggregator surfaces recent and cumulative failures under allocation_metrics, enabling Hybrid Tuning (and dashboards) to flag soup exhaustion before populations flatline.

Analytics Dashboard {#analytics-dashboard}

The web interface provides comprehensive analytics through the distributed server's API endpoints. All analytics methods use real simulation data and provide dynamic insights into ecosystem behavior.

Analytics API {#analytics-api}

The distributed server (src/bytebiota/server/server.py) provides analytics endpoints that serve real-time simulation data. The server aggregates data from workers and provides methods for generating insights from simulation metrics and species classification data. Analytics are computed on-demand and cached for performance.

Population Forecasting {#population-forecasting}

Advanced forecasting models predict population trends using ARIMA-like algorithms, exponential smoothing, and linear regression. Models account for volatility, momentum, and uncertainty to provide confidence intervals and trend analysis.

Extinction Risk Assessment {#extinction-risk-assessment}

Evaluates species extinction risk based on population sizes and trends. Species with populations ≀3 are classified as high risk, while larger populations are assessed for medium or low risk based on stability metrics.

Resource Scarcity Prediction {#resource-scarcity-prediction}

Monitors memory usage, CPU cycles, and storage requirements based on population growth and resource consumption patterns. Provides early warning indicators for resource constraints.

Mutation Pressure Analysis {#mutation-pressure-analysis}

Tracks mutation rates and beneficial mutation ratios using opcode diversity and task success metrics. Identifies periods of high evolutionary pressure and adaptation success.

Behavioral Clustering {#behavioral-clustering}

Groups organisms by behavioral traits (reproduction rate, survival rate) to identify distinct behavioral strategies and evolutionary niches within the ecosystem.

Temporal Patterns {#temporal-patterns}

Analyzes long-term trends in population stability, energy efficiency, and mutation acceleration. Detects seasonal patterns, growth phases, and ecosystem transitions.

Network Analysis {#network-analysis}

Models species relationships and interactions based on population dynamics and evolutionary proximity. Calculates centrality metrics and connection weights between species.

Correlation Analysis {#correlation-analysis}

Computes correlation matrices between key simulation variables (Population, Memory, Energy, Diversity, Age, Errors) to identify relationships and dependencies.

Fitness Landscape {#fitness-landscape}

Visualizes species fitness in 2D/3D space based on population size and evolutionary success. Interactive visualizations show fitness peaks and valleys in the evolutionary landscape.

Phylogenetic Analysis {#phylogenetic-analysis}

Builds hierarchical trees showing evolutionary relationships between species based on kingdom, phylum, and population distributions.

Speciation Events {#speciation-events}

Tracks major speciation events and population changes over time, providing a timeline of evolutionary milestones and ecosystem development.

Automated Insights {#automated-insights}

Generates dynamic insights and recommendations based on ecosystem stability, population trends, energy efficiency, and genetic diversity. Provides actionable recommendations for simulation management.

Adaptation Rates {#adaptation-rates}

Tracks evolutionary adaptation rates and success metrics for different species and populations over time.

Model Performance {#model-performance}

Monitors the performance of machine learning models used for forecasting and analysis, including accuracy metrics and prediction reliability.

Feature Importance {#feature-importance}

Analyzes which simulation variables and organism traits are most predictive of evolutionary success and ecosystem behavior.