Production Deployment
A checklist and deep-dive for running Lelu reliably in production — covering HTTPS, secrets management, Engine scaling, and observability.
Pre-Launch Checklist
Security & Infrastructure
Observability & Monitoring
Policies & Compliance
In Docker Compose healthchecks, prefer 127.0.0.1 over localhost to avoid container-local hostname resolution edge cases.
Scaling the Engine
The Engine is stateless — scale horizontally by running multiple replicas behind a load balancer. All state lives in Redis.
services:
engine:
deploy:
replicas: 3
resources:
limits:
cpus: "1"
memory: 512M
restart_policy:
condition: on-failure
delay: 5sSecrets Management
Never store secrets in environment files committed to source control. Use one of these patterns in production:
AWS Secrets Manager
Use the AWS SSM Parameter Store or Secrets Manager and inject via IAM role at runtime.
Kubernetes Secrets
Mount as environment variables from an encrypted Secret object — use Sealed Secrets or External Secrets Operator.
HashiCorp Vault
Use Vault Agent Injector to automatically inject secrets into pods at startup.
Observability
Lelu provides comprehensive metrics for monitoring authorization decisions, agent behavior, and system performance. Configure Prometheus scraping and alerting for production deployments.
Core Metrics
lelu_http_requests_total{method="POST",path="/v1/agent/authorize",status="200"}
# Request volume and status-code anomalies
lelu_http_request_duration_seconds{method="POST",path="/v1/agent/authorize"}
# Latency SLO / p95 / p99
lelu_auth_decisions_total{type="agent",allowed="false"}
# Deny-rate spikes and confidence policy pressure
lelu_agent_requests_total{agent_id,action,outcome}
# Per-agent authorization outcomes
lelu_agent_confidence_score{agent_id,action}
# Confidence score distribution
lelu_agent_risk_score{agent_id,action}
# Risk score distributionBehavioral Analytics Metrics
lelu_agent_reputation_score{agent_id}
# Current reputation score (0-1)
lelu_agent_anomaly_score{agent_id}
# Anomaly detection score (0-1, higher = more anomalous)
lelu_agent_human_review_total{agent_id,reason}
# Human review requirements by reason
lelu_policy_effectiveness_rate{policy_name,policy_version}
# Policy success ratePredictive Analytics Metrics
lelu_agent_prediction_accuracy{model_type,agent_id}
# Model accuracy (0-1)
lelu_agent_prediction_latency_seconds{model_type}
# Prediction latency
lelu_agent_predictions_total{model_type,outcome}
# Prediction counts
lelu_agent_model_sample_count{model_type}
# Training sample countMulti-Agent Coordination Metrics
lelu_agent_delegation_total{delegator,delegatee,outcome}
# Agent delegation counts
lelu_swarm_operations_total{swarm_id,operation_type,outcome}
# Swarm orchestration operations
lelu_swarm_agent_count{swarm_id}
# Active agents per swarmRecommended Alerts
Advanced Features Configuration
Enable and configure advanced features for production deployments.
OpenTelemetry Tracing
OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4318 OTEL_SERVICE_NAME=lelu-engine OTEL_TRACES_EXPORTER=otlp OTEL_TRACES_SAMPLER=always_on
Behavioral Analytics
# Reputation thresholds REPUTATION_LOW_THRESHOLD=0.5 REPUTATION_MIN_DECISIONS=10 # Anomaly detection ANOMALY_DETECTION_ENABLED=true ANOMALY_SEVERITY_THRESHOLD=0.7 ANOMALY_WINDOW_SIZE=100 # Baseline management BASELINE_SAMPLE_SIZE=100 BASELINE_REFRESH_INTERVAL=24h
Predictive Analytics
# Model training MIN_SAMPLES_FOR_MODEL=100 MODEL_UPDATE_INTERVAL=6h CONFIDENCE_MODEL_WINDOW=30d REVIEW_MODEL_WINDOW=14d # Prediction thresholds CONFIDENCE_PREDICTION_THRESHOLD=0.7 REVIEW_PREDICTION_THRESHOLD=0.6 POLICY_OPTIMIZATION_THRESHOLD=0.5
Prompt Injection Detection
# Enabled by default PROMPT_INJECTION_DETECTION_ENABLED=true PROMPT_INJECTION_SEVERITY_THRESHOLD=0.8 # Alert on high-severity detections PROMPT_INJECTION_ALERT_ENABLED=true
Multi-Agent Deployment Considerations
When deploying systems with multiple coordinating agents, consider these additional factors.
Delegation Chain Limits
Set maximum delegation depth to prevent infinite loops and excessive latency.
Swarm Coordination
Configure swarm size limits and timeout values for coordinated operations.
Trace Context Propagation
Ensure OpenTelemetry context is propagated across agent boundaries for complete trace visibility.
