Self-hosted docs
Failover Automation
Automated failover with safety checks and rollback capabilities.
Reroute completion SLO
Failover operations complete within 90 seconds for approved routes.
- Target: 95% of reroutes complete within 90s
- Measurement: Time from reroute request to traffic shift confirmation
- Safety: Policy checks must pass before traffic shift
Rollback completion SLO
Rollback operations complete within 60 seconds when triggered.
- Target: 99% of rollbacks complete within 60s
- Measurement: Time from rollback request to original route restoration
- Priority: Rollback takes precedence over new reroute requests
Circuit breaker behavior
Automatic circuit breaking prevents cascading failures during system stress.
- Threshold: 5 consecutive failures or 20% error rate over 5 minutes
- Cooldown: 30 minutes before automatic reset attempt
- Manual override: Available via API and dashboard for emergency situations
Route health signal requirements
Health signals used to determine route status and failover eligibility.
- Latency: < 2000ms for 95% of requests over 5 minutes
- Availability: > 99.9% successful responses
- Error rate: < 1% of requests result in 5xx errors
- Policy compliance: All active policies pass validation
Failover allowlist policy examples
Example policies for allowing failover to specific endpoints.
- Allow failover to backup endpoints with similar capabilities
- Require policy compliance checks before activation
- Example: Allow failover from gpt-4 to claude-3-opus for chat routes
Failover denylist policy examples
Example policies for preventing failover to specific endpoints.
- Block failover to endpoints with known compatibility issues
- Prevent failover to higher-cost endpoints without approval
- Example: Deny failover from stable models to experimental models
Fallback priority ordering
Priority ordering for fallback endpoints when multiple options are available.
- Primary: Same provider, different region
- Secondary: Different provider, similar capabilities
- Tertiary: Different provider, reduced capabilities
- Example: gpt-4 → gpt-4-other-region → claude-3-opus → gemini-pro
Endpoint health probing
Active health checks to determine endpoint availability and performance.
- Frequency: Every 30 seconds for active routes
- Timeout: 5 seconds per probe
- Threshold: 3 consecutive failures to mark as unhealthy
- Recovery: 1 successful probe to mark as healthy
Cooldown window behavior
Cooldown periods after failover operations to prevent flapping.
- Default: 30 minutes after automatic failover
- Manual override: Can be reduced to 5 minutes by operators
- Health checks continue during cooldown but don't trigger additional failovers
Manual override behavior
Manual intervention capabilities for failover operations.
- Immediate failover: Bypass cooldown and health check requirements
- Force rollback: Immediately return to original endpoint
- Pause automation: Temporarily disable automatic failover for specific routes
Human-review hold behavior
Hold states that require human approval before proceeding.
- High-risk failovers: Require approval before execution
- Policy violations: Require approval to override
- Cost threshold breaches: Require financial approval
- Timeout: 30 minutes without approval results in automatic cancellation
Automation execution model
How Cyiro executes automation workflows with policy enforcement and safety checks.
- Policy-first evaluation: All automation actions are validated against defined guardrails
- Risk-based approval: High-risk actions require manual approval before execution
- Audit trail generation: Complete logs of all automation decisions and outcomes
- Rollback readiness: Automatic rollback procedures for failed or unsafe automation
- Example: Failover workflow with pre-execution policy checks and post-execution validation
Automation audit log schema
Comprehensive audit logging for all automation actions and decisions.
- Fields: timestamp, automation_id, action_type, workspace_id, route_id, status, initiated_by, approval_status, duration_ms
- Retention: 90 days for standard logs, 365 days for high-risk actions
- Export: Available via API and dashboard with filtering capabilities
- Example: {automation_id: "auto-123", action_type: "failover", route_id: "prod-chat", status: "completed", initiated_by: "system", approval_status: "auto-approved"}