Self-hosted docs

Failover Automation

Automated failover with safety checks and rollback capabilities.

docs.cyiro.comproduction-ready guidance

Reroute completion SLO

Failover operations complete within 90 seconds for approved routes.

Target: 95% of reroutes complete within 90s
Measurement: Time from reroute request to traffic shift confirmation
Safety: Policy checks must pass before traffic shift

Rollback completion SLO

Rollback operations complete within 60 seconds when triggered.

Target: 99% of rollbacks complete within 60s
Measurement: Time from rollback request to original route restoration
Priority: Rollback takes precedence over new reroute requests

Circuit breaker behavior

Automatic circuit breaking prevents cascading failures during system stress.

Threshold: 5 consecutive failures or 20% error rate over 5 minutes
Cooldown: 30 minutes before automatic reset attempt
Manual override: Available via API and dashboard for emergency situations

Route health signal requirements

Health signals used to determine route status and failover eligibility.

Latency: < 2000ms for 95% of requests over 5 minutes
Availability: > 99.9% successful responses
Error rate: < 1% of requests result in 5xx errors
Policy compliance: All active policies pass validation

Failover allowlist policy examples

Example policies for allowing failover to specific endpoints.

Allow failover to backup endpoints with similar capabilities
Require policy compliance checks before activation
Example: Allow failover from gpt-4 to claude-3-opus for chat routes

Failover denylist policy examples

Example policies for preventing failover to specific endpoints.

Block failover to endpoints with known compatibility issues
Prevent failover to higher-cost endpoints without approval
Example: Deny failover from stable models to experimental models

Fallback priority ordering

Priority ordering for fallback endpoints when multiple options are available.

Primary: Same provider, different region
Secondary: Different provider, similar capabilities
Tertiary: Different provider, reduced capabilities
Example: gpt-4 → gpt-4-other-region → claude-3-opus → gemini-pro

Endpoint health probing

Active health checks to determine endpoint availability and performance.

Frequency: Every 30 seconds for active routes
Timeout: 5 seconds per probe
Threshold: 3 consecutive failures to mark as unhealthy
Recovery: 1 successful probe to mark as healthy

Cooldown window behavior

Cooldown periods after failover operations to prevent flapping.

Default: 30 minutes after automatic failover
Manual override: Can be reduced to 5 minutes by operators
Health checks continue during cooldown but don't trigger additional failovers

Manual override behavior

Manual intervention capabilities for failover operations.

Immediate failover: Bypass cooldown and health check requirements
Force rollback: Immediately return to original endpoint
Pause automation: Temporarily disable automatic failover for specific routes

Human-review hold behavior

Hold states that require human approval before proceeding.

High-risk failovers: Require approval before execution
Policy violations: Require approval to override
Cost threshold breaches: Require financial approval
Timeout: 30 minutes without approval results in automatic cancellation

Automation execution model

How Cyiro executes automation workflows with policy enforcement and safety checks.

Policy-first evaluation: All automation actions are validated against defined guardrails
Risk-based approval: High-risk actions require manual approval before execution
Audit trail generation: Complete logs of all automation decisions and outcomes
Rollback readiness: Automatic rollback procedures for failed or unsafe automation
Example: Failover workflow with pre-execution policy checks and post-execution validation

Automation audit log schema

Comprehensive audit logging for all automation actions and decisions.

Fields: timestamp, automation_id, action_type, workspace_id, route_id, status, initiated_by, approval_status, duration_ms
Retention: 90 days for standard logs, 365 days for high-risk actions
Export: Available via API and dashboard with filtering capabilities
Example: {automation_id: "auto-123", action_type: "failover", route_id: "prod-chat", status: "completed", initiated_by: "system", approval_status: "auto-approved"}

Self-hosted docs for Guardrails for AI.