Docs center

Self-hosted docs for Guardrails for AI.

Technical guidance for observability, guardrails, permissioning, and automation in one first-party documentation surface.

Self-hosted docs

Failover Automation

Automated failover with safety checks and rollback capabilities.

docs.cyiro.comproduction-ready guidance

Reroute completion SLO

Failover operations complete within 90 seconds for approved routes.

  • Target: 95% of reroutes complete within 90s
  • Measurement: Time from reroute request to traffic shift confirmation
  • Safety: Policy checks must pass before traffic shift

Rollback completion SLO

Rollback operations complete within 60 seconds when triggered.

  • Target: 99% of rollbacks complete within 60s
  • Measurement: Time from rollback request to original route restoration
  • Priority: Rollback takes precedence over new reroute requests

Circuit breaker behavior

Automatic circuit breaking prevents cascading failures during system stress.

  • Threshold: 5 consecutive failures or 20% error rate over 5 minutes
  • Cooldown: 30 minutes before automatic reset attempt
  • Manual override: Available via API and dashboard for emergency situations

Route health signal requirements

Health signals used to determine route status and failover eligibility.

  • Latency: < 2000ms for 95% of requests over 5 minutes
  • Availability: > 99.9% successful responses
  • Error rate: < 1% of requests result in 5xx errors
  • Policy compliance: All active policies pass validation

Failover allowlist policy examples

Example policies for allowing failover to specific endpoints.

  • Allow failover to backup endpoints with similar capabilities
  • Require policy compliance checks before activation
  • Example: Allow failover from gpt-4 to claude-3-opus for chat routes

Failover denylist policy examples

Example policies for preventing failover to specific endpoints.

  • Block failover to endpoints with known compatibility issues
  • Prevent failover to higher-cost endpoints without approval
  • Example: Deny failover from stable models to experimental models

Fallback priority ordering

Priority ordering for fallback endpoints when multiple options are available.

  • Primary: Same provider, different region
  • Secondary: Different provider, similar capabilities
  • Tertiary: Different provider, reduced capabilities
  • Example: gpt-4 → gpt-4-other-region → claude-3-opus → gemini-pro

Endpoint health probing

Active health checks to determine endpoint availability and performance.

  • Frequency: Every 30 seconds for active routes
  • Timeout: 5 seconds per probe
  • Threshold: 3 consecutive failures to mark as unhealthy
  • Recovery: 1 successful probe to mark as healthy

Cooldown window behavior

Cooldown periods after failover operations to prevent flapping.

  • Default: 30 minutes after automatic failover
  • Manual override: Can be reduced to 5 minutes by operators
  • Health checks continue during cooldown but don't trigger additional failovers

Manual override behavior

Manual intervention capabilities for failover operations.

  • Immediate failover: Bypass cooldown and health check requirements
  • Force rollback: Immediately return to original endpoint
  • Pause automation: Temporarily disable automatic failover for specific routes

Human-review hold behavior

Hold states that require human approval before proceeding.

  • High-risk failovers: Require approval before execution
  • Policy violations: Require approval to override
  • Cost threshold breaches: Require financial approval
  • Timeout: 30 minutes without approval results in automatic cancellation

Automation execution model

How Cyiro executes automation workflows with policy enforcement and safety checks.

  • Policy-first evaluation: All automation actions are validated against defined guardrails
  • Risk-based approval: High-risk actions require manual approval before execution
  • Audit trail generation: Complete logs of all automation decisions and outcomes
  • Rollback readiness: Automatic rollback procedures for failed or unsafe automation
  • Example: Failover workflow with pre-execution policy checks and post-execution validation

Automation audit log schema

Comprehensive audit logging for all automation actions and decisions.

  • Fields: timestamp, automation_id, action_type, workspace_id, route_id, status, initiated_by, approval_status, duration_ms
  • Retention: 90 days for standard logs, 365 days for high-risk actions
  • Export: Available via API and dashboard with filtering capabilities
  • Example: {automation_id: "auto-123", action_type: "failover", route_id: "prod-chat", status: "completed", initiated_by: "system", approval_status: "auto-approved"}
Failover Automation | Guardrails for AI