Skip to main content

Monitoring, Logging, and Resilience

Audit Logging & Monitoring

Comprehensive Logging Strategy Avoca maintains detailed audit logs across multiple infrastructure layers with retention periods optimized for operational and compliance requirements:
  • AWS Infrastructure: 1 month retention for infrastructure-level events
  • Datadog Application Monitoring: 15 days retention for application-specific logging and event management
  • Vercel Deployment Platform: 90 days retention for deployment and application operations
  • Vanta Compliance Platform: Unlimited retention for compliance-relevant events and control evidence
Security Information & Event Management
  • Datadog and Vercel manage application-specific logging and event management
  • Vanta and AWS GuardDuty provide security monitoring of in-scope systems
  • Automated alerting detects suspicious activities and potential security incidents
  • Centralized collection and retention of security-relevant events provides audit evidence

Business Continuity & Backup Strategy

Backup Infrastructure
  • Frequency: Nightly automated database backups
  • Retention: Two-week rolling retention for recovery purposes
  • Provider: Managed by Supabase database infrastructure
  • Future Enhancement: Manual multi-region backup capabilities under evaluation (ETA: Q2 or Q3 2026)
Infrastructure Resilience
  • Serverless Architecture: Cloud-native design enables automatic scaling and high availability
  • Provider SLAs: Enterprise-grade uptime guarantees from all critical infrastructure providers
  • Geographic Distribution: Multi-region capabilities leveraged where available from infrastructure providers

Reliability & Performance Expectations

Uptime & Availability
  • Target SLA: 99.9% uptime (≤8.76 hours downtime/year)
  • Scheduled maintenance: ≥48-hour advance notice outside business hours
  • Status page: Public or partner-accessible
  • Incident communication: Email/SMS for significant outages
Response Time Targets
  • Availability queries: 200 ms p95 goal
  • Booking operations: 500 ms p95 goal
  • Customer lookups: 150 ms p95 goal
  • Content/script retrieval: 100 ms p95 goal (cacheable)
Scalability Planning
  • Average requests/minute: 50–100 (launch)
  • Peak requests/minute: 200–300
  • Growth projection: 3–5× within first year
  • Scaling events include marketing campaigns (10× spikes), seasonal peaks (2–3×), and new customer onboarding
Performance Under Load
  • Gradual degradation preferred over hard failures
  • Rate limiting with clear 429 responses
  • Queue-based processing for non-real-time operations

Error Handling & Health Monitoring

Logging Requirements
  • Detailed error messages (without sensitive data)
  • Request IDs for all responses
  • Correlation IDs for distributed tracing
  • Millisecond timestamp precision
Health Check Endpoints
GET /api/v1/health
Response:
{
  "status": "healthy",
  "version": "1.2.3",
  "timestamp": "2025-10-31T14:30:00Z",
  "dependencies": {
    "database": "healthy",
    "cache": "healthy",
    "external_api": "degraded"
  }
}
Additional endpoints:
  • GET /api/v1/health/ready – Load balancer readiness checks
  • GET /api/v1/health/live – Liveness probes
Monitoring Metrics Exposure
  • Booking success/failure rates by error type
  • API response time percentiles (p50, p95, p99)
  • Rate limit hit rates