Monitoring, Logging, and Resilience
Audit Logging & Monitoring
Comprehensive Logging Strategy Avoca maintains detailed audit logs across multiple infrastructure layers with retention periods optimized for operational and compliance requirements:- AWS Infrastructure: 1 month retention for infrastructure-level events
- Datadog Application Monitoring: 15 days retention for application-specific logging and event management
- Vercel Deployment Platform: 90 days retention for deployment and application operations
- Vanta Compliance Platform: Unlimited retention for compliance-relevant events and control evidence
- Datadog and Vercel manage application-specific logging and event management
- Vanta and AWS GuardDuty provide security monitoring of in-scope systems
- Automated alerting detects suspicious activities and potential security incidents
- Centralized collection and retention of security-relevant events provides audit evidence
Business Continuity & Backup Strategy
Backup Infrastructure- Frequency: Nightly automated database backups
- Retention: Two-week rolling retention for recovery purposes
- Provider: Managed by Supabase database infrastructure
- Future Enhancement: Manual multi-region backup capabilities under evaluation (ETA: Q2 or Q3 2026)
- Serverless Architecture: Cloud-native design enables automatic scaling and high availability
- Provider SLAs: Enterprise-grade uptime guarantees from all critical infrastructure providers
- Geographic Distribution: Multi-region capabilities leveraged where available from infrastructure providers
Reliability & Performance Expectations
Uptime & Availability- Target SLA: 99.9% uptime (≤8.76 hours downtime/year)
- Scheduled maintenance: ≥48-hour advance notice outside business hours
- Status page: Public or partner-accessible
- Incident communication: Email/SMS for significant outages
- Availability queries: 200 ms p95 goal
- Booking operations: 500 ms p95 goal
- Customer lookups: 150 ms p95 goal
- Content/script retrieval: 100 ms p95 goal (cacheable)
- Average requests/minute: 50–100 (launch)
- Peak requests/minute: 200–300
- Growth projection: 3–5× within first year
- Scaling events include marketing campaigns (10× spikes), seasonal peaks (2–3×), and new customer onboarding
- Gradual degradation preferred over hard failures
- Rate limiting with clear 429 responses
- Queue-based processing for non-real-time operations
Error Handling & Health Monitoring
Logging Requirements- Detailed error messages (without sensitive data)
- Request IDs for all responses
- Correlation IDs for distributed tracing
- Millisecond timestamp precision
GET /api/v1/health/ready– Load balancer readiness checksGET /api/v1/health/live– Liveness probes
- Booking success/failure rates by error type
- API response time percentiles (p50, p95, p99)
- Rate limit hit rates