Chapter 0.2 — System Architecture
chapter: 00-introduction/02-system-architecture
version: 1.0.0
status: stable
last_reviewed: 2026-05-26
owners: [platform-engineering]
1. Purpose
This chapter is the architectural reference for the platform. It explains how the moving parts fit together physically and logically, where the trust boundaries are, and how a request flows from a browser or API client all the way to a persisted journal entry.
2. Architectural style
travoBooks is a modular monolith with strictly-bounded modules, deployed as a small fleet of processes. We chose this over microservices for three reasons:
- Transactional integrity. The accounting promise — "an operational change and its journal entry are inseparable" — is enforced inside a single ACID database transaction. Distributing that across services creates a class of consistency bugs we refuse to ship.
- Domain coupling is high. Booking, ticketing, invoicing, ledger, and commission are all variations of the same event. Splitting them into services creates more inter-service chatter than they would have as in-process calls.
- Operational simplicity for the partner. A travel agency in Dhaka cannot operate a 30-service Kubernetes mesh. The deployment surface must remain understandable.
Inside the monolith, modules respect strict import rules (Layer 5 may import Layer 4, but never the reverse) which keep the codebase service-extractable when scale demands it.
3. High-level deployment topology
3.1 Tier responsibilities
| Tier | Responsibility | Scaling |
|---|---|---|
| Edge | TLS termination, WAF, DDoS protection, static caching | CDN provider handles |
| Load balancer | Round-robin + sticky sessions for UI workers | Horizontal |
| Web workers | Render UI; thin controllers | Horizontal, stateless |
| API workers | JSON API, authenticated, rate-limited | Horizontal, stateless |
| Job worker | Cron, async jobs, BSP imports, FX rate fetch, dunning | Vertical first, then sharded by job class |
| Webhook dispatcher | Sign + deliver webhooks with retry | Horizontal |
| Primary DB | All writes, strongly-consistent reads | Vertical + planned partner sharding |
| Read replica | Reports, large reads, search-page queries | Add replicas as needed |
| Redis cache | Hot lookups: FX rates, permission cache, idempotency keys | Vertical, then cluster |
| Redis queue | Job queue, separate Redis instance from cache | Vertical |
| Search index | Customer / supplier / booking text search | OpenSearch or MeiliSearch |
| Object storage | Invoice PDFs, ticket PDFs, ID documents, imports | S3-compatible |
4. Request lifecycle (UI write)
The canonical lifecycle for an authenticated UI write — for example, "agent creates a booking":
Two architectural commitments are visible here:
- Steps 7–13 are a single database transaction. The booking, its segments, the draft invoice, the journal entry, and the audit log are written together or not at all. This is the structural enforcement of the "double-entry by default" pillar.
- Side effects are deferred (steps 14–15). Sending an email or notifying a supplier is not allowed inside the transaction, because rolling back the transaction cannot rewind an email.
5. Request lifecycle (API write)
API requests use the same pipeline with three differences:
- Auth is by Bearer token (PAT or OAuth) instead of session cookie.
- CSRF middleware is bypassed; the token is the proof.
- Responses are JSON; errors follow the Error Code Catalog.
6. Trust boundaries
Rules at each boundary:
- Untrusted → Semi-trusted: TLS, WAF, rate limit, geo-blocking optional per route.
- Semi-trusted → Trusted: authentication required; CSRF on cookie-auth routes; idempotency required on POST/PUT for API; per-partner quotas.
- Trusted → Restricted: the database is reachable only from app/job networks; credentials are scoped least-privilege; secrets are short-lived and rotated.
- Trusted → External (supplier): outbound traffic flows through a network egress proxy with allowlist; supplier credentials live in the secrets manager and are loaded per-partner.
7. Data flow taxonomy
The platform handles four distinct shapes of data movement:
| Shape | Example | Latency profile | Failure semantics |
|---|---|---|---|
| Synchronous in-transaction | Booking + journal entry write | <300 ms | All-or-nothing |
| Asynchronous out-of-band | Send confirmation email | Seconds–minutes | At-least-once with retry |
| Batch / scheduled | BSP file import, FX rate refresh | Minutes–hours | Idempotent reruns |
| Streaming / real-time | Supplier inventory updates | Sub-second | Lossy with replay endpoint |
Module chapters specify which shape each operation uses.
8. Concurrency control
The platform uses three concurrency techniques in deliberate combination:
| Technique | Where used | Why |
|---|---|---|
Optimistic locking (row_version) |
Mutable operational rows: bookings, invoices.draft, customers |
High read/low conflict; minimal contention. |
Pessimistic row-level locks (SELECT ... FOR UPDATE) |
Posting a journal entry, advancing an invoice from draft to issued | Strong serialisation for invariant-critical paths. |
| Distributed lock (Redis) | Cron jobs, BSP import, FX refresh | Prevents duplicate execution across job workers. |
9. Idempotency
Every state-changing API endpoint requires an Idempotency-Key header. The server records (partner_id, route, key) → response_body for 24 hours. A replay returns the cached response without re-executing.
For internal jobs, idempotency is achieved by deterministic external keys (e.g. a BSP file is keyed by BSP-{period}-{partner_id}; reprocessing the same file is a no-op).
10. Caching strategy
| Cache | TTL | Invalidation |
|---|---|---|
| FX rates | 1 hour (intraday); historical rates cached forever | TTL + explicit refresh job |
| Permission set per user | 5 minutes | Invalidated on role change |
| Chart-of-accounts tree | 15 minutes | Invalidated on CoA edit |
| Tax profile per partner | 15 minutes | Invalidated on profile edit |
| Customer / supplier list (paginated page 1) | 30 seconds | TTL only |
| API rate limit counters | Rolling window | TTL only |
We deliberately do not cache anything in the ledger path. Ledger reads are direct from primary DB.
11. Background jobs and schedules
The platform runs the following scheduled jobs. Full listing and SLAs in 08-system-features/03-automation.md.
| Job | Frequency | Purpose |
|---|---|---|
fx_rates.refresh |
Hourly | Pull rates from configured provider |
bsp.import |
Daily 04:00 partner-local | Pull BSP settlement file, post supplier payables |
invoices.send_due_reminders |
Daily 09:00 partner-local | Dunning emails |
subscriptions.renew |
Daily 02:00 UTC | Subscription renewal billing |
gl.close_period_preview |
Daily 23:30 partner-local | Pre-compute period close artefacts |
audit.archive |
Daily 01:00 UTC | Ship audit logs older than 90 days to cold storage |
webhooks.retry |
Every 5 minutes | Retry failed webhook deliveries with exponential backoff |
notifications.flush |
Every 1 minute | Send queued notifications |
12. Observability
Three signal classes:
- Metrics (Prometheus-style): request rate, error rate, p50/p95/p99 latency per route; job durations; queue depth; DB connection pool usage; FX cache hit ratio.
- Logs: structured JSON, one event per significant action, correlated by
request_idandactor_id. Logs are not the audit log — audit lives in the database. - Traces: optional OpenTelemetry traces for slow-path debugging.
A redaction filter strips PII (passport numbers, card PANs, full names of passengers) from logs before shipping.
13. Disaster recovery
| Metric | Target |
|---|---|
| RPO (data loss tolerated) | ≤ 5 minutes |
| RTO (time to restore) | ≤ 60 minutes |
| Backup frequency | Continuous binlog + hourly logical dump |
| Backup retention | 35 days hot, 1 year cold |
| Geo-redundancy | Primary + warm standby in second region |
| Quarterly restore drill | Mandatory; documented in 12-compliance/03-audit-readiness.md |
14. Security controls (cross-reference)
This section summarises; full detail in Volume XII.
- At rest: AES-256 for DB and object storage; column-level encryption for PAN-like fields.
- In transit: TLS 1.3 only; HSTS; certificate pinning for supplier connectors where supported.
- Auth: session cookies (httpOnly, SameSite=Strict, Secure) for UI; Bearer tokens for API; optional MFA per partner policy.
- Secrets: managed vault; no secrets in environment variables in production.
- Dependencies: locked, scanned weekly; CVE budget enforced in CI.
- AppSec: parameterised queries everywhere; output encoding by default; CSRF tokens on all cookie-auth writes; strict CSP.
15. Scaling beyond Phase 1
The architecture is deliberately conservative. Known evolution paths:
| Pressure | Response |
|---|---|
| Single DB write hot-spot | Partner-sharded DB cluster; shard key = partner_id |
| Long-tail GDS latencies | Pull supplier search into a separate "shopping" service |
| Report generation contention | Materialised views + warehouse offload (Snowflake/BigQuery) |
| Webhook fan-out | Dedicated dispatch service with per-partner rate isolation |
| Real-time AI agent traffic | gRPC bidirectional surface backed by the same domain services |
Next: 03-glossary.md — terms and abbreviations used throughout the documentation.