diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md new file mode 100644 index 0000000..97b4a61 --- /dev/null +++ b/docs/ARCHITECTURE.md @@ -0,0 +1,330 @@ +# Architecture Choices + +This document explains the key architectural decisions made in the Hash of Wisdom project and the reasoning behind them. + +## Overall Architecture + +### Clean Architecture +We follow Clean Architecture principles with clear layer separation: + +``` +┌─────────────────────────────────────┐ +│ Infrastructure Layer │ ← cmd/, internal/server, internal/protocol +├─────────────────────────────────────┤ +│ Application Layer │ ← internal/application (message handling) +├─────────────────────────────────────┤ +│ Domain Layer │ ← internal/service, internal/pow (business logic) +├─────────────────────────────────────┤ +│ External Layer │ ← internal/quotes (external APIs) +└─────────────────────────────────────┘ +``` + +**Benefits**: +- **Testability**: Each layer can be unit tested independently +- **Maintainability**: Changes in one layer don't cascade +- **Flexibility**: Easy to swap implementations (e.g., different quote sources) +- **Domain Focus**: Core business rules are isolated and protected + +## Protocol Design + +### Binary Protocol with JSON Payloads +Choice: Custom binary protocol with JSON-encoded message bodies + +**Why Binary Protocol**: +- **Performance**: Efficient framing and length prefixes +- **Reliability**: Clear message boundaries prevent parsing issues +- **Extensibility**: Easy to add message types and versions + +**Why JSON Payloads**: +- **Simplicity**: Standard library support, easy debugging +- **Flexibility**: Schema evolution without breaking compatibility +- **Tooling**: Excellent tooling and human readability + +**Alternative Considered**: Pure binary (Protocol Buffers) +- **Rejected Because**: Added complexity without significant benefit for our use case +- **Trade-off**: Slightly larger payload size for much simpler implementation + +### Stateless Challenge Design +Choice: HMAC-signed challenges with all state embedded + +```go +type Challenge struct { + Target string `json:"target"` // "quotes" + Timestamp int64 `json:"timestamp"` // Unix timestamp + Difficulty int `json:"difficulty"` // Leading zero bits + Random string `json:"random"` // Entropy + Signature string `json:"signature"` // HMAC-SHA256 +} +``` + +**Benefits**: +- **Scalability**: No server-side session storage required +- **Reliability**: Challenges survive server restarts +- **Security**: HMAC prevents tampering and replay attacks +- **Simplicity**: No cache management or cleanup needed + +**Alternative Considered**: Session-based challenges +- **Rejected Because**: Requires distributed session management for horizontal scaling + +## Proof-of-Work Algorithm + +### SHA-256 with Leading Zero Bits +Choice: SHA-256 hashing with difficulty measured as leading zero bits + +**Why SHA-256**: +- **Security**: Cryptographically secure, extensively tested +- **Performance**: Hardware-optimized on most platforms +- **Standardization**: Well-known algorithm with predictable properties + +**Why Leading Zero Bits**: +- **Linear Scaling**: Each bit doubles the difficulty (2^n complexity) +- **Simplicity**: Easy to verify and understand +- **Flexibility**: Fine-grained difficulty adjustment + +**Alternative Considered**: Scrypt/Argon2 (memory-hard functions) +- **Rejected Because**: Excessive complexity for DDoS protection use case +- **Trade-off**: ASIC resistance not needed for temporary challenges + +### Difficulty Range: 4-30 Bits +Choice: Configurable difficulty with reasonable bounds + +- **Minimum (4 bits)**: ~16 attempts average, sub-second solve time +- **Maximum (30 bits)**: ~1 billion attempts, several seconds on modern CPU +- **Default (4 bits)**: Balance between protection and user experience + +## Server Architecture + +### TCP Server with Per-Connection Goroutines +Choice: Custom TCP server with one goroutine per connection + +```go +func (s *TCPServer) Start(ctx context.Context) error { + // Start listener + listener, err := net.Listen("tcp", s.config.Address) + if err != nil { + return err + } + + // Start accept loop in goroutine + go s.acceptLoop(ctx) + return nil // Returns immediately +} + +func (s *TCPServer) acceptLoop(ctx context.Context) { + for { + conn, err := s.listener.Accept() + if err != nil || ctx.Done() != nil { + return + } + + // Launch handler in goroutine with WaitGroup tracking + s.wg.Add(1) + go func() { + defer s.wg.Done() + s.handleConnection(ctx, conn) + }() + } +} +``` + +**Benefits**: +- **Concurrency**: Each connection handled in separate goroutine +- **Non-blocking Start**: Server starts in background, returns immediately +- **Graceful Shutdown**: WaitGroup ensures all connections finish before stop +- **Context Cancellation**: Proper cleanup when context is cancelled +- **Resource Control**: Connection timeouts prevent resource exhaustion + +**Alternative Considered**: HTTP/REST API +- **Rejected Because**: Test task requirements + +### Connection Security: Multi-Level Timeouts +Choice: Layered timeout protection against various attacks + +1. **Connection Timeout (15s)**: Maximum total connection lifetime +2. **Read Timeout (5s)**: Maximum time between incoming bytes +3. **Write Timeout (5s)**: Maximum time to send response + +**Protects Against**: +- **Slowloris**: Slow read timeout prevents slow header attacks +- **Slow POST**: Connection timeout limits total request time +- **Resource Exhaustion**: Automatic cleanup of stale connections + +## Configuration Management + +### cleanenv with YAML + Environment Variables +Choice: File-based configuration with environment variable overrides + +```yaml +# config.yaml +server: + address: ":8080" + +pow: + difficulty: 4 +``` + +```bash +# Environment override +export POW_DIFFICULTY=8 +``` + +**Benefits**: +- **Development**: Easy configuration files for local development +- **Production**: Environment variables for containerized deployments +- **Validation**: Built-in validation and type conversion +- **Documentation**: Self-documenting with struct tags + +**Alternative Considered**: Pure environment variables +- **Rejected Because**: Harder to manage complex configurations + +## Observability Architecture + +### Prometheus Metrics +Choice: Prometheus format metrics with essential measurements + +**Application Metrics**: +- `wisdom_requests_total` - All incoming requests +- `wisdom_request_errors_total{error_type}` - Errors by type +- `wisdom_request_duration_seconds` - Request processing time +- `wisdom_quotes_served_total` - Successfully served quotes + +**Go Runtime Metrics** (automatically exported): +- `go_memstats_*` - Memory allocation and GC statistics +- `go_goroutines` - Current number of goroutines +- `go_gc_duration_seconds` - Garbage collection duration +- `process_*` - Process-level CPU, memory, and file descriptor stats + +**Design Principle**: Simple metrics that provide actionable insights +- **Avoided**: Complex multi-dimensional metrics +- **Focus**: Essential health and performance indicators +- **Runtime Visibility**: Go collector provides deep runtime observability + +### Metrics at Infrastructure Layer +Choice: Collect metrics in TCP server, not business logic + +```go +// In TCP server (infrastructure) +metrics.RequestsTotal.Inc() +start := time.Now() +response, err := s.wisdomApplication.HandleMessage(ctx, msg) +metrics.RequestDuration.Observe(time.Since(start).Seconds()) +``` + +**Benefits**: +- **Separation of Concerns**: Business logic stays pure +- **Consistency**: All requests measured the same way +- **Performance**: Minimal overhead in critical path + +## Design Patterns + +### Dependency Injection +All major components use constructor injection: +```go +server := server.NewTCPServer(wisdomApplication, config, options...) +service := service.NewWisdomService(generator, verifier, quoteService) +``` + +**Benefits**: +- **Testing**: Easy to inject mocks and stubs +- **Configuration**: Runtime assembly of components +- **Decoupling**: Components don't know about concrete implementations + +### Interface Segregation +Small, focused interfaces for easy testing: +```go +type ChallengeGenerator interface { + GenerateChallenge(ctx context.Context) (*Challenge, error) +} + +type QuoteService interface { + GetQuote(ctx context.Context) (string, error) +} +``` + +### Functional Options +Flexible configuration with sensible defaults: +```go +server := NewTCPServer(application, config, + WithLogger(logger), +) +``` + +### Clean Architecture Implementation +See the layer diagram in the Overall Architecture section above for package organization. + +## Testing Architecture + +### Layered Testing Strategy +1. **Unit Tests**: Each package tested independently with mocks +2. **Integration Tests**: End-to-end tests with real TCP connections +3. **Benchmark Tests**: Performance validation for PoW algorithms + +```go +// Unit test with mocks +func TestWisdomService_HandleMessage(t *testing.T) { + mockGenerator := &MockGenerator{} + mockVerifier := &MockVerifier{} + mockQuotes := &MockQuoteService{} + + service := NewWisdomService(mockGenerator, mockVerifier, mockQuotes) + // Test business logic in isolation +} + +// Integration test with real components +func TestTCPServer_SlowlorisProtection(t *testing.T) { + // Start real server, make slow connection + // Verify server doesn't hang +} +``` + +## Security Architecture + +### Defense in Depth +Multiple security layers working together: + +1. **HMAC Authentication**: Prevents challenge tampering +2. **Timestamp Validation**: Prevents replay attacks (5-minute TTL) +3. **Connection Timeouts**: Prevents resource exhaustion +4. **Proof-of-Work**: Rate limiting through computational cost +5. **Input Validation**: All protocol messages validated + +### Threat Model +**Primary Threats Addressed**: +- **DDoS Attacks**: PoW makes attacks expensive +- **Resource Exhaustion**: Connection timeouts and limits +- **Protocol Attacks**: Binary framing prevents confusion +- **Replay Attacks**: Timestamp validation in challenges + +**Threats NOT Addressed** (by design): +- **Authentication**: Public service, no user accounts +- **Authorization**: All valid solutions get quotes +- **Data Confidentiality**: Quotes are public information + +## Trade-offs Made + +### Simplicity vs Performance +- **Chose**: Simple JSON payloads over binary serialization +- **Trade-off**: ~30% larger messages for easier debugging and maintenance + +### Memory vs CPU +- **Chose**: Stateless challenges requiring CPU verification +- **Trade-off**: More CPU per request for better scalability + +### Flexibility vs Optimization +- **Chose**: Interface-based design with dependency injection +- **Trade-off**: Small runtime overhead for much better testability + +### Features vs Complexity +- **Chose**: Essential features only (no rate limiting, user accounts, etc.) +- **Benefit**: Clean, focused implementation that does one thing well + +## Future Architecture Considerations + +For production scaling, consider: +1. **Quote Service Enhancement**: Caching, fallback quotes, multiple API sources +2. **Load Balancing**: Multiple server instances behind load balancer +3. **Rate Limiting**: Per-IP request limiting for additional protection +4. **Monitoring**: Full observability stack (Prometheus, Grafana, alerting) +5. **Security**: TLS encryption for sensitive deployments + +The current architecture provides a solid foundation for these enhancements while maintaining simplicity and focus.