hash-of-wisdom/docs/ARCHITECTURE.md

12 KiB

Architecture Choices

This document explains the key architectural decisions made in the Hash of Wisdom project and the reasoning behind them.

Overall Architecture

Clean Architecture

We follow Clean Architecture principles with clear layer separation:

┌─────────────────────────────────────┐
│         Infrastructure Layer        │  ← cmd/, internal/server, internal/protocol
├─────────────────────────────────────┤
│         Application Layer           │  ← internal/application (message handling)
├─────────────────────────────────────┤
│           Domain Layer              │  ← internal/service, internal/pow (business logic)
├─────────────────────────────────────┤
│          External Layer             │  ← internal/quotes (external APIs)
└─────────────────────────────────────┘

Benefits:

  • Testability: Each layer can be unit tested independently
  • Maintainability: Changes in one layer don't cascade
  • Flexibility: Easy to swap implementations (e.g., different quote sources)
  • Domain Focus: Core business rules are isolated and protected

Protocol Design

Binary Protocol with JSON Payloads

Choice: Custom binary protocol with JSON-encoded message bodies

Why Binary Protocol:

  • Performance: Efficient framing and length prefixes
  • Reliability: Clear message boundaries prevent parsing issues
  • Extensibility: Easy to add message types and versions

Why JSON Payloads:

  • Simplicity: Standard library support, easy debugging
  • Flexibility: Schema evolution without breaking compatibility
  • Tooling: Excellent tooling and human readability

Alternative Considered: Pure binary (Protocol Buffers)

  • Rejected Because: Added complexity without significant benefit for our use case
  • Trade-off: Slightly larger payload size for much simpler implementation

Stateless Challenge Design

Choice: HMAC-signed challenges with all state embedded

type Challenge struct {
    Target     string `json:"target"`      // "quotes"
    Timestamp  int64  `json:"timestamp"`   // Unix timestamp
    Difficulty int    `json:"difficulty"`  // Leading zero bits
    Random     string `json:"random"`      // Entropy
    Signature  string `json:"signature"`   // HMAC-SHA256
}

Benefits:

  • Scalability: No server-side session storage required
  • Reliability: Challenges survive server restarts
  • Security: HMAC prevents tampering and replay attacks
  • Simplicity: No cache management or cleanup needed

Alternative Considered: Session-based challenges

  • Rejected Because: Requires distributed session management for horizontal scaling

Proof-of-Work Algorithm

SHA-256 with Leading Zero Bits

Choice: SHA-256 hashing with difficulty measured as leading zero bits

Why SHA-256:

  • Security: Cryptographically secure, extensively tested
  • Performance: Hardware-optimized on most platforms
  • Standardization: Well-known algorithm with predictable properties

Why Leading Zero Bits:

  • Linear Scaling: Each bit doubles the difficulty (2^n complexity)
  • Simplicity: Easy to verify and understand
  • Flexibility: Fine-grained difficulty adjustment

Alternative Considered: Scrypt/Argon2 (memory-hard functions)

  • Rejected Because: Excessive complexity for DDoS protection use case
  • Trade-off: ASIC resistance not needed for temporary challenges

Difficulty Range: 4-30 Bits

Choice: Configurable difficulty with reasonable bounds

  • Minimum (4 bits): ~16 attempts average, sub-second solve time
  • Maximum (30 bits): ~1 billion attempts, several seconds on modern CPU
  • Default (4 bits): Balance between protection and user experience

Server Architecture

TCP Server with Per-Connection Goroutines

Choice: Custom TCP server with one goroutine per connection

func (s *TCPServer) Start(ctx context.Context) error {
    // Start listener
    listener, err := net.Listen("tcp", s.config.Address)
    if err != nil {
        return err
    }

    // Start accept loop in goroutine
    go s.acceptLoop(ctx)
    return nil // Returns immediately
}

func (s *TCPServer) acceptLoop(ctx context.Context) {
    for {
        conn, err := s.listener.Accept()
        if err != nil || ctx.Done() != nil {
            return
        }

        // Launch handler in goroutine with WaitGroup tracking
        s.wg.Add(1)
        go func() {
            defer s.wg.Done()
            s.handleConnection(ctx, conn)
        }()
    }
}

Benefits:

  • Concurrency: Each connection handled in separate goroutine
  • Non-blocking Start: Server starts in background, returns immediately
  • Graceful Shutdown: WaitGroup ensures all connections finish before stop
  • Context Cancellation: Proper cleanup when context is cancelled
  • Resource Control: Connection timeouts prevent resource exhaustion

Alternative Considered: HTTP/REST API

  • Rejected Because: Test task requirements

Connection Security: Multi-Level Timeouts

Choice: Layered timeout protection against various attacks

  1. Connection Timeout (15s): Maximum total connection lifetime
  2. Read Timeout (5s): Maximum time between incoming bytes
  3. Write Timeout (5s): Maximum time to send response

Protects Against:

  • Slowloris: Slow read timeout prevents slow header attacks
  • Slow POST: Connection timeout limits total request time
  • Resource Exhaustion: Automatic cleanup of stale connections

Configuration Management

cleanenv with YAML + Environment Variables

Choice: File-based configuration with environment variable overrides

# config.yaml
server:
  address: ":8080"

pow:
  difficulty: 4
# Environment override
export POW_DIFFICULTY=8

Benefits:

  • Development: Easy configuration files for local development
  • Production: Environment variables for containerized deployments
  • Validation: Built-in validation and type conversion
  • Documentation: Self-documenting with struct tags

Alternative Considered: Pure environment variables

  • Rejected Because: Harder to manage complex configurations

Observability Architecture

Prometheus Metrics

Choice: Prometheus format metrics with essential measurements

Application Metrics:

  • wisdom_requests_total - All incoming requests
  • wisdom_request_errors_total{error_type} - Errors by type
  • wisdom_request_duration_seconds - Request processing time
  • wisdom_quotes_served_total - Successfully served quotes

Go Runtime Metrics (automatically exported):

  • go_memstats_* - Memory allocation and GC statistics
  • go_goroutines - Current number of goroutines
  • go_gc_duration_seconds - Garbage collection duration
  • process_* - Process-level CPU, memory, and file descriptor stats

Design Principle: Simple metrics that provide actionable insights

  • Avoided: Complex multi-dimensional metrics
  • Focus: Essential health and performance indicators
  • Runtime Visibility: Go collector provides deep runtime observability

Metrics at Infrastructure Layer

Choice: Collect metrics in TCP server, not business logic

// In TCP server (infrastructure)
metrics.RequestsTotal.Inc()
start := time.Now()
response, err := s.wisdomApplication.HandleMessage(ctx, msg)
metrics.RequestDuration.Observe(time.Since(start).Seconds())

Benefits:

  • Separation of Concerns: Business logic stays pure
  • Consistency: All requests measured the same way
  • Performance: Minimal overhead in critical path

Design Patterns

Dependency Injection

All major components use constructor injection:

server := server.NewTCPServer(wisdomApplication, config, options...)
service := service.NewWisdomService(generator, verifier, quoteService)

Benefits:

  • Testing: Easy to inject mocks and stubs
  • Configuration: Runtime assembly of components
  • Decoupling: Components don't know about concrete implementations

Interface Segregation

Small, focused interfaces for easy testing:

type ChallengeGenerator interface {
    GenerateChallenge(ctx context.Context) (*Challenge, error)
}

type QuoteService interface {
    GetQuote(ctx context.Context) (string, error)
}

Functional Options

Flexible configuration with sensible defaults:

server := NewTCPServer(application, config,
    WithLogger(logger),
)

Clean Architecture Implementation

See the layer diagram in the Overall Architecture section above for package organization.

Testing Architecture

Layered Testing Strategy

  1. Unit Tests: Each package tested independently with mocks
  2. Integration Tests: End-to-end tests with real TCP connections
  3. Benchmark Tests: Performance validation for PoW algorithms
// Unit test with mocks
func TestWisdomService_HandleMessage(t *testing.T) {
    mockGenerator := &MockGenerator{}
    mockVerifier := &MockVerifier{}
    mockQuotes := &MockQuoteService{}

    service := NewWisdomService(mockGenerator, mockVerifier, mockQuotes)
    // Test business logic in isolation
}

// Integration test with real components
func TestTCPServer_SlowlorisProtection(t *testing.T) {
    // Start real server, make slow connection
    // Verify server doesn't hang
}

Security Architecture

Defense in Depth

Multiple security layers working together:

  1. HMAC Authentication: Prevents challenge tampering
  2. Timestamp Validation: Prevents replay attacks (5-minute TTL)
  3. Connection Timeouts: Prevents resource exhaustion
  4. Proof-of-Work: Rate limiting through computational cost
  5. Input Validation: All protocol messages validated

Threat Model

Primary Threats Addressed:

  • DDoS Attacks: PoW makes attacks expensive
  • Resource Exhaustion: Connection timeouts and limits
  • Protocol Attacks: Binary framing prevents confusion
  • Replay Attacks: Timestamp validation in challenges

Threats NOT Addressed (by design):

  • Authentication: Public service, no user accounts
  • Authorization: All valid solutions get quotes
  • Data Confidentiality: Quotes are public information

Trade-offs Made

Simplicity vs Performance

  • Chose: Simple JSON payloads over binary serialization
  • Trade-off: ~30% larger messages for easier debugging and maintenance

Memory vs CPU

  • Chose: Stateless challenges requiring CPU verification
  • Trade-off: More CPU per request for better scalability

Flexibility vs Optimization

  • Chose: Interface-based design with dependency injection
  • Trade-off: Small runtime overhead for much better testability

Features vs Complexity

  • Chose: Essential features only (no rate limiting, user accounts, etc.)
  • Benefit: Clean, focused implementation that does one thing well

Future Architecture Considerations

For production scaling, consider:

  1. Quote Service Enhancement: Caching, fallback quotes, multiple API sources
  2. Load Balancing: Multiple server instances behind load balancer
  3. Rate Limiting: Per-IP request limiting for additional protection
  4. Monitoring: Full observability stack (Prometheus, Grafana, alerting)
  5. Security: TLS encryption for sensitive deployments

The current architecture provides a solid foundation for these enhancements while maintaining simplicity and focus.