Perché Amazon Dynamo continua a modellare il moderno storage distribuito 17 anni dopo

A senior engineer’s perspective on building highly available distributed systems Tavola dei contenuti Introduzione: Perché Dynamo ha cambiato tutto Il teorema del trade-off Core Architecture Components Consistent Hashing for Partitioning Replication Strategy (N, R, W) Vector Clocks for Versioning Sloppy Quorum and Hinted Handoff Risoluzione dei conflitti: il problema del carrello degli acquisti Leggere e scrivere il flusso Gli alberi di Merkle per l'antropia Appartenenza e rilevamento fallimento Caratteristiche: Numeri reali Evoluzione della strategia di divisione Confronto di Dynamo con i sistemi moderni Quello che Dynamo non ti offre Esempio di implementazione pratica Le lezioni chiave per la progettazione di sistemi Quando non utilizzare i sistemi Dynamo-Style Conclusione Appendice: Problemi di progettazione e approcci Questo è un riferimento a forma lunga - ogni sezione è indipendente, quindi non esitate a saltare direttamente a ciò che è più rilevante per voi. Questo è un riferimento a forma lunga - ogni sezione è indipendente, quindi non esitate a saltare direttamente a ciò che è più rilevante per voi. Introduction: Why Dynamo Changed Everything Quando Amazon ha pubblicato il documento Dynamo nel 2007, non è stato solo un altro esercizio accademico.Era una soluzione testata in battaglia per problemi reali su larga scala.Mi ricordo quando ho letto per la prima volta questo articolo: ha cambiato fondamentalmente il mio modo di pensare ai sistemi distribuiti. È stato progettato per supportare i servizi ad alto traffico di Amazon come il carrello degli acquisti e i sistemi di gestione delle sessioni. Non ci sono indici secondari, nessun joint, nessuna semantica relazionale - solo chiavi e valori, con estremo focus sulla disponibilità e la scalabilità. Non fornisce linearizzazione o garanzie di ordine globale, anche nelle impostazioni di quorum più elevate. Dynamo is a distributed key-value storage system. Il problema principale che ha affrontato Amazon è stato semplice da dichiarare ma brutale da risolvere: Quando qualcuno tenta di aggiungere un elemento al carrello durante una partizione di rete o un guasto del server, rifiutare che scrivere non è accettabile. How do you build a storage system that never says “no” to customers? The CAP Theorem Trade-off: perché Dynamo sceglie la disponibilità Prima di immergersi nel modo in cui funziona Dynamo, è necessario capire il vincolo fondamentale intorno al quale è progettato. Che cos’è il CAP Theorem? Il teorema CAP descrive un compromesso fondamentale nei sistemi distribuiti: quando si verifica una partizione di rete, è necessario scegliere tra coerenza e disponibilità. Consistenza (C): tutti i nodi vedono gli stessi dati allo stesso tempo Disponibilità (A): ogni richiesta riceve una risposta (successo o fallimento) : System continues working despite network failures Partition Tolerance (P) Un’abbreviazione comune è “Pick 2 of 3”, ma questa è un’eccessiva semplificazione.In pratica, le partizioni di rete sono inevitabili su scala, quindi la vera decisione è: Questa è la vera scelta del design. when partitions occur (and they will), do you sacrifice consistency or availability? Le partizioni di rete si verificano. I cavi vengono tagliati, gli interruttori falliscono, i data center perdono la connettività. Non puoi evitarli, quindi devi scegliere: coerenza o disponibilità? The harsh reality I database tradizionali scelgono la coerenza : Traditional approach Database: "I can't guarantee all replicas are consistent, so I'll reject your write to be safe." Result: Customer sees error, cart is empty Impact: Lost revenue, poor experience Dynamo seleziona la disponibilità : Dynamo’s approach Dynamo: "I'll accept your write with the replicas I can reach. The unreachable replica will catch up later." Result: Customer sees success, item in cart Impact: Sale continues, happy customer Il trade-off visualizzato When a partition occurs: Traditional Database: Choose C over A → Sacrifice Availability - ✓ All replicas always have same data - ✓ No conflicts to resolve - ❌ Rejects writes during failures - ❌ Poor customer experience - ❌ Lost revenue Dynamo: Choose A over C → Sacrifice Strong Consistency - ✓ Accepts writes even during failures - ✓ Excellent customer experience - ✓ No lost revenue - ❌ Replicas might temporarily disagree - ❌ Application must handle conflicts Esempio Amazon reale: Black Friday Shopping Cart Immaginate che sia il Black Friday. Milioni di clienti stanno facendo acquisti. Un cavo di rete viene tagliato tra i data center. : With traditional database Time: 10:00 AM - Network partition occurs Result: - All shopping cart writes fail - "Service Unavailable" errors - Customers can't checkout - Twitter explodes with complaints - Estimated lost revenue: $100,000+ per minute : With Dynamo Time: 10:00 AM - Network partition occurs Result: - Shopping cart writes continue - Customers see success - Some carts might have conflicts (rare) - Application merges conflicting versions - Estimated lost revenue: $0 - A few edge cases need conflict resolution (acceptable) Perché questa scelta ha senso per l'e-commerce Amazon ha fatto la matematica: Costo del rifiuto di una scrittura: vendita immediata persa ($ 50-200) Costo di accettare una scrittura conflittuale: Occasionalmente bisogno di unire carrelli da acquisto (raramente accade, facilmente riparabile) Decisione aziendale: accettare scrittura, affrontare conflitti rari : Types of data where Availability > Consistency Carrozze d'acquisto (fusione di aggiunte conflittuali) Dati di sessione (last-write-wins è fine) Preferenze dell'utente (eventuale coerenza accettabile) Best seller lists (approximate is fine) : Types of data where Consistency > Availability Saldi del conto bancario (non possono avere saldi conflittuali) Conteggio di inventario (non può essere superato) Log di transazione (deve essere ordinato) Questo è il motivo per cui Dynamo non è per tutto, ma per i casi di utilizzo del commercio elettronico di Amazon, scegliere la disponibilità rispetto alla forte coerenza è stato il giusto compromesso. Nuanza importante: Mentre Dynamo è spesso descritto come un sistema AP, è più accurato chiamarlo un sistema di coerenza regolabile. a seconda della configurazione del quorum R e W, può comportarsi più vicino a CP. L'etichetta AP si applica alla sua configurazione predefinita / raccomandata ottimizzata per i carichi di lavoro del commercio elettronico. Mentre Dynamo è spesso descritto come un sistema AP, è più accurato chiamarlo un A seconda della configurazione del quorum R e W, può comportarsi più vicino a CP. L'etichetta AP si applica alla sua configurazione predefinita/recomandata ottimizzata per i carichi di lavoro del commercio elettronico. Important nuance tunable consistency system Componenti architettonici di base Hashing coerente per le partizioni Lasciatemi spiegare questo con un esempio concreto, perché il hashing coerente è uno di quei concetti che sembra magico fino a quando non lo vedi in azione. Il problema: lo sharding tradizionale basato su hash Immaginate di avere 3 server e di voler distribuire i dati su di essi. # Traditional approach - DON'T DO THIS def get_server(key, num_servers): hash_value = hash(key) return hash_value % num_servers # Modulo operation # With 3 servers: get_server("user_123", 3) # Returns server 0 get_server("user_456", 3) # Returns server 1 get_server("user_789", 3) # Returns server 2 Questo funziona... finché non aggiungi o rimuovi un server. Vediamo cosa succede quando passiamo da 3 a 4 server: # Before (3 servers): "user_123" → hash % 3 = 0 → Server 0 "user_456" → hash % 3 = 1 → Server 1 "user_789" → hash % 3 = 2 → Server 2 # After (4 servers): "user_123" → hash % 4 = 0 → Server 0 ✓ (stayed) "user_456" → hash % 4 = 1 → Server 1 ✓ (stayed) "user_789" → hash % 4 = 2 → Server 2 ✓ (stayed) # But wait - this is lucky! In reality, most keys MOVE: "product_ABC" → hash % 3 = 2 → Server 2 "product_ABC" → hash % 4 = 3 → Server 3 ✗ (MOVED!) Quando si cambia il numero di server, quasi tutti i dati devono essere ridistribuiti. Immaginate di spostare terabyte di dati solo per aggiungere un server! The disaster La soluzione: hashing coerente Il hashing coerente risolve questo problema trattando lo spazio hash come un cerchio (0 a 2^32 – 1, avvolto intorno). Step 1: Place servers on the ring A ciascun server viene assegnata una posizione casuale sull’anello (chiamata “token”). Step 2: Place data on the ring Quando si desidera archiviare i dati, si: Hash la chiave per ottenere una posizione sul ring Camminare in orologio da questa posizione Salva i dati sul primo server che incontri Esempio visivo: Anello completo Ecco l'anello disposto in ordine. Le chiavi vanno in orologio al server successivo: Una chiave funziona in direzione dell'orologio fino a quando non colpisce un server. Simple rule : Examples user_123 a 30° → cammina a 45° → Server A lo possiede user_456 a 150° → cammina a 200° → Server C lo possiede cart_789 a 250° → cammina a 280° → Server D lo possiede product_ABC a 300° → passa oltre 360°, avvolge a 0°, continua a 45° → Server A lo possiede Who owns what range? Server A (45°): possiede tutto da 281° a 45° (involge intorno) Server B (120°): possiede tutto da 46° a 120° Server C (200°): possiede tutto da 121° a 200° Server D (280°): possiede tutto da 201° a 280° La magia: aggiungere un server Ora vediamo perché questo è brillante. aggiungiamo Server E alla posizione 160°: BEFORE: Server A (45°) → owns 281°-45° Server B (120°) → owns 46°-120° Server C (200°) → owns 121°-200° ← THIS RANGE WILL SPLIT Server D (280°) → owns 201°-280° AFTER: Server A (45°) → owns 281°-45° ← NO CHANGE Server B (120°) → owns 46°-120° ← NO CHANGE Server E (160°) → owns 121°-160° ← NEW! Takes part of C's range Server C (200°) → owns 161°-200° ← SMALLER range Server D (280°) → owns 201°-280° ← NO CHANGE : Solo le chiavi nell'intervallo di 121°-160° devono essere spostate (da C a E). i server A, B e D sono completamente non interessati! Result Ottimizzazione dei nodi virtuali C'è un problema critico con l'approccio di hashing coerente di base: . random distribution can be extremely uneven The Problem in Detail: Quando assegni casualmente una posizione per server, stai essenzialmente lanciando darts su una tavola circolare. a volte le darts si agglomerano insieme, a volte si diffondono. Vi mostro un esempio concreto: Scenario: 4 servers with single random tokens Server A: 10° } Server B: 25° } ← Only 75° apart! Tiny ranges Server C: 100° } Server D: 280° ← 180° away from C! Huge range Range sizes: - Server A owns: 281° to 10° = 89° (25% of ring) - Server B owns: 11° to 25° = 14° (4% of ring) ← Underutilized! - Server C owns: 26° to 100° = 74° (21% of ring) - Server D owns: 101° to 280° = 179° (50% of ring) ← Overloaded! Real-world consequences: Carica diseguale: Server D gestisce il 50% di tutti i dati, mentre Server B gestisce solo il 4%. Server D’s CPU, disk, and network are maxed out Server B è per lo più idilliaco (capacità sprecata) Your 99.9th percentile latency is dominated by Server D being overloaded Hotspot Cascading: Quando Server D diventa lento o fallisce: Tutti i suoi 50% di carico si spostano a Server A (il prossimo orologio) Il server A sta diventando sovraccaricato Le prestazioni del sistema si deteriorano catastroficamente Scalazione inefficiente: l'aggiunta di server non aiuta uniformemente perché i nuovi server potrebbero atterrare in aree già piccole Visualizing the problem: Ogni server fisico riceve più posizioni virtuali (token). Dynamo’s solution Invece di un lancio di dart per server, lancia molti dart. Più lanci, più diventa la distribuzione (legge dei grandi numeri). How Virtual Nodes Fix the Problem: Let’s take the same 4 servers, but now each server gets 3 virtual nodes (tokens) instead of 1: Physical Server A gets 3 tokens: 10°, 95°, 270° Physical Server B gets 3 tokens: 25°, 180°, 310° Physical Server C gets 3 tokens: 55°, 150°, 320° Physical Server D gets 3 tokens: 75°, 200°, 340° Now the ring looks like: 10° A, 25° B, 55° C, 75° D, 95° A, 150° C, 180° B, 200° D, 270° A, 310° B, 320° C, 340° D Range sizes (approximately): - Server A total: 15° + 55° + 40° = 110° (31% of ring) - Server B total: 30° + 20° + 30° = 80° (22% of ring) - Server C total: 20° + 30° + 20° = 70° (19% of ring) - Server D total: 20° + 70° + 20° = 110° (31% of ring) Il carico varia dal 19% al 31% invece del 4% al 50%. Much better! Why this works: : With more samples (tokens), the random distribution averages out. This is the law of large numbers in action. Statistics : When a server fails, its load is distributed across many servers, not just one neighbor: Granular load distribution Server A fails: - Its token at 10° → load shifts to Server B's token at 25° - Its token at 95° → load shifts to Server C's token at 150° - Its token at 270° → load shifts to Server B's token at 310° Result: The load is spread across multiple servers! : When adding a new server with 3 tokens, it steals small amounts from many servers instead of a huge chunk from one server. Smooth scaling Real Dynamo configurations: Il documento menziona diverse strategie che si sono evolute nel tempo. Versioni precedenti: 100-200 nodi virtuali per server fisico Più tardi ottimizzato a: token Q/S per nodo (dove Q = partizioni totali, S = numero di server) Configurazione tipica: ogni server fisico potrebbe avere 128-256 nodi virtuali The Trade-off: Balance vs Overhead Più nodi virtuali significa una migliore distribuzione del carico, ma c'è un costo. What you gain with more virtual nodes: With 1 token per server (4 servers): Load variance: 4% to 50% (±46% difference) ❌ With 3 tokens per server (12 virtual nodes): Load variance: 19% to 31% (±12% difference) ✓ With 128 tokens per server (512 virtual nodes): Load variance: 24% to 26% (±2% difference) ✓✓ What it costs: Dimensione dei metadati: ogni nodo mantiene le informazioni di routing 1 token per server: traccia 4 voci 128 token per server: traccia 512 voci : Nodes exchange membership info periodically Gossip overhead Più token = più dati per la sincronizzazione tra i nodi Every second, nodes gossip their view of the ring Complexità di ri-equilibrio: quando i nodi si uniscono / lasciano Più nodi virtuali = più trasferimenti di partizioni per coordinare But each transfer is smaller (which is actually good for bootstrapping) Dynamo’s evolution: The paper describes how Amazon optimized this over time: Strategy 1 (Initial): - 100-200 random tokens per server - Problem: Huge metadata (multiple MB per node) - Problem: Slow bootstrapping (had to scan for specific key ranges) Strategy 3 (Current): - Q/S tokens per server (Q=total partitions, S=number of servers) - Equal-sized partitions - Example: 1024 partitions / 8 servers = 128 tokens per server - Benefit: Metadata reduced to KB - Benefit: Fast bootstrapping (transfer whole partition files) Real production sweet spot: Most Dynamo deployments use 128-256 virtual nodes per physical server. This achieves: Load distribution within 10-15% variance (good enough) Metadati superiori a 100KB per nodo (negligibile) Fast failure recovery (load spreads across many nodes) Il passaggio da 128 a 512 token migliora solo il saldo di carico del 2-3%, ma raddoppia le dimensioni dei metadati e il traffico di chat. Why not more? : Physical servers (top) map to multiple virtual positions (bottom) on the ring. This distributes each server’s load across different parts of the hash space. Key concept : Benefits More even load distribution Quando un server fallisce, il suo carico è distribuito su molti server (non solo un vicino) Quando un server si unisce, ruba una piccola quantità da molti server Real-World Impact Comparison Vediamo la differenza con i numeri reali: Traditional Hashing (3 servers → 4 servers): - Keys that need to move: ~75% (3 out of 4) - Example: 1 million keys → 750,000 keys must migrate Consistent Hashing (3 servers → 4 servers): - Keys that need to move: ~25% (1 out of 4) - Example: 1 million keys → 250,000 keys must migrate With Virtual Nodes (150 vnodes total → 200 vnodes): - Keys that need to move: ~12.5% (spread evenly) - Example: 1 million keys → 125,000 keys must migrate - Load is balanced across all servers Il momento “Aha!” La consapevolezza chiave è questa: Consistent hashing decouples the hash space from the number of servers. Tradizionale: server = hash(key) % num_servers ← num_servers è nella formula! Consistent: ← num_servers isn’t in the formula! server = ring.findNextClockwise(hash(key)) This is why adding/removing servers only affects a small portion of the data. The hash values don’t change—only which server “owns” which range changes, and only locally. Think of it like a circular running track with water stations (servers). If you add a new water station, runners only change stations if they’re between the old nearest station and the new one. Everyone else keeps going to their same station. 2. Replication Strategy (N, R, W) The Problem: Availability vs Consistency Trade-off Immaginate di costruire il carrello degli acquisti di Amazon. Un cliente aggiunge un articolo al loro carrello, ma in quel momento esatto: Un server viene riavviato per la manutenzione Un altro server ha un hiccup di rete Un terzo server è perfetto (con una forte coerenza) Traditional database approach Client: "Add this item to my cart" Database: "I need ALL replicas to confirm before I say yes" Server 1: ✗ (rebooting) Server 2: ✗ (network issue) Server 3: ✓ (healthy) Result: "Sorry, service unavailable. Try again later." : 😡 “I can’t add items to my cart during Black Friday?!” Customer experience This is unacceptable for e-commerce. Every rejected write is lost revenue. Dynamo’s Solution: Tunable Quorums Dynamo gives you three knobs to tune the exact trade-off you want: : Number of replicas (how many copies of the data) N R: Quorum di lettura (quante replicas devono rispondere per una lettura di successo) : Write quorum (how many replicas must acknowledge for a successful write) W • Quando , you guarantee quorum overlap—meaning at least one node that received the write will be queried during any read. This overlap enables detection of the latest version, provided the reconciliation logic correctly identifies the highest vector clock. It does not automatically guarantee read-your-writes unless the coordinator properly resolves versions. The magic formula R + W > N Lasciate che vi mostriamo perché questo conta con scenari reali: Scenario 1: Shopping Cart (Prioritize Availability) N = 3 # Three replicas for durability R = 1 # Read from any single healthy node W = 1 # Write to any single healthy node # Trade-off analysis: # ✓ Writes succeed even if 2 out of 3 nodes are down # ✓ Reads succeed even if 2 out of 3 nodes are down # ✓ Maximum availability - never reject customer actions # ✗ Might read stale data # ✗ Higher chance of conflicts (but we can merge shopping carts) What happens during failure: Client: "Add item to cart" Coordinator tries N=3 nodes: - Node 1: ✗ Down - Node 2: ✓ ACK (W=1 satisfied!) - Node 3: Still waiting... Result: SUCCESS returned to client immediately Node 3 eventually gets the update (eventual consistency) Scenario 2: Session State (Balanced Approach) N = 3 R = 2 # Must read from 2 nodes W = 2 # Must write to 2 nodes # Trade-off analysis: # ✓ R + W = 4 > N = 3 → Read-your-writes guaranteed # ✓ Tolerates 1 node failure # ✓ Good balance of consistency and availability # ✗ Write fails if 2 nodes are down # ✗ Read fails if 2 nodes are down Why R + W > N enables read-your-writes: Write to W=2 nodes: [A, B] Later, read from R=2 nodes: [B, C] Because W + R = 4 > N = 3, there's guaranteed overlap! At least one node (B in this case) will have the latest data. The coordinator detects the newest version by comparing vector clocks. This guarantees seeing the latest write as long as reconciliation picks the causally most-recent version correctly. Scenario 3: Financial Data (Prioritize Consistency) N = 3 R = 3 # Must read from ALL nodes W = 3 # Must write to ALL nodes # Trade-off analysis: # ✓ Full replica quorum — reduces likelihood of divergent versions # ✓ Any read will overlap every write quorum # ✗ Write fails if ANY node is down # ✗ Read fails if ANY node is down # ✗ Poor availability during failures Systems requiring strict transactional guarantees typically choose CP systems instead. This configuration is technically supported by Dynamo but sacrifices the availability properties that motivate using it in the first place. Configuration Comparison Table Config N R W Availability Consistency Use Case High Availability 3 1 1 ⭐⭐⭐⭐⭐ ⭐⭐ Shopping cart, wish list Balanced 3 2 2 ⭐⭐⭐⭐ ⭐⭐⭐⭐ Session state, user preferences Full Quorum 3 3 3 ⭐⭐ ⭐⭐⭐⭐⭐ High-stakes reads (not linearizable) Read-Heavy 3 1 3 ⭐⭐⭐ (reads) ⭐⭐⭐⭐ Product catalog, CDN metadata Write-Heavy 3 3 1 ⭐⭐⭐ (writes) ⭐⭐⭐ Click tracking, metrics High Availability 3 1 1 ⭐⭐⭐⭐⭐ ⭐⭐ Shopping cart, wish list Balanced 3 2 2 ⭐⭐⭐⭐ ⭐⭐⭐⭐ Session state, user preferences Full Quorum 3 3 3 ⭐⭐ ⭐⭐⭐⭐⭐ Lettura delle scommesse (non linearizzabile) Read-Heavy 3 1 3 ⭐⭐⭐ (reads) ⭐⭐⭐⭐ Product catalog, CDN metadata Write-Heavy 3 3 1 ⭐⭐⭐ (writes) ⭐⭐⭐ Click tracking, metrics : Systems requiring strong transactional guarantees (e.g., bank account balances) typically shouldn’t use Dynamo. That said, some financial systems do build on Dynamo-style storage for their persistence layer while enforcing stronger semantics at the application or business logic layer. Note on financial systems : Systems requiring strong transactional guarantees (e.g., bank account balances) typically shouldn’t use Dynamo. That said, some financial systems do build on Dynamo-style storage for their persistence layer while enforcing stronger semantics at the application or business logic layer. Note on financial systems The Key Insight La maggior parte dei sistemi utilizzati because: N=3, R=2, W=2 Durabilità: può tollerare fino a 2 fallimenti di replica prima della perdita permanente dei dati (assumendo fallimenti indipendenti e nessuna interruzione correlata). : Tolerates 1 node failure for both reads and writes Availability : R + W > N guarantees that read and write quorums overlap, enabling read-your-writes behavior in the absence of concurrent writes. Consistency : Don’t wait for the slowest node (only need 2 out of 3) Performance Real production numbers from the paper: Amazon’s shopping cart service during peak (holiday season): Configurazione: N = 3, R = 2, W = 2 Decine di milioni di richieste Più di 3 milioni di check-out in un solo giorno No downtime, even with server failures This tunable approach is what made Dynamo revolutionary. You’re not stuck with one-size-fits-all—you tune it based on your actual business requirements. 3. Vector Clocks for Versioning Il problema: rilevare la causalità nei sistemi distribuiti When multiple nodes can accept writes independently, you need to answer a critical question: Are these two versions of the same data related, or were they created concurrently? Why timestamps don’t work: Scenario: Two users edit the same shopping cart simultaneously User 1 at 10:00:01.500 AM: Adds item A → Writes to Node X User 2 at 10:00:01.501 AM: Adds item B → Writes to Node Y Physical timestamp says: User 2's version is "newer" Reality: These are concurrent! Both should be kept! Problem: - Clocks on different servers are NEVER perfectly synchronized - Clock skew can be seconds or even minutes - Network delays are unpredictable - Physical time doesn't capture causality What we really need to know: Version A happened before Version B? → B can overwrite A Version A and B are concurrent? → Keep both, merge later Version A came from reading Version B? → We can track this! The Solution: Vector Clocks A vector clock is a simple data structure: a list of pairs that tracks which nodes have seen which versions. (node_id, counter) The rules: When a node writes data, it increments its own counter When a node reads data, it gets the vector clock When comparing two vector clocks: If all counters in A ≤ counters in B → A is an ancestor of B (B is newer) If some counters in A > B and some B > A → A and B are concurrent (conflict!) Step-by-Step Example Diamo traccia a un carrello d'acquisto attraverso vari aggiornamenti: Breaking down the conflict: D3: [Sx:2, Sy:1] vs D4: [Sx:2, Sz:1] Comparing: - Sx: 2 == 2 ✓ (equal) - Sy: 1 vs missing in D4 → D3 has something D4 doesn't - Sz: missing in D3 vs 1 → D4 has something D3 doesn't Conclusion: CONCURRENT! Neither is an ancestor of the other. Both versions must be kept and merged. Caratteristiche del mondo reale The Dynamo paper reports the following conflict distribution measured over 24 hours of Amazon’s production shopping cart traffic. These numbers reflect Amazon’s specific workload — high read/write ratio, mostly single-user sessions — and should not be assumed to generalize to all Dynamo deployments: 99.94% - Single version (no conflict) 0.00057% - 2 versions 0.00047% - 3 versions 0.00009% - 4 versions : Conflicts are RARE in practice! Key insight Why conflicts happen: Non di solito a causa di fallimenti di rete Mostly from concurrent writers (often automated processes/bots) Human users rarely create conflicts because they’re slow compared to network speed The Size Problem Vector clocks can grow unbounded if many nodes coordinate writes. Dynamo’s solution: quando l'orologio supera una soglia di dimensioni. truncate the oldest entries // When vector clock exceeds threshold (e.g., 10 entries) // Remove the oldest entry based on wall-clock timestamp vectorClock = { 'Sx': {counter: 5, timestamp: 1609459200}, 'Sy': {counter: 3, timestamp: 1609459800}, 'Sz': {counter: 2, timestamp: 1609460400}, // ... 10 more entries } // If size > 10, remove entry with oldest timestamp // ⚠ Risk: Dropping an entry collapses causality information. // Two versions that were causally related may now appear // concurrent, forcing the application to resolve a conflict // that didn't actually exist. In practice, Amazon reports // this has not been a significant problem — but it is a // real theoretical risk in high-churn write environments // with many distinct coordinators. 4. Sloppy Quorum and Hinted Handoff Il problema: i quorum rigorosi uccidono la disponibilità Traditional quorum systems are rigid and unforgiving. Traditional strict quorum: Your data is stored on nodes: A, B, C (preference list) Write requirement: W = 2 Scenario: Node B is down for maintenance Coordinator: "I need to write to 2 nodes from {A, B, C}" Tries: A ✓, B ✗ (down), C ✓ Result: SUCCESS (got 2 out of 3) Scenario: Nodes B AND C are down Coordinator: "I need to write to 2 nodes from {A, B, C}" Tries: A ✓, B ✗ (down), C ✗ (down) Result: FAILURE (only got 1 out of 3) Customer: "Why can't I add items to my cart?!" 😡 Il problema : . If those specific nodes are down, the system becomes unavailable. Strict quorums require specific nodes Real scenario at Amazon: Black Friday, 2:00 PM - Datacenter 1: 20% of nodes being rebooted (rolling deployment) - Datacenter 2: Network hiccup (1-2% packet loss) - Traffic: 10x normal load With strict quorum: - 15% of write requests fail - Customer support phones explode - Revenue impact: Millions per hour The Solution: Sloppy Quorum Dynamo relaxes the quorum requirement: “Write to the first N healthy nodes in the preference list, walking further down the ring if needed.” Preference list for key K: A, B, C But B is down... Sloppy Quorum says: "Don't give up! Walk further down the ring: A, B, C, D, E, F, ..." Coordinator walks until N=3 healthy nodes are found: A, C, D (D is a temporary substitute for B) How Hinted Handoff Works Quando un nodo sostituisce temporaneamente un nodo fallito, memorizza un "indizio" con i dati. Processo Handoff dettagliato Step 1: Detect failure and substitute def write_with_hinted_handoff(key, value, N, W): preference_list = get_preference_list(key) # [A, B, C] healthy_nodes = [] for node in preference_list: if is_healthy(node): healthy_nodes.append((node, is_hint=False)) # If we don't have N healthy nodes, expand the list if len(healthy_nodes) = N: break # Write to first N healthy nodes acks = 0 for node, is_hint in healthy_nodes[:N]: if is_hint: # Store with hint metadata intended_node = find_intended_node(preference_list, node) success = node.write_hinted(key, value, hint=intended_node) else: success = node.write(key, value) if success: acks += 1 if acks >= W: return SUCCESS return FAILURE Step 2: Background hint transfer # Runs periodically on each node (e.g., every 10 seconds) def transfer_hints(): hints_db = get_hinted_replicas() for hint in hints_db: intended_node = hint.intended_for if is_healthy(intended_node): try: intended_node.write(hint.key, hint.value) hints_db.delete(hint) log(f"Successfully transferred hint to {intended_node}") except: log(f"Will retry later for {intended_node}") Why This Is Brilliant Durability maintained: Even though B is down: - We still have N=3 copies: A, C, D - Data won't be lost even if another node fails - System maintains durability guarantee Availability maximized: Client perspective: - Write succeeds immediately - No error message - No retry needed - Customer happy Traditional quorum would have failed: - Only 2 nodes available (A, C) - Need 3 for N=3 - Write rejected - Customer sees error Eventual consistency: Timeline: T=0: Write succeeds (A, C, D with hint) T=0-5min: B is down, but system works fine T=5min: B recovers T=5min+10sec: D detects B is back, transfers hint T=5min+11sec: B has the data, D deletes hint Result: Eventually, all correct replicas have the data Configuration Example // High availability configuration const config = { N: 3, // Want 3 replicas W: 2, // Only need 2 ACKs to succeed R: 2, // Read from 2 nodes // Sloppy quorum allows expanding preference list sloppy_quorum: true, // How far to expand when looking for healthy nodes max_extended_preference_list: 10, // How often to check for hint transfers hint_transfer_interval: 10_seconds, // How long to keep trying to transfer hints hint_retention: 7_days }; Real-World Impact From Amazon’s production experience: During normal operation: Hinted handoff rarely triggered Most writes go to preferred nodes Hints database is mostly empty During failures: Scenario: 5% of nodes failing at any time (normal at Amazon's scale) Without hinted handoff: - Write success rate: 85% - Customer impact: 15% of cart additions fail With hinted handoff: - Write success rate: 99.9%+ - Customer impact: Nearly zero During datacenter failure: Scenario: Entire datacenter unreachable (33% of nodes) Without hinted handoff: - Many keys would lose entire preference list - Massive write failures - System effectively down With hinted handoff: - Writes redirect to other datacenters - Hints accumulate temporarily - When datacenter recovers, hints transfer - Zero customer-visible failures Il trade-off Benefits: ✓ Maximum write availability ✓ Durability maintained during failures ✓ Automatic recovery when nodes come back Nessun intervento manuale richiesto Costs: ✗ Temporary inconsistency (data not on “correct” nodes) ✗ Extra storage for hints database ✗ Background bandwidth for hint transfers Un codice leggermente più complesso ✗ If a substitute node (like D) fails before it can transfer its hint back to B, the number of true replicas drops below N until the situation resolves. This is an important edge case to understand in failure planning. Hinted handoff provides temporary durability, not permanent replication. The availability benefits far outweigh the costs for e-commerce workloads. Amazon’s verdict: Conflict Resolution: The Shopping Cart Problem Let’s talk about the most famous example from the paper: the shopping cart. This is where rubber meets road. What Is a Conflict (and Why Does It Happen)? A Si verifica quando due scritti accadono alla stessa chiave su nodi diversi, senza scrivere "conoscendo" l'altro. Questo è possibile solo perché Dynamo accetta scritti anche quando i nodi non possono comunicare - che è tutto il punto! conflict Ecco una sequenza concreta di eventi che crea un conflitto: Timeline: T=0: Customer logs in. Cart has {shoes} on all 3 nodes. T=1: Network partition: Node1 can't talk to Node2. T=2: Customer adds {jacket} on their laptop → goes to Node1. Cart on Node1: {shoes, jacket} ← Vector clock: [N1:2] T=3: Customer adds {hat} on their phone → goes to Node2. Cart on Node2: {shoes, hat} ← Vector clock: [N2:2] T=4: Network heals. Node1 and Node2 compare notes. Node1 says: "I have version [N1:2]" Node2 says: "I have version [N2:2]" Neither clock dominates the other → CONFLICT! Neither version is “wrong”—both represent real actions the customer took. Dynamo’s job is to detect this situation (via vector clocks) and surface to the application so the application can decide what to do. both versions What Does the Application Do With a Conflict? Questa è la parte cruciale che il documento vi delega: . Dynamo gives you all the concurrent versions; your code decides how to merge them. the application must resolve conflicts using business logic Per il carrello degli acquisti, Amazon ha scelto un : keep all items from all concurrent versions. The rationale is simple—losing an item from a customer’s cart (missing a sale) is worse than occasionally showing a stale item they already deleted. union merge Conflict versions: Version A (from Node1): {shoes, jacket} Version B (from Node2): {shoes, hat} Merge strategy: union Merged cart: {shoes, jacket, hat} ← All items preserved Here’s the actual reconciliation code: from __future__ import annotations from dataclasses import dataclass, field class VectorClock: def __init__(self, clock: dict[str, int] | None = None): self.clock: dict[str, int] = clock.copy() if clock else {} def merge(self, other: "VectorClock") -> "VectorClock": """Merged clock = max of each node's counter across both versions.""" all_keys = set(self.clock) | set(other.clock) merged = {k: max(self.clock.get(k, 0), other.clock.get(k, 0)) for k in all_keys} return VectorClock(merged) def __repr__(self): return f"VectorClock({self.clock})" @dataclass class ShoppingCart: items: list[str] = field(default_factory=list) vector_clock: VectorClock = field(default_factory=VectorClock) @staticmethod def reconcile(carts: list["ShoppingCart"]) -> "ShoppingCart": if len(carts) == 1: return carts[0] # No conflict, nothing to do # Merge strategy: union of all items (never lose additions). # This is Amazon's choice for shopping carts. # A different application might choose last-write-wins or something else. all_items: set[str] = set() merged_clock = VectorClock() for cart in carts: all_items.update(cart.items) # Union: keep everything merged_clock = merged_clock.merge(cart.vector_clock) return ShoppingCart(items=sorted(all_items), vector_clock=merged_clock) # Example conflict scenario cart1 = ShoppingCart(items=["shoes", "jacket"], vector_clock=VectorClock({"N1": 2})) cart2 = ShoppingCart(items=["shoes", "hat"], vector_clock=VectorClock({"N2": 2})) # Dynamo detected a conflict and passes both versions to our reconcile() reconciled = ShoppingCart.reconcile([cart1, cart2]) print(reconciled.items) # ['hat', 'jacket', 'shoes'] — union! Il problema della cancellazione (perché questo diventa difficile) The union strategy has a nasty edge case: . deleted items can come back from the dead T=0: Cart: {shoes, hat} T=1: Customer removes hat → Cart: {shoes} Clock: [N1:3] T=2: Network partition — Node2 still has old state T=3: Some concurrent write to Node2 Clock: [N2:3] T=4: Network heals → conflict detected T=5: Union merge: {shoes} ∪ {shoes, hat} = {shoes, hat} Result: Hat is BACK! Customer removed it, but it reappeared. Amazon explicitly accepts this trade-off. A “ghost” item in a cart is a minor annoyance. Losing a cart addition during a Black Friday sale is lost revenue. Nota di ingegneria: La logica di fusione deve essere specifica per il dominio e accuratamente progettata. L'aggiunta di elementi è commutativa (ordine non conta) e facile da fondere. La rimozione di elementi non è - una cancellazione in un ramo concorrente può essere ignorata silenziosamente durante una fusione basata su un'unione. Questo è un compromesso intenzionale nella progettazione di Dynamo, ma significa che l'applicazione deve ragionare attentamente sulla semantica di aggiungere contro rimuovere. Se i tuoi dati non supportano naturalmente le fusioni dell'unione (ad esempio, un contatore, un indirizzo utente), hai bisogno di una strategia diversa - come CRDT, vincite di ultima scrittura con timestamp, o semplicemente rifiutando le sc : Merge logic must be domain-specific and carefully designed. Adding items is commutative (order doesn’t matter) and easy to merge. Removing items is not—a deletion in one concurrent branch may be silently ignored during a union-based merge. This is an intentional trade-off in Dynamo’s design, but it means the application must reason carefully about add vs. remove semantics. If your data doesn’t naturally support union merges (e.g., a counter, a user’s address), you need a different strategy—such as CRDTs, last-write-wins with timestamps, or simply rejecting concurrent writes for that data type. Engineering depth note Read and Write Flow The diagrams above show the high-level flow, but let’s walk through what actually happens step-by-step during a read and a write. Understanding this concretely will make the earlier concepts click. Write Path Step-by-step narration of a PUT request: to any node (via a load balancer) or directly to the coordinator. Client sends the request Il coordinatore è determinato: questo è il primo nodo nell'elenco delle preferenze per la posizione di hash della chiave nell'anello. — the coordinator increments its own counter in the vector clock, creating a new version. Vector clock is updated , then fans out the write to the other N-1 nodes in the preference list simultaneously. The coordinator writes locally It does NOT wait for all N — just the first W to respond. The remaining nodes that haven’t responded yet will get the write eventually (or via hinted handoff if they’re down). The coordinator waits for W acknowledgments. Una volta arrivati i W ACK, il coordinatore restituisce al cliente 200 OK. Dal punto di vista del cliente, la scrittura è fatta. : The client gets a success response as soon as W nodes confirm. The other (N – W) nodes will receive the write asynchronously. This is why the system is “eventually consistent”—all nodes have the data, just not necessarily at the same moment. Key insight about the write path will Read Path Step-by-step narration of a GET request: to the coordinator for that key. Client sends the request Il coordinatore invia richieste di lettura a tutti i nodi N nell'elenco delle preferenze contemporaneamente (non solo R). Il coordinatore ritorna non appena i nodi R hanno risposto, senza attendere quelli più lenti. The coordinator checks all the vector clocks: Compare the versions returned. If all versions are identical → return the single version immediately. If one version’s clock dominates the others (it’s causally “newer”) → return that version. If versions are concurrent (neither clock dominates) → return to the client, which must merge them. all versions happens in the background: if the coordinator noticed any node returned a stale version, it sends the latest version to that node to bring it up to date. Read repair Perché Dynamo è un motore di archiviazione a scopo generale. non sa se stai memorizzando un carrello, un profilo utente o un token di sessione. sa come fondere due versioni conflittuali in modo che abbia senso commerciale.Il coordinatore ti consegna le versioni concorrenti crude insieme al contesto dell'orologio vettore, e fai la cosa giusta per il tuo caso di utilizzo. Why does the client receive the conflict instead of the coordinator resolving it? your application : when the client writes the merged version back, it must include the context (the merged vector clock). This tells Dynamo that the new write has “seen” all the concurrent versions, so the conflict is resolved. Without this context, Dynamo might think it’s Il concorrente scrive sopra il conflitto ancora non risolto. The vector clock context is the key to closing the loop another Gli alberi di Merkle per l'antropia The Problem: How Do You Know When Replicas Are Out of Sync? After a node recovers from a failure, it may have missed some writes. After a network partition heals, two replicas might diverge. How does Dynamo detect and fix these differences? L'approccio di forza bruta sarebbe: "Ogni ora, confrontare ogni chiave su Node A contro Node B, e sincronizzare tutto ciò che è diverso."Ma su scala Amazon, un singolo nodo potrebbe memorizzare centinaia di milioni di chiavi. L’idea principale: invece di confrontare le chiavi individuali, confrontare . If the hash matches, that whole group is identical—skip it. Only drill down into groups where hashes differ. Dynamo uses Merkle trees to solve this efficiently. hashes of groups of keys : Merkle tree sync is a mechanism. It’s not on the hot read/write path. Normal reads and writes use vector clocks and quorums for versioning. Merkle trees are for the repair process that runs periodically in the background to catch any inconsistencies that slipped through. Important background anti-entropy : Merkle tree sync is a meccanismo. Non è sul percorso di lettura / scrittura calda. Le letture e le scritture normali usano orologi vettoriali e quorum per la versione. Gli alberi di Merkle sono per il processo di riparazione che corre periodicamente in background per catturare eventuali inconsistenze che sono passate. Important background anti-entropy Come si costruisce un albero di Merkle Ogni nodo costruisce un albero di Merkle sui suoi dati, organizzati per i ranghi chiave: I nodi di foglio contengono il hash di una piccola gamma di chiavi dati reali (ad esempio, il hash di tutti i valori per le chiavi k1, k2, k3). I nodi interni contengono il hash dei hash dei loro figli. is a single hash representing the data on the node. The root all Come due nodi si sincronizzano utilizzando gli alberi di Merkle Quando Node A e Node B vogliono verificare se sono sincronizzati: : Confronta le hash root. Se sono le stesse, tutto è identico. Fatto! (Nessun traffico di rete per i dati stessi.) Step 1 : Se le radici differiscono, confrontare i loro figli a sinistra. lo stesso? Step 2 Continua a scendere solo in sottoalbero dove le hash differiscono, finché non raggiungi i nodi delle foglie. Step 3 : Sync only the specific keys in the differing leaf nodes. Step 4 Example: Comparing two nodes Node A root: abc789 ← differs from Node B! Node B root: abc788 Compare left subtrees: Node A left: xyz123 Node B left: xyz123 ← same! Skip entire left half. Compare right subtrees: Node A right: def456 Node B right: def457 ← differs! Go deeper. Compare right-left subtree: Node A right-left: ghi111 Node B right-left: ghi111 ← same! Skip. Compare right-right subtree: Node A right-right: jkl222 Node B right-right: jkl333 ← differs! These are leaves. → Sync only the keys in the right-right leaf range (e.g., k10, k11, k12) Instead of comparing all 1 million keys, we compared 6 hashes and synced only 3 keys! : Synchronization process in code def sync_replicas(node_a, node_b, key_range): """ Efficiently sync two nodes using Merkle trees. Instead of comparing all keys, we compare hashes top-down. Only the ranges where hashes differ need actual key-level sync. """ tree_a = node_a.get_merkle_tree(key_range) tree_b = node_b.get_merkle_tree(key_range) # Step 1: Compare root hashes first. # If they match, every key in this range is identical — nothing to do! if tree_a.root_hash == tree_b.root_hash: return # Zero data transferred — full match! # Step 2: Recursively find differences by traversing top-down. # Only descend into subtrees where hashes differ. differences = [] stack = [(tree_a.root, tree_b.root)] while stack: node_a_subtree, node_b_subtree = stack.pop() if node_a_subtree.hash == node_b_subtree.hash: continue # This whole subtree matches — skip it! if node_a_subtree.is_leaf: # Found a differing leaf — these keys need syncing differences.extend(node_a_subtree.keys) else: # Not a leaf yet — recurse into children for child_a, child_b in zip(node_a_subtree.children, node_b_subtree.children): stack.append((child_a, child_b)) # Step 3: Sync only the specific keys that differed at leaf level. # This might be a handful of keys, not millions. for key in differences: sync_key(node_a, node_b, key) Perché è efficace Il potere degli alberi di Merkle è che il numero di confronti di hash di cui hai bisogno scala con (logaritmica nel numero di chiavi), non il numero di chiavi stesse. La profondità dell’albero Node with 1,000,000 keys: Naive approach: Compare 1,000,000 keys individually Cost: 1,000,000 comparisons Merkle tree: Compare O(log N) hashes top-down Tree depth ≈ 20 levels Cost: 20 comparisons to find differences Then sync only the differing leaves (~few keys) Speedup: ~50,000x fewer comparisons! E criticamente, se due nodi sono (che è quasi sempre vero in un cluster sano), le hash di radice spesso corrispondono interamente e i dati zero devono essere trasferiti. mostly in sync Appartenenza e rilevamento fallimento Dynamo uses a gossip protocol for membership management. Each node periodically exchanges membership information with random peers. There is no master node—all coordination is fully decentralized. L'appartenenza basata su gossip Punti chiave di design Ogni nodo mantiene la propria visualizzazione dell'appartenenza al cluster. Non c'è un registro centrale, quindi non c'è un unico punto di guasto per i dati di appartenenza. No single coordinator : Dynamo utilizza un rilevatore di guasti basato sull'accumulazione (simile a Phi Accrual). Piuttosto che un giudizio binario "vivo / morto", i nodi mantengono un che aumenta il più a lungo un peer non risponde. Questo evita i falsi positivi da hiccups di rete transitorie. Failure suspicion vs. detection Livello di sospetto Node A's view of Node B: - Last heartbeat: 3 seconds ago → Suspicion low → Healthy - Last heartbeat: 15 seconds ago → Suspicion rising → Likely slow/degraded - Last heartbeat: 60 seconds ago → Suspicion high → Treat as failed I nuovi nodi contattano un nodo di sementi per unirsi, quindi il rumor diffonde la loro presenza al resto del cluster. L'appartenenza all'anello è alla fine coerente - i diversi nodi possono avere visioni leggermente diverse dell'anello momentaneamente, il che è accettabile. Decentralized bootstrapping Caratteristiche: Numeri reali Il documento fornisce dati di prestazioni affascinanti. Distribuzione della latenza Metric | Average | 99.9th Percentile --------------------|---------|------------------ Read latency | ~10ms | ~200ms Write latency | ~15ms | ~200ms Key insight: 99.9th percentile is ~20x the average! Il 99.9° percentile è influenzato da: Why the huge gap? Collezione spazzatura pausa Variazioni di disco I/O Network jitter Load imbalance Questo è il motivo per cui gli Amazon SLA sono specificati al percentile 99.9, non alla media. Versioni di conflitto Da 24 ore di traffico del carrello di acquisto di produzione di Amazon (per carta Dynamo). Nota questi riflettono le caratteristiche specifiche del carico di lavoro di Amazon, non una linea di base universale: 99.94% - Saw exactly one version (no conflict) 0.00057% - Saw 2 versions 0.00047% - Saw 3 versions 0.00009% - Saw 4 versions I conflitti sono rari nella pratica! Più spesso causati da scrittori concorrenti (robot), non fallimenti. Takeaway Evoluzione della strategia di divisione Dynamo si è evoluta attraverso tre strategie di divisione.Questa evoluzione ci insegna importanti lezioni: Strategia 1: Random Tokens (Iniziale) Problem: Random token assignment → uneven load Problem: Adding nodes → expensive data scans Problem: Can't easily snapshot the system L'assegnazione di token casuale suona elegante, ma è un incubo nella pratica.Ogni nodo ottiene una posizione casuale sull'anello, il che significa ranghi di proprietà dei dati molto diversi e distribuzione di carico ineguale. Operational lesson Strategia 2: partizioni uguali + token casuali Improvement: Decouples partitioning from placement Problem: Still has load balancing issues Strategia 3: Q/S token per nodo – partizioni di dimensioni uguali + posizionamento deterministico (attuale) What Q and S mean: Q = il numero totale di partizioni fisse in cui l'anello è diviso (ad esempio 1024). Pensa a queste come a frammenti uguali, pre-cortati dello spazio hash che non cambiano mai la forma. S = il numero di server fisici attualmente nel cluster (es. 8). Q/S = quante di quelle partizioni fisse ciascun server è responsabile (ad esempio 1024 / 8 = 128 partizioni per server). Il cambiamento chiave dalle strategie precedenti: l'anello è ora diviso in partizioni fisse di dimensioni uguali Q I server non ottengono più posizioni casuali – ognuno di essi possiede esattamente le partizioni Q/S, distribuite uniformemente intorno all’anello. prima Example: Q=12 partitions, S=3 servers Ring divided into 12 equal slices (each covers 30° of the 360° ring): Partition 1: 0°– 30° → Server A Partition 2: 30°– 60° → Server B Partition 3: 60°– 90° → Server C Partition 4: 90°–120° → Server A Partition 5: 120°–150° → Server B Partition 6: 150°–180° → Server C ...and so on, round-robin Each server owns exactly Q/S = 12/3 = 4 partitions → perfectly balanced. When a 4th server joins (S becomes 4): New Q/S = 12/4 = 3 partitions per server. Each existing server hands off 1 partition to the new server. Only 3 out of 12 partitions move — the rest are untouched. This evolution — from random tokens to fixed, equal-sized partitions with balanced ownership — is one of the most instructive operational learnings from Dynamo. The early approach prioritized simplicity of implementation; the later approach prioritized operational simplicity and predictability. Confronto di Dynamo con i sistemi moderni System Consistency Model Use Case Dynamo Influence Cassandra Tunable (N, R, W) Time-series, analytics Direct descendant — heavily inspired by Dynamo, uses same consistent hashing and quorum concepts Riak Tunable, vector clocks Key-value store Closest faithful Dynamo implementation Amazon DynamoDB Eventually consistent by default Managed NoSQL DynamoDB is a completely different system internally, with no vector clocks and much simpler conflict resolution. Shares the name and high-level inspiration only. ⚠️ Not the same as Dynamo! Voldemort Tunable LinkedIn's data store Open-source Dynamo implementation Google Spanner Linearizable Global SQL Opposite choice to Dynamo — prioritizes CP via TrueTime clock synchronization Redis Cluster Eventually consistent Caching, sessions Uses consistent hashing; much simpler conflict resolution Cassandra Modalità di regolazione (N, R, W) Analisi del tempo, analisi Diretto discendente – fortemente ispirato a Dynamo, utilizza gli stessi concetti di hashing e quorum coerenti Riak Orologi, orologi vettoriali Negozio Valore chiave Implementazione più fiabile di Dynamo Amazon DynamoDB Consistenti per default Gestione NoSQL DynamoDB è un sistema completamente diverso internamente, senza orologi vettoriali e risoluzione dei conflitti molto più semplice. ⚠️ Not the same as Dynamo! Voldemort TUNABILE Il data store di LinkedIn Implementazione Open Source di Dynamo Google Spanner lineare Scuola globale Opzione opposta a Dynamo – priorizza CP tramite sincronizzazione orologio TrueTime Redis Cluster Alla fine coerente Caching, sessioni Uses consistent hashing; much simpler conflict resolution : Many engineers conflate Amazon DynamoDB with the Dynamo paper. They are very different. DynamoDB is a managed service optimized for operational simplicity. It does not expose vector clocks, does not use the same partitioning scheme, and uses a proprietary consistency model. The paper is about the internal Dynamo storage engine that predates DynamoDB. The DynamoDB confusion Molti ingegneri confondono Amazon DynamoDB con la carta Dynamo. Sono molto diversi. DynamoDB è un servizio gestito ottimizzato per la semplicità operativa. Non espone orologi vettoriali, non utilizza lo stesso schema di partizione e utilizza un modello di coerenza proprietario. The DynamoDB confusion Quello che Dynamo non ti offre Ogni blog di ingegneri senior dovrebbe essere onesto riguardo alle limitazioni. ecco cosa Dynamo negozia esplicitamente: Nessuna transazione: le operazioni sono solo a chiave singola. Non è possibile aggiornare atomicamente più chiavi. Nessun indice secondario: è possibile cercare i dati solo con la chiave primaria (almeno nella progettazione originale). No joins: è un negozio di valore chiave. Non esiste un linguaggio di query. Nessun ordine globale: gli eventi tra chiavi diverse non hanno un ordine garantito. Nessuna linearizzazione: anche a R=W=N, Dynamo non fornisce letture linearizzabili. Nessuna risoluzione automatica dei conflitti: il sistema rileva i conflitti e li presenta all'applicazione.L'applicazione deve risolverli.Se i tuoi ingegneri non capiscono questo, avrai dei sottili bug di dati. Costi di riparazione su scala: Il processo di antientropia (conciliazione dell'albero di Merkle) non è gratuito. Crescita dell'orologio vettoriale: in ambienti di scrittura ad alto contenuto con molti coordinatori, gli orologi vettoriali possono crescere abbastanza grandi da richiedere la truncazione, che introduce una potenziale perdita di causalità. Comprendere queste limitazioni è fondamentale per il funzionamento di sistemi in stile Dynamo nella produzione. Esempio di implementazione pratica Di seguito è riportato un'implementazione Python autonoma dei concetti di base di Dynamo. È intenzionalmente semplificato - nessuna rete effettiva, nessuna persistenza - ma modella fedelmente come gli orologi vettoriali, l'anello hash coerente, i quorum leggono / scrivono e la rilevazione dei conflitti interagiscono. Parte 1: Orologio vettoriale Il class è la base del tracciamento delle versioni. è solo una mappatura del dizionario Due operazioni chiave: VectorClock node_id → counter — bump our own counter when we write increment(node) — check if one clock is causally “after” another; if neither dominates, the writes were concurrent (conflict) dominates(other) from __future__ import annotations from dataclasses import dataclass, field from typing import Optional class VectorClock: """ Tracks causality across distributed writes. A clock like {"nodeA": 2, "nodeB": 1} means: - nodeA has coordinated 2 writes - nodeB has coordinated 1 write - Any version with these counters has "seen" those writes """ def __init__(self, clock: dict[str, int] | None = None): self.clock: dict[str, int] = clock.copy() if clock else {} def increment(self, node_id: str) -> "VectorClock": """Return a new clock with node_id's counter bumped by 1.""" new_clock = self.clock.copy() new_clock[node_id] = new_clock.get(node_id, 0) + 1 return VectorClock(new_clock) def dominates(self, other: "VectorClock") -> bool: """ Returns True if self is causally AFTER other. self dominates other when: - Every counter in self is >= the same counter in other, AND - At least one counter in self is strictly greater. Meaning: self has seen everything other has seen, plus more. """ all_keys = set(self.clock) | set(other.clock) at_least_one_greater = False for key in all_keys: self_val = self.clock.get(key, 0) other_val = other.clock.get(key, 0) if self_val other_val: at_least_one_greater = True return at_least_one_greater def merge(self, other: "VectorClock") -> "VectorClock": """ Merge two clocks by taking the max of each counter. Used after resolving a conflict to produce a new clock that has "seen" everything both conflicting versions saw. """ all_keys = set(self.clock) | set(other.clock) merged = {k: max(self.clock.get(k, 0), other.clock.get(k, 0)) for k in all_keys} return VectorClock(merged) def __repr__(self): return f"VectorClock({self.clock})" Parte 2: Valore della versione Ogni valore memorizzato in Dynamo è avvolto con il suo orologio vettoriale. Questo accoppiamento è ciò che consente al coordinatore di confrontare le versioni durante le letture e rilevare i conflitti. @dataclass class VersionedValue: """ A value paired with its causal history (vector clock). When a client reads, they get back a VersionedValue. When they write an update, they must include the context (the vector clock they read) so Dynamo knows what version they're building on top of. """ value: object vector_clock: VectorClock def __repr__(self): return f"VersionedValue(value={self.value!r}, clock={self.vector_clock})" Parte 3: Nodo simulato In Dynamo reale ogni nodo è un processo separato. Qui li simuliamo come oggetti in memoria. Il dettaglio chiave: ogni nodo ha il suo locale I nodi possono essere contrassegnati come per simulare i fallimenti. storage down class DynamoNode: """ Simulates a single Dynamo storage node. In production this would be a separate server with disk storage. Here it's an in-memory dict so we can demo the logic without networking. """ def __init__(self, node_id: str, token: int): self.node_id = node_id self.token = token # Position on the consistent hash ring self.storage: dict[str, list[VersionedValue]] = {} self.down = False # Toggle to simulate node failures def write(self, key: str, versioned_value: VersionedValue) -> bool: """ Store a versioned value. Returns False if the node is down. We store a LIST of versions per key, because a node might hold multiple concurrent (conflicting) versions until they are resolved by the application. """ if self.down: return False # In a real node this would be written to disk (e.g. BerkeleyDB) self.storage[key] = [versioned_value] return True def read(self, key: str) -> list[VersionedValue] | None: """ Return all versions of a key. Returns None if the node is down. A healthy node with no data for the key returns an empty list. """ if self.down: return None return self.storage.get(key, []) def __repr__(self): status = "DOWN" if self.down else f"token={self.token}" return f"DynamoNode({self.node_id}, {status})" Parte 4: Anello hash coerente L'anello mappa le chiavi per i nodi. Ordiniamo i nodi per il loro token (posizione) e utilizziamo un cammino orologio per trovare il coordinatore e la lista di preferenze per qualsiasi chiave. import hashlib class ConsistentHashRing: """ Maps any key to an ordered list of N nodes (the preference list). Nodes are placed at fixed positions (tokens) on a conceptual ring from 0 to 2^32. A key hashes to a position, then walks clockwise to find its nodes. This means adding/removing one node only rebalances ~1/N of keys, rather than reshuffling everything like modulo hashing would. """ def __init__(self, nodes: list[DynamoNode]): # Sort nodes by token so we can do clockwise lookup efficiently self.nodes = sorted(nodes, key=lambda n: n.token) def _hash(self, key: str) -> int: """Consistent hash of a key into the ring's token space.""" # Use MD5 for a simple, evenly distributed hash. # Real Dynamo uses a more sophisticated hash (e.g., SHA-1). digest = hashlib.md5(key.encode()).hexdigest() return int(digest, 16) % (2**32) def get_preference_list(self, key: str, n: int) -> list[DynamoNode]: """ Return the first N nodes clockwise from key's hash position. These are the nodes responsible for storing this key. The first node in the list is the coordinator — it receives the client request and fans out to the others. """ if not self.nodes: return [] key_hash = self._hash(key) # Find the first node whose token is >= key's hash (clockwise) start_idx = 0 for i, node in enumerate(self.nodes): if node.token >= key_hash: start_idx = i break # If key_hash is greater than all tokens, wrap around to node 0 else: start_idx = 0 # Walk clockwise, collecting N unique nodes preference_list = [] for i in range(len(self.nodes)): idx = (start_idx + i) % len(self.nodes) preference_list.append(self.nodes[idx]) if len(preference_list) == n: break return preference_list Parte 5: Il coordinatore Dynamo Questo è il cuore del sistema – la logica che gestisce le richieste dei clienti, i fan in replicas, attende il quorum e rileva i conflitti. class SimplifiedDynamo: """ Coordinates reads and writes across a cluster of DynamoNodes. Any node can act as coordinator for any request — there's no dedicated master. The coordinator is simply whichever node receives the client request (or the first node in the preference list, if using partition-aware routing). Configuration: N = total replicas per key R = minimum nodes that must respond to a read (read quorum) W = minimum nodes that must acknowledge a write (write quorum) """ def __init__(self, nodes: list[DynamoNode], N: int = 3, R: int = 2, W: int = 2): self.N = N self.R = R self.W = W self.ring = ConsistentHashRing(nodes) # ------------------------------------------------------------------ # # WRITE # # ------------------------------------------------------------------ # def put(self, key: str, value: object, context: VectorClock | None = None) -> VectorClock: """ Write a key-value pair to N replicas, wait for W ACKs. The 'context' is the vector clock from a previous read. Always pass context when updating an existing key — it tells Dynamo which version you're building on top of, so it can detect whether your write is concurrent with anything else. Returns the new vector clock, which the caller should store and pass back on future writes to this key. Raises: RuntimeError if fewer than W nodes acknowledged. """ preference_list = self.ring.get_preference_list(key, self.N) if not preference_list: raise RuntimeError("No nodes available") # The coordinator is always the first node in the preference list. coordinator = preference_list[0] # Increment the coordinator's counter in the vector clock. # If no context was provided (brand new key), start a fresh clock. base_clock = context if context is not None else VectorClock() new_clock = base_clock.increment(coordinator.node_id) versioned = VersionedValue(value=value, vector_clock=new_clock) # Fan out to all N replicas. # In a real system these would be concurrent RPC calls. # Here we call them sequentially for simplicity. ack_count = 0 for node in preference_list: success = node.write(key, versioned) if success: ack_count += 1 # Only need W ACKs to declare success. # The remaining replicas are updated asynchronously (or via # hinted handoff if they were down). if ack_count < self.W: raise RuntimeError( f"Write quorum not met: got {ack_count} ACKs, needed {self.W}" ) print(f"[PUT] key={key!r} value={value!r} clock={new_clock} " f"({ack_count}/{self.N} nodes wrote)") return new_clock # ------------------------------------------------------------------ # # READ # # ------------------------------------------------------------------ # def get(self, key: str) -> list[VersionedValue]: """ Read a key from N replicas, wait for R responses, reconcile. Returns a LIST of VersionedValues: - Length 1 → clean read, no conflict - Length >1 → concurrent versions detected; application must merge After reading, the caller should: 1. If no conflict: use the single value normally. 2. If conflict: merge the values using application logic, then call put() with the merged value and the merged vector clock as context. This "closes" the conflict. Read repair happens in the background: any replica that returned a stale version is silently updated with the latest version. """ preference_list = self.ring.get_preference_list(key, self.N) # Collect responses from all N nodes all_versions: list[VersionedValue] = [] responding_nodes: list[tuple[DynamoNode, list[VersionedValue]]] = [] for node in preference_list: result = node.read(key) if result is not None: # None means the node is down all_versions.extend(result) responding_nodes.append((node, result)) if len(responding_nodes) < self.R: raise RuntimeError( f"Read quorum not met: got {len(responding_nodes)} responses, needed {self.R}" ) # Reconcile: discard any version that is strictly dominated # (i.e., is a causal ancestor of) another version. # What remains is the set of concurrent versions. reconciled = self._reconcile(all_versions) # Background read repair: if any node returned something older # than the reconciled result, send it the latest version. # (Simplified: only meaningful when there's a single winner.) if len(reconciled) == 1: latest = reconciled[0] for node, versions in responding_nodes: if not versions or versions[0].vector_clock != latest.vector_clock: node.write(key, latest) # Repair silently in background status = "clean" if len(reconciled) == 1 else f"CONFLICT ({len(reconciled)} versions)" print(f"[GET] key={key!r} status={status} " f"({len(responding_nodes)}/{self.N} nodes responded)") return reconciled # ------------------------------------------------------------------ # # INTERNAL: VERSION RECONCILIATION # # ------------------------------------------------------------------ # def _reconcile(self, versions: list[VersionedValue]) -> list[VersionedValue]: """ Remove any version that is a causal ancestor of another version. If version A's clock is dominated by version B's clock, then B is strictly newer — A adds no new information and can be dropped. Whatever remains after pruning are CONCURRENT versions: writes that happened without either "knowing about" the other. The application must merge these using domain-specific logic. Example: versions = [clock={A:1}, clock={A:2}, clock={B:1}] {A:2} dominates {A:1} → drop {A:1} {A:2} and {B:1} are concurrent → both survive result = [{A:2}, {B:1}] ← conflict! application must merge """ dominated = set() for i, v1 in enumerate(versions): for j, v2 in enumerate(versions): if i != j and v2.vector_clock.dominates(v1.vector_clock): dominated.add(i) # v1 is an ancestor of v2, discard v1 survivors = [v for i, v in enumerate(versions) if i not in dominated] # De-duplicate: identical clocks from different replicas are the same version seen_clocks: list[VectorClock] = [] unique: list[VersionedValue] = [] for v in survivors: if not any(v.vector_clock.clock == s.clock for s in seen_clocks): unique.append(v) seen_clocks.append(v.vector_clock) return unique if unique else versions Parte 6: Putting It All Together - Una demo Eseguiamo uno scenario completo: normale scrittura / lettura, poi un conflitto simulato in cui due nodi si differenziano e l'applicazione deve unirli. def demo(): # ── Setup ────────────────────────────────────────────────────────── # # Five nodes placed at evenly spaced positions on the hash ring. # In a real cluster these would span multiple datacenters. nodes = [ DynamoNode("node-A", token=100), DynamoNode("node-B", token=300), DynamoNode("node-C", token=500), DynamoNode("node-D", token=700), DynamoNode("node-E", token=900), ] dynamo = SimplifiedDynamo(nodes, N=3, R=2, W=2) print("=" * 55) print("SCENARIO 1: Normal write and read (no conflict)") print("=" * 55) # Write the initial shopping cart ctx = dynamo.put("cart:user-42", {"items": ["shoes"]}) # Read it back — should be a clean single version versions = dynamo.get("cart:user-42") print(f"Read result: {versions[0].value}\n") # Update the cart, passing the context from our earlier read. # The context tells Dynamo "this write builds on top of clock ctx". ctx = dynamo.put("cart:user-42", {"items": ["shoes", "jacket"]}, context=ctx) versions = dynamo.get("cart:user-42") print(f"After update: {versions[0].value}\n") print("=" * 55) print("SCENARIO 2: Simulated conflict — two concurrent writes") print("=" * 55) # Write the base version base_ctx = dynamo.put("cart:user-99", {"items": ["hat"]}) # Now simulate a network partition: # node-A and node-B can't talk to each other. # We model this by writing directly to individual nodes. pref_list = dynamo.ring.get_preference_list("cart:user-99", 3) node_1, node_2, node_3 = pref_list[0], pref_list[1], pref_list[2] # Write 1: customer adds "scarf" via node_1 (e.g., their laptop) clock_1 = base_ctx.increment(node_1.node_id) node_1.write("cart:user-99", VersionedValue({"items": ["hat", "scarf"]}, clock_1)) # Write 2: customer adds "gloves" via node_2 (e.g., their phone) # This write also descends from base_ctx, not from clock_1. # Neither write knows about the other → they are concurrent. clock_2 = base_ctx.increment(node_2.node_id) node_2.write("cart:user-99", VersionedValue({"items": ["hat", "gloves"]}, clock_2)) # Read — coordinator sees two concurrent versions and surfaces the conflict versions = dynamo.get("cart:user-99") if len(versions) > 1: print(f"\nConflict detected! {len(versions)} concurrent versions:") for i, v in enumerate(versions): print(f" Version {i+1}: {v.value} clock={v.vector_clock}") # Application-level resolution: union merge (Amazon's shopping cart strategy) # Merge items: take the union so no addition is lost all_items = set() merged_clock = versions[0].vector_clock for v in versions: all_items.update(v.value["items"]) merged_clock = merged_clock.merge(v.vector_clock) merged_value = {"items": sorted(all_items)} print(f"\nMerged result: {merged_value}") # Write the resolved version back with the merged clock as context. # This "closes" the conflict — future reads will see a single version. final_ctx = dynamo.put("cart:user-99", merged_value, context=merged_clock) versions = dynamo.get("cart:user-99") print(f"\nAfter resolution: {versions[0].value}") assert len(versions) == 1, "Should be a single version after merge" if __name__ == "__main__": demo() Expected output: ======================================================= SCENARIO 1: Normal write and read (no conflict) ======================================================= [PUT] key='cart:user-42' value={'items': ['shoes']} clock=VectorClock({'node-A': 1}) (3/3 nodes wrote) [GET] key='cart:user-42' status=clean (3/3 nodes responded) Read result: {'items': ['shoes']} [PUT] key='cart:user-42' value={'items': ['shoes', 'jacket']} clock=VectorClock({'node-A': 2}) (3/3 nodes wrote) [GET] key='cart:user-42' status=clean (3/3 nodes responded) After update: {'items': ['shoes', 'jacket']} ======================================================= SCENARIO 2: Simulated conflict — two concurrent writes ======================================================= [PUT] key='cart:user-99' value={'items': ['hat']} clock=VectorClock({'node-A': 1}) (3/3 nodes wrote) [GET] key='cart:user-99' status=CONFLICT (2 versions) (3/3 nodes responded) Conflict detected! 2 concurrent versions: Version 1: {'items': ['hat', 'scarf']} clock=VectorClock({'node-A': 2}) Version 2: {'items': ['hat', 'gloves']} clock=VectorClock({'node-A': 1, 'node-B': 1}) Merged result: {'items': ['gloves', 'hat', 'scarf']} [PUT] key='cart:user-99' value={'items': ['gloves', 'hat', 'scarf']} ... (3/3 nodes wrote) [GET] key='cart:user-99' status=clean (3/3 nodes responded) After resolution: {'items': ['gloves', 'hat', 'scarf']} Nella sceneggiatura 2, il coordinatore identifica correttamente che e Non sono né uguali né in una relazione di dominazione - né è un antenato dell'altro - quindi entrambi si presentano come simultanei.L'applicazione prende poi la responsabilità di unirli e scrivere una versione risoluta con l'orologio unito. What to notice: {'node-A': 2} {'node-A': 1, 'node-B': 1} Le lezioni chiave per la progettazione di sistemi Dopo aver lavorato con i sistemi ispirati a Dynamo per anni, ecco i miei consigli chiave: 1. Always-On Beats Strongly-Consistent Per le applicazioni orientate all'utente, la disponibilità vince quasi sempre. Gli utenti tollereranno la visione di dati leggermente stagnanti. Non tollereranno il "Servizio non disponibile". 2. Application-Level Reconciliation is Powerful Non abbiate paura di spingere la risoluzione dei conflitti all'applicazione.L'applicazione comprende la logica aziendale e può prendere decisioni più intelligenti di quanto il database possa mai fare. 3. Tunable Consistency is Essential Le aggiunte al carrello di acquisto richiedono una disponibilità elevata (W = 1).Le transazioni finanziarie richiedono garanzie più forti (W = N).La capacità di allineare questa per operazione è incredibilmente preziosa. 4. The 99.9th Percentile Matters More Than Average Concentrate i vostri sforzi di ottimizzazione sulle latenze di coda. Questo è ciò che gli utenti sperimentano effettivamente durante i tempi di picco. 5. Gossip Protocols Scale Beautifully La coordinazione decentralizzata tramite gossip elimina singoli punti di fallimento e scala a migliaia di nodi. Quando non utilizzare i sistemi Dynamo-Style Essere onesti sui compromessi. Non usare questo approccio quando: È necessaria una forte coerenza (transazioni finanziarie, gestione degli inventari) Sono necessarie query complesse (rapporti, analisi, connessioni) Le transazioni coprono più elementi (Dynamo è solo operazioni a chiave singola) Il tuo team non può gestire l'eventuale coerenza (se gli sviluppatori non comprendono gli orologi vettoriali e la risoluzione dei conflitti, avrai problemi) Conclusione Dynamo represents a fundamental shift in how we think about distributed systems. By embracing eventual consistency and providing tunable trade-offs, it enables building systems that scale to massive sizes while maintaining high availability. Le lezioni del documento hanno influenzato un'intera generazione di database distribuiti.Che tu stia usando Cassandra, Riak o DynamoDB, stai beneficiando delle intuizioni pubblicate per la prima volta in questo articolo. Come ingegneri, il nostro compito è comprendere questi compromessi profondamente e applicarli in modo appropriato. Dynamo ci dà uno strumento potente, ma come qualsiasi strumento, è solo buono quanto la nostra comprensione di quando e come usarlo. Continua la lettura Titolo originale: SOSP 2007 Il blog di Werner Vogels: All Things Distributed Cassandra Documentation: Comprendere come questi concetti vengono implementati “Designing Data-Intensive Applications” di Martin Kleppmann – Capitolo 5 sulla replicazione Appendice: Problemi di progettazione e approcci Tre problemi aperti che sorgono nelle interviste di progettazione del sistema e nel lavoro di ingegneria reale. Problema 1: Risoluzione dei conflitti per un editor di documenti collaborativo : Stai costruendo qualcosa di simile a Google Docs supportato da un negozio in stile Dynamo. Due utenti modificano lo stesso paragrafo contemporaneamente. The problem La strategia del carrello d'acquisto (unione di tutti gli elementi) è sicura solo perché l'aggiunta di elementi è commutativa. Se l'Utente A elimina una frase e l'Utente B la modifica al centro, l'unione delle loro modifiche è senza senso o contraddittoria. Why shopping cart union doesn’t work here {A} ∪ {B} = {B} ∪ {A} The right approach: Operational Transformation (OT) or CRDTs La soluzione del settore consiste nel rappresentare il documento non come un blocco di testo, ma come una sequenza di operazioni, e trasformare le operazioni simultanee in modo che entrambe possano essere applicate senza conflitti: User A's operation: delete(position=50, length=20) User B's operation: insert(position=60, text="new sentence") Without OT: B's insert position (60) is now wrong because A deleted 20 chars. With OT: Transform B's operation against A's: B's insert position shifts to 40 (60 - 20). Both operations now apply cleanly. La strategia di risoluzione dei conflitti per lo strato Dynamo sarebbe: Conservare le operazioni (non snapshots del documento completo) come valore per ogni chiave. In conflitto, raccogliere tutti gli elenchi di operazioni concorrenti da ciascuna versione. Applicare OT per unirli in un unico registro operativo coerente. Scrivi il log di fusione con l'orologio vettoriale di fusione come contesto. : The operation log per document segment, not the rendered text. This makes merges deterministic and lossless. What to store in Dynamo I loro strati di archiviazione utilizzano OT o una variante di CRDT (Conflict-free Replicated Data Types), che sono strutture di dati che sono matematicamente garantite per fondersi senza conflitti indipendentemente dall'ordine operativo. Real-world reference Problema 2: scegliere N, R, W per diversi casi di utilizzo Quale configurazione scegliere per (a) un negozio di sessione, (b) un catalogo di prodotti, (c) profili utente? The problem Il modo corretto per pensare a questo: identificare la modalità di guasto che costa di più - una scrittura mancata (perdita di dati) o una scrittura rifiutata (indisponibilità). Session store — prioritize availability Le sessioni sono temporanee e specifiche per l'utente. Se la sessione di un utente è brevemente bloccata o persa, viene esclusa e riconnetta. Questo è fastidioso ma non catastrofico. N=3, R=1, W=1 Rationale: - W=1: Accept session writes even during heavy failures. A user can't log in if their session write is rejected. - R=1: Read from any single node. Stale session data is harmless. - N=3: Still replicate to 3 nodes for basic durability. Trade-off accepted: Stale session reads are possible but inconsequential. Product catalog — prioritize read performance and consistency I dati del prodotto sono scritti raramente (dai team dell'opps) ma leggono milioni di volte al giorno.I prezzi o le descrizioni sono problematiche.Vuoi leggere in modo rapido e coerente. N=3, R=2, W=3 Rationale: - W=3: All replicas must confirm a catalog update before it's live. A price change half-published is worse than a brief write delay. - R=2: Read quorum overlap with W=3 guarantees fresh data. Acceptable: catalog writes are rare, so write latency doesn't matter. - N=3: Standard replication for durability. Trade-off accepted: Writes are slow and fail if any node is down. Acceptable because catalog updates are infrequent. User profiles — balanced I dati del profilo (nome, e-mail, preferenze) sono moderatamente importanti. Un profilo stagnante è fastidioso ma non pericoloso. Un aggiornamento rifiutato (ad esempio, l'utente non può aggiornare la propria e-mail) è un vero problema. N=3, R=2, W=2 Rationale: - The classic balanced configuration. - R + W = 4 > N = 3, so quorums overlap: reads will see the latest write. - Tolerates 1 node failure for both reads and writes. - Appropriate for data that matters but doesn't require strict consistency. Trade-off accepted: A second simultaneous node failure will cause errors. Acceptable for non-critical user data. Decision framework summary: Priority R W When to use Max availability 1 1 Sessions, ephemeral state, click tracking Balanced 2 2 User profiles, preferences, soft state Consistent reads 2 3 Catalogs, config, rarely-written reference data Highest consistency 3 3 Anywhere you need R+W > N with zero tolerance for stale reads (still not linearizable) Disponibilità max 1 1 Sessioni, stato effimero, click tracking equilibrato 2 2 Profili utente, preferenze, stato soft La lettura coerente 2 3 Cataloghi, config, dati di riferimento raramente scritti La massima coerenza 3 3 Ovunque sia necessario R+W > N con tolleranza zero per le letture stagnanti (non ancora linearizzabili) Problema 3: Testare un sistema Dynamo-Style sotto scenari di partizione Come verificare che il sistema si comporti correttamente quando i nodi falliscono e si verificano le partizioni? The problem Questo è uno dei problemi più difficili nel test dei sistemi distribuiti perché i bug appaiono solo in specifiche interleavings di eventi concorrenti che sono difficili da riprodurre deterministicamente. Layer 1: Unit tests for the logic in isolation Before testing distributed behavior, verify the building blocks independently. Vector clock comparison logic, conflict detection, and reconciliation functions can all be tested with pure unit tests — no networking needed. def test_concurrent_clocks_detected_as_conflict(): clock_a = VectorClock({"node-A": 2}) clock_b = VectorClock({"node-B": 2}) assert not clock_a.dominates(clock_b) assert not clock_b.dominates(clock_a) # Both survive reconciliation → conflict correctly detected def test_ancestor_clock_is_discarded(): old_clock = VectorClock({"node-A": 1}) new_clock = VectorClock({"node-A": 3}) assert new_clock.dominates(old_clock) # old_clock should be pruned during reconciliation Layer 2: Deterministic fault injection Invece di sperare che i fallimenti si verifichino nell'ordine giusto durante il test di carico, iniettali deliberatamente e ripetutamente. è una versione semplice di questo. Nei sistemi di produzione, le biblioteche come o E questo a livello infrastrutturale. node.down = True Giuseppe Le scimmie del caos Scenari principali da testare: Scenario A: Write succeeds with W=2, third replica is down. → Verify: the data is readable after the down node recovers. → Verify: no data loss occurred. Scenario B: Two nodes accept concurrent writes to the same key. → Verify: the next read surfaces exactly 2 conflicting versions. → Verify: after the application writes a merged version, the next read is clean. Scenario C: Node goes down mid-write (wrote to W-1 nodes). → Verify: the write is correctly rejected (RuntimeError). → Verify: no partial writes are visible to readers. Scenario D: All N nodes recover after a full partition. → Verify: no data was lost across the cluster. → Verify: vector clocks are still meaningful (no spurious conflicts). Layer 3: Property-based testing Invece di scrivere casi di prova individuali, definire che deve sempre tenere e generare migliaia di sequenze di operazioni casuali per cercare di violarle: invarianti # Invariant: after any sequence of writes and merges, a final get() # should always return exactly one version (no unresolved conflicts). # Invariant: a value written with a context derived from a previous read # should never produce a conflict with that read's version # (it should dominate it). # Invariant: if R + W > N, a value written successfully should always # be visible in the next read (read-your-writes, absent concurrent writes). Strumenti come (Python) consente di esprimere queste invarianti e di trovare automaticamente controesempi. ipotesi Layer 4: Linearizability checkers Per la massima fiducia, registrare l'ora di inizio di ogni operazione, l'ora di fine e il risultato durante un test di iniezione di guasti, quindi alimentare la storia a un controllore di linearizzazione come Ti dirà se qualsiasi storia osservata è coerente con un'esecuzione sequenziale corretta - anche per un sistema eventualmente coerente che funziona entro le sue garanzie dichiarate. Knossos Scritto dalle trinche dei sistemi distribuiti. intuizioni testate in battaglia, zero ondate a mano. Leggi il link Leggi il link