A senior engineer’s perspective on building highly available distributed systems Tablica sadržaja Uvod: Zašto je Dynamo sve promijenio Teorem trgovanja CAP Core Architecture Components Consistent Hashing for Partitioning Replication Strategy (N, R, W) Vector Clocks for Versioning Sloppy Quorum and Hinted Handoff Rješavanje sukoba: problem košarice za kupovinu Pročitajte i napišite tok Merkle drveće za anti-entropiju Članstvo i detekcija neuspeha Ključna reč: realni brojevi Razdvajanje strategije evolucije U poređenju Dynamo sa modernim sistemima Šta vam Dynamo ne daje Primer praktične implementacije Key Lessons for System Design Kada NE koristiti Dynamo-Style sustave Zaključak Prilog: Dizajn Problemi i pristupi Ovo je referenca dugog oblika - svaki odeljak stoji sam po sebi, tako da slobodno preskočite direktno na ono što je najrelevantnije za vas. Ovo je referenca dugog oblika - svaki odeljak stoji sam po sebi, tako da slobodno preskočite direktno na ono što je najrelevantnije za vas. Uvod: Zašto je Dynamo sve promijenio Kada je Amazon 2007. objavio članak Dynamo, to nije bila samo još jedna akademska vježba. To je bilo borbeno testirano rješenje za stvarne probleme u masovnim razmjerima. Dizajniran je da podrži Amazonove usluge visokog prometa, kao što su sistemi za upravljanje košaricom i sesijom. Nema sekundarnih indeksa, nema spojeva, nema relacijske semantike – samo ključeve i vrijednosti, sa ekstremnim fokusom na dostupnost i skalabilnost. Ne pruža linearizaciju ili globalne garancije narudžbine, čak ni na najvišim postavkama kvoruma. Dynamo is a distributed key-value storage system. Osnovni problem Amazona bio je jednostavan za reći, ali brutalan za riješiti: Kada netko pokuša dodati stavku u svoju košaricu tokom mrežne particije ili neuspjeha servera, odbijanje pisanja nije prihvatljivo. How do you build a storage system that never says “no” to customers? The CAP Theorem Trade-off: Zašto Dynamo bira dostupnost Prije nego što se uronite u način na koji Dynamo radi, morate razumjeti temeljna ograničenja oko kojih je osmišljen. Šta je CAP teorem? CAP teorem opisuje temeljni kompromis u distribuiranim sistemima: kada dođe do mrežne particije, morate izabrati između dosljednosti i dostupnosti. Dosljednost (C): Svi čvorovi vide iste podatke u isto vrijeme Dostupnost (A): Svaki zahtjev dobija odgovor (uspeh ili neuspeh) : System continues working despite network failures Partition Tolerance (P) Uobičajena skraćenica je “pick 2 of 3”, ali to je previše pojednostavljenje.U praksi, mrežne particije su neizbježne na skali, pa je prava odluka: To je pravi izbor dizajna. when partitions occur (and they will), do you sacrifice consistency or availability? : Mrežne particije će se dogoditi. Kablovi će biti rezani, prekidači neće uspjeti, datacentri će izgubiti povezanost. Ne možete ih izbjeći, pa morate odabrati: Usklađenost ili Dostupnost? The harsh reality Tradicionalne baze podataka biraju dosljednost : Traditional approach Database: "I can't guarantee all replicas are consistent, so I'll reject your write to be safe." Result: Customer sees error, cart is empty Impact: Lost revenue, poor experience Dynamo odabire dostupnost : Dynamo’s approach Dynamo: "I'll accept your write with the replicas I can reach. The unreachable replica will catch up later." Result: Customer sees success, item in cart Impact: Sale continues, happy customer Vizualizovana trgovina When a partition occurs: Traditional Database: Choose C over A → Sacrifice Availability - ✓ All replicas always have same data - ✓ No conflicts to resolve - ❌ Rejects writes during failures - ❌ Poor customer experience - ❌ Lost revenue Dynamo: Choose A over C → Sacrifice Strong Consistency - ✓ Accepts writes even during failures - ✓ Excellent customer experience - ✓ No lost revenue - ❌ Replicas might temporarily disagree - ❌ Application must handle conflicts Pravi Amazon primer: crni petak košarica za kupovinu Zamislite da je Crni petak. Milioni kupaca kupuju. Mrežni kabel se prekida između podatkovnih centara. : With traditional database Time: 10:00 AM - Network partition occurs Result: - All shopping cart writes fail - "Service Unavailable" errors - Customers can't checkout - Twitter explodes with complaints - Estimated lost revenue: $100,000+ per minute : With Dynamo Time: 10:00 AM - Network partition occurs Result: - Shopping cart writes continue - Customers see success - Some carts might have conflicts (rare) - Application merges conflicting versions - Estimated lost revenue: $0 - A few edge cases need conflict resolution (acceptable) Zašto ovaj izbor ima smisla za e-trgovinu Amazon je napravio matematiku: Trošak odbacivanja pisma: Odmah izgubljena prodaja ($ 50-200) Troškovi prihvaćanja sukobljivog pisanja: Ponekad je potrebno spojiti košare za kupovinu (ređe se događa, lako se popravlja) Poslovna odluka: Prihvatite pisma, bavite se retkim sukobima : Types of data where Availability > Consistency Kupovne košarice (spajanje kontradiktornih dodataka) Podaci o sesiji (posljednje pisanje-dobitak je u redu) Preferencije korisnika (moguća dosljednost prihvatljiva) Najbolje prodajne liste (približno je dobro) : Types of data where Consistency > Availability Saldo na bankovnom računu (ne može imati sukobljene salde) Inventorsko računanje (ne može se preprodavati) Transakcioni dnevnici (treba naručiti) To je razlog zašto Dynamo nije za sve - ali za Amazonove slučajeve korištenja e-trgovine, odabir dostupnosti nad snažnom dosljednošću bio je pravi kompromis. Važna nijansa: Iako se Dynamo često opisuje kao AP sistem, točnije ga je nazvati sistemom usklađenosti.Ovisno o vašoj konfiguraciji R i W kvoruma, može se ponašati bliže CP-u. Iako je Dynamo često opisan kao AP sistem, točnije je nazvati ga Ovisno o vašoj konfiguraciji R i W kvoruma, može se ponašati bliže CP. AP oznaka se primjenjuje na njegovu podrazumevanu / preporučenu konfiguraciju optimizovanu za radna opterećenja e-trgovine. Important nuance tunable consistency system Osnovne komponente arhitekture Konsistentno haširanje za particioniranje Dozvolite mi da to objasnim konkretnim primjerom, jer dosljedno haširanje je jedan od onih koncepata koji izgledaju čarobno dok ga ne vidite u akciji. Ključna reč: tradicionalni hash-based sharding Zamislite da imate 3 servera i želite distribuirati podatke preko njih. # Traditional approach - DON'T DO THIS def get_server(key, num_servers): hash_value = hash(key) return hash_value % num_servers # Modulo operation # With 3 servers: get_server("user_123", 3) # Returns server 0 get_server("user_456", 3) # Returns server 1 get_server("user_789", 3) # Returns server 2 Ovo radi... sve dok ne dodate ili uklonite server. Hajde da vidimo šta se događa kada idemo od 3 do 4 servera: # Before (3 servers): "user_123" → hash % 3 = 0 → Server 0 "user_456" → hash % 3 = 1 → Server 1 "user_789" → hash % 3 = 2 → Server 2 # After (4 servers): "user_123" → hash % 4 = 0 → Server 0 ✓ (stayed) "user_456" → hash % 4 = 1 → Server 1 ✓ (stayed) "user_789" → hash % 4 = 2 → Server 2 ✓ (stayed) # But wait - this is lucky! In reality, most keys MOVE: "product_ABC" → hash % 3 = 2 → Server 2 "product_ABC" → hash % 4 = 3 → Server 3 ✗ (MOVED!) Kada promijenite broj servera, gotovo svi vaši podaci moraju biti redistribuirani.Zamislite da se terabajti podataka kreću samo da biste dodali jedan server! The disaster Rešenje: Konsistentno haširanje Konsistentno haširanje to rješava tretiranjem haš prostora kao kruga (0 do 2^32 – 1, zaglavljenog oko). Step 1: Place servers on the ring Svakom serveru je dodijeljena slučajna pozicija na prstenu (koji se zove „token“). Step 2: Place data on the ring Kada želite pohraniti podatke, vi: Hash ključ da biste dobili poziciju na prstenu Walk clockwise from that position Skladištenje podataka na prvom serveru s kojim se susrećete Vizuelni primer: Kompletni prsten Ovde je prsten postavljen u redoslijedu. Ključevi hodaju u smeru sata na sledeći server: Ključ se kreće u smjeru sata dok ne udari na server. Simple rule : Examples user_123 na 30° → ide na 45° → Server A posjeduje user_456 na 150° → ide na 200° → Server C je vlasnik cart_789 na 250° → ide na 280° → Server D posjeduje product_ABC na 300° → prelazi 360°, obara na 0°, nastavlja na 45° → Server A posjeduje Who owns what range? Server A (45°): posjeduje sve od 281° do 45° (oblači se oko) Server B (120°): posjeduje sve od 46° do 120° Server C (200°): posjeduje sve od 121° do 200° Server D (280°): posjeduje sve od 201° do 280° The Magic: Adding a Server Now let’s see why this is brilliant. We add Server E at position 160°: BEFORE: Server A (45°) → owns 281°-45° Server B (120°) → owns 46°-120° Server C (200°) → owns 121°-200° ← THIS RANGE WILL SPLIT Server D (280°) → owns 201°-280° AFTER: Server A (45°) → owns 281°-45° ← NO CHANGE Server B (120°) → owns 46°-120° ← NO CHANGE Server E (160°) → owns 121°-160° ← NEW! Takes part of C's range Server C (200°) → owns 161°-200° ← SMALLER range Server D (280°) → owns 201°-280° ← NO CHANGE : Samo ključeve u rasponu od 121°-160° treba premjestiti (od C do E). Serveri A, B i D su potpuno nezaštićeni! Result Optimizacija virtualnih čvorova Postoji kritičan problem sa osnovnim dosljednim pristupom hashinga: . random distribution can be extremely uneven The Problem in Detail: Kada slučajno dodelite jednu poziciju po serveru, u suštini bacate dartove na kružnu ploču. Ponekad se dartovi skupljaju zajedno, ponekad se šire. Dozvolite mi da vam pokažem konkretan primer: Scenario: 4 servers with single random tokens Server A: 10° } Server B: 25° } ← Only 75° apart! Tiny ranges Server C: 100° } Server D: 280° ← 180° away from C! Huge range Range sizes: - Server A owns: 281° to 10° = 89° (25% of ring) - Server B owns: 11° to 25° = 14° (4% of ring) ← Underutilized! - Server C owns: 26° to 100° = 74° (21% of ring) - Server D owns: 101° to 280° = 179° (50% of ring) ← Overloaded! Real-world consequences: Neravnomjerno opterećenje: Server D obrađuje 50% svih podataka dok Server B obrađuje samo 4%. Server D CPU, disk i mreža su maksimalno iscrpljeni Server B je uglavnom neaktivan (izgubljeni kapacitet) Vaša 99,9-ta percentilna latencija dominira preopterećenjem servera D Hotspot kaskadiranje: Kada Server D postaje spor ili ne uspeva: Svi njegovi 50% opterećenja prelazi na Server A (sljedeći jedan u smjeru sata) Server A sada postaje preopterećen System performance degrades catastrophically Neefikasno skaliranje: Dodavanje servera ne pomaže ravnomjerno jer novi serveri mogu sletjeti u već malim rasponima Visualizing the problem: : Each physical server gets multiple virtual positions (tokens). Dynamo’s solution Umesto jednog dart bacanja po serveru, bacajte mnogo dartova. Što više bacanja, to je čak i distribucija (zakon velikih brojeva). How Virtual Nodes Fix the Problem: Let’s take the same 4 servers, but now each server gets 3 virtual nodes (tokens) instead of 1: Physical Server A gets 3 tokens: 10°, 95°, 270° Physical Server B gets 3 tokens: 25°, 180°, 310° Physical Server C gets 3 tokens: 55°, 150°, 320° Physical Server D gets 3 tokens: 75°, 200°, 340° Now the ring looks like: 10° A, 25° B, 55° C, 75° D, 95° A, 150° C, 180° B, 200° D, 270° A, 310° B, 320° C, 340° D Range sizes (approximately): - Server A total: 15° + 55° + 40° = 110° (31% of ring) - Server B total: 30° + 20° + 30° = 80° (22% of ring) - Server C total: 20° + 30° + 20° = 70° (19% of ring) - Server D total: 20° + 70° + 20° = 110° (31% of ring) Opterećenje se kreće od 19% do 31% umjesto od 4% do 50%. Much better! Why this works: : With more samples (tokens), the random distribution averages out. This is the law of large numbers in action. Statistics : When a server fails, its load is distributed across many servers, not just one neighbor: Granular load distribution Server A fails: - Its token at 10° → load shifts to Server B's token at 25° - Its token at 95° → load shifts to Server C's token at 150° - Its token at 270° → load shifts to Server B's token at 310° Result: The load is spread across multiple servers! : When adding a new server with 3 tokens, it steals small amounts from many servers instead of a huge chunk from one server. Smooth scaling Real Dynamo configurations: U dokumentu se spominju različite strategije koje su se vremenom razvile.U proizvodnji: Ranije verzije: 100-200 virtualnih čvorova po fizičkom serveru Kasnije optimizovano na: Q/S žetone po čvoru (gde Q = ukupna particija, S = broj servera) Tipična podešavanja: Svaki fizički server može imati 128-256 virtualnih čvorova The Trade-off: Balance vs Overhead More virtual nodes means better load distribution, but there’s a cost. What you gain with more virtual nodes: With 1 token per server (4 servers): Load variance: 4% to 50% (±46% difference) ❌ With 3 tokens per server (12 virtual nodes): Load variance: 19% to 31% (±12% difference) ✓ With 128 tokens per server (512 virtual nodes): Load variance: 24% to 26% (±2% difference) ✓✓ What it costs: Veličina metapodataka: Svaki čvor održava informacije o putovanju 1 token po serveru: Track 4 unosa 128 tokens per server: Track 512 entries Gossip overhead: Nodes razmjenjuju članstvo informacije periodično Više žetona = više podataka za sinhronizaciju između čvorova Every second, nodes gossip their view of the ring : When nodes join/leave Rebalancing complexity More virtual nodes = more partition transfers to coordinate But each transfer is smaller (which is actually good for bootstrapping) Dynamo’s evolution: U članku se opisuje kako je Amazon optimizovao ovo tijekom vremena: Strategy 1 (Initial): - 100-200 random tokens per server - Problem: Huge metadata (multiple MB per node) - Problem: Slow bootstrapping (had to scan for specific key ranges) Strategy 3 (Current): - Q/S tokens per server (Q=total partitions, S=number of servers) - Equal-sized partitions - Example: 1024 partitions / 8 servers = 128 tokens per server - Benefit: Metadata reduced to KB - Benefit: Fast bootstrapping (transfer whole partition files) Real production sweet spot: Većina Dynamo implementacija koristi 128-256 virtualnih čvorova po fizičkom serveru. Distribucija opterećenja u rasponu od 10-15% (dovoljno dobro) Metadata overhead under 100KB per node (negligible) Fast failure recovery (load spreads across many nodes) Od 128 do 512 žetona samo poboljšava ravnotežu opterećenja za 2-3%, ali udvostručuje veličinu metapodataka i promet gossip. Why not more? : Fizički serveri (vrhovni) mape na više virtualnih pozicija (dno) na prstenu. Key concept : Benefits More even load distribution Kada server ne uspije, njegovo opterećenje je raspoređeno na mnogim serverima (ne samo jednom susjedu) Kada se server priključi, ukrade malu količinu od mnogih servera Real-World Impact Comparison Pogledajmo razliku sa stvarnim brojevima: Traditional Hashing (3 servers → 4 servers): - Keys that need to move: ~75% (3 out of 4) - Example: 1 million keys → 750,000 keys must migrate Consistent Hashing (3 servers → 4 servers): - Keys that need to move: ~25% (1 out of 4) - Example: 1 million keys → 250,000 keys must migrate With Virtual Nodes (150 vnodes total → 200 vnodes): - Keys that need to move: ~12.5% (spread evenly) - Example: 1 million keys → 125,000 keys must migrate - Load is balanced across all servers “Aha!” trenutak The key insight is this: Consistent hashing decouples the hash space from the number of servers. Traditional: ← num_servers is in the formula! server = hash(key) % num_servers Consistent: ← num_servers isn’t in the formula! server = ring.findNextClockwise(hash(key)) This is why adding/removing servers only affects a small portion of the data. The hash values don’t change—only which server “owns” which range changes, and only locally. Razmislite o tome kao o kružnoj trci sa vodnim stanicama (serverima). Ako dodate novu vodnu stanicu, trkači menjaju stanice samo ako su između stare najbliže stanice i nove. Strategija replikacije (N, R, W) The Problem: Availability vs Consistency Trade-off Zamislite da gradite Amazonovu košaricu za kupovinu.Kupac dodaje stavku u svoju košaricu, ali u tom trenutku: One server is being rebooted for maintenance Drugi server ima mrežni hiccup Treći server je savršeno dobro (Sve je u skladu sa dosljednošću): Traditional database approach Client: "Add this item to my cart" Database: "I need ALL replicas to confirm before I say yes" Server 1: ✗ (rebooting) Server 2: ✗ (network issue) Server 3: ✓ (healthy) Result: "Sorry, service unavailable. Try again later." “Ne mogu dodati stavke u košaricu tokom Crnog petka?” Customer experience This is unacceptable for e-commerce. Every rejected write is lost revenue. Dynamo rješenje: Tunable Quorums Dynamo vam daje tri dugmeta za podešavanje točan kompromis koji želite: N: Broj replika (koliko kopija podataka) : Read quorum (how many replicas must respond for a successful read) R W: Pišite kvorum (koliko replika mora priznati za uspešno pisanje) : When , garantujete preklapanje kvoruma – što znači da će barem jedan čvor koji je primio pisanje biti upitan tokom svakog čitanja. Ovo preklapanje omogućava detekciju najnovijih verzija, pod uvjetom da logika pomirenja ispravno identificira najviši vektorski sat. The magic formula R + W > N Let me show you why this matters with real scenarios: Scenario 1: Korpa za kupovinu (prioritizujte dostupnost) N = 3 # Three replicas for durability R = 1 # Read from any single healthy node W = 1 # Write to any single healthy node # Trade-off analysis: # ✓ Writes succeed even if 2 out of 3 nodes are down # ✓ Reads succeed even if 2 out of 3 nodes are down # ✓ Maximum availability - never reject customer actions # ✗ Might read stale data # ✗ Higher chance of conflicts (but we can merge shopping carts) What happens during failure: Client: "Add item to cart" Coordinator tries N=3 nodes: - Node 1: ✗ Down - Node 2: ✓ ACK (W=1 satisfied!) - Node 3: Still waiting... Result: SUCCESS returned to client immediately Node 3 eventually gets the update (eventual consistency) Scenario 2: Session State (Balanced Approach) N = 3 R = 2 # Must read from 2 nodes W = 2 # Must write to 2 nodes # Trade-off analysis: # ✓ R + W = 4 > N = 3 → Read-your-writes guaranteed # ✓ Tolerates 1 node failure # ✓ Good balance of consistency and availability # ✗ Write fails if 2 nodes are down # ✗ Read fails if 2 nodes are down Why R + W > N enables read-your-writes: Write to W=2 nodes: [A, B] Later, read from R=2 nodes: [B, C] Because W + R = 4 > N = 3, there's guaranteed overlap! At least one node (B in this case) will have the latest data. The coordinator detects the newest version by comparing vector clocks. This guarantees seeing the latest write as long as reconciliation picks the causally most-recent version correctly. Scenario 3: Financial Data (Prioritize Consistency) N = 3 R = 3 # Must read from ALL nodes W = 3 # Must write to ALL nodes # Trade-off analysis: # ✓ Full replica quorum — reduces likelihood of divergent versions # ✓ Any read will overlap every write quorum # ✗ Write fails if ANY node is down # ✗ Read fails if ANY node is down # ✗ Poor availability during failures Sistemi koji zahtijevaju stroge transakcijske garancije obično biraju CP sisteme umjesto toga. Ova konfiguracija je tehnički podržana od strane Dynamo-a, ali žrtvuje svojstva dostupnosti koja motiviraju njegovu upotrebu na prvom mestu. Configuration Usporedna tabela Config N R W Availability Consistency Use Case High Availability 3 1 1 ⭐⭐⭐⭐⭐ ⭐⭐ Shopping cart, wish list Balanced 3 2 2 ⭐⭐⭐⭐ ⭐⭐⭐⭐ Session state, user preferences Full Quorum 3 3 3 ⭐⭐ ⭐⭐⭐⭐⭐ High-stakes reads (not linearizable) Read-Heavy 3 1 3 ⭐⭐⭐ (reads) ⭐⭐⭐⭐ Product catalog, CDN metadata Write-Heavy 3 3 1 ⭐⭐⭐ (writes) ⭐⭐⭐ Click tracking, metrics High Availability 3 1 1 ⭐⭐⭐⭐⭐ ⭐⭐ Shopping cart, wish list Balanced 3 2 2 ⭐⭐⭐⭐ ⭐⭐⭐⭐ Status sesije, korisničke preferencije Full Quorum 3 3 3 ⭐⭐ ⭐⭐⭐⭐⭐ High-stakes reads (not linearizable) Read-Heavy 3 1 3 ⭐⭐⭐ (reads) ⭐⭐⭐⭐ Katalog proizvoda, CDN metapodatci Write-Heavy 3 3 1 ⭐⭐⭐ (Pisanje) ⭐⭐⭐ Click tracking, metrics : Systems requiring strong transactional guarantees (e.g., bank account balances) typically shouldn’t use Dynamo. That said, some financial systems do build on Dynamo-style storage for their persistence layer while enforcing stronger semantics at the application or business logic layer. Note on financial systems : Systems requiring strong transactional guarantees (e.g., bank account balances) typically shouldn’t use Dynamo. That said, some financial systems do build on Dynamo-style storage for their persistence layer while enforcing stronger semantics at the application or business logic layer. Note on financial systems Ključni uvid Većina sistema koristi Zato što: N=3, R=2, W=2 Trajnost: može podnijeti do 2 neuspjeha replika pre trajnog gubitka podataka (uzimajući u obzir nezavisne neuspjehe i bez koreliranih prekida). : Tolerates 1 node failure for both reads and writes Availability : R + W > N guarantees that read and write quorums overlap, enabling read-your-writes behavior in the absence of concurrent writes. Consistency Prednosti: Ne čekajte najsporiji čvor (potrebno je samo 2 od 3 čvorova) Real production numbers from the paper: Amazon’s shopping cart service during peak (holiday season): Configuration: N=3, R=2, W=2 Handled tens of millions of requests Over 3 million checkouts in a single day Bez obustava, čak i uz neuspjeh servera Ovaj prilagodljiv pristup je ono što je učinilo Dynamo revolucionarnim. Vi niste zaglavljeni u one-size-fits-all – vi ga prilagodite na osnovu vaših stvarnih poslovnih zahteva. Vektorski satovi za verziju Problem: Otkrivanje uzročnosti u distribuiranim sistemima Kada više čvorova može prihvatiti pisma samostalno, morate odgovoriti na kritično pitanje: Are these two versions of the same data related, or were they created concurrently? Why timestamps don’t work: Scenario: Two users edit the same shopping cart simultaneously User 1 at 10:00:01.500 AM: Adds item A → Writes to Node X User 2 at 10:00:01.501 AM: Adds item B → Writes to Node Y Physical timestamp says: User 2's version is "newer" Reality: These are concurrent! Both should be kept! Problem: - Clocks on different servers are NEVER perfectly synchronized - Clock skew can be seconds or even minutes - Network delays are unpredictable - Physical time doesn't capture causality What we really need to know: Version A happened before Version B? → B can overwrite A Version A and B are concurrent? → Keep both, merge later Version A came from reading Version B? → We can track this! Rešenje: Vektorski satovi A vector clock is a simple data structure: a list of Parovi koji prate koji čvorovi su vidjeli koje verzije. (node_id, counter) The rules: Kada čvor piše podatke, povećava svoj broj Kada čvor pročita podatke, dobija vektorski sat When comparing two vector clocks: Ako su svi brojevi u A ≤ brojevi u B → A je predak B (B je noviji) Ako su neki brojevi u A > B i neki B > A → A i B istovremeno (konflikt!) Step-by-Step Example Let’s trace a shopping cart through multiple updates: Breaking down the conflict: D3: [Sx:2, Sy:1] vs D4: [Sx:2, Sz:1] Comparing: - Sx: 2 == 2 ✓ (equal) - Sy: 1 vs missing in D4 → D3 has something D4 doesn't - Sz: missing in D3 vs 1 → D4 has something D3 doesn't Conclusion: CONCURRENT! Neither is an ancestor of the other. Both versions must be kept and merged. Real-World Characteristics The Dynamo paper reports the following conflict distribution measured over 24 hours of Amazon’s production shopping cart traffic. These numbers reflect Amazon’s specific workload — high read/write ratio, mostly single-user sessions — and should not be assumed to generalize to all Dynamo deployments: 99.94% - Single version (no conflict) 0.00057% - 2 versions 0.00047% - 3 versions 0.00009% - 4 versions : Conflicts are RARE in practice! Key insight Why conflicts happen: Ne obično zbog neuspjeha mreže Uglavnom od istovremenih pisaca (često automatizovani procesi / roboti) Ljudski korisnici rijetko stvaraju sukobe jer su spori u odnosu na brzinu mreže Problem veličine Vector clocks can grow unbounded if many nodes coordinate writes. Dynamo’s solution: once the clock exceeds a size threshold. truncate the oldest entries // When vector clock exceeds threshold (e.g., 10 entries) // Remove the oldest entry based on wall-clock timestamp vectorClock = { 'Sx': {counter: 5, timestamp: 1609459200}, 'Sy': {counter: 3, timestamp: 1609459800}, 'Sz': {counter: 2, timestamp: 1609460400}, // ... 10 more entries } // If size > 10, remove entry with oldest timestamp // ⚠ Risk: Dropping an entry collapses causality information. // Two versions that were causally related may now appear // concurrent, forcing the application to resolve a conflict // that didn't actually exist. In practice, Amazon reports // this has not been a significant problem — but it is a // real theoretical risk in high-churn write environments // with many distinct coordinators. Sloppy Quorum i Hinted Handoff The Problem: Strict Quorums Kill Availability Tradicionalni kvorumski sustavi su rigidni i neodoljivi. Traditional strict quorum: Your data is stored on nodes: A, B, C (preference list) Write requirement: W = 2 Scenario: Node B is down for maintenance Coordinator: "I need to write to 2 nodes from {A, B, C}" Tries: A ✓, B ✗ (down), C ✓ Result: SUCCESS (got 2 out of 3) Scenario: Nodes B AND C are down Coordinator: "I need to write to 2 nodes from {A, B, C}" Tries: A ✓, B ✗ (down), C ✗ (down) Result: FAILURE (only got 1 out of 3) Customer: "Why can't I add items to my cart?!" 😡 Naš problem : . If those specific nodes are down, the system becomes unavailable. Strict quorums require specific nodes Real scenario at Amazon: Black Friday, 2:00 PM - Datacenter 1: 20% of nodes being rebooted (rolling deployment) - Datacenter 2: Network hiccup (1-2% packet loss) - Traffic: 10x normal load With strict quorum: - 15% of write requests fail - Customer support phones explode - Revenue impact: Millions per hour Ključna reč: sloppy quorum Dynamo opušta zahtev za kvorumom: “Write to the first N healthy nodes in the preference list, walking further down the ring if needed.” Preference list for key K: A, B, C But B is down... Sloppy Quorum says: "Don't give up! Walk further down the ring: A, B, C, D, E, F, ..." Coordinator walks until N=3 healthy nodes are found: A, C, D (D is a temporary substitute for B) Kako Hinted Handoff radi When a node temporarily substitutes for a failed node, it stores a “hint” with the data. Detailed Hinted Handoff Process Step 1: Detect failure and substitute def write_with_hinted_handoff(key, value, N, W): preference_list = get_preference_list(key) # [A, B, C] healthy_nodes = [] for node in preference_list: if is_healthy(node): healthy_nodes.append((node, is_hint=False)) # If we don't have N healthy nodes, expand the list if len(healthy_nodes) < N: extended_list = get_extended_preference_list(key) for node in extended_list: if node not in preference_list and is_healthy(node): healthy_nodes.append((node, is_hint=True)) if len(healthy_nodes) >= N: break # Write to first N healthy nodes acks = 0 for node, is_hint in healthy_nodes[:N]: if is_hint: # Store with hint metadata intended_node = find_intended_node(preference_list, node) success = node.write_hinted(key, value, hint=intended_node) else: success = node.write(key, value) if success: acks += 1 if acks >= W: return SUCCESS return FAILURE Step 2: Background hint transfer # Runs periodically on each node (e.g., every 10 seconds) def transfer_hints(): hints_db = get_hinted_replicas() for hint in hints_db: intended_node = hint.intended_for if is_healthy(intended_node): try: intended_node.write(hint.key, hint.value) hints_db.delete(hint) log(f"Successfully transferred hint to {intended_node}") except: log(f"Will retry later for {intended_node}") Why This Is Brilliant Durability maintained: Even though B is down: - We still have N=3 copies: A, C, D - Data won't be lost even if another node fails - System maintains durability guarantee Availability maximized: Client perspective: - Write succeeds immediately - No error message - No retry needed - Customer happy Traditional quorum would have failed: - Only 2 nodes available (A, C) - Need 3 for N=3 - Write rejected - Customer sees error Eventual consistency: Timeline: T=0: Write succeeds (A, C, D with hint) T=0-5min: B is down, but system works fine T=5min: B recovers T=5min+10sec: D detects B is back, transfers hint T=5min+11sec: B has the data, D deletes hint Result: Eventually, all correct replicas have the data Primjer konfiguracije // High availability configuration const config = { N: 3, // Want 3 replicas W: 2, // Only need 2 ACKs to succeed R: 2, // Read from 2 nodes // Sloppy quorum allows expanding preference list sloppy_quorum: true, // How far to expand when looking for healthy nodes max_extended_preference_list: 10, // How often to check for hint transfers hint_transfer_interval: 10_seconds, // How long to keep trying to transfer hints hint_retention: 7_days }; Real-World Impact From Amazon’s production experience: During normal operation: Hinted handoff rarely triggered Većina pisama ide na omiljene čvorove Hints baza podataka je uglavnom prazna During failures: Scenario: 5% of nodes failing at any time (normal at Amazon's scale) Without hinted handoff: - Write success rate: 85% - Customer impact: 15% of cart additions fail With hinted handoff: - Write success rate: 99.9%+ - Customer impact: Nearly zero During datacenter failure: Scenario: Entire datacenter unreachable (33% of nodes) Without hinted handoff: - Many keys would lose entire preference list - Massive write failures - System effectively down With hinted handoff: - Writes redirect to other datacenters - Hints accumulate temporarily - When datacenter recovers, hints transfer - Zero customer-visible failures Trgovanje Benefits: ✓ Maksimalna dostupnost pisanja ✓ Durability maintained during failures ✓ Automatic recovery when nodes come back ✓ No manual intervention required Costs: Privremena nedosljednost (podaci nisu na „ispravnim“ čvorovima) Dodatno skladištenje za bazu podataka ✗ Background bandwidth for hint transfers Malo složeniji kod ✗ If a substitute node (like D) fails before it can transfer its hint back to B, the number of true replicas drops below N until the situation resolves. This is an important edge case to understand in failure planning. Hinted handoff provides temporary durability, not permanent replication. Prednosti dostupnosti daleko prevazilaze troškove radnog opterećenja e-trgovine. Amazon’s verdict: Conflict Resolution: The Shopping Cart Problem Let’s talk about the most famous example from the paper: the shopping cart. This is where rubber meets road. Šta je konflikt (i zašto se događa)? A occurs when two writes happen to the same key on different nodes, without either write “knowing about” the other. This is only possible because Dynamo accepts writes even when nodes can’t communicate—which is the whole point! conflict Here’s a concrete sequence of events that creates a conflict: Timeline: T=0: Customer logs in. Cart has {shoes} on all 3 nodes. T=1: Network partition: Node1 can't talk to Node2. T=2: Customer adds {jacket} on their laptop → goes to Node1. Cart on Node1: {shoes, jacket} ← Vector clock: [N1:2] T=3: Customer adds {hat} on their phone → goes to Node2. Cart on Node2: {shoes, hat} ← Vector clock: [N2:2] T=4: Network heals. Node1 and Node2 compare notes. Node1 says: "I have version [N1:2]" Node2 says: "I have version [N2:2]" Neither clock dominates the other → CONFLICT! Neither version is “wrong”—both represent real actions the customer took. Dynamo’s job is to detect this situation (via vector clocks) and surface na aplikaciju tako da aplikacija može odlučiti šta da radi. both versions What Does the Application Do With a Conflict? This is the crucial part that the paper delegates to you: Dynamo vam daje sve istovremeno verzije; vaš kod odlučuje kako ih spajati. the application must resolve conflicts using business logic Za košaricu za kupovinu, Amazon je izabrao : keep all items from all concurrent versions. The rationale is simple—losing an item from a customer’s cart (missing a sale) is worse than occasionally showing a stale item they already deleted. union merge Conflict versions: Version A (from Node1): {shoes, jacket} Version B (from Node2): {shoes, hat} Merge strategy: union Merged cart: {shoes, jacket, hat} ← All items preserved Evo pravog kodeksa pomirenja: from __future__ import annotations from dataclasses import dataclass, field class VectorClock: def __init__(self, clock: dict[str, int] | None = None): self.clock: dict[str, int] = clock.copy() if clock else {} def merge(self, other: "VectorClock") -> "VectorClock": """Merged clock = max of each node's counter across both versions.""" all_keys = set(self.clock) | set(other.clock) merged = {k: max(self.clock.get(k, 0), other.clock.get(k, 0)) for k in all_keys} return VectorClock(merged) def __repr__(self): return f"VectorClock({self.clock})" @dataclass class ShoppingCart: items: list[str] = field(default_factory=list) vector_clock: VectorClock = field(default_factory=VectorClock) @staticmethod def reconcile(carts: list["ShoppingCart"]) -> "ShoppingCart": if len(carts) == 1: return carts[0] # No conflict, nothing to do # Merge strategy: union of all items (never lose additions). # This is Amazon's choice for shopping carts. # A different application might choose last-write-wins or something else. all_items: set[str] = set() merged_clock = VectorClock() for cart in carts: all_items.update(cart.items) # Union: keep everything merged_clock = merged_clock.merge(cart.vector_clock) return ShoppingCart(items=sorted(all_items), vector_clock=merged_clock) # Example conflict scenario cart1 = ShoppingCart(items=["shoes", "jacket"], vector_clock=VectorClock({"N1": 2})) cart2 = ShoppingCart(items=["shoes", "hat"], vector_clock=VectorClock({"N2": 2})) # Dynamo detected a conflict and passes both versions to our reconcile() reconciled = ShoppingCart.reconcile([cart1, cart2]) print(reconciled.items) # ['hat', 'jacket', 'shoes'] — union! The Deletion Problem (Why This Gets Tricky) The union strategy has a nasty edge case: . deleted items can come back from the dead T=0: Cart: {shoes, hat} T=1: Customer removes hat → Cart: {shoes} Clock: [N1:3] T=2: Network partition — Node2 still has old state T=3: Some concurrent write to Node2 Clock: [N2:3] T=4: Network heals → conflict detected T=5: Union merge: {shoes} ∪ {shoes, hat} = {shoes, hat} Result: Hat is BACK! Customer removed it, but it reappeared. Amazon explicitly accepts this trade-off. A “ghost” item in a cart is a minor annoyance. Losing a cart addition during a Black Friday sale is lost revenue. Inženjerska dubina napomena: Logika spajanja mora biti specifična za domen i pažljivo dizajnirana. Dodavanje stavki je komutativno (porudžbina nije bitna) i lako se spajanje. Uklanjanje stavki nije – brisanje u jednoj istovremenoj granici može biti tiho ignorirano tokom spajanja na bazi sindikata. Ovo je namjerno kompromitiranje u dizajnu Dynamo-a, ali to znači da aplikacija mora pažljivo razmisliti o dodavanju i uklanjanju semantike. Ako vaši podaci prirodno ne podržavaju spajanje udruženja (npr. broj, korisnička adresa), potrebna vam je drugačija strategija – kao što su CRDT-ovi, zadnje pisanje dobitaka sa vremenskim žigovima ili jednostavno odbaci : Logika spajanja mora biti specifična za domen i pažljivo dizajnirana. Dodavanje stavki je komutativno (porudžbina nije bitna) i lako se spajanje. Uklanjanje stavki nije – brisanje u jednoj istovremenoj granici može biti tiho ignorisano tokom spajanja na bazi udruženja. Ovo je namjerno kompromitovanje u dizajnu Dynama, ali to znači da aplikacija mora pažljivo razmisliti o dodavanju i uklanjanju semantike. Ako vaši podaci prirodno ne podržavaju spajanja udruženja (npr. broj, korisnička adresa), potrebna vam je drugačija strategija – kao što su CRDT-ovi, poslednje dobitke pisanja sa vremenskim žigovima ili jednostavno odbacivanje istovremenih pisama za taj tip Engineering depth note Read and Write Flow Gornji dijagrami pokazuju tok na visokoj razini, ali hajde da prođemo kroz ono što se zapravo događa korak po korak tokom čitanja i pisanja. Pišite put Step-by-step narration of a PUT request: Klijent šalje zahtjev bilo kom čvoru (preko balansera opterećenja) ili direktno koordinatoru. — this is the first node in the preference list for the key’s hash position on the ring. The coordinator is determined Vektorski sat se ažurira – koordinator povećava svoj broj u vektorskom satu, stvarajući novu verziju. Koordinator piše lokalno, a zatim fanovi istovremeno ispisuju druge N-1 čvorove na listi preferencija. It does NOT wait for all N — just the first W to respond. The remaining nodes that haven’t responded yet will get the write eventually (or via hinted handoff if they’re down). The coordinator waits for W acknowledgments. Nakon dolaska W ACK-a, koordinator vraća klijentu 200 OK. : Klijent dobija uspešan odgovor čim W čvorovi potvrde. Ostali (N – W) čvorovi će primiti pisanje asinhrono. have the data, just not necessarily at the same moment. Key insight about the write path will Read Path Step-by-step narration of a GET request: Korisnik šalje zahtjev koordinatoru za taj ključ. in the preference list simultaneously (not just R). This is important — it contacts all N, but only needs R to respond. The coordinator sends read requests to all N nodes The coordinator returns as soon as R nodes have replied, without waiting for the slower ones. Wait for R responses. The coordinator checks all the vector clocks: Compare the versions returned. If all versions are identical → return the single version immediately. If one version’s clock dominates the others (it’s causally “newer”) → return that version. If versions are concurrent (neither clock dominates) → return to the client, which must merge them. all versions Popravak čitanja događa se u pozadini: ako koordinator primeti da je bilo koji čvor vratio nestalnu verziju, on šalje najnoviju verziju tom čvoru kako bi je ažurirao. Jer Dynamo je motor za skladištenje opšte namene. Ne zna da li čuvate košaricu za kupovinu, korisnički profil ili token sesije. zna kako spojiti dve suprotstavljene verzije na način koji ima poslovni smisao. Koordinator vam daje sirove istovremeno verzije zajedno sa kontekstom vektorskog sata, a vi radite pravu stvar za vaš slučaj upotrebe. Why does the client receive the conflict instead of the coordinator resolving it? your application : kada klijent napiše spajanu verziju nazad, ona mora uključiti kontekst (spajan vektorski sat). To kaže Dynamu da je novo pisanje "vidjelo" sve istovremeno verzije, tako da je sukob riješen. concurrent write on top of the still-unresolved conflict. The vector clock context is the key to closing the loop another Merkle Trees for Anti-Entropy The Problem: How Do You Know When Replicas Are Out of Sync? After a node recovers from a failure, it may have missed some writes. After a network partition heals, two replicas might diverge. How does Dynamo detect and fix these differences? The brute-force approach would be: “Every hour, compare every key on Node A against Node B, and sync anything that’s different.” But at Amazon’s scale, a single node might store hundreds of millions of keys. Comparing them all one by one would be so slow and bandwidth-intensive that it would interfere with normal traffic. The core idea: instead of comparing individual keys, compare . If the hash matches, that whole group is identical—skip it. Only drill down into groups where hashes differ. Dynamo uses Merkle trees to solve this efficiently. hashes of groups of keys Važno: Sinhronizacija Merkle stabla je mehanizam antientropije u pozadini. To nije na putu za vruće čitanje/pisanje. Normalno čitanje i pisanje koriste vektorske satove i kvorume za verziju. Merkle stabla su za proces popravka koji periodično radi u pozadini kako bi uhvatili bilo kakve nedosljednosti koje su prošli. : Merkle tree sync is a mechanism. It’s not on the hot read/write path. Normal reads and writes use vector clocks and quorums for versioning. Merkle trees are for the repair process that runs periodically in the background to catch any inconsistencies that slipped through. Important background anti-entropy How a Merkle Tree Is Built Svaki čvor gradi Merkle stablo preko svojih podataka, organizovanih po ključnim rasponima: List čvorovi sadrže haš malog raspona stvarnih ključeva podataka (npr. haš svih vrednosti za ključeve k1, k2, k3). Unutarnji čvorovi sadrže hašiše njihovih dečjih hašiša. is a single hash representing the data on the node. The root all Kako sinhronizovati dva čvora koristeći Merkle drveće When Node A and Node B want to check if they’re in sync: : Uspoređujte korijenske haše. Ako su iste, sve je identično. učinjeno! (Nema mrežnog prometa za same podatke.) Step 1 : Ako se korijeni razlikuju, uporedite njihovu levu decu. isto? preskočite celu polovicu ključnog prostora. Step 2 : Keep descending only into subtrees where hashes differ, until you reach the leaf nodes. Step 3 : Sync only the specific keys in the differing leaf nodes. Step 4 Example: Comparing two nodes Node A root: abc789 ← differs from Node B! Node B root: abc788 Compare left subtrees: Node A left: xyz123 Node B left: xyz123 ← same! Skip entire left half. Compare right subtrees: Node A right: def456 Node B right: def457 ← differs! Go deeper. Compare right-left subtree: Node A right-left: ghi111 Node B right-left: ghi111 ← same! Skip. Compare right-right subtree: Node A right-right: jkl222 Node B right-right: jkl333 ← differs! These are leaves. → Sync only the keys in the right-right leaf range (e.g., k10, k11, k12) Instead of comparing all 1 million keys, we compared 6 hashes and synced only 3 keys! : Synchronization process in code def sync_replicas(node_a, node_b, key_range): """ Efficiently sync two nodes using Merkle trees. Instead of comparing all keys, we compare hashes top-down. Only the ranges where hashes differ need actual key-level sync. """ tree_a = node_a.get_merkle_tree(key_range) tree_b = node_b.get_merkle_tree(key_range) # Step 1: Compare root hashes first. # If they match, every key in this range is identical — nothing to do! if tree_a.root_hash == tree_b.root_hash: return # Zero data transferred — full match! # Step 2: Recursively find differences by traversing top-down. # Only descend into subtrees where hashes differ. differences = [] stack = [(tree_a.root, tree_b.root)] while stack: node_a_subtree, node_b_subtree = stack.pop() if node_a_subtree.hash == node_b_subtree.hash: continue # This whole subtree matches — skip it! if node_a_subtree.is_leaf: # Found a differing leaf — these keys need syncing differences.extend(node_a_subtree.keys) else: # Not a leaf yet — recurse into children for child_a, child_b in zip(node_a_subtree.children, node_b_subtree.children): stack.append((child_a, child_b)) # Step 3: Sync only the specific keys that differed at leaf level. # This might be a handful of keys, not millions. for key in differences: sync_key(node_a, node_b, key) Zašto je ovo efikasno Snaga Merkle stabla je da je broj hash poređenja koje trebate skale sa (logarithmic in the number of keys), not the number of keys themselves. Dubina drveta Node with 1,000,000 keys: Naive approach: Compare 1,000,000 keys individually Cost: 1,000,000 comparisons Merkle tree: Compare O(log N) hashes top-down Tree depth ≈ 20 levels Cost: 20 comparisons to find differences Then sync only the differing leaves (~few keys) Speedup: ~50,000x fewer comparisons! I kritički, ako su dva čvorova (što je gotovo uvek istinito u zdravom klasteru), korijen hash često odgovara u potpunosti i nula podataka treba prenijeti. mostly in sync Membership and Failure Detection Dynamo koristi protokol gossip za upravljanje članstvom. Svaki čvor periodično razmjenjuje informacije o članstvu sa slučajnim kolegama. Ne postoji glavni čvor – sva koordinacija je potpuno decentralizovana. Gossip-based članstvo Ključne tačke dizajna : Svaki čvor održava svoj pogled članstva u klasteru. Ne postoji središnji registar, tako da ne postoji jedna tačka neuspjeha za podatke o članstvu. No single coordinator : Dynamo koristi detektor neuspjeha zasnovan na akumulaciji (sličan Phi Accrualu). Umjesto binarne "žive/mrtve" presude, čvorovi održavaju što se povećava što je duže kolega nereaktivan. Ovo izbegava lažne pozitivne rezultate iz prijelaznih mrežnih udaraca. Failure suspicion vs. detection suspicion level Node A's view of Node B: - Last heartbeat: 3 seconds ago → Suspicion low → Healthy - Last heartbeat: 15 seconds ago → Suspicion rising → Likely slow/degraded - Last heartbeat: 60 seconds ago → Suspicion high → Treat as failed : Novi čvorovi kontaktiraju čvor semena da bi se pridružili, a zatim šapuće širi njihovo prisustvo na ostatak klastera. članstvo u prstenu je na kraju dosljedno – različiti čvorovi mogu imati malo drugačije poglede na prsten u trenutku, što je prihvatljivo. Decentralized bootstrapping Performance Characteristics: Real Numbers Papir pruža fascinantne podatke o performansi. Latentna distribucija Metric | Average | 99.9th Percentile --------------------|---------|------------------ Read latency | ~10ms | ~200ms Write latency | ~15ms | ~200ms Key insight: 99.9th percentile is ~20x the average! Na 99.9th percentil utječe: Why the huge gap? Garbage kolekcija pauze Disk I/O varijacije Mreža Jitter Load imbalance To je razlog zašto su Amazon SLA-i navedeni na 99.9th percentilu, a ne prosjeku. Konfliktne verzije Od 24 sata prometa u Amazonovom proizvodnom košaru za kupovinu (po Dynamo papiru). Napomena: ovo odražava specifične karakteristike Amazonovog radnog opterećenja, a ne univerzalnu baznu liniju: 99.94% - Saw exactly one version (no conflict) 0.00057% - Saw 2 versions 0.00047% - Saw 3 versions 0.00009% - Saw 4 versions Konflikti su rijetki u praksi! Najčešće uzrokovani istovremenim piscima (robotima), a ne neuspehima. Takeaway Razdvajanje strategije evolucije Dynamo je evoluirao kroz tri strategije particioniranja. Ova evolucija nas uči važnim lekcijama: Strategy 1: Random Tokens (Initial) Problem: Random token assignment → uneven load Problem: Adding nodes → expensive data scans Problem: Can't easily snapshot the system : Slučajno dodjela žetona zvuči elegantno, ali je noćna mora u praksi. Svaki čvor dobija slučajnu poziciju na prstenu, što znači divlje različite raspone vlasništva podataka i neujednačenu distribuciju opterećenja. Operational lesson Strategy 2: Equal-sized Partitions + Random Tokens Improvement: Decouples partitioning from placement Problem: Still has load balancing issues Strategija 3: Q/S žetoni po čvoru – podjele jednake veličine + determinističko postavljanje (trenutno) What Q and S mean: Q = ukupni broj fiksnih particija na koje je prsten podijeljen (npr. 1024). Razmislite o njima kao o jednako velikim, prethodno izrezanim rezovima hash prostora koji nikada ne mijenjaju oblik. S = broj fizičkih servera trenutno u klasteru (npr. 8). = how many of those fixed slices each server is responsible for (e.g. 1024 / 8 = ). Q/S 128 partitions per server Ključna promena od ranijih strategija: prsten je sada podijeljen u Q fiksne, jednake veličine particije Serveri više ne dobijaju slučajne pozicije – svaka od njih posjeduje tačno Q/S particije, ravnomjerno raspoređene oko prstena. Prvo Example: Q=12 partitions, S=3 servers Ring divided into 12 equal slices (each covers 30° of the 360° ring): Partition 1: 0°– 30° → Server A Partition 2: 30°– 60° → Server B Partition 3: 60°– 90° → Server C Partition 4: 90°–120° → Server A Partition 5: 120°–150° → Server B Partition 6: 150°–180° → Server C ...and so on, round-robin Each server owns exactly Q/S = 12/3 = 4 partitions → perfectly balanced. When a 4th server joins (S becomes 4): New Q/S = 12/4 = 3 partitions per server. Each existing server hands off 1 partition to the new server. Only 3 out of 12 partitions move — the rest are untouched. Ova evolucija – od slučajnih žetona do fiksnih, jednakih veličina particija sa uravnoteženim vlasništvom – jedna je od najinstruktivnijih operativnih učenja iz Dynama. U poređenju Dynamo sa modernim sistemima System Consistency Model Use Case Dynamo Influence Cassandra Tunable (N, R, W) Time-series, analytics Direct descendant — heavily inspired by Dynamo, uses same consistent hashing and quorum concepts Riak Tunable, vector clocks Key-value store Closest faithful Dynamo implementation Amazon DynamoDB Eventually consistent by default Managed NoSQL DynamoDB is a completely different system internally, with no vector clocks and much simpler conflict resolution. Shares the name and high-level inspiration only. ⚠️ Not the same as Dynamo! Voldemort Tunable LinkedIn's data store Open-source Dynamo implementation Google Spanner Linearizable Global SQL Opposite choice to Dynamo — prioritizes CP via TrueTime clock synchronization Redis Cluster Eventually consistent Caching, sessions Uses consistent hashing; much simpler conflict resolution Cassandra Uređivanje (N, R, W) Vremenska serija, Analitičari Direktni potomak – snažno inspirisan Dynamo, koristi iste dosljedne hashing i kvorum koncepte Riak Vektorski satovi Vektorski satovi Trgovina ključnim vrednostima Najbliža verna implementacija Dynamo Amazon DynamoDB Konačno dosljedno po defaultu Upravljanje NoSQL DynamoDB je potpuno drugačiji sistem interno, bez vektorskih satova i mnogo jednostavnije rješavanje sukoba. ⚠️ Not the same as Dynamo! Voldemort Tunež LinkedIn's data store Otvorenog koda Dynamo implementacija Google Spanner Linearno Globalni SQL Suprotno od Dynamo-a – prioritizuje CP preko TrueTime sinhronizacije sata Redis Cluster Eventually consistent Caching, Sesije Koristi dosljedno haširanje; mnogo jednostavnije rješavanje sukoba DynamoDB zbunjenost: Mnogi inženjeri zbunjuju Amazon DynamoDB sa Dynamo papira. Oni su vrlo različiti. DynamoDB je upravljana usluga optimizovana za operativnu jednostavnost. Ne otkriva vektorske satove, ne koristi istu shemu particioniranja, i koristi vlastiti model dosljednosti. Papir je o unutarnjem Dynamo skladišnom motoru koji prethodi DynamoDB. DynamoDB zbunjenost: Mnogi inženjeri zbunjuju Amazon DynamoDB sa Dynamo papira. Oni su vrlo različiti. DynamoDB je upravljana usluga optimizovana za operativnu jednostavnost. Ne otkriva vektorske satove, ne koristi istu shemu particioniranja, i koristi vlastiti model dosljednosti. Papir je o unutarnjem Dynamo skladišnom motoru koji prethodi DynamoDB. Šta vam Dynamo ne daje Svaki viši inženjerski blog trebao bi biti iskren o ograničenjima. Evo šta Dynamo izričito trguje: : Operations are single-key only. You can’t atomically update multiple keys. No transactions Nema sekundarnih indeksa: možete pretražiti podatke samo primarnim ključem (barem u originalnom dizajnu). Nema pridruživanja: To je trgovina ključnim vrednostima. Ne postoji jezik upita. Nema globalnog poretka: Događaji na različitim ključima nemaju zajamčeno poretko. Nema linearizabilnosti: Čak i kod R=W=N, Dynamo ne pruža linearizabilne čitanja. Nema automatskog rješavanja sukoba: Sistem detektuje sukobe i prenosi ih na aplikaciju. Aplikacija ih mora riješiti. Ako vaši inženjeri to ne razumeju, imaćete suptilne bake podataka. Troškovi popravka na skali: Proces antientropije (Merkleovo pomirenje stabala) nije besplatan. Rast vektorskih satova: U visokokvalitetnim pismenim okruženjima s mnogim koordinatorima, vektorski satovi mogu postati dovoljno veliki da zahtijevaju trunciranje, što uvodi potencijalni gubitak uzročnosti. Razumevanje ovih ograničenja je ključno za uspešno funkcioniranje sistema u stilu Dynamo u proizvodnji. Primer praktične implementacije Below is a self-contained Python implementation of the core Dynamo concepts. It’s intentionally simplified—no actual networking, no persistence—but it faithfully models how vector clocks, the consistent hash ring, quorum reads/writes, and conflict detection interact. Each component is explained before its code. Ključna reč: vektorski sat The class je osnova za praćenje verzija. To je samo mape rečnika . Two key operations: VectorClock node_id → counter — bump our own counter when we write increment(node) — check if one clock is causally “after” another; if neither dominates, the writes were concurrent (conflict) dominates(other) from __future__ import annotations from dataclasses import dataclass, field from typing import Optional class VectorClock: """ Tracks causality across distributed writes. A clock like {"nodeA": 2, "nodeB": 1} means: - nodeA has coordinated 2 writes - nodeB has coordinated 1 write - Any version with these counters has "seen" those writes """ def __init__(self, clock: dict[str, int] | None = None): self.clock: dict[str, int] = clock.copy() if clock else {} def increment(self, node_id: str) -> "VectorClock": """Return a new clock with node_id's counter bumped by 1.""" new_clock = self.clock.copy() new_clock[node_id] = new_clock.get(node_id, 0) + 1 return VectorClock(new_clock) def dominates(self, other: "VectorClock") -> bool: """ Returns True if self is causally AFTER other. self dominates other when: - Every counter in self is >= the same counter in other, AND - At least one counter in self is strictly greater. Meaning: self has seen everything other has seen, plus more. """ all_keys = set(self.clock) | set(other.clock) at_least_one_greater = False for key in all_keys: self_val = self.clock.get(key, 0) other_val = other.clock.get(key, 0) if self_val < other_val: return False # self is missing something other has seen if self_val > other_val: at_least_one_greater = True return at_least_one_greater def merge(self, other: "VectorClock") -> "VectorClock": """ Merge two clocks by taking the max of each counter. Used after resolving a conflict to produce a new clock that has "seen" everything both conflicting versions saw. """ all_keys = set(self.clock) | set(other.clock) merged = {k: max(self.clock.get(k, 0), other.clock.get(k, 0)) for k in all_keys} return VectorClock(merged) def __repr__(self): return f"VectorClock({self.clock})" Deo 2: Vrijednost verzije Svaka vrijednost pohranjena u Dynamo-u je obložena svojim vektorskim satom. Ovo spajanje je ono što omogućuje koordinatoru da uporedi verzije tokom čitanja i otkrije sukobe. @dataclass class VersionedValue: """ A value paired with its causal history (vector clock). When a client reads, they get back a VersionedValue. When they write an update, they must include the context (the vector clock they read) so Dynamo knows what version they're building on top of. """ value: object vector_clock: VectorClock def __repr__(self): return f"VersionedValue(value={self.value!r}, clock={self.vector_clock})" Deo 3: Simulirani čvorovi U realnom Dynamu svaki čvor je zaseban proces. Ovde ih simuliramo kao objekte u memoriji. Ključni detalj: svaki čvor ima svoju lokalnu Node se mogu označiti kao Simulirajte neuspehe storage down class DynamoNode: """ Simulates a single Dynamo storage node. In production this would be a separate server with disk storage. Here it's an in-memory dict so we can demo the logic without networking. """ def __init__(self, node_id: str, token: int): self.node_id = node_id self.token = token # Position on the consistent hash ring self.storage: dict[str, list[VersionedValue]] = {} self.down = False # Toggle to simulate node failures def write(self, key: str, versioned_value: VersionedValue) -> bool: """ Store a versioned value. Returns False if the node is down. We store a LIST of versions per key, because a node might hold multiple concurrent (conflicting) versions until they are resolved by the application. """ if self.down: return False # In a real node this would be written to disk (e.g. BerkeleyDB) self.storage[key] = [versioned_value] return True def read(self, key: str) -> list[VersionedValue] | None: """ Return all versions of a key. Returns None if the node is down. A healthy node with no data for the key returns an empty list. """ if self.down: return None return self.storage.get(key, []) def __repr__(self): status = "DOWN" if self.down else f"token={self.token}" return f"DynamoNode({self.node_id}, {status})" Deo 4: Konsistentni hash prsten Uređujemo čvorove po njihovom žetonu (poziciji) i koristimo hodnik za pronalaženje koordinatora i liste preferencija za bilo koji ključ. import hashlib class ConsistentHashRing: """ Maps any key to an ordered list of N nodes (the preference list). Nodes are placed at fixed positions (tokens) on a conceptual ring from 0 to 2^32. A key hashes to a position, then walks clockwise to find its nodes. This means adding/removing one node only rebalances ~1/N of keys, rather than reshuffling everything like modulo hashing would. """ def __init__(self, nodes: list[DynamoNode]): # Sort nodes by token so we can do clockwise lookup efficiently self.nodes = sorted(nodes, key=lambda n: n.token) def _hash(self, key: str) -> int: """Consistent hash of a key into the ring's token space.""" # Use MD5 for a simple, evenly distributed hash. # Real Dynamo uses a more sophisticated hash (e.g., SHA-1). digest = hashlib.md5(key.encode()).hexdigest() return int(digest, 16) % (2**32) def get_preference_list(self, key: str, n: int) -> list[DynamoNode]: """ Return the first N nodes clockwise from key's hash position. These are the nodes responsible for storing this key. The first node in the list is the coordinator — it receives the client request and fans out to the others. """ if not self.nodes: return [] key_hash = self._hash(key) # Find the first node whose token is >= key's hash (clockwise) start_idx = 0 for i, node in enumerate(self.nodes): if node.token >= key_hash: start_idx = i break # If key_hash is greater than all tokens, wrap around to node 0 else: start_idx = 0 # Walk clockwise, collecting N unique nodes preference_list = [] for i in range(len(self.nodes)): idx = (start_idx + i) % len(self.nodes) preference_list.append(self.nodes[idx]) if len(preference_list) == n: break return preference_list Deo 5: Dinamo koordinator To je srce sistema – logika koja rješava zahtjeve klijenata, obožavatelje na replike, čeka na kvorum, i detektuje sukobe. class SimplifiedDynamo: """ Coordinates reads and writes across a cluster of DynamoNodes. Any node can act as coordinator for any request — there's no dedicated master. The coordinator is simply whichever node receives the client request (or the first node in the preference list, if using partition-aware routing). Configuration: N = total replicas per key R = minimum nodes that must respond to a read (read quorum) W = minimum nodes that must acknowledge a write (write quorum) """ def __init__(self, nodes: list[DynamoNode], N: int = 3, R: int = 2, W: int = 2): self.N = N self.R = R self.W = W self.ring = ConsistentHashRing(nodes) # ------------------------------------------------------------------ # # WRITE # # ------------------------------------------------------------------ # def put(self, key: str, value: object, context: VectorClock | None = None) -> VectorClock: """ Write a key-value pair to N replicas, wait for W ACKs. The 'context' is the vector clock from a previous read. Always pass context when updating an existing key — it tells Dynamo which version you're building on top of, so it can detect whether your write is concurrent with anything else. Returns the new vector clock, which the caller should store and pass back on future writes to this key. Raises: RuntimeError if fewer than W nodes acknowledged. """ preference_list = self.ring.get_preference_list(key, self.N) if not preference_list: raise RuntimeError("No nodes available") # The coordinator is always the first node in the preference list. coordinator = preference_list[0] # Increment the coordinator's counter in the vector clock. # If no context was provided (brand new key), start a fresh clock. base_clock = context if context is not None else VectorClock() new_clock = base_clock.increment(coordinator.node_id) versioned = VersionedValue(value=value, vector_clock=new_clock) # Fan out to all N replicas. # In a real system these would be concurrent RPC calls. # Here we call them sequentially for simplicity. ack_count = 0 for node in preference_list: success = node.write(key, versioned) if success: ack_count += 1 # Only need W ACKs to declare success. # The remaining replicas are updated asynchronously (or via # hinted handoff if they were down). if ack_count < self.W: raise RuntimeError( f"Write quorum not met: got {ack_count} ACKs, needed {self.W}" ) print(f"[PUT] key={key!r} value={value!r} clock={new_clock} " f"({ack_count}/{self.N} nodes wrote)") return new_clock # ------------------------------------------------------------------ # # READ # # ------------------------------------------------------------------ # def get(self, key: str) -> list[VersionedValue]: """ Read a key from N replicas, wait for R responses, reconcile. Returns a LIST of VersionedValues: - Length 1 → clean read, no conflict - Length >1 → concurrent versions detected; application must merge After reading, the caller should: 1. If no conflict: use the single value normally. 2. If conflict: merge the values using application logic, then call put() with the merged value and the merged vector clock as context. This "closes" the conflict. Read repair happens in the background: any replica that returned a stale version is silently updated with the latest version. """ preference_list = self.ring.get_preference_list(key, self.N) # Collect responses from all N nodes all_versions: list[VersionedValue] = [] responding_nodes: list[tuple[DynamoNode, list[VersionedValue]]] = [] for node in preference_list: result = node.read(key) if result is not None: # None means the node is down all_versions.extend(result) responding_nodes.append((node, result)) if len(responding_nodes) < self.R: raise RuntimeError( f"Read quorum not met: got {len(responding_nodes)} responses, needed {self.R}" ) # Reconcile: discard any version that is strictly dominated # (i.e., is a causal ancestor of) another version. # What remains is the set of concurrent versions. reconciled = self._reconcile(all_versions) # Background read repair: if any node returned something older # than the reconciled result, send it the latest version. # (Simplified: only meaningful when there's a single winner.) if len(reconciled) == 1: latest = reconciled[0] for node, versions in responding_nodes: if not versions or versions[0].vector_clock != latest.vector_clock: node.write(key, latest) # Repair silently in background status = "clean" if len(reconciled) == 1 else f"CONFLICT ({len(reconciled)} versions)" print(f"[GET] key={key!r} status={status} " f"({len(responding_nodes)}/{self.N} nodes responded)") return reconciled # ------------------------------------------------------------------ # # INTERNAL: VERSION RECONCILIATION # # ------------------------------------------------------------------ # def _reconcile(self, versions: list[VersionedValue]) -> list[VersionedValue]: """ Remove any version that is a causal ancestor of another version. If version A's clock is dominated by version B's clock, then B is strictly newer — A adds no new information and can be dropped. Whatever remains after pruning are CONCURRENT versions: writes that happened without either "knowing about" the other. The application must merge these using domain-specific logic. Example: versions = [clock={A:1}, clock={A:2}, clock={B:1}] {A:2} dominates {A:1} → drop {A:1} {A:2} and {B:1} are concurrent → both survive result = [{A:2}, {B:1}] ← conflict! application must merge """ dominated = set() for i, v1 in enumerate(versions): for j, v2 in enumerate(versions): if i != j and v2.vector_clock.dominates(v1.vector_clock): dominated.add(i) # v1 is an ancestor of v2, discard v1 survivors = [v for i, v in enumerate(versions) if i not in dominated] # De-duplicate: identical clocks from different replicas are the same version seen_clocks: list[VectorClock] = [] unique: list[VersionedValue] = [] for v in survivors: if not any(v.vector_clock.clock == s.clock for s in seen_clocks): unique.append(v) seen_clocks.append(v.vector_clock) return unique if unique else versions Odlomak 6: Putting It All Together - Demo Let’s run through a complete scenario: normal write/read, then a simulated conflict where two nodes diverge and the application must merge them. def demo(): # ── Setup ────────────────────────────────────────────────────────── # # Five nodes placed at evenly spaced positions on the hash ring. # In a real cluster these would span multiple datacenters. nodes = [ DynamoNode("node-A", token=100), DynamoNode("node-B", token=300), DynamoNode("node-C", token=500), DynamoNode("node-D", token=700), DynamoNode("node-E", token=900), ] dynamo = SimplifiedDynamo(nodes, N=3, R=2, W=2) print("=" * 55) print("SCENARIO 1: Normal write and read (no conflict)") print("=" * 55) # Write the initial shopping cart ctx = dynamo.put("cart:user-42", {"items": ["shoes"]}) # Read it back — should be a clean single version versions = dynamo.get("cart:user-42") print(f"Read result: {versions[0].value}\n") # Update the cart, passing the context from our earlier read. # The context tells Dynamo "this write builds on top of clock ctx". ctx = dynamo.put("cart:user-42", {"items": ["shoes", "jacket"]}, context=ctx) versions = dynamo.get("cart:user-42") print(f"After update: {versions[0].value}\n") print("=" * 55) print("SCENARIO 2: Simulated conflict — two concurrent writes") print("=" * 55) # Write the base version base_ctx = dynamo.put("cart:user-99", {"items": ["hat"]}) # Now simulate a network partition: # node-A and node-B can't talk to each other. # We model this by writing directly to individual nodes. pref_list = dynamo.ring.get_preference_list("cart:user-99", 3) node_1, node_2, node_3 = pref_list[0], pref_list[1], pref_list[2] # Write 1: customer adds "scarf" via node_1 (e.g., their laptop) clock_1 = base_ctx.increment(node_1.node_id) node_1.write("cart:user-99", VersionedValue({"items": ["hat", "scarf"]}, clock_1)) # Write 2: customer adds "gloves" via node_2 (e.g., their phone) # This write also descends from base_ctx, not from clock_1. # Neither write knows about the other → they are concurrent. clock_2 = base_ctx.increment(node_2.node_id) node_2.write("cart:user-99", VersionedValue({"items": ["hat", "gloves"]}, clock_2)) # Read — coordinator sees two concurrent versions and surfaces the conflict versions = dynamo.get("cart:user-99") if len(versions) > 1: print(f"\nConflict detected! {len(versions)} concurrent versions:") for i, v in enumerate(versions): print(f" Version {i+1}: {v.value} clock={v.vector_clock}") # Application-level resolution: union merge (Amazon's shopping cart strategy) # Merge items: take the union so no addition is lost all_items = set() merged_clock = versions[0].vector_clock for v in versions: all_items.update(v.value["items"]) merged_clock = merged_clock.merge(v.vector_clock) merged_value = {"items": sorted(all_items)} print(f"\nMerged result: {merged_value}") # Write the resolved version back with the merged clock as context. # This "closes" the conflict — future reads will see a single version. final_ctx = dynamo.put("cart:user-99", merged_value, context=merged_clock) versions = dynamo.get("cart:user-99") print(f"\nAfter resolution: {versions[0].value}") assert len(versions) == 1, "Should be a single version after merge" if __name__ == "__main__": demo() Expected output: ======================================================= SCENARIO 1: Normal write and read (no conflict) ======================================================= [PUT] key='cart:user-42' value={'items': ['shoes']} clock=VectorClock({'node-A': 1}) (3/3 nodes wrote) [GET] key='cart:user-42' status=clean (3/3 nodes responded) Read result: {'items': ['shoes']} [PUT] key='cart:user-42' value={'items': ['shoes', 'jacket']} clock=VectorClock({'node-A': 2}) (3/3 nodes wrote) [GET] key='cart:user-42' status=clean (3/3 nodes responded) After update: {'items': ['shoes', 'jacket']} ======================================================= SCENARIO 2: Simulated conflict — two concurrent writes ======================================================= [PUT] key='cart:user-99' value={'items': ['hat']} clock=VectorClock({'node-A': 1}) (3/3 nodes wrote) [GET] key='cart:user-99' status=CONFLICT (2 versions) (3/3 nodes responded) Conflict detected! 2 concurrent versions: Version 1: {'items': ['hat', 'scarf']} clock=VectorClock({'node-A': 2}) Version 2: {'items': ['hat', 'gloves']} clock=VectorClock({'node-A': 1, 'node-B': 1}) Merged result: {'items': ['gloves', 'hat', 'scarf']} [PUT] key='cart:user-99' value={'items': ['gloves', 'hat', 'scarf']} ... (3/3 nodes wrote) [GET] key='cart:user-99' status=clean (3/3 nodes responded) After resolution: {'items': ['gloves', 'hat', 'scarf']} U scenariju 2, koordinator ispravno utvrđuje da and nisu ni jednaki ni u dominantnom odnosu – ni jedan nije predak drugog – tako da se oboje pojavljuju kao istovremeno. aplikacija zatim preuzima odgovornost za spajanje i vraćanje rezolucije sa spajanjem sata. What to notice: {'node-A': 2} {'node-A': 1, 'node-B': 1} Ključne lekcije za sistemski dizajn Nakon što sam godinama radio sa sistemima inspirisanim Dynamo-om, evo mojih ključnih ideja: 1. Always-On Beats Strongly-Consistent Za aplikacije usmjerene prema korisniku, dostupnost gotovo uvek pobjeđuje. Korisnici će tolerirati da vide malo zastarjele podatke. Oni neće tolerirati „Služba nije dostupna“. 2. Application-Level Reconciliation is Powerful Don’t be afraid to push conflict resolution to the application. The application understands the business logic and can make smarter decisions than the database ever could. 3. Tunable Consistency is Essential One size doesn’t fit all. Shopping cart additions need high availability (W=1). Financial transactions need stronger guarantees (W=N). The ability to tune this per-operation is incredibly valuable. 4. The 99.9th Percentile Matters More Than Average Usredotočite svoje napore za optimizaciju na latencije za rep. To je ono što korisnici zapravo doživljavaju tokom vrhunskih sati. 5. Gossip Protocols Scale Beautifully Decentralizovana koordinacija preko gossip eliminiše pojedinačne tačke neuspeha i skale na hiljade čvorova. Kada NE koristiti Dynamo-Style sustave Budite iskreni o kompromisima. Nemojte koristiti ovaj pristup kada: Potrebna je snažna doslednost (financijske transakcije, upravljanje zalihom) Potrebne su složene upite (izveštavanje, analize, spojevi) Transakcije obuhvaćaju više stavki (Dynamo je samo operacije s jednim ključem) Vaš tim ne može upravljati eventualnom dosljednošću (ako programeri ne razumeju vektorske sata i rješavanje sukoba, imaćete problema) Zaključak Dynamo predstavlja temeljnu promjenu u načinu na koji razmišljamo o distribuiranim sistemima. Prihvaćajući eventualnu dosljednost i pružajući prilagodljive kompromise, omogućava izgradnju sistema koji se razvijaju do masivnih veličina uz održavanje visoke dostupnosti. Bez obzira da li koristite Cassandra, Riak ili DynamoDB, koristite uvid koji je prvi put objavljen u ovom članku. Kao inženjeri, naš posao je da duboko razumemo ove kompromise i primenimo ih na odgovarajući način. Dynamo nam daje moćan alat, ali kao i svaki alat, to je samo dobro kao naše razumevanje kada i kako ga koristiti. Daljnje čitanje Originalni Dynamo papir: SOSP 2007 Werner Vogels’ Blog: Sve stvari distribuirane Cassandra Documentation: Understanding how these concepts are implemented “Dizajn podataka-intensive aplikacije” Martin Kleppmann – poglavlje 5 o replikaciji Appendix: Design Problems and Approaches Tri otvorena pitanja koja se javljaju u intervjuima za dizajn sistema i stvarnom inženjerskom radu. Razmislite o svakoj od njih pre nego što pročitate raspravu. Problem 1: Rješavanje sukoba za kolaborativni urednik dokumenata : Gradite nešto poput Google Dokumenta koje podržava prodavnica u stilu Dynamo. Dva korisnika istovremeno uređuju isti stavak. The problem Strategija košarice za kupovinu (unija svih stavki) je sigurna samo zato što je dodavanje stavki komutativno – Ako Korisnik A briše rečenicu, a Korisnik B uređuje sredinu rečenice, ujedinjenje njihovih promjena je beznačajno ili kontradiktorno. Why shopping cart union doesn’t work here {A} ∪ {B} = {B} ∪ {A} The right approach: Operational Transformation (OT) or CRDTs Rješenje industrije je da se dokument ne predstavlja kao blob teksta, već kao sekvenca operacija, i da se transformišu istovremeno operacije tako da se oboje mogu primijeniti bez sukoba: User A's operation: delete(position=50, length=20) User B's operation: insert(position=60, text="new sentence") Without OT: B's insert position (60) is now wrong because A deleted 20 chars. With OT: Transform B's operation against A's: B's insert position shifts to 40 (60 - 20). Both operations now apply cleanly. Strategija rješavanja sukoba za Dynamo sloj bi bila: Skladištenje operacija (ne punih snimaka dokumenta) kao vrednost za svaki ključ. U slučaju sukoba prikupite sve popise istovremenih operacija iz svake verzije. Nanesite OT da biste ih spojili u jedan dosljedan dnevnik operacija. Napišite spajani dnevnik sa spajanim vektorskim satom kao kontekst. : Operativni dnevnik po segmentu dokumenta, a ne prikazan tekst. To čini spajanja deterministski i bez gubitaka. What to store in Dynamo Njihovi slojevi skladištenja koriste ili OT ili varijantu CRDT-a (Konflikt-free Replicated Data Types), koji su podatkovne strukture koje su matematički zajamčene za spajanje bez sukoba bez obzira na redoslijed operacije. Real-world reference Problem 2: Odabir N, R, W za različite slučajeve upotrebe : Koju biste konfiguraciju odabrali za (a) prodavnicu sesije, (b) katalog proizvoda, (c) korisničke profile? The problem Ispravan način razmišljanja o tome: identificirajte način neuspjeha koji košta više – propustio zapis (izgubljeni podaci) ili odbijen zapis (nedostupnost). Session store — prioritize availability Sesije su privremene i specifične za korisnika. Ako je korisnička sesija kratko zastala ili izgubljena, oni se izlaze i ponovo se prijave. To je uznemirujuće, ali ne i katastrofalno. N=3, R=1, W=1 Rationale: - W=1: Accept session writes even during heavy failures. A user can't log in if their session write is rejected. - R=1: Read from any single node. Stale session data is harmless. - N=3: Still replicate to 3 nodes for basic durability. Trade-off accepted: Stale session reads are possible but inconsequential. Product catalog — prioritize read performance and consistency Podaci o proizvodima se retko pišu (od strane ops timova), ali se čitaju milijune puta dnevno. Stale cijene ili opisi su problematični. N=3, R=2, W=3 Rationale: - W=3: All replicas must confirm a catalog update before it's live. A price change half-published is worse than a brief write delay. - R=2: Read quorum overlap with W=3 guarantees fresh data. Acceptable: catalog writes are rare, so write latency doesn't matter. - N=3: Standard replication for durability. Trade-off accepted: Writes are slow and fail if any node is down. Acceptable because catalog updates are infrequent. User profiles — balanced Podaci profila (ime, e-pošta, preferencije) su umjereno važni. Stalni profil je uznemirujući, ali ne i opasan. Odbijena ažuriranje (npr. korisnik ne može ažurirati svoju e-poštu) je pravi problem. N=3, R=2, W=2 Rationale: - The classic balanced configuration. - R + W = 4 > N = 3, so quorums overlap: reads will see the latest write. - Tolerates 1 node failure for both reads and writes. - Appropriate for data that matters but doesn't require strict consistency. Trade-off accepted: A second simultaneous node failure will cause errors. Acceptable for non-critical user data. Decision framework summary: Priority R W When to use Max availability 1 1 Sessions, ephemeral state, click tracking Balanced 2 2 User profiles, preferences, soft state Consistent reads 2 3 Catalogs, config, rarely-written reference data Highest consistency 3 3 Anywhere you need R+W > N with zero tolerance for stale reads (still not linearizable) Maksimalna dostupnost 1 1 Sesije, efemerno stanje, praćenje klikom Izbalansiranost 2 2 Profil korisnika, preferencije, soft status Dosljedno čitanje 2 3 Katalogi, konfig, retko pisani referentni podaci Najveća doslednost 3 3 Gde god vam je potreban R+W > N sa nultom tolerancijom za stalno čitanje (još nije linearizabilno) Problem 3: Testing a Dynamo-Style System Under Partition Scenarios Kako provjeriti da se vaš sistem zaista ponaša ispravno kada čvorovi ne uspiju i pojavljuju se particije? The problem Ovo je jedan od najtežih problema u distribuiranim sistemima testiranja jer se bugovi pojavljuju samo u specifičnim interleavings istovremenih događaja koji su teško reproducirati deterministski. Layer 1: Unit tests for the logic in isolation Prije testiranja distribuiranog ponašanja, neovisno provjerite građevinske blokove. Logika usporedbe vektorskih satova, detekcija sukoba i funkcije pomirenja mogu se testirati pomoću testova čiste jedinice – nema potrebe za mrežom. def test_concurrent_clocks_detected_as_conflict(): clock_a = VectorClock({"node-A": 2}) clock_b = VectorClock({"node-B": 2}) assert not clock_a.dominates(clock_b) assert not clock_b.dominates(clock_a) # Both survive reconciliation → conflict correctly detected def test_ancestor_clock_is_discarded(): old_clock = VectorClock({"node-A": 1}) new_clock = VectorClock({"node-A": 3}) assert new_clock.dominates(old_clock) # old_clock should be pruned during reconciliation Layer 2: Deterministic fault injection Umjesto da se nadate da će se neuspehi dogoditi u pravom redoslijedu tokom testiranja opterećenja, ubrizgavajte ih namjerno i ponavljajući. je jednostavna verzija toga. U proizvodnim sistemima, biblioteke kao što su or To treba učiniti na razini infrastrukture. node.down = True Japanac Majmunski haos Ključni scenariji za testiranje: Scenario A: Write succeeds with W=2, third replica is down. → Verify: the data is readable after the down node recovers. → Verify: no data loss occurred. Scenario B: Two nodes accept concurrent writes to the same key. → Verify: the next read surfaces exactly 2 conflicting versions. → Verify: after the application writes a merged version, the next read is clean. Scenario C: Node goes down mid-write (wrote to W-1 nodes). → Verify: the write is correctly rejected (RuntimeError). → Verify: no partial writes are visible to readers. Scenario D: All N nodes recover after a full partition. → Verify: no data was lost across the cluster. → Verify: vector clocks are still meaningful (no spurious conflicts). Layer 3: Property-based testing Umesto pisanja pojedinačnih test slučajeva, definisati koji mora uvek držati i generirati hiljade slučajnih operacijskih sekvencija kako bi ih pokušao prekršiti: invariantno # Invariant: after any sequence of writes and merges, a final get() # should always return exactly one version (no unresolved conflicts). # Invariant: a value written with a context derived from a previous read # should never produce a conflict with that read's version # (it should dominate it). # Invariant: if R + W > N, a value written successfully should always # be visible in the next read (read-your-writes, absent concurrent writes). alat kao što je (Python) omogućuje vam da izrazite ove invariante i automatski pronađete kontraprimere. Hypothesis Layer 4: Linearizability checkers Za najveću pouzdanost, zabilježite vreme početka, vreme završetka i rezultat svake operacije tokom testa ubrizgavanja grešaka, a zatim isporučite istoriju kontroloru linearnosti kao što je On će vam reći da li je svaka promatrana povijest u skladu sa ispravnim sekvencijalnim izvršenjem - čak i za eventualno dosljedni sistem koji radi u okviru svojih navedenih garancija. Knjige Written from the trenches of distributed systems. Battle-tested insights, zero hand-waving. Link za beležnicu Link za beležnicu