Why Amazon Dynamo Hlola I-Modern Distributed Storage 17 Years Ngemva

A senior engineer’s perspective on building highly available distributed systems Umbala weContents Intuthuko: Yini Dynamo ukuguqulwa konke I-CAP Theorem ye-trade-off Core Architecture Components Consistent Hashing for Partitioning Replication Strategy (N, R, W) Vector Clocks for Versioning Sloppy Quorum and Hinted Handoff Ukusebenza kwe-Conflict Resolution: I-Shopping Cart Problem Thola futhi wabhala Flow I-Merkel Trees ye-Anti-entropy Ukubuyekezwa kwe-Memberhood and Failure Detection Imikhiqizo: Real Numbers Ukulungiselela Strategy Evolution Ukubala Dynamo ku-Modern Systems Yini Dynamo Ayikwazanga Isibonelo esebenzayo Izinzuzo eziyinhloko ze-System Design Ukusebenzisa I-Dynamo-Style Systems Ukuphakama I-Appendix: I-Design I-Problems and Approaches Kuyinto ifomu elide - zonke izigaba zihlanganisa ngokulinganayo, ngakho-ke sicela uxhumane ngqo kuzo zonke izidingo zakho. Kuyinto ifomu elide - zonke izigaba zihlanganisa ngokulinganayo, ngakho-ke sicela uxhumane ngqo kuzo zonke izidingo zakho. Intuthuko: Yini Dynamo ukuguqulwa konke Uma i-Amazon yasungulwa umbhalo we-Dynamo ngo-2007, akuyona kuphela isivumelwano sokufundisa. Kuyinto isixazululo esithathwe kwebhizinisi kwebhizinisi emangalisayo. Ngingathanda lapho ngifunde kuqala lokhu umbhalo-ukuguqulwa ngokuvamile indlela yami yokufunda izinhlelo ezihambelana. Yenziwe ukweseka i-Amazon inkonzo ephezulu ye-traffic ezifana ne-shopping cart kanye ne-session management systems. Akukho ama-indices yesibini, akukho ama-joins, akukho i-semantics ye-relational – kuphela ama-keys kanye nama-values, nge-focus eningi ku-availability kanye ne-scalability. Akukwazi ukunikeza i-linearizability noma i-global ordering guarantees, ngisho kwizinga le-quorum ephakeme. Uma inqubo yakho inikeza izakhiwo zayo, i-Dynamo ayikho ithuluzi elihle. Dynamo is a distributed key-value storage system. I-problem core e-Amazon eyenziwe ngempumelelo, kodwa enhle ukuhlangabezana: Uma umuntu udinga ukongeza isikhwama ku-shopping cart esithile ngesikhathi se-network partition noma i-server failure, ukuchithwa okuzenzakalelayo akufanele. Yonke ukuchithwa okuzenzakalelayo kunesilinganiselwe ngempumelelo kanye nokumiswa kwe-client. How do you build a storage system that never says “no” to customers? I-CAP Theorem ye-Trade-off: Yini i-Dynamo ukhethe ukufinyelela Ngaphambi kokufunda kanjani i-Dynamo isebenza, kufanele ufunde ukucindezeleka okuhlobene. Yini i-CAP Theorem? I-CAP i-theorem ibonisa i-comprom-off esiyisisekelo emakhemikhali e-distributed: lapho i-network partition ikhona, kufanele ukhethe phakathi kwe-consistency ne-availability. I-properties ezintathu zihlanganisa: I-Consistency (C): Zonke ama-nodes zibonise idatha efanayo ngexesha elifanayo Ukubuyekezwa (A): Zonke imibuzo akufumana impendulo (ukuphumelela noma ukunciphisa) I-Partition Tolerance (P): I-System isebenza nangaphandle kokuphumelela kwebhizinisi I-acronym eyenziwe ngokuvamile yi-"pick 2 of 3", kodwa lokhu kuyinto i-over-simplification. Kwi-practice, i-network partitions ayikho ku-scale, ngakho-ke isixazululo esithakazelisayo: Kuyinto ukwahlukanisa design enhle. when partitions occur (and they will), do you sacrifice consistency or availability? : I-NETWORK PARTITION ISIKHULU. I-CABLE YOKUQHA, I-Switches YOKUQHA, I-DATECENTERS YOKUQHA UKUSEBENZA UKUSEBENZA. Ungabikho, ngakho-ke kufanele ukhethe: Ukuhambelana noma Ukuhambelana? The harsh reality Databases ezivamile Choose Consistency : Traditional approach Database: "I can't guarantee all replicas are consistent, so I'll reject your write to be safe." Result: Customer sees error, cart is empty Impact: Lost revenue, poor experience Dynamo Khetha Ukuphepha : Dynamo’s approach Dynamo: "I'll accept your write with the replicas I can reach. The unreachable replica will catch up later." Result: Customer sees success, item in cart Impact: Sale continues, happy customer I-Trade-off ebonakalayo When a partition occurs: Traditional Database: Choose C over A → Sacrifice Availability - ✓ All replicas always have same data - ✓ No conflicts to resolve - ❌ Rejects writes during failures - ❌ Poor customer experience - ❌ Lost revenue Dynamo: Choose A over C → Sacrifice Strong Consistency - ✓ Accepts writes even during failures - ✓ Excellent customer experience - ✓ No lost revenue - ❌ Replicas might temporarily disagree - ❌ Application must handle conflicts I-Amazon Real Isibonelo: I-Black Friday Shopping Cart Thina Black Friday. Millions amakhasimende akuyona. I-network cable ifakwe phakathi kwamathambo zamathambo. : With traditional database Time: 10:00 AM - Network partition occurs Result: - All shopping cart writes fail - "Service Unavailable" errors - Customers can't checkout - Twitter explodes with complaints - Estimated lost revenue: $100,000+ per minute : With Dynamo Time: 10:00 AM - Network partition occurs Result: - Shopping cart writes continue - Customers see success - Some carts might have conflicts (rare) - Application merges conflicting versions - Estimated lost revenue: $0 - A few edge cases need conflict resolution (acceptable) Why This Choice Yenza Inani Ukuze E-Commerce I-Amazon yenza le math: I-Cost of Rejecting a Writing: I-Immediate Lost Sale ($50-200) I-Cost of Accepting a Conflicting Writing: I-Occasionally need to merge shopping carts (ukudalwa, kulula ukucubungula) I-Business Decision: I-Accept Writes, Ukuxhumana Nge-Conflict Rare : Types of data where Availability > Consistency Izikhwama ze-shopping (i-merge conflicting additions) Idatha yeSession (i-last-write-wins iyatholakala) Izinzuzo zokusebenzisa (okungenani ukuxhumana kunokwenzeka) I-Best Seller Lists (I-approximate is fine) : Types of data where Consistency > Availability Izilinganiso ze-akhawunti ze-banking (okungekho izilinganiso ze-conflicting) I-Inventory Counts (ayikwazi ukuthengiswa okungenani) I-Transaction Logs (kufuneka ifakwe) Ngenxa yalokho, i-Dynamo ayikho kuzo zonke - kodwa kumaziko yokusetshenziswa kwe-Amazon e-commerce, ukhetho lokusebenza kunazo ngaphezu kokuqinisekisa okuphakeme kubalulekile. Ukucaciswa okuphambili: Nangona i-Dynamo ibizwa ngokuthi inkqubo ye-AP, kubalulekile kakhulu ukubizwa ngokuthi inkqubo ye-coherence ye-tuneable. Ngokuhambisana ne-quorum yakho ye-R kanye ne-W, kungenziwa ngakumbi ku-CP. I-AP label isetshenziselwa ukucaciswa kwe-default / evamile e-e-commerce. : Nangona i-Dynamo ikhiwa njengekhwalithi ye-AP, kuncike ngokuphathelene kakhulu ukubizwa ngokuthi i-AP Ngokusho ne-quorum yakho ye-R kanye ne-W, ingasebenza ngokufanayo ne-CP. I-AP label isetshenziselwa isizukulwane se-default/recommended e-e-commerce workloads. Important nuance tunable consistency system Izinhlelo ze-Architecture Core I-Consistent Hashing ye-Partitioning Ngiyavuma lokhu nge umzekelo esifanayo, ngoba hashing enhle kuyinto omunye ama-concepts ebonakalayo kuze kube kungase ukholelwa ukuthi kusebenza. I-Problem: I-Hash-Based Sharding Ehlukile Ukukhetha ukuthi unayo 3 i-server futhi ufuna ukusabalalisa idatha phakathi kwabo. Isisombululo se-naive: # Traditional approach - DON'T DO THIS def get_server(key, num_servers): hash_value = hash(key) return hash_value % num_servers # Modulo operation # With 3 servers: get_server("user_123", 3) # Returns server 0 get_server("user_456", 3) # Returns server 1 get_server("user_789", 3) # Returns server 2 Kuyinto kusebenza ... kuze kube ukongeza noma ukuthatha inkonzo. Hlola okufakiwe lapho siphinde kusuka ku-3 kuya ku-4 inkonzo: # Before (3 servers): "user_123" → hash % 3 = 0 → Server 0 "user_456" → hash % 3 = 1 → Server 1 "user_789" → hash % 3 = 2 → Server 2 # After (4 servers): "user_123" → hash % 4 = 0 → Server 0 ✓ (stayed) "user_456" → hash % 4 = 1 → Server 1 ✓ (stayed) "user_789" → hash % 4 = 2 → Server 2 ✓ (stayed) # But wait - this is lucky! In reality, most keys MOVE: "product_ABC" → hash % 3 = 2 → Server 2 "product_ABC" → hash % 4 = 3 → Server 3 ✗ (MOVED!) : Uma ukuguqulwa inombolo ye-server, cishe Wonke idatha yakho kufuneka ukuguqulwa. Qinisekisa ukuguqulwa kwe-terabytes yedatha kuphela ukongeza i-server eyodwa! The disaster Isisombululo: I-Consistent Hashing I-hashing evuselelwa okuhloswe ngokucophelela indawo ye-hash njenge-circle (0 kuya ku-2^32 - 1, okwakhiwa). Step 1: Place servers on the ring Ngama-server iyatholakala indawo ephakeme kwi-ring (eyaziwa njenge "token"). Thola lokhu njenge ukubeka ama-markers ku-circular race track. Step 2: Place data on the ring Uma ufuna ukugcina idatha, wena: I-Hash ye-key ukufumana indawo ku-ring Ukusuka kwelanga ukusuka kwelinye indawo Hlola idatha ku-server yokuqala ungathola Umbala we-Full Ring Ngiyazi i-ring efakiwe ngokuhambisana. Izici zihambe ngokuhambisana ne-clockwise kuya ku-server elilandelayo: : I-key isebenza ngokuhambisana ne-clock until it hits a server. I-server ithatha i-key. Simple rule : Examples user_123 at 30° → ivela ku-45° → I-Server A ithi user_456 at 150° → walks kuya ku 200° → I-Server C ithi cart_789 at 250° → uhamba kuya ku-280° → I-Server D inikeza product_ABC ku-300 ° → usuka 360 °, uqhuba ku-0 °, uqhubeke ku-45 ° → I-Server A inikeza Who owns what range? I-Server A (i-45°): Inikeza yonke into kusuka ku-281° kuya ku-45° (i-wraps) I-Server B (120°): inikeza konke kusuka ku-46° kuya ku-120° I-Server C (200°): inikeza yonke into kusuka ku-121° kuya ku-200° I-Server D (280°): inikeza yonke into kusuka ku-201° kuya ku-280° I-Magic: Ukongeza i-server Sishayele ukuthi lokhu kuyinto enhle. Sishayele i-Server E ku-position 160°: BEFORE: Server A (45°) → owns 281°-45° Server B (120°) → owns 46°-120° Server C (200°) → owns 121°-200° ← THIS RANGE WILL SPLIT Server D (280°) → owns 201°-280° AFTER: Server A (45°) → owns 281°-45° ← NO CHANGE Server B (120°) → owns 46°-120° ← NO CHANGE Server E (160°) → owns 121°-160° ← NEW! Takes part of C's range Server C (200°) → owns 161°-200° ← SMALLER range Server D (280°) → owns 201°-280° ← NO CHANGE : Kufuneka kuphela izibambo ezingu-112°-160° (kusuka C kuya ku-E). I-Server A, B, ne-D akufanele ngokuphelele! Result I-Virtual Nodes Optimization Kukhona inkinga ebalulekile nge-basic consistent hashing approach: . random distribution can be extremely uneven The Problem in Detail: Uma ungahleli ngempumelelo indawo eyodwa ngamunye i-server, ngokuvamile unamathela i-darts ku-board ye-circular. Noma-ke i-darts zihlanganisa, noma-ke zihlanganisa. Lokhu kwenza ama-hotspots. Ngiyaxazisa isibonelo esiyingqayizivele: Scenario: 4 servers with single random tokens Server A: 10° } Server B: 25° } ← Only 75° apart! Tiny ranges Server C: 100° } Server D: 280° ← 180° away from C! Huge range Range sizes: - Server A owns: 281° to 10° = 89° (25% of ring) - Server B owns: 11° to 25° = 14° (4% of ring) ← Underutilized! - Server C owns: 26° to 100° = 74° (21% of ring) - Server D owns: 101° to 280° = 179° (50% of ring) ← Overloaded! Real-world consequences: I-Uneven Load: I-Server D ithatha i-50% yebhizinisi, lapho i-Server B ithatha kuphela i-4%. Lokhu kubalulekile: I-CPU, i-disk, ne-network ye-Server D iyahlukaniswa I-Server B iyatholakala kakhulu (i-capacity eyenziwe) I-99.9th percentile ye-latency yakho iyahambisana ne-Server D enesibopho I-Hotspot Cascading: Uma i-Server D ithatha i-slow or fails: Zonke i-50% yayo isitimela isixazululo ku-Server A (i-clockwise elilandelayo) I-Server A iyatholakala ngokushesha Ukusebenza kwamakhemikhali I-Scaling Inefficient: Ukongezwa kwe-servers akunakusiza ngokuqondile ngoba ama-servers ezintsha angakwazi ukufinyelela ezingenalutho. Visualizing the problem: : I-server ye-physical iyatholakala izindawo eziningi ze-virtual (i-tokens). Dynamo’s solution Ngaphandle kwe-dart yendiza ye-server, thina ama-dart ezininzi. Ngaphezulu kwe-dart yendiza, kuningi i-distribution (ukulawula kwama-number eningi). How Virtual Nodes Fix the Problem: Thola i-servers efanayo ze-4, kodwa manje wonke inkonzo inikeza ama-node angu-3 (i-tokens) kunezinto ye-1: Physical Server A gets 3 tokens: 10°, 95°, 270° Physical Server B gets 3 tokens: 25°, 180°, 310° Physical Server C gets 3 tokens: 55°, 150°, 320° Physical Server D gets 3 tokens: 75°, 200°, 340° Now the ring looks like: 10° A, 25° B, 55° C, 75° D, 95° A, 150° C, 180° B, 200° D, 270° A, 310° B, 320° C, 340° D Range sizes (approximately): - Server A total: 15° + 55° + 40° = 110° (31% of ring) - Server B total: 30° + 20° + 30° = 80° (22% of ring) - Server C total: 20° + 30° + 20° = 70° (19% of ring) - Server D total: 20° + 70° + 20° = 110° (31% of ring) Ukusebenza phakathi 19% kuya 31% ngaphandle 4% kuya 50%. Much better! Why this works: : With more samples (tokens), the random distribution averages out. This is the law of large numbers in action. Statistics : When a server fails, its load is distributed across many servers, not just one neighbor: Granular load distribution Server A fails: - Its token at 10° → load shifts to Server B's token at 25° - Its token at 95° → load shifts to Server C's token at 150° - Its token at 270° → load shifts to Server B's token at 310° Result: The load is spread across multiple servers! : When adding a new server with 3 tokens, it steals small amounts from many servers instead of a huge chunk from one server. Smooth scaling Real Dynamo configurations: Umbhali wabelana izindlela ezahlukile ezivela ngesikhathi. Ukukhiqizwa: Izinguquko ezidlulile: 100-200 virtual node ngamunye yokusebenza Later optimized to: Q/S tokens per node (where Q = total partitions, S = number of servers) Typical setup: Each physical server might have 128-256 virtual nodes The Trade-off: Balance vs Overhead Ukubuyekezwa kwe-Node ye-Virtual kuhlanganisa ukubuyekezwa kwe-Load, kodwa kunezimali. What you gain with more virtual nodes: With 1 token per server (4 servers): Load variance: 4% to 50% (±46% difference) ❌ With 3 tokens per server (12 virtual nodes): Load variance: 19% to 31% (±12% difference) ✓ With 128 tokens per server (512 virtual nodes): Load variance: 24% to 26% (±2% difference) ✓✓ What it costs: Usayizi ye-metadata: Isikhathi esisodwa se-node esihlala ulwazi se-routing I-1 i-token ye-server: Ukuhlola ama-entries e-4 I-128 i-token ngalinye i-server: Ukuhlola i-512 ama-entries I-Gossip Overhead: I-Nodes ihamba idatha ye-memberhood ngokuvamile I-More Tokens = I-Data engaphezu kwe-sync phakathi kwama-nodes Every second, nodes gossip their view of the ring I-rebalancing complexity: Uma ama-nodes abahlukanisa / abahlukanisa More virtual nodes = more partition transfers ukuze ukuguqulwa Kodwa ngamunye ukulayisha kuncane (okungcono kakhulu ukuze bootstrapping) Dynamo’s evolution: I-Archive ibonisa kanjani i-Amazon iyahambisana okuhle ngokushesha: Strategy 1 (Initial): - 100-200 random tokens per server - Problem: Huge metadata (multiple MB per node) - Problem: Slow bootstrapping (had to scan for specific key ranges) Strategy 3 (Current): - Q/S tokens per server (Q=total partitions, S=number of servers) - Equal-sized partitions - Example: 1024 partitions / 8 servers = 128 tokens per server - Benefit: Metadata reduced to KB - Benefit: Fast bootstrapping (transfer whole partition files) Real production sweet spot: Most Dynamo deployments use 128-256 virtual nodes per physical server. This achieves: Load distribution within 10-15% variance (good enough) I-Metadata Overhead engaphansi kwe-100KB ngalinye ye-node (i-negligible) Fast failure recovery (load spreads across many nodes) Diminishing returns. Going from 128 to 512 tokens only improves load balance by 2-3%, but doubles metadata size and gossip traffic. Why not more? : Physical servers (top) map to multiple virtual positions (bottom) on the ring. This distributes each server’s load across different parts of the hash space. Key concept : Benefits More even load distribution Uma i-server isixazululo, isilinganiso se-server esithathwe ku-multi-server (hhayi kuphela isixazululo eyodwa) Uma i-server ifumaneka, ivimbele inani elincane kusuka kumakhasimende amaningi Real-World Impact Comparison Thola ukubonisa umahluko we-real numbers: Traditional Hashing (3 servers → 4 servers): - Keys that need to move: ~75% (3 out of 4) - Example: 1 million keys → 750,000 keys must migrate Consistent Hashing (3 servers → 4 servers): - Keys that need to move: ~25% (1 out of 4) - Example: 1 million keys → 250,000 keys must migrate With Virtual Nodes (150 vnodes total → 200 vnodes): - Keys that need to move: ~12.5% (spread evenly) - Example: 1 million keys → 125,000 keys must migrate - Load is balanced across all servers The “Aha!” Moment The key insight is this: Consistent hashing decouples the hash space from the number of servers. Traditional: ← num_servers is in the formula! server = hash(key) % num_servers Imininingwane: server = ring.findNextClockwise(hash(key)) ← num_servers ayikho ku-formula! Ngenxa yalokho, ukwengeza / ukunciphisa i-server kuncike kuphela ingxenye encane yedatha. Ama-hash values akubuyekeza-only which server "owns" which range changes, futhi kuphela e-locally. Yenziwe njenge-circular runway nge-water stations (i-servers). Uma ukongeza isitimela esitsha se-water, abalandeli abalandeli kuphela uma abalandeli phakathi isitimela esidala esidala ne isitimela esitsha. Wonke abantu abalandeli isitimela esifanayo. 2. Replication Strategy (N, R, W) The Problem: Availability vs Consistency Trade-off Thina ukwakha isikhwama se-Amazon. Umthengi ukongeza isikhwama isikhwama kwakhe, kodwa ngesikhathi esifanele: One server is rebooted for ukuphathwa I-Server Enye I-Network Hiccup A third server is perfectly fine (strong consistency): Traditional database approach Client: "Add this item to my cart" Database: "I need ALL replicas to confirm before I say yes" Server 1: ✗ (rebooting) Server 2: ✗ (network issue) Server 3: ✓ (healthy) Result: "Sorry, service unavailable. Try again later." : "Uma akwazi ukongeza izindawo kumkhosi yami ngesikhathi Black Friday?!" Customer experience Kuyinto engatholakali ku-e-commerce. Yonke ucwaningo wahlukaniswa kuyinto yokushintshwa kwezimali. Dynamo’s Solution: Tunable Quorums Dynamo gives you three knobs to tune the exact trade-off you want: N: Inombolo ye-replicas (inombolo ye-copy ye-datas) I-Reader Quorum (Ukufakelwa kwe-Replica ye-Replica ye-Replica) : Write quorum (how many replicas must acknowledge for a successful write) W : When Ukubuyekezwa kwe-Quorum Ukubuyekezwa ku-Quorum Ukubuyekezwa ku-Quorum Ukubuyekezwa ku-Quorum Ukubuyekezwa ku-Quorum Ukubuyekezwa ku-Quorum Ukubuyekezwa ku-Quorum Ukubuyekezwa ku-Quorum Ukubuyekezwa ku-Quorum Ukubuyekezwa ku-Quorum Ukubuyekezwa ku-Quorum Ukubuyekezwa ku-Quorum Ukubuyekezwa ku-Quorum Ukubuyekezwa ku-Quorum Ukubuyekezwa ku-Quorum Ukubuyekezwa ku-Quorum Ukubuyekezwa ku-Quorum Ukubuyekezwa ku-Quorum Ukubuyekezwa ku-Quorum Ukubuyekezwa ku-Quorum Ukubuyekezwa ku-Quorum Ukubuyekezwa ku-Quorum The magic formula R + W > N Ngiyazi ukuthi kungcono ukuthi lokhu kubaluleke ngezinqubo zangempela: I-Scenario 1: I-Shopping Cart (I-Preference Availability) N = 3 # Three replicas for durability R = 1 # Read from any single healthy node W = 1 # Write to any single healthy node # Trade-off analysis: # ✓ Writes succeed even if 2 out of 3 nodes are down # ✓ Reads succeed even if 2 out of 3 nodes are down # ✓ Maximum availability - never reject customer actions # ✗ Might read stale data # ✗ Higher chance of conflicts (but we can merge shopping carts) What happens during failure: Client: "Add item to cart" Coordinator tries N=3 nodes: - Node 1: ✗ Down - Node 2: ✓ ACK (W=1 satisfied!) - Node 3: Still waiting... Result: SUCCESS returned to client immediately Node 3 eventually gets the update (eventual consistency) Scenario 2: Session State (Balanced Approach) N = 3 R = 2 # Must read from 2 nodes W = 2 # Must write to 2 nodes # Trade-off analysis: # ✓ R + W = 4 > N = 3 → Read-your-writes guaranteed # ✓ Tolerates 1 node failure # ✓ Good balance of consistency and availability # ✗ Write fails if 2 nodes are down # ✗ Read fails if 2 nodes are down Why R + W > N enables read-your-writes: Write to W=2 nodes: [A, B] Later, read from R=2 nodes: [B, C] Because W + R = 4 > N = 3, there's guaranteed overlap! At least one node (B in this case) will have the latest data. The coordinator detects the newest version by comparing vector clocks. This guarantees seeing the latest write as long as reconciliation picks the causally most-recent version correctly. Scenario 3: Financial Data (Prioritize Consistency) N = 3 R = 3 # Must read from ALL nodes W = 3 # Must write to ALL nodes # Trade-off analysis: # ✓ Full replica quorum — reduces likelihood of divergent versions # ✓ Any read will overlap every write quorum # ✗ Write fails if ANY node is down # ✗ Read fails if ANY node is down # ✗ Poor availability during failures Systems requiring strict transactional guarantees typically choose CP systems instead. This configuration is technically supported by Dynamo but sacrifices the availability properties that motivate using it in the first place. Configuration Ukulinganiswa Table Config N R W Availability Consistency Use Case High Availability 3 1 1 ⭐⭐⭐⭐⭐ ⭐⭐ Shopping cart, wish list Balanced 3 2 2 ⭐⭐⭐⭐ ⭐⭐⭐⭐ Session state, user preferences Full Quorum 3 3 3 ⭐⭐ ⭐⭐⭐⭐⭐ High-stakes reads (not linearizable) Read-Heavy 3 1 3 ⭐⭐⭐ (reads) ⭐⭐⭐⭐ Product catalog, CDN metadata Write-Heavy 3 3 1 ⭐⭐⭐ (writes) ⭐⭐⭐ Click tracking, metrics High Availability 3 1 1 ⭐⭐⭐⭐⭐ Ngena ngemvume I-Shopping Cart, i-Wish List Balanced 3 2 2 ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐ I-Session Status, Izinzuzo ze-User Full Quorum 3 3 3 Ngena ngemvume ⭐⭐⭐⭐⭐ High-stakes reads (not linearizable) Read-Heavy 3 1 3 ⭐⭐⭐⭐ I-Catalogue ye-Product, i-CDN metadata Write-Heavy 3 3 1 ⭐⭐⭐ (Izithombe) ⭐⭐⭐⭐ Click tracking, metrics : Systems requiring strong transactional guarantees (e.g., bank account balances) typically shouldn’t use Dynamo. That said, some financial systems do build on Dynamo-style storage for their persistence layer while enforcing stronger semantics at the application or business logic layer. Note on financial systems : Systems requiring strong transactional guarantees (e.g., bank account balances) typically shouldn’t use Dynamo. That said, some financial systems do build on Dynamo-style storage for their persistence layer while enforcing stronger semantics at the application or business logic layer. Note on financial systems I-Key Insight Zonke izinhlelo zokusebenzisa Ngenxa : N=3, R=2, W=2 : Can tolerate up to 2 replica failures before permanent data loss (assuming independent failures and no correlated outages). Durability : Tolerates 1 node failure for both reads and writes Availability : R + W > N guarantees that read and write quorums overlap, enabling read-your-writes behavior in the absence of concurrent writes. Consistency : Don’t wait for the slowest node (only need 2 out of 3) Performance Real production numbers from the paper: Ukusebenza kwe-shopping cart ye-Amazon ngexesha le-peak (i-holiday season): Configuration: N=3, R=2, W=2 Ukusebenza kwezigidi ezingu-millions Over 3 million checkouts in a single day Ngaphandle kwe-Downtime, ngisho ne-server failures Ukulungiselela okuhlobisa okuhlobisa yenza i-Dynamo emangalisayo. Ungathandwa nge-one-size-fits-all - ungathandwa ngokuvumelana nezidingo zakho zebhizinisi. I-Vector Clocks ye-Versioning The Problem: Detecting Causality in Distributed Systems When multiple nodes can accept writes independently, you need to answer a critical question: Are these two versions of the same data related, or were they created concurrently? Why timestamps don’t work: Scenario: Two users edit the same shopping cart simultaneously User 1 at 10:00:01.500 AM: Adds item A → Writes to Node X User 2 at 10:00:01.501 AM: Adds item B → Writes to Node Y Physical timestamp says: User 2's version is "newer" Reality: These are concurrent! Both should be kept! Problem: - Clocks on different servers are NEVER perfectly synchronized - Clock skew can be seconds or even minutes - Network delays are unpredictable - Physical time doesn't capture causality What we really need to know: Version A happened before Version B? → B can overwrite A Version A and B are concurrent? → Keep both, merge later Version A came from reading Version B? → We can track this! Isisombululo: Vector Clocks A vector clock is a simple data structure: a list of pairs that tracks which nodes have seen which versions. (node_id, counter) The rules: When a node writes data, it increments its own counter When a node reads data, it gets the vector clock When comparing two vector clocks: If all counters in A ≤ counters in B → A is a ancestor of B (B kuyinto entsha) If some counters in A > B and some B > A → A and B are concurrent (conflict!) Step-by-Step Example Let’s trace a shopping cart through multiple updates: Breaking down the conflict: D3: [Sx:2, Sy:1] vs D4: [Sx:2, Sz:1] Comparing: - Sx: 2 == 2 ✓ (equal) - Sy: 1 vs missing in D4 → D3 has something D4 doesn't - Sz: missing in D3 vs 1 → D4 has something D3 doesn't Conclusion: CONCURRENT! Neither is an ancestor of the other. Both versions must be kept and merged. Real-World Characteristics The Dynamo paper reports the following conflict distribution measured over 24 hours of Amazon’s production shopping cart traffic. These numbers reflect Amazon’s specific workload — high read/write ratio, mostly single-user sessions — and should not be assumed to generalize to all Dynamo deployments: 99.94% - Single version (no conflict) 0.00057% - 2 versions 0.00047% - 3 versions 0.00009% - 4 versions : Conflicts are RARE in practice! Key insight Why conflicts happen: Not usually from network failures Ngokuvamile kusuka kumadivayisi abalandeli (ngaphezu kwalokho amaprosesa / ama-bots abalandeli) Human users rarely create conflicts because they’re slow compared to network speed Ukuphakama Problem I-vector clocks ingakuvula ngokushesha uma ama-nodes ezininzi zihlanganisa. Isisombululo seDynamo: uma i-clock engaphezulu i-dimensional threshold. truncate the oldest entries // When vector clock exceeds threshold (e.g., 10 entries) // Remove the oldest entry based on wall-clock timestamp vectorClock = { 'Sx': {counter: 5, timestamp: 1609459200}, 'Sy': {counter: 3, timestamp: 1609459800}, 'Sz': {counter: 2, timestamp: 1609460400}, // ... 10 more entries } // If size > 10, remove entry with oldest timestamp // ⚠ Risk: Dropping an entry collapses causality information. // Two versions that were causally related may now appear // concurrent, forcing the application to resolve a conflict // that didn't actually exist. In practice, Amazon reports // this has not been a significant problem — but it is a // real theoretical risk in high-churn write environments // with many distinct coordinators. 4. Sloppy Quorum and Hinted Handoff The Problem: Strict Quorums Kill Availability Traditional quorum systems are rigid and unforgiving. Traditional strict quorum: Your data is stored on nodes: A, B, C (preference list) Write requirement: W = 2 Scenario: Node B is down for maintenance Coordinator: "I need to write to 2 nodes from {A, B, C}" Tries: A ✓, B ✗ (down), C ✓ Result: SUCCESS (got 2 out of 3) Scenario: Nodes B AND C are down Coordinator: "I need to write to 2 nodes from {A, B, C}" Tries: A ✓, B ✗ (down), C ✗ (down) Result: FAILURE (only got 1 out of 3) Customer: "Why can't I add items to my cart?!" 😡 Ukuphakama: . If those specific nodes are down, the system becomes unavailable. Strict quorums require specific nodes Real scenario at Amazon: Black Friday, 2:00 PM - Datacenter 1: 20% of nodes being rebooted (rolling deployment) - Datacenter 2: Network hiccup (1-2% packet loss) - Traffic: 10x normal load With strict quorum: - 15% of write requests fail - Customer support phones explode - Revenue impact: Millions per hour Ukusabela: Sloppy Quorum Dynamo ukunciphisa isidingo se-quorum: “Write to the first N healthy nodes in the preference list, walking further down the ring if needed.” Preference list for key K: A, B, C But B is down... Sloppy Quorum says: "Don't give up! Walk further down the ring: A, B, C, D, E, F, ..." Coordinator walks until N=3 healthy nodes are found: A, C, D (D is a temporary substitute for B) How Hinted Handoff Works Uma i-node ihamba ngempumelelo i-node ebuthukile, ivimbele i-hint nge-data. Detailed Hinted Handoff Process Step 1: Detect failure and substitute def write_with_hinted_handoff(key, value, N, W): preference_list = get_preference_list(key) # [A, B, C] healthy_nodes = [] for node in preference_list: if is_healthy(node): healthy_nodes.append((node, is_hint=False)) # If we don't have N healthy nodes, expand the list if len(healthy_nodes) = N: break # Write to first N healthy nodes acks = 0 for node, is_hint in healthy_nodes[:N]: if is_hint: # Store with hint metadata intended_node = find_intended_node(preference_list, node) success = node.write_hinted(key, value, hint=intended_node) else: success = node.write(key, value) if success: acks += 1 if acks >= W: return SUCCESS return FAILURE Step 2: Background hint transfer # Runs periodically on each node (e.g., every 10 seconds) def transfer_hints(): hints_db = get_hinted_replicas() for hint in hints_db: intended_node = hint.intended_for if is_healthy(intended_node): try: intended_node.write(hint.key, hint.value) hints_db.delete(hint) log(f"Successfully transferred hint to {intended_node}") except: log(f"Will retry later for {intended_node}") Why This Is Brilliant Durability maintained: Even though B is down: - We still have N=3 copies: A, C, D - Data won't be lost even if another node fails - System maintains durability guarantee Availability maximized: Client perspective: - Write succeeds immediately - No error message - No retry needed - Customer happy Traditional quorum would have failed: - Only 2 nodes available (A, C) - Need 3 for N=3 - Write rejected - Customer sees error Eventual consistency: Timeline: T=0: Write succeeds (A, C, D with hint) T=0-5min: B is down, but system works fine T=5min: B recovers T=5min+10sec: D detects B is back, transfers hint T=5min+11sec: B has the data, D deletes hint Result: Eventually, all correct replicas have the data Isibonelo Configuration // High availability configuration const config = { N: 3, // Want 3 replicas W: 2, // Only need 2 ACKs to succeed R: 2, // Read from 2 nodes // Sloppy quorum allows expanding preference list sloppy_quorum: true, // How far to expand when looking for healthy nodes max_extended_preference_list: 10, // How often to check for hint transfers hint_transfer_interval: 10_seconds, // How long to keep trying to transfer hints hint_retention: 7_days }; Real-World Impact From Amazon’s production experience: During normal operation: Hinted handoff rarely triggered Most writes go to preferred nodes I-Hints database iyatholakala kakhulu During failures: Scenario: 5% of nodes failing at any time (normal at Amazon's scale) Without hinted handoff: - Write success rate: 85% - Customer impact: 15% of cart additions fail With hinted handoff: - Write success rate: 99.9%+ - Customer impact: Nearly zero During datacenter failure: Scenario: Entire datacenter unreachable (33% of nodes) Without hinted handoff: - Many keys would lose entire preference list - Massive write failures - System effectively down With hinted handoff: - Writes redirect to other datacenters - Hints accumulate temporarily - When datacenter recovers, hints transfer - Zero customer-visible failures The Trade-off Benefits: ✓ Maximum write availability ✓ Ukuvuthwa okwesikhashana ngesikhathi izixazululo ✓ Automatic recovery when nodes come back ✓ No manual intervention required Costs: I-inconsistency ephilayo (ama-data ayikho ku-nodes "ukugqwala") ✗ Extra storage for hints database ✗ Background bandwidth for hint transfers ✗ Slightly more complex code ✗ If a substitute node (like D) fails before it can transfer its hint back to B, the number of true replicas drops below N until the situation resolves. This is an important edge case to understand in failure planning. Hinted handoff provides temporary durability, not permanent replication. Izinzuzo ze-availability zihlanganisa kakhulu izindleko ze-e-commerce workloads. Amazon’s verdict: Ukusebenza kwe-Conflict Resolution: I-Shopping Cart Problem Let’s talk about the most famous example from the paper: the shopping cart. This is where rubber meets road. What Is a Conflict (and Why Does It Happen)? A occurs when two writes happen to the same key on different nodes, without either write “knowing about” the other. This is only possible because Dynamo accepts writes even when nodes can’t communicate—which is the whole point! conflict Here’s a concrete sequence of events that creates a conflict: Timeline: T=0: Customer logs in. Cart has {shoes} on all 3 nodes. T=1: Network partition: Node1 can't talk to Node2. T=2: Customer adds {jacket} on their laptop → goes to Node1. Cart on Node1: {shoes, jacket} ← Vector clock: [N1:2] T=3: Customer adds {hat} on their phone → goes to Node2. Cart on Node2: {shoes, hat} ← Vector clock: [N2:2] T=4: Network heals. Node1 and Node2 compare notes. Node1 says: "I have version [N1:2]" Node2 says: "I have version [N2:2]" Neither clock dominates the other → CONFLICT! Wonke inguqulo akuyona "ukugqoka" - zibe zihlanganisa imiphumela emangalisayo owenziwe ngempumelelo. Imisebenzi ye-Dynamo kuyinto ukuhlola lokhu (ngokusebenzisa amahora we-vector) kanye nokuphepha to the application so the application can decide what to do. both versions What Does the Application Do With a Conflict? Kuyinto ingxenye ebalulekile ukuthi ithemba ibhizinisi wena: I-Dynamo inikeza kuzo zonke izinguquko ezisebenzayo; ikhodi lakho lihlola indlela yokuhlanganisa. the application must resolve conflicts using business logic For the shopping cart, Amazon chose a : keep all items from all concurrent versions. The rationale is simple—losing an item from a customer’s cart (missing a sale) is worse than occasionally showing a stale item they already deleted. union merge Conflict versions: Version A (from Node1): {shoes, jacket} Version B (from Node2): {shoes, hat} Merge strategy: union Merged cart: {shoes, jacket, hat} ← All items preserved Here’s the actual reconciliation code: from __future__ import annotations from dataclasses import dataclass, field class VectorClock: def __init__(self, clock: dict[str, int] | None = None): self.clock: dict[str, int] = clock.copy() if clock else {} def merge(self, other: "VectorClock") -> "VectorClock": """Merged clock = max of each node's counter across both versions.""" all_keys = set(self.clock) | set(other.clock) merged = {k: max(self.clock.get(k, 0), other.clock.get(k, 0)) for k in all_keys} return VectorClock(merged) def __repr__(self): return f"VectorClock({self.clock})" @dataclass class ShoppingCart: items: list[str] = field(default_factory=list) vector_clock: VectorClock = field(default_factory=VectorClock) @staticmethod def reconcile(carts: list["ShoppingCart"]) -> "ShoppingCart": if len(carts) == 1: return carts[0] # No conflict, nothing to do # Merge strategy: union of all items (never lose additions). # This is Amazon's choice for shopping carts. # A different application might choose last-write-wins or something else. all_items: set[str] = set() merged_clock = VectorClock() for cart in carts: all_items.update(cart.items) # Union: keep everything merged_clock = merged_clock.merge(cart.vector_clock) return ShoppingCart(items=sorted(all_items), vector_clock=merged_clock) # Example conflict scenario cart1 = ShoppingCart(items=["shoes", "jacket"], vector_clock=VectorClock({"N1": 2})) cart2 = ShoppingCart(items=["shoes", "hat"], vector_clock=VectorClock({"N2": 2})) # Dynamo detected a conflict and passes both versions to our reconcile() reconciled = ShoppingCart.reconcile([cart1, cart2]) print(reconciled.items) # ['hat', 'jacket', 'shoes'] — union! The Deletion Problem (Why This Gets Tricky) The union strategy has a nasty edge case: . deleted items can come back from the dead T=0: Cart: {shoes, hat} T=1: Customer removes hat → Cart: {shoes} Clock: [N1:3] T=2: Network partition — Node2 still has old state T=3: Some concurrent write to Node2 Clock: [N2:3] T=4: Network heals → conflict detected T=5: Union merge: {shoes} ∪ {shoes, hat} = {shoes, hat} Result: Hat is BACK! Customer removed it, but it reappeared. I-Amazon ivumela ngokuvamile le-compromise. I- "ghost" item e-carrying kuyinto ingxaki encane. Ukukhangisa i-carrying add-on ngesikhathi sokuthengisa kwe-Black Friday kuyinto yokushintshwa kwezimali. : Merge logic must be domain-specific and carefully designed. Adding items is commutative (order doesn’t matter) and easy to merge. Removing items is not—a deletion in one concurrent branch may be silently ignored during a union-based merge. This is an intentional trade-off in Dynamo’s design, but it means the application must reason carefully about add vs. remove semantics. If your data doesn’t naturally support union merges (e.g., a counter, a user’s address), you need a different strategy—such as CRDTs, last-write-wins with timestamps, or simply rejecting concurrent writes for that data type. Engineering depth note I-Logic ye-Fusion kufuneka iyatholakala kwedolobha ngokufanelekayo. Ukongezwa kwezinto ku-commutative (ukudlulisa akufanele) futhi kulula ukuhlanganiswa. Ukukhishwa kwezinto akufanele - ukususwa kwelinye ibhizinisi elifanayo ingatholakale ngempumelelo ngexesha le-union-based fusion. Lokhu kuyinto i-comprom-off esebenzayo emakheni ye-Dynamo, kodwa kungenzeka ukuthi isicelo kuqinisekisa ngokufanele mayelana ne-add vs. ukunciphisa i-semantics. Uma idatha yakho akufanele ngokwemvelo ukuhlanganiswa kwe-union (isib. I-counter, i-address ye-username), ufuna isinyathelo esahlukile - njenge-CRDTs, i-last-write Engineering depth note Thola futhi wabhala Flow I-diagram ephakeme ibonisa umphumela we-high-level, kodwa siphinde ngokuvamile ngokufanele ngama-step-by-step ngesikhathi kokufunda nokufunda. Ukuphepha lokhu ngokuvamile uzothola imibuzo ezidlulileyo. Ukubhalisa indlela Step-by-step narration of a PUT request: I-Client inikeza isicelo ku-node ye-node ye-node ye-node ye-node ye-node ye-node ye-node ye-node. — this is the first node in the preference list for the key’s hash position on the ring. The coordinator is determined I-Vector clock iyahlaziywa - i-coordinator i-increments its own in the vector clock, okwenza inguqulo entsha. I-coordinator ibhalwe lokusebenza, bese abalandeli abalandeli kubhalwe kumazwe angu-N-1 e-preference list ngokufanayo. It does NOT wait for all N — just the first W to respond. The remaining nodes that haven’t responded yet will get the write eventually (or via hinted handoff if they’re down). The coordinator waits for W acknowledgments. to the client. From the client’s perspective, the write is done. Once W ACKs arrive, the coordinator returns 200 OK : The client gets a success response as soon as W nodes confirm. The other (N – W) nodes will receive the write asynchronously. This is why the system is “eventually consistent”—all nodes have the data, just not necessarily at the same moment. Key insight about the write path Waze Read Path Step-by-step narration of a GET request: to the coordinator for that key. Client sends the request in the preference list simultaneously (not just R). This is important — it contacts all N, but only needs R to respond. The coordinator sends read requests to all N nodes Ukukhangisa imibuzo ye-R. I-Coordinator ivumela ngokushesha lapho ama-Nodes ze-R zibonakali, ngaphandle kokukhangisa ama-Nodes ezincinane. The coordinator checks all the vector clocks: Compare the versions returned. If all versions are identical → return the single version immediately. If one version’s clock dominates the others (it’s causally “newer”) → return that version. If versions are concurrent (neither clock dominates) → return to the client, which must merge them. all versions happens in the background: if the coordinator noticed any node returned a stale version, it sends the latest version to that node to bring it up to date. Read repair Because Dynamo is a general-purpose storage engine. It doesn’t know whether you’re storing a shopping cart, a user profile, or a session token. Only knows how to merge two conflicting versions in a way that makes business sense. The coordinator hands you the raw concurrent versions along with the vector clock context, and you do the right thing for your use case. Why does the client receive the conflict instead of the coordinator resolving it? isicelo yakho : when the client writes the merged version back, it must include the context (the merged vector clock). This tells Dynamo that the new write has “seen” all the concurrent versions, so the conflict is resolved. Without this context, Dynamo might think it’s Umlinganiselo wokubhalisa ku-conflict eyenziwe. The vector clock context is the key to closing the loop Okunye Merkle Trees for Anti-Entropy The Problem: How Do You Know When Replicas Are Out of Sync? After a node recovers from a failure, it may have missed some writes. After a network partition heals, two replicas might diverge. How does Dynamo detect and fix these differences? The brute-force approach would be: “Every hour, compare every key on Node A against Node B, and sync anything that’s different.” But at Amazon’s scale, a single node might store hundreds of millions of keys. Comparing them all one by one would be so slow and bandwidth-intensive that it would interfere with normal traffic. Umqondo wokugqibela: Ngaphandle kokubili izilimi ngamunye, ukuguqulwa Uma i-hash ibhokisi, inqwaba elilodwa - uxhumane. Hlola kuphela kumahlobo lapho i-hashes iyahlukile. Dynamo uses Merkle trees to solve this efficiently. hashes of groups of keys : Merkle tree sync is a mechanism. It’s not on the hot read/write path. Normal reads and writes use vector clocks and quorums for versioning. Merkle trees are for the repair process that runs periodically in the background to catch any inconsistencies that slipped through. Important background anti-entropy : Merkle tree sync is a mechanism. It’s not on the hot read/write path. Normal reads and writes use vector clocks and quorums for versioning. Merkle trees are for the repair process that runs periodically in the background to catch any inconsistencies that slipped through. Important background anti-entropy How a Merkle Tree Is Built Konke node ukwakha a Merkle ingubo phezu idatha yayo, eyakhelwe ngama-key: contain the hash of a small range of actual data keys (e.g., hash of all values for keys k1, k2, k3). Leaf nodes contain the hash of their children’s hashes. Internal nodes I-root yi-hash eyodwa enikeza yonke idatha e-node. Indlela Two Nodes Sync Usebenzisa Merkle Amadolobha Uma i-Node A ne-Node B ufuna ukubuyekeza ukuthi zihlanganisa: : Compare root hashes. If they’re the same, everything is identical. Done! (No network traffic for the data itself.) Step 1 : If roots differ, compare their left children. Same? Skip that entire half of the key space. Step 2 : Faka ukujabulela kuphela ku-subtrees lapho ama-hashes zihlukile, kuze kufika ku-leaf nodes. Step 3 : Sync only the specific keys in the differing leaf nodes. Step 4 Example: Comparing two nodes Node A root: abc789 ← differs from Node B! Node B root: abc788 Compare left subtrees: Node A left: xyz123 Node B left: xyz123 ← same! Skip entire left half. Compare right subtrees: Node A right: def456 Node B right: def457 ← differs! Go deeper. Compare right-left subtree: Node A right-left: ghi111 Node B right-left: ghi111 ← same! Skip. Compare right-right subtree: Node A right-right: jkl222 Node B right-right: jkl333 ← differs! These are leaves. → Sync only the keys in the right-right leaf range (e.g., k10, k11, k12) Instead of comparing all 1 million keys, we compared 6 hashes and synced only 3 keys! : Synchronization process in code def sync_replicas(node_a, node_b, key_range): """ Efficiently sync two nodes using Merkle trees. Instead of comparing all keys, we compare hashes top-down. Only the ranges where hashes differ need actual key-level sync. """ tree_a = node_a.get_merkle_tree(key_range) tree_b = node_b.get_merkle_tree(key_range) # Step 1: Compare root hashes first. # If they match, every key in this range is identical — nothing to do! if tree_a.root_hash == tree_b.root_hash: return # Zero data transferred — full match! # Step 2: Recursively find differences by traversing top-down. # Only descend into subtrees where hashes differ. differences = [] stack = [(tree_a.root, tree_b.root)] while stack: node_a_subtree, node_b_subtree = stack.pop() if node_a_subtree.hash == node_b_subtree.hash: continue # This whole subtree matches — skip it! if node_a_subtree.is_leaf: # Found a differing leaf — these keys need syncing differences.extend(node_a_subtree.keys) else: # Not a leaf yet — recurse into children for child_a, child_b in zip(node_a_subtree.children, node_b_subtree.children): stack.append((child_a, child_b)) # Step 3: Sync only the specific keys that differed at leaf level. # This might be a handful of keys, not millions. for key in differences: sync_key(node_a, node_b, key) Why This Is Efficient The power of Merkle trees is that the number of hash comparisons you need scales with the (I-logarithmic ku-number of keys), akuyona inani le-keys ngokwayo. Ububanzi obukhulu Node with 1,000,000 keys: Naive approach: Compare 1,000,000 keys individually Cost: 1,000,000 comparisons Merkle tree: Compare O(log N) hashes top-down Tree depth ≈ 20 levels Cost: 20 comparisons to find differences Then sync only the differing leaves (~few keys) Speedup: ~50,000x fewer comparisons! Futhi ngokuvamile, uma ama-nodes amabili (eyenziwe ngokuvamile ku-cluster enhle), i-root hashes ngokuvamile ibheka ngokuphelele futhi idatha ye-zero kufuneka ifakwe. I-anti-entropy inqubo enhle kakhulu emzimbeni ebonakalayo. mostly in sync Membership and Failure Detection I-Dynamo isetshenziselwa i-protocol ye-goossip yokulawula i-memberhood. I-node eyodwa ikakhulukazi i-exchange information ye-memberhood nge-peers e-random. Akukho i-master node-ukubuyekezwa kokusebenzisana ngokuphelele. Gossip-Based Membership Key Design Points I-Node ye-Cluster ye-Node ye-Node ye-Node ye-Node ye-Node ye-Node ye-Node ye-Node ye-Node ye-Node ye-Node ye-Node ye-Node ye-Node ye-Node. No single coordinator : I-Dynamo isetshenziselwa i-accrual-based failure detector (ngama-Phi Accrual). Ngaphandle kwe-binary “alive/dead” judgment, ama-nodes zihlanganisa a umphumela wokuphendula ngokushesha i-peer engaphendula. Lokhu kususa imiphumela emibi emibi emibi emibi emibi. Failure suspicion vs. detection Izinga lokuphendula Node A's view of Node B: - Last heartbeat: 3 seconds ago → Suspicion low → Healthy - Last heartbeat: 15 seconds ago → Suspicion rising → Likely slow/degraded - Last heartbeat: 60 seconds ago → Suspicion high → Treat as failed : Ama-nodes ezintsha zihlanganisa i-seme node ukuze zihlanganise, bese i-gallery ivumela ukufinyelela kwelinye i-cluster. I-ring membership ikakhulukazi ibhaliswe - ama-nodes ahlukene angakwazi ukufinyelela okungenani okungenani kwelanga ngokushesha, okuyinto asebenzayo. Decentralized bootstrapping Imikhiqizo: Real Numbers The paper provides fascinating performance data. Let me break it down: Ukukhishwa kwe-Latency Metric | Average | 99.9th Percentile --------------------|---------|------------------ Read latency | ~10ms | ~200ms Write latency | ~15ms | ~200ms Key insight: 99.9th percentile is ~20x the average! The 99.9th percentile is affected by: Why the huge gap? I-Garbage Collection I-Pauses Disk I/O variations I-Jitter ye-Network Load imbalance This is why Amazon SLAs are specified at 99.9th percentile, not average. Version Conflicts Ukusuka izinsuku ezingu-24 ze-Amazon yokukhiqiza zokuhamba kwe-shopping cart (ngokubhaliwe yi-Dynamo). Kuhlolwe kulezi zihlanganisa izici zokusebenza ze-Amazon, akuyona isisekelo se-universal: 99.94% - Saw exactly one version (no conflict) 0.00057% - Saw 2 versions 0.00047% - Saw 3 versions 0.00009% - Saw 4 versions : I-Conflict ibonakala emangalisayo emzimbeni! Ngokuvamile kubangelwa ama-writers abalandeli (i-robots), ngaphandle kokuphumula. Takeaway Partitioning Strategy Evolution I-Dynamo yakhelwe ngokusebenzisa izindlela ezintathu ze-partitioning. Lezi zophuhliso zibonisa izifundo eziyinhloko: Strategy 1: I-Random Tokens (Initial) Problem: Random token assignment → uneven load Problem: Adding nodes → expensive data scans Problem: Can't easily snapshot the system : Random token assignment sounds elegant but is a nightmare in practice. Each node gets a random position on the ring, which means wildly different data ownership ranges and uneven load distribution. Operational lesson Strategy 2: I-Equal-Size Partitions + I-Random Tokens Improvement: Decouples partitioning from placement Problem: Still has load balancing issues Strategy 3: Q/S Tokens Per Node — Equal-sized Partitions + Deterministic Placement (Current) What Q and S mean: Q = inani elikhulu lwezinhlayiyana ezivamile ezahlukaniswe ku-ring (isib. 1024). Hlola lezo njengezinhlayiyana ezivamile, ezivela ezivela ezivela ku-hash space ukuthi akuyona isakhiwo. S = inombolo ye-server ye-physical e-claster (isib. 8). = how many of those fixed slices each server is responsible for (e.g. 1024 / 8 = ). Q/S 128 partitions per server Ukuguqulwa kwe-key kusuka kuma-strategies ezidlulileyo: umlinganiselo sishintshwe ku-Q fixed, partitions equal-size I-server ayidlulele izindawo ezivamile – zonke zihlanganisa izigaba ze-Q/S, zihlanganiswa ngokulinganayo emhlabeni. Okokuqala Example: Q=12 partitions, S=3 servers Ring divided into 12 equal slices (each covers 30° of the 360° ring): Partition 1: 0°– 30° → Server A Partition 2: 30°– 60° → Server B Partition 3: 60°– 90° → Server C Partition 4: 90°–120° → Server A Partition 5: 120°–150° → Server B Partition 6: 150°–180° → Server C ...and so on, round-robin Each server owns exactly Q/S = 12/3 = 4 partitions → perfectly balanced. When a 4th server joins (S becomes 4): New Q/S = 12/4 = 3 partitions per server. Each existing server hands off 1 partition to the new server. Only 3 out of 12 partitions move — the rest are untouched. This evolution — from random tokens to fixed, equal-sized partitions with balanced ownership — is one of the most instructive operational learnings from Dynamo. The early approach prioritized simplicity of implementation; the later approach prioritized operational simplicity and predictability. Comparing Dynamo to Modern Systems System Consistency Model Use Case Dynamo Influence Cassandra Tunable (N, R, W) Time-series, analytics Direct descendant — heavily inspired by Dynamo, uses same consistent hashing and quorum concepts Riak Tunable, vector clocks Key-value store Closest faithful Dynamo implementation Amazon DynamoDB Eventually consistent by default Managed NoSQL DynamoDB is a completely different system internally, with no vector clocks and much simpler conflict resolution. Shares the name and high-level inspiration only. ⚠️ Not the same as Dynamo! Voldemort Tunable LinkedIn's data store Open-source Dynamo implementation Google Spanner Linearizable Global SQL Opposite choice to Dynamo — prioritizes CP via TrueTime clock synchronization Redis Cluster Eventually consistent Caching, sessions Uses consistent hashing; much simpler conflict resolution Cassandra Ukuhlobisa (N, R, W) isikhathi Imininingwane, Analysis I-Direct Descendant - ezikhishwe kakhulu ku-Dynamo, isetshenziselwa izinhlelo ezivamile ze-hashing kanye ne-quorum Riak Ukulungiselela, amahora we-vector Ikhaya Key-value Ukusebenza kwe-Dynamo enhle kakhulu Amazon DynamoDB Eventually consistent by default Imininingwane NoSQL DynamoDB is a completely different system internally, with no vector clocks and much simpler conflict resolution. Shares the name and high-level inspiration only. ⚠️ Not the same as Dynamo! Voldemort Tunable I-Data Store ye-LinkedIn Ukusebenza kwe-Open-Source Dynamo Google Spanner Linearizable I-Global SQL Ukuphakama kwe-Dynamo — ukwahlukanisa i-CP nge-TrueTime clock synchronization Redis Cluster Okugcwele ngokuvamile Caching, Izigaba Ukusetshenziswa hashing enhle; isixazululo sokuphendula kakhulu I-DynamoDB ingxaki: Amadivayisi abaninzi zihlanganisa i-Amazon DynamoDB ne-Dynamo Paper. Zihlanganisa kakhulu. I-DynamoDB iyisevisi elawulwa eyenzelwe ukunakekelwa kokusebenza. Ayikwazanga ama-vector clocks, ayisebenzisa isakhiwo se-partitioning efanayo, futhi isetshenziselwa imodeli ye-consistency yama-proprietary. I-DynamoDB isetshenziselwa i-internal Dynamo storage engine eyenza i-DynamoDB. : Many engineers conflate Amazon DynamoDB with the Dynamo paper. They are very different. DynamoDB is a managed service optimized for operational simplicity. It does not expose vector clocks, does not use the same partitioning scheme, and uses a proprietary consistency model. The paper is about the internal Dynamo storage engine that predates DynamoDB. The DynamoDB confusion Yini Dynamo Ayikwazanga Konke umbhali we-senior injiniyela kufanele kubhalwe ngokuthandayo mayelana nezimfuneko. Ngiyazi ukuthi i-Dynamo ngokuvumelana ngokuvumelana: Ngaphandle kwezimali: Izimali zihlanganisa kuphela-key. Ungenza ukuhlaziywa kwezimpahla eziningana. : You can only look up data by its primary key (at least in the original design). No secondary indexes No joins: Kuyinto i-keyword-value store. Akukho i-query language. Ngaphandle kwe-global ordering: Izincwajana ezahlukile zihlanganisa ngaphandle kwe-garanted ordering. Ngaphandle kwe-linearizability: Ngaphandle kwe-R=W=N, i-Dynamo ayikwazi ukunikela izibalo ze-linearizable. Akukho i-clock ye-global, akukho i-serializability enhle. Ngaphandle kokuphendula okuzenzakalelayo: I-system ibonise ama-conflicts futhi ibonise kwabo ku-application. I-application kufanele ibonise. Uma ama-engineers akho akufuna lokhu, uzothola ama-data bugs ezincinane. : The anti-entropy process (Merkle tree reconciliation) is not free. At large scale, background repair traffic can be significant. Repair costs at scale I-vector clock growth: E-high-churn writing environments enezinhlangano eziningi, i-vector clocks ingabangela kakhulu kakhulu ukuze kufuneka ukuchithwa, okuvumela ukuchithwa kwe-causality. Understanding these limitations is critical to successfully operating Dynamo-style systems in production. Isibonelo esebenzayo Ngezansi i-Python ye-self-contained implementation ye-Core Dynamo i-Dynamo. I-Dynamo i-Dynamo ye-Dynamo ye-Dynamo ye-Dynamo ye-Dynamo ye-Dynamo ye-Dynamo ye-Dynamo ye-Dynamo ye-Dynamo ye-Dynamo ye-Dynamo ye-Dynamo ye-Dynamo ye-Dynamo ye-Dynamo ye-Dynamo. Isigaba 1: I-Vector Clock Waze Class iyisisombululo ye-version tracking. It is just a dictionary mapping . Two key operations: VectorClock node_id → counter — bump our own counter when we write increment(node) — check if one clock is causally “after” another; if neither dominates, the writes were concurrent (conflict) dominates(other) from __future__ import annotations from dataclasses import dataclass, field from typing import Optional class VectorClock: """ Tracks causality across distributed writes. A clock like {"nodeA": 2, "nodeB": 1} means: - nodeA has coordinated 2 writes - nodeB has coordinated 1 write - Any version with these counters has "seen" those writes """ def __init__(self, clock: dict[str, int] | None = None): self.clock: dict[str, int] = clock.copy() if clock else {} def increment(self, node_id: str) -> "VectorClock": """Return a new clock with node_id's counter bumped by 1.""" new_clock = self.clock.copy() new_clock[node_id] = new_clock.get(node_id, 0) + 1 return VectorClock(new_clock) def dominates(self, other: "VectorClock") -> bool: """ Returns True if self is causally AFTER other. self dominates other when: - Every counter in self is >= the same counter in other, AND - At least one counter in self is strictly greater. Meaning: self has seen everything other has seen, plus more. """ all_keys = set(self.clock) | set(other.clock) at_least_one_greater = False for key in all_keys: self_val = self.clock.get(key, 0) other_val = other.clock.get(key, 0) if self_val other_val: at_least_one_greater = True return at_least_one_greater def merge(self, other: "VectorClock") -> "VectorClock": """ Merge two clocks by taking the max of each counter. Used after resolving a conflict to produce a new clock that has "seen" everything both conflicting versions saw. """ all_keys = set(self.clock) | set(other.clock) merged = {k: max(self.clock.get(k, 0), other.clock.get(k, 0)) for k in all_keys} return VectorClock(merged) def __repr__(self): return f"VectorClock({self.clock})" Part 2: Versioned Value Konke impahla elawulwa ku-Dynamo iyahlukaniswa ne-vector clock yayo. Lokhu ukuhanjiswa kuyinto okuvumela i-coordinator ukuhanjiswa kwezinguquko ngesikhathi sokufunda nokufunda imibuzo. @dataclass class VersionedValue: """ A value paired with its causal history (vector clock). When a client reads, they get back a VersionedValue. When they write an update, they must include the context (the vector clock they read) so Dynamo knows what version they're building on top of. """ value: object vector_clock: VectorClock def __repr__(self): return f"VersionedValue(value={self.value!r}, clock={self.vector_clock})" Part 3: Simulated Node In real Dynamo each node is a separate process. Here we simulate them as in-memory objects. The key detail: each node has its own local dict. Nodes can be marked as Ukubonisa imiphumela. storage down class DynamoNode: """ Simulates a single Dynamo storage node. In production this would be a separate server with disk storage. Here it's an in-memory dict so we can demo the logic without networking. """ def __init__(self, node_id: str, token: int): self.node_id = node_id self.token = token # Position on the consistent hash ring self.storage: dict[str, list[VersionedValue]] = {} self.down = False # Toggle to simulate node failures def write(self, key: str, versioned_value: VersionedValue) -> bool: """ Store a versioned value. Returns False if the node is down. We store a LIST of versions per key, because a node might hold multiple concurrent (conflicting) versions until they are resolved by the application. """ if self.down: return False # In a real node this would be written to disk (e.g. BerkeleyDB) self.storage[key] = [versioned_value] return True def read(self, key: str) -> list[VersionedValue] | None: """ Return all versions of a key. Returns None if the node is down. A healthy node with no data for the key returns an empty list. """ if self.down: return None return self.storage.get(key, []) def __repr__(self): status = "DOWN" if self.down else f"token={self.token}" return f"DynamoNode({self.node_id}, {status})" Isigaba 4: I-Consistent Hash Ring I-ring ibheka i-keys kuya kuma-nodes. Sihlanganisa ama-nodes ngokusebenzisa i-token (i-position) yayo futhi usebenzisa isitimela se-clockwise ukufumana i-coordinator ne-preference list ye-key eyodwa. import hashlib class ConsistentHashRing: """ Maps any key to an ordered list of N nodes (the preference list). Nodes are placed at fixed positions (tokens) on a conceptual ring from 0 to 2^32. A key hashes to a position, then walks clockwise to find its nodes. This means adding/removing one node only rebalances ~1/N of keys, rather than reshuffling everything like modulo hashing would. """ def __init__(self, nodes: list[DynamoNode]): # Sort nodes by token so we can do clockwise lookup efficiently self.nodes = sorted(nodes, key=lambda n: n.token) def _hash(self, key: str) -> int: """Consistent hash of a key into the ring's token space.""" # Use MD5 for a simple, evenly distributed hash. # Real Dynamo uses a more sophisticated hash (e.g., SHA-1). digest = hashlib.md5(key.encode()).hexdigest() return int(digest, 16) % (2**32) def get_preference_list(self, key: str, n: int) -> list[DynamoNode]: """ Return the first N nodes clockwise from key's hash position. These are the nodes responsible for storing this key. The first node in the list is the coordinator — it receives the client request and fans out to the others. """ if not self.nodes: return [] key_hash = self._hash(key) # Find the first node whose token is >= key's hash (clockwise) start_idx = 0 for i, node in enumerate(self.nodes): if node.token >= key_hash: start_idx = i break # If key_hash is greater than all tokens, wrap around to node 0 else: start_idx = 0 # Walk clockwise, collecting N unique nodes preference_list = [] for i in range(len(self.nodes)): idx = (start_idx + i) % len(self.nodes) preference_list.append(self.nodes[idx]) if len(preference_list) == n: break return preference_list Part 5: The Dynamo Coordinator Kuyinto isisindo se-system - i-logic eyenza imibuzo ye-client, ama-fans e-replicas, ama-quorum, ne-detecting conflicts. Cindezela okuhlobene; lokhu kuyinto lapho zonke izimo ezidlulileyo zihlanganisa. class SimplifiedDynamo: """ Coordinates reads and writes across a cluster of DynamoNodes. Any node can act as coordinator for any request — there's no dedicated master. The coordinator is simply whichever node receives the client request (or the first node in the preference list, if using partition-aware routing). Configuration: N = total replicas per key R = minimum nodes that must respond to a read (read quorum) W = minimum nodes that must acknowledge a write (write quorum) """ def __init__(self, nodes: list[DynamoNode], N: int = 3, R: int = 2, W: int = 2): self.N = N self.R = R self.W = W self.ring = ConsistentHashRing(nodes) # ------------------------------------------------------------------ # # WRITE # # ------------------------------------------------------------------ # def put(self, key: str, value: object, context: VectorClock | None = None) -> VectorClock: """ Write a key-value pair to N replicas, wait for W ACKs. The 'context' is the vector clock from a previous read. Always pass context when updating an existing key — it tells Dynamo which version you're building on top of, so it can detect whether your write is concurrent with anything else. Returns the new vector clock, which the caller should store and pass back on future writes to this key. Raises: RuntimeError if fewer than W nodes acknowledged. """ preference_list = self.ring.get_preference_list(key, self.N) if not preference_list: raise RuntimeError("No nodes available") # The coordinator is always the first node in the preference list. coordinator = preference_list[0] # Increment the coordinator's counter in the vector clock. # If no context was provided (brand new key), start a fresh clock. base_clock = context if context is not None else VectorClock() new_clock = base_clock.increment(coordinator.node_id) versioned = VersionedValue(value=value, vector_clock=new_clock) # Fan out to all N replicas. # In a real system these would be concurrent RPC calls. # Here we call them sequentially for simplicity. ack_count = 0 for node in preference_list: success = node.write(key, versioned) if success: ack_count += 1 # Only need W ACKs to declare success. # The remaining replicas are updated asynchronously (or via # hinted handoff if they were down). if ack_count < self.W: raise RuntimeError( f"Write quorum not met: got {ack_count} ACKs, needed {self.W}" ) print(f"[PUT] key={key!r} value={value!r} clock={new_clock} " f"({ack_count}/{self.N} nodes wrote)") return new_clock # ------------------------------------------------------------------ # # READ # # ------------------------------------------------------------------ # def get(self, key: str) -> list[VersionedValue]: """ Read a key from N replicas, wait for R responses, reconcile. Returns a LIST of VersionedValues: - Length 1 → clean read, no conflict - Length >1 → concurrent versions detected; application must merge After reading, the caller should: 1. If no conflict: use the single value normally. 2. If conflict: merge the values using application logic, then call put() with the merged value and the merged vector clock as context. This "closes" the conflict. Read repair happens in the background: any replica that returned a stale version is silently updated with the latest version. """ preference_list = self.ring.get_preference_list(key, self.N) # Collect responses from all N nodes all_versions: list[VersionedValue] = [] responding_nodes: list[tuple[DynamoNode, list[VersionedValue]]] = [] for node in preference_list: result = node.read(key) if result is not None: # None means the node is down all_versions.extend(result) responding_nodes.append((node, result)) if len(responding_nodes) < self.R: raise RuntimeError( f"Read quorum not met: got {len(responding_nodes)} responses, needed {self.R}" ) # Reconcile: discard any version that is strictly dominated # (i.e., is a causal ancestor of) another version. # What remains is the set of concurrent versions. reconciled = self._reconcile(all_versions) # Background read repair: if any node returned something older # than the reconciled result, send it the latest version. # (Simplified: only meaningful when there's a single winner.) if len(reconciled) == 1: latest = reconciled[0] for node, versions in responding_nodes: if not versions or versions[0].vector_clock != latest.vector_clock: node.write(key, latest) # Repair silently in background status = "clean" if len(reconciled) == 1 else f"CONFLICT ({len(reconciled)} versions)" print(f"[GET] key={key!r} status={status} " f"({len(responding_nodes)}/{self.N} nodes responded)") return reconciled # ------------------------------------------------------------------ # # INTERNAL: VERSION RECONCILIATION # # ------------------------------------------------------------------ # def _reconcile(self, versions: list[VersionedValue]) -> list[VersionedValue]: """ Remove any version that is a causal ancestor of another version. If version A's clock is dominated by version B's clock, then B is strictly newer — A adds no new information and can be dropped. Whatever remains after pruning are CONCURRENT versions: writes that happened without either "knowing about" the other. The application must merge these using domain-specific logic. Example: versions = [clock={A:1}, clock={A:2}, clock={B:1}] {A:2} dominates {A:1} → drop {A:1} {A:2} and {B:1} are concurrent → both survive result = [{A:2}, {B:1}] ← conflict! application must merge """ dominated = set() for i, v1 in enumerate(versions): for j, v2 in enumerate(versions): if i != j and v2.vector_clock.dominates(v1.vector_clock): dominated.add(i) # v1 is an ancestor of v2, discard v1 survivors = [v for i, v in enumerate(versions) if i not in dominated] # De-duplicate: identical clocks from different replicas are the same version seen_clocks: list[VectorClock] = [] unique: list[VersionedValue] = [] for v in survivors: if not any(v.vector_clock.clock == s.clock for s in seen_clocks): unique.append(v) seen_clocks.append(v.vector_clock) return unique if unique else versions Part 6: Putting It All Together — A Demo Sishayele nge-scenario ephelele: ukubhalisa / ukubhalisa ngokuvamile, bese isivumelwano se-conflict lapho amadolobha amabili abahlukile futhi isicelo kufanele ifake. def demo(): # ── Setup ────────────────────────────────────────────────────────── # # Five nodes placed at evenly spaced positions on the hash ring. # In a real cluster these would span multiple datacenters. nodes = [ DynamoNode("node-A", token=100), DynamoNode("node-B", token=300), DynamoNode("node-C", token=500), DynamoNode("node-D", token=700), DynamoNode("node-E", token=900), ] dynamo = SimplifiedDynamo(nodes, N=3, R=2, W=2) print("=" * 55) print("SCENARIO 1: Normal write and read (no conflict)") print("=" * 55) # Write the initial shopping cart ctx = dynamo.put("cart:user-42", {"items": ["shoes"]}) # Read it back — should be a clean single version versions = dynamo.get("cart:user-42") print(f"Read result: {versions[0].value}\n") # Update the cart, passing the context from our earlier read. # The context tells Dynamo "this write builds on top of clock ctx". ctx = dynamo.put("cart:user-42", {"items": ["shoes", "jacket"]}, context=ctx) versions = dynamo.get("cart:user-42") print(f"After update: {versions[0].value}\n") print("=" * 55) print("SCENARIO 2: Simulated conflict — two concurrent writes") print("=" * 55) # Write the base version base_ctx = dynamo.put("cart:user-99", {"items": ["hat"]}) # Now simulate a network partition: # node-A and node-B can't talk to each other. # We model this by writing directly to individual nodes. pref_list = dynamo.ring.get_preference_list("cart:user-99", 3) node_1, node_2, node_3 = pref_list[0], pref_list[1], pref_list[2] # Write 1: customer adds "scarf" via node_1 (e.g., their laptop) clock_1 = base_ctx.increment(node_1.node_id) node_1.write("cart:user-99", VersionedValue({"items": ["hat", "scarf"]}, clock_1)) # Write 2: customer adds "gloves" via node_2 (e.g., their phone) # This write also descends from base_ctx, not from clock_1. # Neither write knows about the other → they are concurrent. clock_2 = base_ctx.increment(node_2.node_id) node_2.write("cart:user-99", VersionedValue({"items": ["hat", "gloves"]}, clock_2)) # Read — coordinator sees two concurrent versions and surfaces the conflict versions = dynamo.get("cart:user-99") if len(versions) > 1: print(f"\nConflict detected! {len(versions)} concurrent versions:") for i, v in enumerate(versions): print(f" Version {i+1}: {v.value} clock={v.vector_clock}") # Application-level resolution: union merge (Amazon's shopping cart strategy) # Merge items: take the union so no addition is lost all_items = set() merged_clock = versions[0].vector_clock for v in versions: all_items.update(v.value["items"]) merged_clock = merged_clock.merge(v.vector_clock) merged_value = {"items": sorted(all_items)} print(f"\nMerged result: {merged_value}") # Write the resolved version back with the merged clock as context. # This "closes" the conflict — future reads will see a single version. final_ctx = dynamo.put("cart:user-99", merged_value, context=merged_clock) versions = dynamo.get("cart:user-99") print(f"\nAfter resolution: {versions[0].value}") assert len(versions) == 1, "Should be a single version after merge" if __name__ == "__main__": demo() Expected output: ======================================================= SCENARIO 1: Normal write and read (no conflict) ======================================================= [PUT] key='cart:user-42' value={'items': ['shoes']} clock=VectorClock({'node-A': 1}) (3/3 nodes wrote) [GET] key='cart:user-42' status=clean (3/3 nodes responded) Read result: {'items': ['shoes']} [PUT] key='cart:user-42' value={'items': ['shoes', 'jacket']} clock=VectorClock({'node-A': 2}) (3/3 nodes wrote) [GET] key='cart:user-42' status=clean (3/3 nodes responded) After update: {'items': ['shoes', 'jacket']} ======================================================= SCENARIO 2: Simulated conflict — two concurrent writes ======================================================= [PUT] key='cart:user-99' value={'items': ['hat']} clock=VectorClock({'node-A': 1}) (3/3 nodes wrote) [GET] key='cart:user-99' status=CONFLICT (2 versions) (3/3 nodes responded) Conflict detected! 2 concurrent versions: Version 1: {'items': ['hat', 'scarf']} clock=VectorClock({'node-A': 2}) Version 2: {'items': ['hat', 'gloves']} clock=VectorClock({'node-A': 1, 'node-B': 1}) Merged result: {'items': ['gloves', 'hat', 'scarf']} [PUT] key='cart:user-99' value={'items': ['gloves', 'hat', 'scarf']} ... (3/3 nodes wrote) [GET] key='cart:user-99' status=clean (3/3 nodes responded) After resolution: {'items': ['gloves', 'hat', 'scarf']} Ngo Scenario 2, i-coordinator ibonise ukuthi and Zonke izindlela zokusebenza zokusebenza zokusebenza zokusebenza zokusebenza zokusebenza nezinhlangano zokusebenza nezinhlangano zokusebenza zokusebenza zokusebenza zokusebenza. What to notice: {'node-A': 2} {'node-A': 1, 'node-B': 1} Key Lessons for System Design Ngemuva kokusebenza nge-Dynamo-inspired systems iminyaka eminyakeni, apha iziphakamiso zayo eziphambili: 1. Always-On Beats Strongly-Consistent Ngama-user-facing izinhlelo, ukufinyelela ngokuvamile ukunqoba. Abasebenzisi akuyona ukubonisa idatha encane. Abanikeza "Service Unavailable." 2. Application-Level Reconciliation is Powerful Ungafakwa ukuthintela isixazululo se-conflict ku-application. I-application ibonise i-logic yebhizinisi futhi inokufinyelela izixazululo ezinzima kunazo kunazo. 3. Tunable Consistency is Essential Ukuphakama kwe-shopping cart kufuneka i-high availability (W = 1). I-transactions ye-financial kufuneka ama-guarantees eqinile (W = N). Ukuphakama kwe-per-transaction kunokwenzeka kakhulu. 4. The 99.9th Percentile Matters More Than Average Qiniseka izinzuzo zakho ze-optimization ku-tail latencies. Kuyinto ukuthi abasebenzisi zithembisa ngesikhathi izinsuku ezinguqulo. 5. Gossip Protocols Scale Beautifully Ukuqhathanisa okuzenzakalelayo nge-gossip ukunciphisa amaphuzu amancane amaphuzu amancane amaphuzu amancane amaphuzu amancane. Ukusebenzisa I-Dynamo-Style Systems Qiniseka ngokugcwele mayelana ne-compromise-offs. Ungasebenzisa le ndlela lapho: Ukuvumelana okuqinile kufuneka (ukudluliselwa kwezimali, ukuphathwa kwezimali) Imininingwane eziphilayo ezidingekayo (ukubhalisa, ukuhlaziywa, ukuxhumana) Izimali zihlanganisa izinto eziningi (Dynamo kuyinto kuphela izimali single-key) I-team yakho ayikwazi ukulawula ukuxhumana okuqhubekayo (Uma abadlali awukwazi ukuxhumana ne-vector clocks ne-conflict resolution, uzothola imibuzo) Ukuphakama I-Dynamo inikeza isixazululo esiyisisekelo ekusebenzeni izinhlelo ezihlaziywa. Ngokuvumelana nokugqithwa kwe-consistency kanye nokunikezela izixazululo ezihlaziywa, ivumela ukwakhiwa kwezinhlelo ezihlaziywa kwama-mass sizes ngokuvumelana nokufakwa okuphakeme. Ukufundwa kwebhizinisi lashukumisa umdlavuza we-Distributed Databases. Noma usebenzisa i-Cassandra, i-Riak, noma i-DynamoDB, uzothola imibuzo eyinhloko ebhaliwe kuleli khasi. As engineers, our job is to understand these trade-offs deeply and apply them appropriately. Dynamo gives us a powerful tool, but like any tool, it’s only as good as our understanding of when and how to use it. Ukungena ngemvume I-Dynamo Paper yokuqala: SOSP 2007 Werner Vogels' Blog: Zonke Izinto Zihlanganiswa I-Documentation ye-Cassandra: Ukuphathelela kanjani izinhlelo zokusebenza "Ukuhlola Izicelo ze-Data-Intensive" ngu-Martin Kleppmann - I-Chapter 5 ye-Replication I-Appendix: I-Design I-Problems and Approaches Three open-ended problems that come up in system design interviews and real engineering work. Think through each before reading the discussion. I-Problem 1: Ukusebenza kwe-Conflict for a Collaborative Document Editor : Ingabe ukwakha into efana ne-Google Docs eyenziwe nge-Dynamo-style store. Amakhasimende amabili abahlala i-paragraph efanayo ngokulandelanayo. Indlela yokusebenza ne-conflict? The problem : Isakhiwo se-shopping cart (i-union of all items) iyatholakala kuphela ngoba ukwengeza i-elements ku-commutative — I-Text editing ayikho ku-commutative. Uma Umbhali A uxhumane isifundo futhi Umbhali B uxhumane isifundo se-middle, ukuxhumana kwezinhlangano zayo zihlukile noma zihlukile. Why shopping cart union doesn’t work here {A} ∪ {B} = {B} ∪ {A} The right approach: Operational Transformation (OT) or CRDTs Isisombululo se-industry kuyinto ukubonisa idokhumenti angaphandle kwe-blob ye-text, kodwa njenge-sequence ye-operations, futhi ukuguqulwa ama-operations e-concurrent ukuze ama-operations ezimbini angasetshenziselwa ngaphandle kokuphumula: User A's operation: delete(position=50, length=20) User B's operation: insert(position=60, text="new sentence") Without OT: B's insert position (60) is now wrong because A deleted 20 chars. With OT: Transform B's operation against A's: B's insert position shifts to 40 (60 - 20). Both operations now apply cleanly. I-conflict resolution strategy ye-Dynamo layer kuyinto: I-Store Operations (okungabikho i-snapshots ephelele ye-document) njenge-value ye-key eyodwa. Ku-conflict, ukuthatha zonke izilimi zokusebenza okuqhubekayo kusuka ku-version eyodwa. Yenza i-OT ukuze zihlanganise ku-operation log eyodwa. Hlola i-log yokuhlanganisa ngokuvumelana ne-vector clock yokuhlanganisa njenge-context. : I-operation log ye-document segment, noma i-rendered text. Lokhu kwenza i-fusion enhle futhi enhle. What to store in Dynamo I-Google Docs, i-Notion, ne-Figma isebenza ngokuvamile. I-storage layers yayo isebenzisa noma i-OT noma inguqulo ye-CRDTs (I-Conflict-free Replicated Data Types), okuyinto izakhiwo zebhizinisi ezigquguqukayo zihlanganisa ngaphandle kokuphumula ngaphandle kokuphumula ngaphandle kokufaka. Real-world reference Isimo 2: Khetha N, R, W ngoba izicelo ezahlukene : Yini ukulawula ungathola (a) i-session store, (b) i-product catalog, (c) ama-user profiles? The problem Uhlobo olungcono ukufikelela lokhu: ukucacisa imodi yokungasebenzi okungenani kakhulu - umbhalo owafakiwe (ukudluliselwa kwedatha) noma umbhalo owafakiwe (ukudluliselwa). Ngemuva kwalokho, ukhethe izindleko ze-quorum ngokuvumelana. Session store — prioritize availability I-sessions iyahora futhi i-user-specific. Uma i-session ye-username isithambile noma isithambile ngokushesha, i-session isithambulwe futhi isithambulwe ngokushesha. Lokhu kubaluleke kodwa akuyona i-catastrophic. Unemibuzo yokubhalisa isihloko se-session. N=3, R=1, W=1 Rationale: - W=1: Accept session writes even during heavy failures. A user can't log in if their session write is rejected. - R=1: Read from any single node. Stale session data is harmless. - N=3: Still replicate to 3 nodes for basic durability. Trade-off accepted: Stale session reads are possible but inconsequential. Product catalog — prioritize read performance and consistency I-Data ye-Product ibhalwe rhoqo (ngama-opps teams) kodwa ibhalwe amamilioni amahora ngosuku. Izindleko ezingenalutho noma izifundo zihlanganisa. Ufuna ukubuyekeza okusheshayo, okuqhubekayo. N=3, R=2, W=3 Rationale: - W=3: All replicas must confirm a catalog update before it's live. A price change half-published is worse than a brief write delay. - R=2: Read quorum overlap with W=3 guarantees fresh data. Acceptable: catalog writes are rare, so write latency doesn't matter. - N=3: Standard replication for durability. Trade-off accepted: Writes are slow and fail if any node is down. Acceptable because catalog updates are infrequent. User profiles — balanced Imininingwane ye-profile (isithombe, i-imeyili, i-preferences) kuyinto enhle kakhulu. I-profile enhle kuyinto enhle kodwa ayikho emangalisayo. I-update eyenziwe (isib. Usebenzisi awukwazi ukuhlaziywa kwe-imeyili yabo) kuyinto ingcindezi enhle. N=3, R=2, W=2 Rationale: - The classic balanced configuration. - R + W = 4 > N = 3, so quorums overlap: reads will see the latest write. - Tolerates 1 node failure for both reads and writes. - Appropriate for data that matters but doesn't require strict consistency. Trade-off accepted: A second simultaneous node failure will cause errors. Acceptable for non-critical user data. Decision framework summary: Priority R W When to use Max availability 1 1 Sessions, ephemeral state, click tracking Balanced 2 2 User profiles, preferences, soft state Consistent reads 2 3 Catalogs, config, rarely-written reference data Highest consistency 3 3 Anywhere you need R+W > N with zero tolerance for stale reads (still not linearizable) Max Ukusebenza 1 1 Sessions, ephemeral idatha, click tracking Ukuhlobisa 2 2 I-User Profiles, Imibuzo, i-soft state Okufakiwe ngokuvamile 2 3 I-catalogues, i-config, idatha ye-reference ebizwa kakhulu Ukuphakama okuphakeme 3 3 Nokho ufuna R+W > N nge-zero tolerance ngenxa ye-stale reads (ngaphandle kwe-linearizable) I-Problem 3: Ukubuyekeza uhlelo ye-Dynamo-Style ngaphansi kwe-Partition Scenarios : Ungayifaka kanjani ukuthi uhlelo lakho isebenza ngokufanele uma ama-nodes akufanele futhi ama-partitions asebenza? The problem Kuyinto enye yezinkinga ezinzima ezivela ku-Distributed Systems Testing ngoba ama-bugs zibonakalayo kuphela ku-interleavings ezithile ze-events ezihambayo ezinzima ukuvela ngokufanelekileyo. Layer 1: Unit tests for the logic in isolation Ngaphambi kokuqinisekisa ukusebenza okuzenzakalelayo, ukuhlola izakhiwo ngokuzimela. I-Vector clock comparison logic, ukucubungula kwezinhlangano, kanye nezinhlangano zokusebenza zokusebenza zokusebenza nge-pure unit tests - akufanele ukuxhuma kwebhizinisi. def test_concurrent_clocks_detected_as_conflict(): clock_a = VectorClock({"node-A": 2}) clock_b = VectorClock({"node-B": 2}) assert not clock_a.dominates(clock_b) assert not clock_b.dominates(clock_a) # Both survive reconciliation → conflict correctly detected def test_ancestor_clock_is_discarded(): old_clock = VectorClock({"node-A": 1}) new_clock = VectorClock({"node-A": 3}) assert new_clock.dominates(old_clock) # old_clock should be pruned during reconciliation Layer 2: Deterministic fault injection Rather than hoping failures happen in the right order during load testing, inject them deliberately and repeatably. In the demo implementation above, inguqulo efanelekayo. Ngezinkqubo yokukhiqiza, izibuyekezo ezifana noma Yenza lokhu ku-infrastructure level. node.down = True ikhaya Ukulungiswa Monkey Izinzuzo eziyinhloko ze-Test: Scenario A: Write succeeds with W=2, third replica is down. → Verify: the data is readable after the down node recovers. → Verify: no data loss occurred. Scenario B: Two nodes accept concurrent writes to the same key. → Verify: the next read surfaces exactly 2 conflicting versions. → Verify: after the application writes a merged version, the next read is clean. Scenario C: Node goes down mid-write (wrote to W-1 nodes). → Verify: the write is correctly rejected (RuntimeError). → Verify: no partial writes are visible to readers. Scenario D: All N nodes recover after a full partition. → Verify: no data was lost across the cluster. → Verify: vector clocks are still meaningful (no spurious conflicts). Layer 3: Property-based testing Ngaphandle kokubili isicelo se-test case, defines okuyinto kufanele uvale futhi kwenziwe ezininzi izigaba zokusebenza ngempumelelo ukuhlangabezana kwabo: Ukuhlobisa # Invariant: after any sequence of writes and merges, a final get() # should always return exactly one version (no unresolved conflicts). # Invariant: a value written with a context derived from a previous read # should never produce a conflict with that read's version # (it should dominate it). # Invariant: if R + W > N, a value written successfully should always # be visible in the next read (read-your-writes, absent concurrent writes). Izinsiza like (Python) ungahambisa lezi invariants futhi okuzenzakalelayo ukufuna ama-counterexamples. Ukubuyekezwa Layer 4: Linearizability checkers Ukuze kube lula kakhulu, ubhale isikhathi lokuqala, isikhathi lokugqibela lwezinsizakalo ngamunye, futhi imiphumela ngexesha lokuphendula isizinda, bese ukhuthaza umlando ku-linearizability checker njenge It uyazi ukuthi noma iyiphi i-history ebonakalayo iyahambisana ne-sequential execution efanelekayo-ngaphandle kwe-system eyenziwe ngempumelelo ebonakalayo ebonakalayo emangalisayo. Ukucaciswa Yakhelwe kusuka ku-trance ye-distributed systems. Ukuhlolwa kwe-batch, zero-hand-waving. Imininingwane Link Imininingwane Link