Yintoni i-Amazon Dynamo ibonisa i-Storage ye-Distributed Storage ye-17 iminyaka emva

A senior engineer’s perspective on building highly available distributed systems Umbala weContents Introduction: Yintoni i-Dynamo iye yibhalwe yonke into I-CAP Theorem ye-trade-off Core Architecture Components Consistent Hashing for Partitioning Replication Strategy (N, R, W) Vector Clocks for Versioning Sloppy Quorum and Hinted Handoff Ukuphendula i-Conflict: I-Problem of the Shopping Cart Ukubhala kunye nokubhala Flow Iimveliso ze-Merkel ye-anti-entropy I-Memberhood kunye ne-Failure Detection Izixhobo zokusebenza: Inombolo Real Ukulungiselela Strategy Evolution Ukubala i-Dynamo kwi-Modern Systems Yintoni i-Dynamo Ayikho Isibonelo esebenzayo Ukufundwa kwezifundo ze-System Design Xa akufanele ukusebenzisa i-Dynamo-Style Systems Ukucinga I-Appendix: Iingxaki kunye neengxaki ze-Design I-reference ye-long-form - zonke iingxaki ziyafumaneka ngokufanelekileyo, ngoko ukhangela ngqo kwiingxaki ye-relevant kakhulu yakho. I-reference ye-long-form - zonke iingxaki ziyafumaneka ngokufanelekileyo, ngoko ukhangela ngqo kwiingxaki ye-relevant kakhulu yakho. Introduction: Yintoni i-Dynamo iye yibhalwe yonke into Xa i-Amazon yasungula i-Dynamo paper ngo-2007, akukho nje isebenzo sokufundisa. Yinto isisombululo esebenzayo kwiingxaki ezininzi ezininzi ezininzi. Ndingathanda xa ndingathanda lokuqala le nqaku-ukuguqulela ngokufanelekileyo indlela ndingathanda kwiinkqubo ezidlulileyo. Yenzelwe ukunceda iinkonzo ezininzi ze-Amazon, ezifana ne-shopping cart kunye ne-session management systems. Akukho iindeksi ze-secondary, akukho i-joins, akukho i-semantics ye-relational – kuphela i-keys kunye neeyure, kunye ne-focus ekhulwini kwi-availability kunye ne-scalability. Akukho i-linearizability okanye i-global ordering guarantees, nangona kwi-quorum ephakeme kakhulu. Ukuba inkqubo yakho inokufuneka izakhiwo ezininzi, i-Dynamo ayikho isixhobo efanelekileyo. Dynamo is a distributed key-value storage system. I-problem core ye-Amazon eyenziwe ngempumelelo ukuhlaziywa kodwa enzima ukuyisombulula: Xa umntu udinga ukongeza into kwi-shopping cart yayo ngexesha i-network partition okanye i-server failure, ukuxhaswa kwegama le-writing ayidinga. Yonke ukuxhaswa kwangaphantsi iye yabaxhaswa kunye nokukhutshwa kwe-customer trust. How do you build a storage system that never says “no” to customers? I-CAP Theorem I-Trade-off: Yintoni i-Dynamo ibonelela kwi-availability Ngaphambi kokufunda njani i-Dynamo, kufuneka ufumane i-constraint esisisombululo esisisombululo. Yintoni i-CAP Theorem? I-CAP i-theorem ibonisa i-comprom-off esisiseko kwi-system e-distributed: xa i-network partition ikhona, kufuneka ukhethe phakathi kwe-consistency kunye ne-availability. Iintlobo ezintathu ziquka: I-Consistency (C): Zonke i-nodes zibonise idatha efanayo ngexesha elifanayo Ukusebenza (A): Zonke imibuzo unayo impendulo (ukuphumelela okanye ukunciphisa) I-Partition Tolerance (P): I-System isebenza nangaphandle kweengxaki ze-network I-abbreviation epheleleyo yi-"pick 2 of 3", kodwa oku kuyinto i-over-simplification. Kwi-practice, i-network partitions iyathintela kwi-scale, ngoko i-decision efanelekileyo: Yintoni ukhetho design efanelekileyo. when partitions occur (and they will), do you sacrifice consistency or availability? : I-Network Partitions YOKUQATHA. I-Cable ifakwe, i-Switches ifumaneka, i-datacenters ibonakaliswe. Unako ukunceda, ngoko kufuneka ukhethe: Consistency okanye Availability? The harsh reality Izixhobo zeData Traditional Choose Consistency : Traditional approach Database: "I can't guarantee all replicas are consistent, so I'll reject your write to be safe." Result: Customer sees error, cart is empty Impact: Lost revenue, poor experience Dynamo Khetha Ukuphepha : Dynamo’s approach Dynamo: "I'll accept your write with the replicas I can reach. The unreachable replica will catch up later." Result: Customer sees success, item in cart Impact: Sale continues, happy customer I-Trade-off ye-visualized When a partition occurs: Traditional Database: Choose C over A → Sacrifice Availability - ✓ All replicas always have same data - ✓ No conflicts to resolve - ❌ Rejects writes during failures - ❌ Poor customer experience - ❌ Lost revenue Dynamo: Choose A over C → Sacrifice Strong Consistency - ✓ Accepts writes even during failures - ✓ Excellent customer experience - ✓ No lost revenue - ❌ Replicas might temporarily disagree - ❌ Application must handle conflicts I-Amazon Real Example: I-Black Friday Shopping Cart Ukukhangisa i-Black Friday. Millions of customers are shopping. I-network cable ifakwe phakathi kwiziko zebhanki. : With traditional database Time: 10:00 AM - Network partition occurs Result: - All shopping cart writes fail - "Service Unavailable" errors - Customers can't checkout - Twitter explodes with complaints - Estimated lost revenue: $100,000+ per minute : With Dynamo Time: 10:00 AM - Network partition occurs Result: - Shopping cart writes continue - Customers see success - Some carts might have conflicts (rare) - Application merges conflicting versions - Estimated lost revenue: $0 - A few edge cases need conflict resolution (acceptable) Yintoni Le Choice Kufuneka I-E-Commerce I-Amazon yenza i-mathematics: Ixabiso yokukhusela umbhalo: Ukuhambisa ngokushesha ($ 50-200) I-Cost of Accepting a Conflicting Writing: I-Occasionally need to merge shopping carts (ukuba ngokufanelekileyo, lula ukuguqulwa) Ukuxhaswa kwebhizinisi: Ukukhetha iingxowa, ukuxhaswa kwiingxowa ezininzi : Types of data where Availability > Consistency I-shopping carts (i-merge additions eziqhelekileyo) Izixhobo ze-session (i-last-write-wins is fine) Imibuzo yabasebenzisi (ukuba ukuxhaswa kunokwenzeka) I-Best Seller Lists (i-approximately is fine) : Types of data where Consistency > Availability Iinkcukacha zeenkcukacha zebhanki (ayikwazi ukufumana iinkcukacha ezinxulumene) I-Inventory Counts (ayikwazi ukuthengiswa engaphezulu) I-Transaction Logs (kufuneka ifumaneke) Yinto yoko i-Dynamo ayikho yonke into - kodwa kwiimeko ze-Amazon ye-e-commerce, ukhethe ukufikelela kwi-consistency elikhulu yaba i-compromise efanelekileyo. Ukucaciswa: Nangona i-Dynamo ibizwa ngokuba yi-AP system, kunokwenzeka ukuba ibizwa ngokuba yi-tuneable consistency system. Ngokusho kwi-quorum yakho ye-R kunye ne-W, inokukwazi ukuxhaswa ngakumbi kwi-CP. I-AP label ifumaneka kwi-default/recommended configuration ye-e-commerce workloads. : Nangona i-Dynamo ikhiwa njengoko inkqubo ye-AP, kunokwenzeka ukuba ibizwa ngokuba yi-AP. Ngokusho ne-quorum yakho ye-R kunye ne-W, inokukwazi ukuqhuba ngakumbi kwi-CP. I-AP label isebenza kwi-default / i-recommended configuration yayo eyongezelelweyo kwi-e-commerce workloads. Important nuance tunable consistency system Iimpawu ze-Architecture Core I-Consistent Hashing ye-Partitioning Nceda nqakraza oku ngexesha elifanelekileyo, ngenxa yokuba i-hashing epheleleyo yinkqubo ezininzi ezininzi ezibonakalayo kuze kube lula. I-Problem: I-Hash-Based Sharding ezivamile Ukubonisa ukuba unayo i-3 iinkonzo kwaye ufuna ukuhanjiswa idatha kwiintlobo ezininzi. Umgangatho we-naive: # Traditional approach - DON'T DO THIS def get_server(key, num_servers): hash_value = hash(key) return hash_value % num_servers # Modulo operation # With 3 servers: get_server("user_123", 3) # Returns server 0 get_server("user_456", 3) # Returns server 1 get_server("user_789", 3) # Returns server 2 Ukusebenza... ngexesha xa ukongeza okanye ukuthatha i-server. Hlola oku kwenziwa xa kuthetha kwi-3 ukuya kwi-4 i-server: # Before (3 servers): "user_123" → hash % 3 = 0 → Server 0 "user_456" → hash % 3 = 1 → Server 1 "user_789" → hash % 3 = 2 → Server 2 # After (4 servers): "user_123" → hash % 4 = 0 → Server 0 ✓ (stayed) "user_456" → hash % 4 = 1 → Server 1 ✓ (stayed) "user_789" → hash % 4 = 2 → Server 2 ✓ (stayed) # But wait - this is lucky! In reality, most keys MOVE: "product_ABC" → hash % 3 = 2 → Server 2 "product_ABC" → hash % 4 = 3 → Server 3 ✗ (MOVED!) : Xa ukuguqulwa inani le-server, malunga NDE idatha yakho kufuneka ifumaneka. Qinisekisa ukuhambisa i-terabytes yeedatha nje ukongeza i-server enye! The disaster Ukusabela: I-Consistent Hashing I-hashing ye-hashing efanelekileyo ithatha le space ye-hash njenge-circle (0 ukuya ku-2^32 - 1, ngokubandakanya). Step 1: Place servers on the ring Wonke i-server ibekwe indawo ye-random kwi-ring (eyaziwa ngokuba yi-token). Yenza oku njengoko ukugcina i-markers kwi-circular race track. Step 2: Place data on the ring Xa ufuna ukugcina idatha, wena: I-Hash key ukufumana indawo kwi-ring Walk clockwise from that position Ukugcina idatha kwi-server yokuqala ufumana Umzekelo: Umzekelo olupheleleyo Nazi i-ring ebonakalayo ngexesha elandelayo. Iingcebiso zithembisa kwi-clockwise kwi-server elandelayo: : Umgca uqhagamshelane ngexesha lokugqibela xa uqhagamshelane kwi-server. I-server elidlulayo i-key. Simple rule : Examples user_123 kwi-30° → ivela kwi-45° → I-Server A ihamba user_456 kwi-150° → uqhagamshelwano kwi-200° → I-Server C ithatha cart_789 kwi-250° → uqhagamshelwano kwi-280° → I-Server D ithatha product_ABC kwi-300° → ivela kwi-360°, ivela kwi-0°, ivela kwi-45° → I-Server A ihamba Who owns what range? I-Server A (i-45°): Inezinto ezininzi ezivela kwi-281° ukuya kwi-45° (i-wraps) I-Server B (120°): I-Server I-Server I-Server I-Server I-120°): I-Server I-Server I-Server I-Server I-120° I-Server C (200°): I-proprietary yonke into kusuka ku-121° ukuya ku-200° I-Server D (280°): I-proprietary yonke into kusuka ku 201° ukuya ku-280° I-Magic: Ukongeza i-server Ndiyathanda ngexesha elandelayo. Sinikezela i-Server E kwi-position ye-160°: BEFORE: Server A (45°) → owns 281°-45° Server B (120°) → owns 46°-120° Server C (200°) → owns 121°-200° ← THIS RANGE WILL SPLIT Server D (280°) → owns 201°-280° AFTER: Server A (45°) → owns 281°-45° ← NO CHANGE Server B (120°) → owns 46°-120° ← NO CHANGE Server E (160°) → owns 121°-160° ← NEW! Takes part of C's range Server C (200°) → owns 161°-200° ← SMALLER range Server D (280°) → owns 201°-280° ← NO CHANGE : Khetha kuphela iingcebiso kwi-112°-160° kufuneka uqhuba (kuC ukuya ku-E). Iingcebisi A, B, kunye neD ziyafumaneka ngokupheleleyo! Result Ukusebenza kwe-Virtual Nodes Optimization Kukho ingxaki ebalulekileyo kunye nenkqubo esisiseko ye-consistent hashing: . random distribution can be extremely uneven The Problem in Detail: Xa uqhagamshelane ngempumelelo indawo enye ngalinye iinkonzo, ngokwenene uqhagamshelane i-darts kwi-board ye-circular. Kwiimeko i-darts zihlanganisa kunye, kwimeko zihlanganisa. Oku kwenza i-hotspots. Ndiyathanda umzekelo olufanelekileyo: Scenario: 4 servers with single random tokens Server A: 10° } Server B: 25° } ← Only 75° apart! Tiny ranges Server C: 100° } Server D: 280° ← 180° away from C! Huge range Range sizes: - Server A owns: 281° to 10° = 89° (25% of ring) - Server B owns: 11° to 25° = 14° (4% of ring) ← Underutilized! - Server C owns: 26° to 100° = 74° (21% of ring) - Server D owns: 101° to 280° = 179° (50% of ring) ← Overloaded! Real-world consequences: I-Uneven Load: I-Server D ithatha i-50% yaye i-Server B ithatha kuphela i-4%. Oku kubalulekile: I-CPU, i-disk kunye ne-network ye-Server D ziya kufumaneka I-Server B ikakhulu i-idle (i-capacity eyenziwe) I-99.9th percentile ye-latency ye-Server D ifumaneka : When Server D becomes slow or fails: Hotspot cascading Zonke i-50% ye-load ye-Server A (i-clockwise elandelayo) I-Server A ngoku iye i-overloaded Ukusebenza kwamakhemikhali ukunciphisa I-Scaling ine-efficient: Ukongezelela i-server ayikwazanga ngokugqithisileyo ngenxa yokuba i-server ezintsha ingathanda kwizinga ezincinane Visualizing the problem: : Wonke i-server ye-physical ibonelela iindawo ezininzi ze-virtual (i-tokens). Dynamo’s solution Nangona i-dart yethuthe ngalinye ye-server, i-dart yethuthe iintlobo ezininzi. Iintlobo ezininzi zethuthe, i-distribution iya kuba kakhulu (i-law of large numbers). How Virtual Nodes Fix the Problem: Let’s take the same 4 servers, but now each server gets 3 virtual nodes (tokens) instead of 1: Physical Server A gets 3 tokens: 10°, 95°, 270° Physical Server B gets 3 tokens: 25°, 180°, 310° Physical Server C gets 3 tokens: 55°, 150°, 320° Physical Server D gets 3 tokens: 75°, 200°, 340° Now the ring looks like: 10° A, 25° B, 55° C, 75° D, 95° A, 150° C, 180° B, 200° D, 270° A, 310° B, 320° C, 340° D Range sizes (approximately): - Server A total: 15° + 55° + 40° = 110° (31% of ring) - Server B total: 30° + 20° + 30° = 80° (22% of ring) - Server C total: 20° + 30° + 20° = 70° (19% of ring) - Server D total: 20° + 70° + 20° = 110° (31% of ring) Load ranges from 19% to 31% instead of 4% to 50%. Much better! Why this works: : With more samples (tokens), the random distribution averages out. This is the law of large numbers in action. Statistics : When a server fails, its load is distributed across many servers, not just one neighbor: Granular load distribution Server A fails: - Its token at 10° → load shifts to Server B's token at 25° - Its token at 95° → load shifts to Server C's token at 150° - Its token at 270° → load shifts to Server B's token at 310° Result: The load is spread across multiple servers! : When adding a new server with 3 tokens, it steals small amounts from many servers instead of a huge chunk from one server. Smooth scaling Real Dynamo configurations: The paper mentions different strategies evolved over time. In production: Early versions: 100-200 virtual nodes per physical server Later optimized to: Q/S tokens per node (where Q = total partitions, S = number of servers) Ukulungiselela okuzenzakalelayo: Kukho iinkonzo ze-128-256 ye-virtual node. The Trade-off: Balance vs Overhead Iingubo ezininzi ze-virtual ziquka ukuthengiswa kwexabiso elungileyo, kodwa kunezinto ezininzi. What you gain with more virtual nodes: With 1 token per server (4 servers): Load variance: 4% to 50% (±46% difference) ❌ With 3 tokens per server (12 virtual nodes): Load variance: 19% to 31% (±12% difference) ✓ With 128 tokens per server (512 virtual nodes): Load variance: 24% to 26% (±2% difference) ✓✓ What it costs: Ubungakanani we-metadata: Zonke i-node ibonisa ulwazi olusebenzayo 1 i-token ngalinye: Track 4 iingxaki 128 tokens per server: Track 512 entries I-Gossip Overhead: I-Nodes ihlawula iinkcukacha ze-memberhood ngexesha elide I-Tokens ezininzi = iinkcukacha ezininzi zihlanganisa phakathi kwe-nodes Zonke imizuzu, ama-nodes zibonisa i-view yayo ye-ring : When nodes join/leave Rebalancing complexity Iingxowa ezininzi ze-virtual nodes = iingxowa ezininzi ze-partition ukuqhagamshelana Kodwa zonke iintlawulo ezincinane (eyenza ngokwenene i-bootstrapping) Dynamo’s evolution: The paper describes how Amazon optimized this over time: Strategy 1 (Initial): - 100-200 random tokens per server - Problem: Huge metadata (multiple MB per node) - Problem: Slow bootstrapping (had to scan for specific key ranges) Strategy 3 (Current): - Q/S tokens per server (Q=total partitions, S=number of servers) - Equal-sized partitions - Example: 1024 partitions / 8 servers = 128 tokens per server - Benefit: Metadata reduced to KB - Benefit: Fast bootstrapping (transfer whole partition files) Real production sweet spot: Izixhobo ezininzi ze-Dynamo zisebenzisa i-128-256 i-node ye-virtual ngalinye kwi-server ye-physical. Oku kuthatha: Ukusabalalisa Ukusabalalisa ngaphakathi 10-15% variance (ngcono kakhulu) Metadata overhead under 100KB per node (negligible) Fast failure recovery (load spreads across many nodes) Ukunciphisa iintlawulo. Ukusuka kwi-128 ukuya kwi-512 i-token kunyusa kuphela i-load balance ngexesha le-2-3%, kodwa ukwandisa ubungakanani we-metadata kunye ne-gallery traffic. Why not more? : I-server ye-physical (i-top) i-mapping kwiindawo ezininzi ze-virtual (i-bottom) kwi-ring. Ngokwenza oku, ukuhanjiswa kwe-load ye-server kwiindawo ezahlukeneyo ze-hash space. Key concept : Benefits More even load distribution When a server fails, its load is distributed across many servers (not just one neighbor) When a server joins, it steals a small amount from many servers Real-World Impact Comparison Let’s see the difference with real numbers: Traditional Hashing (3 servers → 4 servers): - Keys that need to move: ~75% (3 out of 4) - Example: 1 million keys → 750,000 keys must migrate Consistent Hashing (3 servers → 4 servers): - Keys that need to move: ~25% (1 out of 4) - Example: 1 million keys → 250,000 keys must migrate With Virtual Nodes (150 vnodes total → 200 vnodes): - Keys that need to move: ~12.5% (spread evenly) - Example: 1 million keys → 125,000 keys must migrate - Load is balanced across all servers Umzuzu we-Aha! The key insight is this: Consistent hashing decouples the hash space from the number of servers. Traditional: ← num_servers is in the formula! server = hash(key) % num_servers Consistent: ← num_servers isn’t in the formula! server = ring.findNextClockwise(hash(key)) This is why adding/removing servers only affects a small portion of the data. The hash values don’t change—only which server “owns” which range changes, and only locally. Think of it like a circular running track with water stations (servers). If you add a new water station, runners only change stations if they’re between the old nearest station and the new one. Everyone else keeps going to their same station. 2. Replication Strategy (N, R, W) The Problem: Availability vs Consistency Trade-off Ukubonisa ukuba utshintshe i-shopping cart ye-Amazon. Umthengi utshintshe into kwi-shopping cart yayo, kodwa ngexesha elifanelekileyo: One server is being rebooted for maintenance Another server has a network hiccup A third server is perfectly fine (strong consistency): Traditional database approach Client: "Add this item to my cart" Database: "I need ALL replicas to confirm before I say yes" Server 1: ✗ (rebooting) Server 2: ✗ (network issue) Server 3: ✓ (healthy) Result: "Sorry, service unavailable. Try again later." : "Kungabonakali ukongeza iimveliso kwi-cart yam ngexesha le-Black Friday?" Customer experience This is unacceptable for e-commerce. Every rejected write is lost revenue. Dynamo’s Solution: Tunable Quorums Dynamo inikeza amathuba ezintathu ukucacisa i-compromise efanelekileyo: N: Inani le-replicas (okanye amaxabiso zeedatha) : Read quorum (how many replicas must respond for a successful read) R W: Wabelane i-quorum (okanye i-replicas ezininzi kufuneka ifumaneke ukuba ibhalwe ngempumelelo) *Ukuba , you guarantee quorum overlap—meaning at least one node that received the write will be queried during any read. This overlap enables detection of the latest version, provided the reconciliation logic correctly identifies the highest vector clock. It does not automatically guarantee read-your-writes unless the coordinator properly resolves versions. The magic formula R + W > N Ndiyathanda nceda kucacisa ukuba oku kubaluleke kwi-scenaries ezininzi: I-Scenario 1: I-Shopping Cart (Ukuhlaziywa kwe-availability) N = 3 # Three replicas for durability R = 1 # Read from any single healthy node W = 1 # Write to any single healthy node # Trade-off analysis: # ✓ Writes succeed even if 2 out of 3 nodes are down # ✓ Reads succeed even if 2 out of 3 nodes are down # ✓ Maximum availability - never reject customer actions # ✗ Might read stale data # ✗ Higher chance of conflicts (but we can merge shopping carts) What happens during failure: Client: "Add item to cart" Coordinator tries N=3 nodes: - Node 1: ✗ Down - Node 2: ✓ ACK (W=1 satisfied!) - Node 3: Still waiting... Result: SUCCESS returned to client immediately Node 3 eventually gets the update (eventual consistency) Scenario 2: Session State (Balanced Approach) N = 3 R = 2 # Must read from 2 nodes W = 2 # Must write to 2 nodes # Trade-off analysis: # ✓ R + W = 4 > N = 3 → Read-your-writes guaranteed # ✓ Tolerates 1 node failure # ✓ Good balance of consistency and availability # ✗ Write fails if 2 nodes are down # ✗ Read fails if 2 nodes are down Why R + W > N enables read-your-writes: Write to W=2 nodes: [A, B] Later, read from R=2 nodes: [B, C] Because W + R = 4 > N = 3, there's guaranteed overlap! At least one node (B in this case) will have the latest data. The coordinator detects the newest version by comparing vector clocks. This guarantees seeing the latest write as long as reconciliation picks the causally most-recent version correctly. Scenario 3: Iinkcukacha zebhizinisi (Ukuhlaziywa kwePriority) N = 3 R = 3 # Must read from ALL nodes W = 3 # Must write to ALL nodes # Trade-off analysis: # ✓ Full replica quorum — reduces likelihood of divergent versions # ✓ Any read will overlap every write quorum # ✗ Write fails if ANY node is down # ✗ Read fails if ANY node is down # ✗ Poor availability during failures Systems requiring strict transactional guarantees typically choose CP systems instead. This configuration is technically supported by Dynamo but sacrifices the availability properties that motivate using it in the first place. Ukubala Table Ukubala Table Config N R W Availability Consistency Use Case High Availability 3 1 1 ⭐⭐⭐⭐⭐ ⭐⭐ Shopping cart, wish list Balanced 3 2 2 ⭐⭐⭐⭐ ⭐⭐⭐⭐ Session state, user preferences Full Quorum 3 3 3 ⭐⭐ ⭐⭐⭐⭐⭐ High-stakes reads (not linearizable) Read-Heavy 3 1 3 ⭐⭐⭐ (reads) ⭐⭐⭐⭐ Product catalog, CDN metadata Write-Heavy 3 3 1 ⭐⭐⭐ (writes) ⭐⭐⭐ Click tracking, metrics High Availability 3 1 1 ⭐⭐⭐⭐⭐ ⭐⭐ Shopping cart, wish list Balanced 3 2 2 ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐ I-Session Status, i-User Preferences Full Quorum 3 3 3 ⭐⭐⭐ ⭐⭐⭐⭐⭐ I-High-Stakes Reads (i-non linearizable) Read-Heavy 3 1 3 ⭐⭐⭐ (reads) ⭐⭐⭐⭐⭐ Product catalog, CDN metadata Write-Heavy 3 3 1 ⭐⭐⭐ (writes) ⭐⭐⭐ Click tracking, metrics : Systems requiring strong transactional guarantees (e.g., bank account balances) typically shouldn’t use Dynamo. That said, some financial systems do build on Dynamo-style storage for their persistence layer while enforcing stronger semantics at the application or business logic layer. Note on financial systems : Systems requiring strong transactional guarantees (e.g., bank account balances) typically shouldn’t use Dynamo. That said, some financial systems do build on Dynamo-style storage for their persistence layer while enforcing stronger semantics at the application or business logic layer. Note on financial systems Indawo Key Most systems use Kuba: N=3, R=2, W=2 : Can tolerate up to 2 replica failures before permanent data loss (assuming independent failures and no correlated outages). Durability Ukulungelelaniso: Ixabiso 1 ye-node ye-reads ne-writes : R + W > N guarantees that read and write quorums overlap, enabling read-your-writes behavior in the absence of concurrent writes. Consistency : Don’t wait for the slowest node (only need 2 out of 3) Performance Real production numbers from the paper: Amazon’s shopping cart service during peak (holiday season): Configuration: N=3, R=2, W=2 Ukusetyenziswa kwizigidi zeemilioni Over 3 million checkouts in a single day No downtime, even with server failures This tunable approach is what made Dynamo revolutionary. You’re not stuck with one-size-fits-all—you tune it based on your actual business requirements. I-Vector Clocks ye-Versioning The Problem: Detecting Causality in Distributed Systems When multiple nodes can accept writes independently, you need to answer a critical question: Are these two versions of the same data related, or were they created concurrently? Why timestamps don’t work: Scenario: Two users edit the same shopping cart simultaneously User 1 at 10:00:01.500 AM: Adds item A → Writes to Node X User 2 at 10:00:01.501 AM: Adds item B → Writes to Node Y Physical timestamp says: User 2's version is "newer" Reality: These are concurrent! Both should be kept! Problem: - Clocks on different servers are NEVER perfectly synchronized - Clock skew can be seconds or even minutes - Network delays are unpredictable - Physical time doesn't capture causality What we really need to know: Version A happened before Version B? → B can overwrite A Version A and B are concurrent? → Keep both, merge later Version A came from reading Version B? → We can track this! Izixhobo ze-Vector Clocks A vector clock is a simple data structure: a list of pairs that tracks which nodes have seen which versions. (node_id, counter) The rules: When a node writes data, it increments its own counter Xa i-node ibonisa iinkcukacha, ibonelela i-vector clock Xa usebenzise iiyure ezimbini ze-vector: If all counters in A ≤ counters in B → A is an ancestor of B (B is newer) If some counters in A > B and some B > A → A and B are concurrent (conflict!) Step-by-Step Example Nceda usebenzise i-shopping cart nge-updates ezininzi: Breaking down the conflict: D3: [Sx:2, Sy:1] vs D4: [Sx:2, Sz:1] Comparing: - Sx: 2 == 2 ✓ (equal) - Sy: 1 vs missing in D4 → D3 has something D4 doesn't - Sz: missing in D3 vs 1 → D4 has something D3 doesn't Conclusion: CONCURRENT! Neither is an ancestor of the other. Both versions must be kept and merged. Iimpawu ze-real world I-Dynamo paper ibhalisele i-conflict distribution ezilandelayo esilinganiselwe kwiiyure ze-24 ze-Amazon yokuvelisa i-shopping cart traffic. Lezi zibonelelo zibonele i-workload ye-Amazon - i-high reading/writing ratio, ezininzi iiyure ze-single-user - kwaye akufuneka ukuhlaziywa kuzo zonke iimveliso ze-Dynamo: 99.94% - Single version (no conflict) 0.00057% - 2 versions 0.00047% - 3 versions 0.00009% - 4 versions : Iingxaki ziyafumaneka kakhulu kwi-practice! Key insight Why conflicts happen: Not usually from network failures Mostly from concurrent writers (often automated processes/bots) Human users rarely create conflicts because they’re slow compared to network speed The Size Problem Vector clocks can grow unbounded if many nodes coordinate writes. Dynamo’s solution: once the clock exceeds a size threshold. truncate the oldest entries // When vector clock exceeds threshold (e.g., 10 entries) // Remove the oldest entry based on wall-clock timestamp vectorClock = { 'Sx': {counter: 5, timestamp: 1609459200}, 'Sy': {counter: 3, timestamp: 1609459800}, 'Sz': {counter: 2, timestamp: 1609460400}, // ... 10 more entries } // If size > 10, remove entry with oldest timestamp // ⚠ Risk: Dropping an entry collapses causality information. // Two versions that were causally related may now appear // concurrent, forcing the application to resolve a conflict // that didn't actually exist. In practice, Amazon reports // this has not been a significant problem — but it is a // real theoretical risk in high-churn write environments // with many distinct coordinators. I-Sloppy Quorum kunye ne-Hinted Handoff I-Problem: Kworums Strict Kill Availability Traditional quorum systems are rigid and unforgiving. Traditional strict quorum: Your data is stored on nodes: A, B, C (preference list) Write requirement: W = 2 Scenario: Node B is down for maintenance Coordinator: "I need to write to 2 nodes from {A, B, C}" Tries: A ✓, B ✗ (down), C ✓ Result: SUCCESS (got 2 out of 3) Scenario: Nodes B AND C are down Coordinator: "I need to write to 2 nodes from {A, B, C}" Tries: A ✓, B ✗ (down), C ✗ (down) Result: FAILURE (only got 1 out of 3) Customer: "Why can't I add items to my cart?!" 😡 The problem: Ukuba ama-nodes ezizodwa zihambe, le nkqubo iya kuba engabonakali. Strict quorums require specific nodes Real scenario at Amazon: Black Friday, 2:00 PM - Datacenter 1: 20% of nodes being rebooted (rolling deployment) - Datacenter 2: Network hiccup (1-2% packet loss) - Traffic: 10x normal load With strict quorum: - 15% of write requests fail - Customer support phones explode - Revenue impact: Millions per hour The Solution: Sloppy Quorum Dynamo relaxes the quorum requirement: “Write to the first N healthy nodes in the preference list, walking further down the ring if needed.” Preference list for key K: A, B, C But B is down... Sloppy Quorum says: "Don't give up! Walk further down the ring: A, B, C, D, E, F, ..." Coordinator walks until N=3 healthy nodes are found: A, C, D (D is a temporary substitute for B) Indlela Hinted Handoff Isebenza When a node temporarily substitutes for a failed node, it stores a “hint” with the data. Detailed Hinted Handoff Process Step 1: Detect failure and substitute def write_with_hinted_handoff(key, value, N, W): preference_list = get_preference_list(key) # [A, B, C] healthy_nodes = [] for node in preference_list: if is_healthy(node): healthy_nodes.append((node, is_hint=False)) # If we don't have N healthy nodes, expand the list if len(healthy_nodes) = N: break # Write to first N healthy nodes acks = 0 for node, is_hint in healthy_nodes[:N]: if is_hint: # Store with hint metadata intended_node = find_intended_node(preference_list, node) success = node.write_hinted(key, value, hint=intended_node) else: success = node.write(key, value) if success: acks += 1 if acks >= W: return SUCCESS return FAILURE Step 2: Background hint transfer # Runs periodically on each node (e.g., every 10 seconds) def transfer_hints(): hints_db = get_hinted_replicas() for hint in hints_db: intended_node = hint.intended_for if is_healthy(intended_node): try: intended_node.write(hint.key, hint.value) hints_db.delete(hint) log(f"Successfully transferred hint to {intended_node}") except: log(f"Will retry later for {intended_node}") Yintoni i-Brilliant Durability maintained: Even though B is down: - We still have N=3 copies: A, C, D - Data won't be lost even if another node fails - System maintains durability guarantee Availability maximized: Client perspective: - Write succeeds immediately - No error message - No retry needed - Customer happy Traditional quorum would have failed: - Only 2 nodes available (A, C) - Need 3 for N=3 - Write rejected - Customer sees error Eventual consistency: Timeline: T=0: Write succeeds (A, C, D with hint) T=0-5min: B is down, but system works fine T=5min: B recovers T=5min+10sec: D detects B is back, transfers hint T=5min+11sec: B has the data, D deletes hint Result: Eventually, all correct replicas have the data Configuration Example // High availability configuration const config = { N: 3, // Want 3 replicas W: 2, // Only need 2 ACKs to succeed R: 2, // Read from 2 nodes // Sloppy quorum allows expanding preference list sloppy_quorum: true, // How far to expand when looking for healthy nodes max_extended_preference_list: 10, // How often to check for hint transfers hint_transfer_interval: 10_seconds, // How long to keep trying to transfer hints hint_retention: 7_days }; Real-World Impact From Amazon’s production experience: During normal operation: Hinted handoff rarely triggered Zonke iingxelo zithunyelwe kwi-nodes ezidumileyo Hints database is mostly empty During failures: Scenario: 5% of nodes failing at any time (normal at Amazon's scale) Without hinted handoff: - Write success rate: 85% - Customer impact: 15% of cart additions fail With hinted handoff: - Write success rate: 99.9%+ - Customer impact: Nearly zero During datacenter failure: Scenario: Entire datacenter unreachable (33% of nodes) Without hinted handoff: - Many keys would lose entire preference list - Massive write failures - System effectively down With hinted handoff: - Writes redirect to other datacenters - Hints accumulate temporarily - When datacenter recovers, hints transfer - Zero customer-visible failures The Trade-off Benefits: ✓ Maximum write availability ✓ Durability maintained during failures ✓ Automatic recovery when nodes come back ✓ No manual intervention required Costs: ✗ Temporary inconsistency (data not on “correct” nodes) ✗ Extra storage for hints database ✗ Background bandwidth for hint transfers Ikhowudi elifanelekileyo kakhulu ✗ If a substitute node (like D) fails before it can transfer its hint back to B, the number of true replicas drops below N until the situation resolves. This is an important edge case to understand in failure planning. Hinted handoff provides temporary durability, not permanent replication. Izixhobo ze-availability ziquka kakhulu iindleko ze-e-commerce workloads. Amazon’s verdict: Ukuphendula i-Conflict: I-Problem of the Shopping Cart Let’s talk about the most famous example from the paper: the shopping cart. This is where rubber meets road. What Is a Conflict (and Why Does It Happen)? A occurs when two writes happen to the same key on different nodes, without either write “knowing about” the other. This is only possible because Dynamo accepts writes even when nodes can’t communicate—which is the whole point! conflict Here’s a concrete sequence of events that creates a conflict: Timeline: T=0: Customer logs in. Cart has {shoes} on all 3 nodes. T=1: Network partition: Node1 can't talk to Node2. T=2: Customer adds {jacket} on their laptop → goes to Node1. Cart on Node1: {shoes, jacket} ← Vector clock: [N1:2] T=3: Customer adds {hat} on their phone → goes to Node2. Cart on Node2: {shoes, hat} ← Vector clock: [N2:2] T=4: Network heals. Node1 and Node2 compare notes. Node1 says: "I have version [N1:2]" Node2 says: "I have version [N2:2]" Neither clock dominates the other → CONFLICT! Neither version is “wrong”—both represent real actions the customer took. Dynamo’s job is to detect this situation (via vector clocks) and surface ukuba isicelo ukuze isicelo unokufuneka njani. both versions What Does the Application Do With a Conflict? Kuyinto ingxenye ebalulekileyo ukuba i-paper ibhalisele ukuba: . Dynamo gives you all the concurrent versions; your code decides how to merge them. the application must resolve conflicts using business logic For the shopping cart, Amazon chose a : keep all items from all concurrent versions. The rationale is simple—losing an item from a customer’s cart (missing a sale) is worse than occasionally showing a stale item they already deleted. union merge Conflict versions: Version A (from Node1): {shoes, jacket} Version B (from Node2): {shoes, hat} Merge strategy: union Merged cart: {shoes, jacket, hat} ← All items preserved Here’s the actual reconciliation code: from __future__ import annotations from dataclasses import dataclass, field class VectorClock: def __init__(self, clock: dict[str, int] | None = None): self.clock: dict[str, int] = clock.copy() if clock else {} def merge(self, other: "VectorClock") -> "VectorClock": """Merged clock = max of each node's counter across both versions.""" all_keys = set(self.clock) | set(other.clock) merged = {k: max(self.clock.get(k, 0), other.clock.get(k, 0)) for k in all_keys} return VectorClock(merged) def __repr__(self): return f"VectorClock({self.clock})" @dataclass class ShoppingCart: items: list[str] = field(default_factory=list) vector_clock: VectorClock = field(default_factory=VectorClock) @staticmethod def reconcile(carts: list["ShoppingCart"]) -> "ShoppingCart": if len(carts) == 1: return carts[0] # No conflict, nothing to do # Merge strategy: union of all items (never lose additions). # This is Amazon's choice for shopping carts. # A different application might choose last-write-wins or something else. all_items: set[str] = set() merged_clock = VectorClock() for cart in carts: all_items.update(cart.items) # Union: keep everything merged_clock = merged_clock.merge(cart.vector_clock) return ShoppingCart(items=sorted(all_items), vector_clock=merged_clock) # Example conflict scenario cart1 = ShoppingCart(items=["shoes", "jacket"], vector_clock=VectorClock({"N1": 2})) cart2 = ShoppingCart(items=["shoes", "hat"], vector_clock=VectorClock({"N2": 2})) # Dynamo detected a conflict and passes both versions to our reconcile() reconciled = ShoppingCart.reconcile([cart1, cart2]) print(reconciled.items) # ['hat', 'jacket', 'shoes'] — union! The Deletion Problem (Why This Gets Tricky) The union strategy has a nasty edge case: . deleted items can come back from the dead T=0: Cart: {shoes, hat} T=1: Customer removes hat → Cart: {shoes} Clock: [N1:3] T=2: Network partition — Node2 still has old state T=3: Some concurrent write to Node2 Clock: [N2:3] T=4: Network heals → conflict detected T=5: Union merge: {shoes} ∪ {shoes, hat} = {shoes, hat} Result: Hat is BACK! Customer removed it, but it reappeared. Amazon explicitly accepts this trade-off. A “ghost” item in a cart is a minor annoyance. Losing a cart addition during a Black Friday sale is lost revenue. : Merge logic must be domain-specific and carefully designed. Adding items is commutative (order doesn’t matter) and easy to merge. Removing items is not—a deletion in one concurrent branch may be silently ignored during a union-based merge. This is an intentional trade-off in Dynamo’s design, but it means the application must reason carefully about add vs. remove semantics. If your data doesn’t naturally support union merges (e.g., a counter, a user’s address), you need a different strategy—such as CRDTs, last-write-wins with timestamps, or simply rejecting concurrent writes for that data type. Engineering depth note : I-Fusion logic kufuneka yi-domain-specific kwaye yenzelwe ngokucacileyo. Ukongezelela iimveliso i-commutative (i-order doesn't matter) kwaye lula ukuqhagamshelane. Ukukhusela iimveliso ayikho-ukukhuselwa kwinqanaba elinye elifanelekileyo kunokufumaneka ngexesha le-union-based merger. Oku kuquka i-compromise ye-Design ye-Dynamo, kodwa kubalulekile ukuba isicelo uxhomekeke ngokucacileyo malunga ne-add vs. eliminate semantics. Ukuba idatha yakho ayinxalenye ngokwemvelo ukuqhagamshelwano kwe-union (isib. i-counter, i-address ye-username), kufuneka i-strategy eyahlukileyo- Engineering depth note Read and Write Flow I-diagram edlulileyo ibonisa umgangatho we-high-level, kodwa siphandakanya into efanayo ngexesha elandelayo ngexesha lokufunda kunye nokufunda. Ukuphathelela oku ngokugqithisileyo kuya kubangela iingcebiso ezidlulileyo. Write Path Step-by-step narration of a PUT request: I-Client inikeza isicelo kwi-node ye-node ye-node ye-node ye-node ye-node ye-node ye-node ye-node ye-node ye-node ye-node ye-node ye-node ye-node. I-coordinator ifumaneka - le node yokuqala kwi-preference list ye-hash position ye-key kwi-ring. — the coordinator increments its own counter in the vector clock, creating a new version. Vector clock is updated I-coordinator ibhalwe kwi-locally, ke amafanelekileyo zithunyelwe kwi-N-1 node ezininzi kwi-preference list ngexesha elifanelekileyo. It does NOT wait for all N — just the first W to respond. The remaining nodes that haven’t responded yet will get the write eventually (or via hinted handoff if they’re down). The coordinator waits for W acknowledgments. to the client. From the client’s perspective, the write is done. Once W ACKs arrive, the coordinator returns 200 OK : I-client ibonelela ngempumelelo ngexesha i-node ye-W ibonelela. I-node ezininzi (i-N - W) zithunyelwe ngokubhalisa ngokubhalisa. Ngoko ke i-system ibonelela "ngokuqhelekileyo" - zonke i-node have the data, just not necessarily at the same moment. Key insight about the write path will Ukucinga umzila Step-by-step narration of a GET request: I-Client inikeza i-request kwi-coordinator ye-ke key. in the preference list simultaneously (not just R). This is important — it contacts all N, but only needs R to respond. The coordinator sends read requests to all N nodes Ukukhangisa imibuzo ye-R. I-coordinator ifumaneka ngokukhawuleza emva kokuba ama-R nodes zibonakalisa, ngaphandle kokukhangisa ama-slower ones. The coordinator checks all the vector clocks: Compare the versions returned. If all versions are identical → return the single version immediately. If one version’s clock dominates the others (it’s causally “newer”) → return that version. If versions are concurrent (neither clock dominates) → return to the client, which must merge them. all versions Ukubhalisa Ukubhalisa kuza kwi-background: Ukuba i-coordinator ibonise ukuba nayiphi na i-node ibonise i-version ye-stale, ibonise i-version ye-ultimate kwi-node elandelayo ukuze ifumanise. Because Dynamo is a general-purpose storage engine. It doesn’t know whether you’re storing a shopping cart, a user profile, or a session token. Only knows how to merge two conflicting versions in a way that makes business sense. The coordinator hands you the raw concurrent versions along with the vector clock context, and you do the right thing for your use case. Why does the client receive the conflict instead of the coordinator resolving it? isicelo yakho : xa i-client ibhalise inguqulelo elihambelana, kufuneka kuquka i-context (i-clock ye-vector eyihambelana). Oku ivumela i-Dynamo ukuba inguqulelo elitsha iye "ngathi" zonke iinguqulelo ezihambelana, ngoko i-conflict iyahlulwe. Ngaphandle kwe-context, i-Dynamo ingathanda ukuba iyiphi na inguqulelo elihambelana. umlinganiselo wokubhalisa kwi-conflict ebhalisiweyo. The vector clock context is the key to closing the loop Umzekelo Iimveliso ze-Merkel ye-anti-entropy I-Problem: Indlela Uyazi Xa I-Replicas I-Out of Sync? Emva kokufumana i-node emva kokuphumelela, kunokufumana iifayile ezininzi. Emva kokufumana i-network partition, iifayile ezimbini ziyafumaneka. Yintoni i-Dynamo ibonise kunye nokuguqula iifayile ezininzi? The brute-force approach would be: “Every hour, compare every key on Node A against Node B, and sync anything that’s different.” But at Amazon’s scale, a single node might store hundreds of millions of keys. Comparing them all one by one would be so slow and bandwidth-intensive that it would interfere with normal traffic. Umzekelo wokugqibela: Ngaphandle kokuxhomekeka iingcebiso ezahlukeneyo, ukuxhomekeka . If the hash matches, that whole group is identical—skip it. Only drill down into groups where hashes differ. Dynamo uses Merkle trees to solve this efficiently. I-hashes yeqela le-keys Important: I-Merkel Tree Sync yinkqubo ye-anti-entropy ye-background. Ayikho kwi-hot read/write path. I-reads kunye ne-writes zokusetyenziswa ne-vector clocks kunye ne-quorums ye-versioning. I-Merkel Tree yi-process ye-repair eyenziwa ngexesha elidlulileyo kwi-background ukufumana nayiphi na i-inconsistencies eyenza. : Merkle tree sync is a mechanism. It is not on the hot read/writing path. Okuzenzakalelayo ukhangela nokubhala usebenzisa iiyure ze-vector kunye ne-quorums for versioning. Merkle trees ziquka inkqubo yokulungisa eyenziwa ngexesha elidlulileyo emzimbeni ukufumana nayiphi na iingxaki ezivela. Important background anti-entropy How a Merkle Tree Is Built Each node builds a Merkle tree over its data, organized by key ranges: contain the hash of a small range of actual data keys (e.g., hash of all values for keys k1, k2, k3). Leaf nodes Iingubo zangaphakathi zihlanganisa i-hash ye-hashes yabasetyhini zayo. is a single hash representing the data on the node. The root all How Two Nodes Sync Using Merkle Trees Xa i-Node A kunye ne-Node B ufuna ukubuyekeza ukuba ziyafumaneka kwi-sync: : Compare root hashes. If they’re the same, everything is identical. Done! (No network traffic for the data itself.) Step 1 : Ukuba izilwanyana zihlukile, uqhagamshelane nabasetyhini zabo. Inyaniso? Hlola le nqaku epheleleyo ye-key space. Step 2 : Keep descending only into subtrees where hashes differ, until you reach the leaf nodes. Step 3 : Sync only the specific keys in the differing leaf nodes. Step 4 Example: Comparing two nodes Node A root: abc789 ← differs from Node B! Node B root: abc788 Compare left subtrees: Node A left: xyz123 Node B left: xyz123 ← same! Skip entire left half. Compare right subtrees: Node A right: def456 Node B right: def457 ← differs! Go deeper. Compare right-left subtree: Node A right-left: ghi111 Node B right-left: ghi111 ← same! Skip. Compare right-right subtree: Node A right-right: jkl222 Node B right-right: jkl333 ← differs! These are leaves. → Sync only the keys in the right-right leaf range (e.g., k10, k11, k12) Instead of comparing all 1 million keys, we compared 6 hashes and synced only 3 keys! : Synchronization process in code def sync_replicas(node_a, node_b, key_range): """ Efficiently sync two nodes using Merkle trees. Instead of comparing all keys, we compare hashes top-down. Only the ranges where hashes differ need actual key-level sync. """ tree_a = node_a.get_merkle_tree(key_range) tree_b = node_b.get_merkle_tree(key_range) # Step 1: Compare root hashes first. # If they match, every key in this range is identical — nothing to do! if tree_a.root_hash == tree_b.root_hash: return # Zero data transferred — full match! # Step 2: Recursively find differences by traversing top-down. # Only descend into subtrees where hashes differ. differences = [] stack = [(tree_a.root, tree_b.root)] while stack: node_a_subtree, node_b_subtree = stack.pop() if node_a_subtree.hash == node_b_subtree.hash: continue # This whole subtree matches — skip it! if node_a_subtree.is_leaf: # Found a differing leaf — these keys need syncing differences.extend(node_a_subtree.keys) else: # Not a leaf yet — recurse into children for child_a, child_b in zip(node_a_subtree.children, node_b_subtree.children): stack.append((child_a, child_b)) # Step 3: Sync only the specific keys that differed at leaf level. # This might be a handful of keys, not millions. for key in differences: sync_key(node_a, node_b, key) Why This Is Efficient Umthombo we-Merkel ibonakala ukuba inani le-hash comparisons kufuneka uqhagamshelane (logarithmic in the number of keys), not the number of keys themselves. depth of the tree Node with 1,000,000 keys: Naive approach: Compare 1,000,000 keys individually Cost: 1,000,000 comparisons Merkle tree: Compare O(log N) hashes top-down Tree depth ≈ 20 levels Cost: 20 comparisons to find differences Then sync only the differing leaves (~few keys) Speedup: ~50,000x fewer comparisons! Kwakhona, ukuba iindidi ezimbini ziquka (which is almost always true in a healthy cluster), the root hashes often match entirely and zero data needs to be transferred. The anti-entropy process is very cheap in the common case. mostly in sync I-Memberhood kunye ne-Failure Detection Dynamo uses a gossip protocol for membership management. Each node periodically exchanges membership information with random peers. There is no master node—all coordination is fully decentralized. I-Gossip-based Membership Key Design Points I-Node ye-Cluster ye-Node ye-Node ye-Node ye-Node ye-Node ye-Node ye-Node ye-Node ye-Node ye-Node ye-Node ye-Node ye-Node ye-Node ye-Node ye-Node ye-Node ye-Node ye-Node. No single coordinator : Dynamo uses an accrual-based failure detector (similar to Phi Accrual). Rather than a binary “alive/dead” judgment, nodes maintain a le ngamaxesha elide i-peer engabonakali. Oku kuthatha i-positive engabonakali kwi-hiccups ye-network. Failure suspicion vs. detection suspicion level Node A's view of Node B: - Last heartbeat: 3 seconds ago → Suspicion low → Healthy - Last heartbeat: 15 seconds ago → Suspicion rising → Likely slow/degraded - Last heartbeat: 60 seconds ago → Suspicion high → Treat as failed : New nodes contact a seed node to join, then gossip spreads their presence to the rest of the cluster. Ring membership is eventually consistent—different nodes may have slightly different views of the ring momentarily, which is acceptable. Decentralized bootstrapping Performance Characteristics: Real Numbers Izixhobo inikeza iinkcukacha zokusebenza ezizangalisayo. Nceda ndibandakanya: Latency Distribution Metric | Average | 99.9th Percentile --------------------|---------|------------------ Read latency | ~10ms | ~200ms Write latency | ~15ms | ~200ms Key insight: 99.9th percentile is ~20x the average! I-percentile ye-99.9 ifumaneka ngu: Why the huge gap? I-Garbage Collection i-pauses Disk I/O variations Umnqweno Jitter Ukuphazamiseka Yintoni i-SLA ye-Amazon ifumaneka kwi-percentile ye-99.9th, ngaphandle kwe-average. Version Conflicts Ukusuka kwiiyure ze-24 ye-Amazon yokukhiqiza ye-shopping cart traffic (ngokubhaliwe yi-Dynamo). Qinisekisa oku kubonisa iimpawu zokusebenza ze-Amazon, akukho isiseko se-universal: 99.94% - Saw exactly one version (no conflict) 0.00057% - Saw 2 versions 0.00047% - Saw 3 versions 0.00009% - Saw 4 versions : Iingxaki ziyafumaneka kwimeko! Kwiimeko ezisetyenziswa ngama-writers (i-robots), ngaphandle kokuphumula. Takeaway Ukulungiselela Strategy Evolution I-Dynamo yandiswa ngeengxaki ezintathu ze-partitioning. Le nkqubo yandisa izifundo ezininzi ezininzi ezininzi: Strategy 1: Random Tokens (Initial) Problem: Random token assignment → uneven load Problem: Adding nodes → expensive data scans Problem: Can't easily snapshot the system I-Token ye-Token ye-Token ye-Token ye-Token ye-Token ye-Token ye-Token ye-Token ye-Token ye-Token ye-Token ye-Token ye-Token ye-Token ye-Token ye-Token ye-Token ye-Token ye-Token ye-Token ye-Token ye-Token ye-Token ye-Token. Operational lesson Strategy 2: Equal-sized Partitions + Random Tokens Improvement: Decouples partitioning from placement Problem: Still has load balancing issues Strategy 3: Q/S Tokens Per Node - Iingubo ye-equal-size + Ukubeka kwe-Deterministic (i-Current) What Q and S mean: Q = inombolo yobungakanani obungapheliyo ebandayo ebandayo (isib. 1024). Hlola le nto njengezingxaki ze-equal-size, ezivela ngexesha le-hash, ezingenakufutshane isiko. = the number of physical servers currently in the cluster (e.g. 8). S Q/S = ubungakanani obungapheliyo obungapheliyo ngalinye (isib. 1024 / 8 = 128 partitions ngalinye). The key shift from earlier strategies: the ring is now divided into Q fixed, equal-sized partitions , and then those partitions are assigned evenly to servers. Servers no longer get random positions — they each own exactly Q/S partitions, distributed evenly around the ring. first Example: Q=12 partitions, S=3 servers Ring divided into 12 equal slices (each covers 30° of the 360° ring): Partition 1: 0°– 30° → Server A Partition 2: 30°– 60° → Server B Partition 3: 60°– 90° → Server C Partition 4: 90°–120° → Server A Partition 5: 120°–150° → Server B Partition 6: 150°–180° → Server C ...and so on, round-robin Each server owns exactly Q/S = 12/3 = 4 partitions → perfectly balanced. When a 4th server joins (S becomes 4): New Q/S = 12/4 = 3 partitions per server. Each existing server hands off 1 partition to the new server. Only 3 out of 12 partitions move — the rest are untouched. I-evolution ye-token ye-random ukuya kwi-partition ye-equal-size ye-equal-equal-equal-equal-equal-equal-equal-equal-equal-equal-equal-equal-equal-equal-equal-equal-equal-equal-equal-equal-equal-equal-equal-equal-equal-equal-equal-equal-equal-equal-equal-equal-equal-equal-equal-equal-equal-equal-equal-equal-equal-equal-equal Ukubala i-Dynamo kwi-Modern Systems System Consistency Model Use Case Dynamo Influence Cassandra Tunable (N, R, W) Time-series, analytics Direct descendant — heavily inspired by Dynamo, uses same consistent hashing and quorum concepts Riak Tunable, vector clocks Key-value store Closest faithful Dynamo implementation Amazon DynamoDB Eventually consistent by default Managed NoSQL DynamoDB is a completely different system internally, with no vector clocks and much simpler conflict resolution. Shares the name and high-level inspiration only. ⚠️ Not the same as Dynamo! Voldemort Tunable LinkedIn's data store Open-source Dynamo implementation Google Spanner Linearizable Global SQL Opposite choice to Dynamo — prioritizes CP via TrueTime clock synchronization Redis Cluster Eventually consistent Caching, sessions Uses consistent hashing; much simpler conflict resolution Cassandra Ukulungiselela (N, R, W) Izixhobo ze-time, i-analytics I-Direct Descendant - ezininzi i-Dynamo, isebenzisa i-hashing kunye ne-quorum efanayo Riak Ukulungiselela, iiyure ze-vector Isixhobo Key-value Ukusetyenziswa kwe-Dynamo epheleleyo Amazon DynamoDB Okugqitywa kwe-default Ukusebenza NoSQL I-DynamoDB yinkqubo epheleleyo kwi-internal, ngaphandle kwe-vector clocks kunye ne-conflict resolution elula kakhulu. I-DynamoDB ibonelela kuphela i-name kunye ne-high-level inspiration. ⚠️ Not the same as Dynamo! Voldemort Ukucinga LinkedIn's data store Ukusebenza kwe-Open-Source Dynamo Google Spanner Ukulungiselela Global SQL Ukukhangisa i-Dynamo - Ukukhangisa i-CP nge-TrueTime clock synchronization Redis Cluster Okugqithisileyo Caching, iingxowa Ukusetyenziswa kwe-hashing efanelekileyo; isisombululo se-conflict elula kakhulu I-DynamoDB ingxaki: Iingcali ezininzi zihlanganisa i-Amazon DynamoDB kunye ne-Dynamo Paper. Zininzi ezahlukileyo. I-DynamoDB yinkonzo olusebenzayo ezisetyenziselwa ukuncedisa ukusebenza. Ayikho iiyure ze-vector, ayisebenzisa i-squeme ye-partitioning efanayo, kwaye isebenzisa i-model ye-consistency ye-proprietary. I-paper ibekwe kwi-Dynamo internal storage engine eyenza i-DynamoDB. Iingcali ezininzi zihlanganisa i-Amazon DynamoDB kunye ne-Dynamo Paper. Zininzi ezahlukileyo. I-DynamoDB yinkonzo olusebenzayo ezisetyenziselwa ukunciphisa ukusebenza. Akukho iiyure ze-vector, akusetyenziswa kwinkqubo ye-partitioning efanayo, kwaye isebenzisa i-model ye-consistency ye-proprietary. I-DynamoDB inqaku malunga ne-Dynamo internal storage engine eyenza i-DynamoDB. The DynamoDB confusion What Dynamo Does NOT Give You Zonke i-blog ye-engineer ye-senior kufuneka ifunyenwe malunga neengxaki. Nazi nto le-Dynamo ibhizinisa ngokucacileyo: akukho transactions: Operations kuphela single-key. Unako ukuhlaziywa atomically keys ezininzi. Ayikho iindeksi ze-secondary: Unako ukufumana idatha kuphela kwi-key yayo yokuqala (ngaphansi kwi-design yokuqala). : It’s a key-value store. There is no query language. No joins No global ordering: Iziganeko ezahlukileyo ezahlukileyo ezahlukileyo ezahlukileyo. Ayikho linearizability: Nangona R=W=N, Dynamo ayikwazi ukufumana linearizable. Akukho iiyure yehlabathi, akukho serializability eziqhelekileyo. : The system detects conflicts and surfaces them to the application. The must resolve them. If your engineers don’t understand this, you will have subtle data bugs. No automatic conflict resolution application : The anti-entropy process (Merkle tree reconciliation) is not free. At large scale, background repair traffic can be significant. Repair costs at scale Ukukhula kwe-clock ye-vektor: Kwiimeko ze-high-churn yokubhala kunye ne-coordinators ezininzi, i-clock ye-vektor ingangena kakhulu ukuze kufuneka ukuchitha, nto leyo ivumela ukuchithwa kwe-causality efanelekileyo. Ukunyaniseka kwezinto ezininzi kubalulekile ukuvelisa ngempumelelo iinkqubo Dynamo-style kwimveliso. Isibonelo esebenzayo Below is a self-contained Python implementation of the core Dynamo concepts. It’s intentionally simplified—no actual networking, no persistence—but it faithfully models how vector clocks, the consistent hash ring, quorum reads/writes, and conflict detection interact. Each component is explained before its code. Isigaba 1: I-Vector Clock Yintoni class is the foundation of version tracking. It’s just a dictionary mapping Iintlobo ezimbini zokusebenza: VectorClock node_id → counter — bump our own counter when we write increment(node) — check if one clock is causally “after” another; if neither dominates, the writes were concurrent (conflict) dominates(other) from __future__ import annotations from dataclasses import dataclass, field from typing import Optional class VectorClock: """ Tracks causality across distributed writes. A clock like {"nodeA": 2, "nodeB": 1} means: - nodeA has coordinated 2 writes - nodeB has coordinated 1 write - Any version with these counters has "seen" those writes """ def __init__(self, clock: dict[str, int] | None = None): self.clock: dict[str, int] = clock.copy() if clock else {} def increment(self, node_id: str) -> "VectorClock": """Return a new clock with node_id's counter bumped by 1.""" new_clock = self.clock.copy() new_clock[node_id] = new_clock.get(node_id, 0) + 1 return VectorClock(new_clock) def dominates(self, other: "VectorClock") -> bool: """ Returns True if self is causally AFTER other. self dominates other when: - Every counter in self is >= the same counter in other, AND - At least one counter in self is strictly greater. Meaning: self has seen everything other has seen, plus more. """ all_keys = set(self.clock) | set(other.clock) at_least_one_greater = False for key in all_keys: self_val = self.clock.get(key, 0) other_val = other.clock.get(key, 0) if self_val other_val: at_least_one_greater = True return at_least_one_greater def merge(self, other: "VectorClock") -> "VectorClock": """ Merge two clocks by taking the max of each counter. Used after resolving a conflict to produce a new clock that has "seen" everything both conflicting versions saw. """ all_keys = set(self.clock) | set(other.clock) merged = {k: max(self.clock.get(k, 0), other.clock.get(k, 0)) for k in all_keys} return VectorClock(merged) def __repr__(self): return f"VectorClock({self.clock})" Isigaba 2: Versioned Value Yonke umgangatho ebhalwe kwi-Dynamo uqhagamshelane kunye ne-vector clock yayo. Le ukuqhagamshelwano ivumela i-coordinator ukuxhaswa i-versions ngexesha lokufunda kunye nokufumana iingxaki. @dataclass class VersionedValue: """ A value paired with its causal history (vector clock). When a client reads, they get back a VersionedValue. When they write an update, they must include the context (the vector clock they read) so Dynamo knows what version they're building on top of. """ value: object vector_clock: VectorClock def __repr__(self): return f"VersionedValue(value={self.value!r}, clock={self.vector_clock})" Part 3: Simulated Node Kwi-Dynamo efanelekileyo, zonke i-nodes ziquka inkqubo eyahlukileyo. Kule nathi sinama-objects e-memory. Izici ebalulekileyo: zonke i-nodes ziquka i-local yayo dict. Nodes ingathintela njenge Ukubonisa iingxaki. storage down class DynamoNode: """ Simulates a single Dynamo storage node. In production this would be a separate server with disk storage. Here it's an in-memory dict so we can demo the logic without networking. """ def __init__(self, node_id: str, token: int): self.node_id = node_id self.token = token # Position on the consistent hash ring self.storage: dict[str, list[VersionedValue]] = {} self.down = False # Toggle to simulate node failures def write(self, key: str, versioned_value: VersionedValue) -> bool: """ Store a versioned value. Returns False if the node is down. We store a LIST of versions per key, because a node might hold multiple concurrent (conflicting) versions until they are resolved by the application. """ if self.down: return False # In a real node this would be written to disk (e.g. BerkeleyDB) self.storage[key] = [versioned_value] return True def read(self, key: str) -> list[VersionedValue] | None: """ Return all versions of a key. Returns None if the node is down. A healthy node with no data for the key returns an empty list. """ if self.down: return None return self.storage.get(key, []) def __repr__(self): status = "DOWN" if self.down else f"token={self.token}" return f"DynamoNode({self.node_id}, {status})" Uhlobo lwe-4: I-Hash Ring I-ring ibhasi iingcebiso kwiingcebiso. Thina ndicinga iingcebiso ngexesha lokuqala (i-position) kwaye usebenzisa iingcebiso yeengcebiso ukufumana i-coordinator kunye ne-preference list yeengcebiso. import hashlib class ConsistentHashRing: """ Maps any key to an ordered list of N nodes (the preference list). Nodes are placed at fixed positions (tokens) on a conceptual ring from 0 to 2^32. A key hashes to a position, then walks clockwise to find its nodes. This means adding/removing one node only rebalances ~1/N of keys, rather than reshuffling everything like modulo hashing would. """ def __init__(self, nodes: list[DynamoNode]): # Sort nodes by token so we can do clockwise lookup efficiently self.nodes = sorted(nodes, key=lambda n: n.token) def _hash(self, key: str) -> int: """Consistent hash of a key into the ring's token space.""" # Use MD5 for a simple, evenly distributed hash. # Real Dynamo uses a more sophisticated hash (e.g., SHA-1). digest = hashlib.md5(key.encode()).hexdigest() return int(digest, 16) % (2**32) def get_preference_list(self, key: str, n: int) -> list[DynamoNode]: """ Return the first N nodes clockwise from key's hash position. These are the nodes responsible for storing this key. The first node in the list is the coordinator — it receives the client request and fans out to the others. """ if not self.nodes: return [] key_hash = self._hash(key) # Find the first node whose token is >= key's hash (clockwise) start_idx = 0 for i, node in enumerate(self.nodes): if node.token >= key_hash: start_idx = i break # If key_hash is greater than all tokens, wrap around to node 0 else: start_idx = 0 # Walk clockwise, collecting N unique nodes preference_list = [] for i in range(len(self.nodes)): idx = (start_idx + i) % len(self.nodes) preference_list.append(self.nodes[idx]) if len(preference_list) == n: break return preference_list Uhlobo lwe-5: I-Dynamo Coordinator Yinto yendalo yekhwalithi - i-logic elandelayo iimfuno ze-client, i-fans kwi-replicas, i-quorum, kunye ne-detecting iingxaki. Ukufundisa oku ngokufanelekileyo; apho zonke iingcebiso ezidlulileyo zihlanganisa. class SimplifiedDynamo: """ Coordinates reads and writes across a cluster of DynamoNodes. Any node can act as coordinator for any request — there's no dedicated master. The coordinator is simply whichever node receives the client request (or the first node in the preference list, if using partition-aware routing). Configuration: N = total replicas per key R = minimum nodes that must respond to a read (read quorum) W = minimum nodes that must acknowledge a write (write quorum) """ def __init__(self, nodes: list[DynamoNode], N: int = 3, R: int = 2, W: int = 2): self.N = N self.R = R self.W = W self.ring = ConsistentHashRing(nodes) # ------------------------------------------------------------------ # # WRITE # # ------------------------------------------------------------------ # def put(self, key: str, value: object, context: VectorClock | None = None) -> VectorClock: """ Write a key-value pair to N replicas, wait for W ACKs. The 'context' is the vector clock from a previous read. Always pass context when updating an existing key — it tells Dynamo which version you're building on top of, so it can detect whether your write is concurrent with anything else. Returns the new vector clock, which the caller should store and pass back on future writes to this key. Raises: RuntimeError if fewer than W nodes acknowledged. """ preference_list = self.ring.get_preference_list(key, self.N) if not preference_list: raise RuntimeError("No nodes available") # The coordinator is always the first node in the preference list. coordinator = preference_list[0] # Increment the coordinator's counter in the vector clock. # If no context was provided (brand new key), start a fresh clock. base_clock = context if context is not None else VectorClock() new_clock = base_clock.increment(coordinator.node_id) versioned = VersionedValue(value=value, vector_clock=new_clock) # Fan out to all N replicas. # In a real system these would be concurrent RPC calls. # Here we call them sequentially for simplicity. ack_count = 0 for node in preference_list: success = node.write(key, versioned) if success: ack_count += 1 # Only need W ACKs to declare success. # The remaining replicas are updated asynchronously (or via # hinted handoff if they were down). if ack_count < self.W: raise RuntimeError( f"Write quorum not met: got {ack_count} ACKs, needed {self.W}" ) print(f"[PUT] key={key!r} value={value!r} clock={new_clock} " f"({ack_count}/{self.N} nodes wrote)") return new_clock # ------------------------------------------------------------------ # # READ # # ------------------------------------------------------------------ # def get(self, key: str) -> list[VersionedValue]: """ Read a key from N replicas, wait for R responses, reconcile. Returns a LIST of VersionedValues: - Length 1 → clean read, no conflict - Length >1 → concurrent versions detected; application must merge After reading, the caller should: 1. If no conflict: use the single value normally. 2. If conflict: merge the values using application logic, then call put() with the merged value and the merged vector clock as context. This "closes" the conflict. Read repair happens in the background: any replica that returned a stale version is silently updated with the latest version. """ preference_list = self.ring.get_preference_list(key, self.N) # Collect responses from all N nodes all_versions: list[VersionedValue] = [] responding_nodes: list[tuple[DynamoNode, list[VersionedValue]]] = [] for node in preference_list: result = node.read(key) if result is not None: # None means the node is down all_versions.extend(result) responding_nodes.append((node, result)) if len(responding_nodes) < self.R: raise RuntimeError( f"Read quorum not met: got {len(responding_nodes)} responses, needed {self.R}" ) # Reconcile: discard any version that is strictly dominated # (i.e., is a causal ancestor of) another version. # What remains is the set of concurrent versions. reconciled = self._reconcile(all_versions) # Background read repair: if any node returned something older # than the reconciled result, send it the latest version. # (Simplified: only meaningful when there's a single winner.) if len(reconciled) == 1: latest = reconciled[0] for node, versions in responding_nodes: if not versions or versions[0].vector_clock != latest.vector_clock: node.write(key, latest) # Repair silently in background status = "clean" if len(reconciled) == 1 else f"CONFLICT ({len(reconciled)} versions)" print(f"[GET] key={key!r} status={status} " f"({len(responding_nodes)}/{self.N} nodes responded)") return reconciled # ------------------------------------------------------------------ # # INTERNAL: VERSION RECONCILIATION # # ------------------------------------------------------------------ # def _reconcile(self, versions: list[VersionedValue]) -> list[VersionedValue]: """ Remove any version that is a causal ancestor of another version. If version A's clock is dominated by version B's clock, then B is strictly newer — A adds no new information and can be dropped. Whatever remains after pruning are CONCURRENT versions: writes that happened without either "knowing about" the other. The application must merge these using domain-specific logic. Example: versions = [clock={A:1}, clock={A:2}, clock={B:1}] {A:2} dominates {A:1} → drop {A:1} {A:2} and {B:1} are concurrent → both survive result = [{A:2}, {B:1}] ← conflict! application must merge """ dominated = set() for i, v1 in enumerate(versions): for j, v2 in enumerate(versions): if i != j and v2.vector_clock.dominates(v1.vector_clock): dominated.add(i) # v1 is an ancestor of v2, discard v1 survivors = [v for i, v in enumerate(versions) if i not in dominated] # De-duplicate: identical clocks from different replicas are the same version seen_clocks: list[VectorClock] = [] unique: list[VersionedValue] = [] for v in survivors: if not any(v.vector_clock.clock == s.clock for s in seen_clocks): unique.append(v) seen_clocks.append(v.vector_clock) return unique if unique else versions Part 6: Putting It All Together — A Demo Thina uqhagamshelane kwi-scenario epheleleyo: ubhalisa / ukucacisa okuzenzakalelayo, ngoko ku-simulated conflict apho ama-nodes ezimbini zihlanganisa kwaye isicelo kufuneka uqhagamshelane. def demo(): # ── Setup ────────────────────────────────────────────────────────── # # Five nodes placed at evenly spaced positions on the hash ring. # In a real cluster these would span multiple datacenters. nodes = [ DynamoNode("node-A", token=100), DynamoNode("node-B", token=300), DynamoNode("node-C", token=500), DynamoNode("node-D", token=700), DynamoNode("node-E", token=900), ] dynamo = SimplifiedDynamo(nodes, N=3, R=2, W=2) print("=" * 55) print("SCENARIO 1: Normal write and read (no conflict)") print("=" * 55) # Write the initial shopping cart ctx = dynamo.put("cart:user-42", {"items": ["shoes"]}) # Read it back — should be a clean single version versions = dynamo.get("cart:user-42") print(f"Read result: {versions[0].value}\n") # Update the cart, passing the context from our earlier read. # The context tells Dynamo "this write builds on top of clock ctx". ctx = dynamo.put("cart:user-42", {"items": ["shoes", "jacket"]}, context=ctx) versions = dynamo.get("cart:user-42") print(f"After update: {versions[0].value}\n") print("=" * 55) print("SCENARIO 2: Simulated conflict — two concurrent writes") print("=" * 55) # Write the base version base_ctx = dynamo.put("cart:user-99", {"items": ["hat"]}) # Now simulate a network partition: # node-A and node-B can't talk to each other. # We model this by writing directly to individual nodes. pref_list = dynamo.ring.get_preference_list("cart:user-99", 3) node_1, node_2, node_3 = pref_list[0], pref_list[1], pref_list[2] # Write 1: customer adds "scarf" via node_1 (e.g., their laptop) clock_1 = base_ctx.increment(node_1.node_id) node_1.write("cart:user-99", VersionedValue({"items": ["hat", "scarf"]}, clock_1)) # Write 2: customer adds "gloves" via node_2 (e.g., their phone) # This write also descends from base_ctx, not from clock_1. # Neither write knows about the other → they are concurrent. clock_2 = base_ctx.increment(node_2.node_id) node_2.write("cart:user-99", VersionedValue({"items": ["hat", "gloves"]}, clock_2)) # Read — coordinator sees two concurrent versions and surfaces the conflict versions = dynamo.get("cart:user-99") if len(versions) > 1: print(f"\nConflict detected! {len(versions)} concurrent versions:") for i, v in enumerate(versions): print(f" Version {i+1}: {v.value} clock={v.vector_clock}") # Application-level resolution: union merge (Amazon's shopping cart strategy) # Merge items: take the union so no addition is lost all_items = set() merged_clock = versions[0].vector_clock for v in versions: all_items.update(v.value["items"]) merged_clock = merged_clock.merge(v.vector_clock) merged_value = {"items": sorted(all_items)} print(f"\nMerged result: {merged_value}") # Write the resolved version back with the merged clock as context. # This "closes" the conflict — future reads will see a single version. final_ctx = dynamo.put("cart:user-99", merged_value, context=merged_clock) versions = dynamo.get("cart:user-99") print(f"\nAfter resolution: {versions[0].value}") assert len(versions) == 1, "Should be a single version after merge" if __name__ == "__main__": demo() Expected output: ======================================================= SCENARIO 1: Normal write and read (no conflict) ======================================================= [PUT] key='cart:user-42' value={'items': ['shoes']} clock=VectorClock({'node-A': 1}) (3/3 nodes wrote) [GET] key='cart:user-42' status=clean (3/3 nodes responded) Read result: {'items': ['shoes']} [PUT] key='cart:user-42' value={'items': ['shoes', 'jacket']} clock=VectorClock({'node-A': 2}) (3/3 nodes wrote) [GET] key='cart:user-42' status=clean (3/3 nodes responded) After update: {'items': ['shoes', 'jacket']} ======================================================= SCENARIO 2: Simulated conflict — two concurrent writes ======================================================= [PUT] key='cart:user-99' value={'items': ['hat']} clock=VectorClock({'node-A': 1}) (3/3 nodes wrote) [GET] key='cart:user-99' status=CONFLICT (2 versions) (3/3 nodes responded) Conflict detected! 2 concurrent versions: Version 1: {'items': ['hat', 'scarf']} clock=VectorClock({'node-A': 2}) Version 2: {'items': ['hat', 'gloves']} clock=VectorClock({'node-A': 1, 'node-B': 1}) Merged result: {'items': ['gloves', 'hat', 'scarf']} [PUT] key='cart:user-99' value={'items': ['gloves', 'hat', 'scarf']} ... (3/3 nodes wrote) [GET] key='cart:user-99' status=clean (3/3 nodes responded) After resolution: {'items': ['gloves', 'hat', 'scarf']} Kwi-Scenario 2, i-coordinator ibonelela ngokufanelekileyo ukuba iimveliso Zonke iingxaki zihlanganisa kunye neengxaki ze-dominance - okanye iingxaki ze-anti-anti-dominance - ngoko zihlanganisa kunye neengxaki zihlanganisa kunye neengxaki zihlanganisa kunye neengxaki zihlanganisa kunye neengxaki zihlanganisa. What to notice: {'node-A': 2} {'node-A': 1, 'node-B': 1} Ukufundwa kwezifundo ze-System Design Emva kokusebenza kunye ne-Dynamo-inspired systems iminyaka emininzi, apha iingxaki zayo eziphambili: 1. Always-On Beats Strongly-Consistent For user-facing applications, availability almost always wins. Users will tolerate seeing slightly stale data. They won’t tolerate “Service Unavailable.” 2. Application-Level Reconciliation is Powerful Uyaziqhelekanga ukuchofoza i-conflict resolution kwi-application. I-application ibonelela i-logic yebhizinisi kwaye inokufumana iingcebiso ezinzima kunokuba i-database. 3. Tunable Consistency is Essential I-Shopping cart additions kufuneka i-high availability (W = 1). Iingxaki ze-financial kufuneka iingcango ezininzi (W = N). Ukukwazi ukucacisa le-per-operation kubalulekile kakhulu. 4. The 99.9th Percentile Matters More Than Average Izixhobo zakho ze-optimization zihlanganisa i-tail latencies. Oku yintoni abasebenzisi abasebenza ngexesha le-peak. 5. Gossip Protocols Scale Beautifully Ukuxhaswa kwe-decentralized via gossip ukunciphisa iingxaki ezininzi kunye neengxaki ezininzi zeengxaki. Xa akufanele ukusebenzisa i-Dynamo-Style Systems Be honest about trade-offs. Don’t use this approach when: (financial transactions, inventory management) Strong consistency is required Iingxaki ezininzi ziquka (i-reporting, i-analytics, i-joins) (Dynamo is single-key operations only) Transactions span multiple items I-team yakho ayikwazi ukuqhagamshelane ne-consistency ye-eventual (ukuba abathengi awukwazi ukufumana i-vector clocks kunye ne-conflict resolution, uya kufumaneka kwi-problems) Ukucinga I-Dynamo inikeza ukuguqulwa kwinkqubo ezidlulileyo. Ngokuvumelana ne-consistency ekugqibeleni kunye nokunika i-compromise ye-tune-offs, ivumela ukuvelisa i-systems ezihlangene kwizinga ezininzi kunye nokugcina i-availability ephezulu. Ukuba usebenzisa i-Cassandra, i-Riak, okanye i-DynamoDB, unokufumana iingcebiso ezidlulileyo ezidlulileyo kule nqaku. Njengomatshini, umsebenzi lethu kuyinto ukufumana ezininzi iingxaki ezininzi kunye nokusetyenziswa ngokufanelekileyo. I-Dynamo inikezela isixhobo esebenzayo, kodwa njengokuba nayiphi na isixhobo, iyiphi na isixhobo esebenzayo. Ukucaciswa kwakhona Umbhali wokuqala weDynamo: SOSP 2007 I-Werner Vogels' Blog: Zonke izinto ezidlulileyo I-Documentation ye-Cassandra: Ukuphathelela njani iinkcukacha zokusetyenziswa "I-Design ye-Data-Intensive Applications" nguMartin Kleppmann - I-Chapter 5 ye-Replication I-Appendix: Iingxaki kunye neengxaki ze-Design Three open-end problems that come up in system design interviews and real engineering work. Think through each before reading the discussion. Indawo 1: Ukuphendula i-conflict ye-Collaborative Document Editor : Uyafunda into efana ne-Google Docs ekhuselekileyo kwi-Dynamo-style store. Abasebenzisi abahlala i-paragraph efanayo ngexesha elifanayo. Indlela yokusebenza ne-conflict? The problem : The shopping cart strategy (union of all items) is only safe because adding items is commutative — If User A deletes a sentence and User B edits the middle of it, the union of their changes is meaningless or contradictory. Why shopping cart union doesn’t work here {A} ∪ {B} = {B} ∪ {A} The right approach: Operational Transformation (OT) or CRDTs Isisombululo se-industry ibonisa i-document ayikho i-blob ye-text, kodwa njenge-sequence ye-operations, yaye ukuguqulwa i-operations ezinxulumeneyo ngoko kunokwenzeka ngaphandle kokuphendula: User A's operation: delete(position=50, length=20) User B's operation: insert(position=60, text="new sentence") Without OT: B's insert position (60) is now wrong because A deleted 20 chars. With OT: Transform B's operation against A's: B's insert position shifts to 40 (60 - 20). Both operations now apply cleanly. I-strategy ye-conflict resolution ye-Dynamo layer iya kuba: I-Store Operations (ayikho i-snapshots epheleleyo ye-document) njenge-value ye-key eyodwa. Kwi-conflict, ufumane zonke iindidi zokusebenza ezifanelekileyo kwi-version eyodwa. Ukusetyenzisa i-OT ukuxhaswa kwi-operation log efanayo. Zifaka i-log yokuhlala kunye ne-vector clock yokuhlala njenge-context. : The operation log per document segment, not the rendered text. This makes merges deterministic and lossless. What to store in Dynamo Izixhobo ze-Google Docs, i-Notion, kunye ne-Figma. Iingqungquthela zayo ze-storage zisebenzisa okanye i-OT okanye i-CRDTs (i-Conflict-free Replicated Data Types), iinkqubo ze-data ezibonakalayo ukuba zihlanganisa ngaphandle kwe-conflict ngaphandle kwe-operation ordering. Real-world reference Indawo 2: Khetha N, R, W kwiimeko ezahlukeneyo zokusetyenziswa : Yintoni i-configuration uya kuchagua (a) i-session store, (b) i-product catalog, (c) i-user profiles? The problem Umgangatho olungcono ukufikelela oku: ukucacisa umgangatho we-Failure enokuhlawula kakhulu - umgangatho we-Writing (ukuphazamiseka kwedatha) okanye umgangatho we-Writing (ukunciphisa). Emva koko ukhethe iindleko ze-quorum ngokufanayo. Session store — prioritize availability Zifumaneka i-session kunye ne-user-specific. Ukuba i-session ye-username ifumaneka ngexesha elifutshane okanye ifumaneka, i-session ifumaneka kwaye ifumaneka kwakhona. Yinto enomdla kodwa ayikho i-catastrophic. Uyakwazi ukuchaza i-session yokubhalisa. N=3, R=1, W=1 Rationale: - W=1: Accept session writes even during heavy failures. A user can't log in if their session write is rejected. - R=1: Read from any single node. Stale session data is harmless. - N=3: Still replicate to 3 nodes for basic durability. Trade-off accepted: Stale session reads are possible but inconsequential. Product catalog — prioritize read performance and consistency Iinkcukacha zeemveliso zithunyelwe rhoqo (ngamaqela ze-ops) kodwa zithunyelwe ezininzi iiyure ngosuku. Izixhobo zeemveliso ze-stable zithunyelwe. Ufuna ukucacisa ngokukhawuleza kunye ne-consistent. N=3, R=2, W=3 Rationale: - W=3: All replicas must confirm a catalog update before it's live. A price change half-published is worse than a brief write delay. - R=2: Read quorum overlap with W=3 guarantees fresh data. Acceptable: catalog writes are rare, so write latency doesn't matter. - N=3: Standard replication for durability. Trade-off accepted: Writes are slow and fail if any node is down. Acceptable because catalog updates are infrequent. User profiles — balanced Iinkcukacha zeprofayili (isithombe, i-imeyile, iiprofayili) zihlanganisa kakhulu. I-profil ye-stable ingxaki, kodwa ayikho i-dangerous. I-update eyenziwe (isib. inokukwazi ukuhlaziywa kwe-imeyile yayo) i-problem epheleleyo. N=3, R=2, W=2 Rationale: - The classic balanced configuration. - R + W = 4 > N = 3, so quorums overlap: reads will see the latest write. - Tolerates 1 node failure for both reads and writes. - Appropriate for data that matters but doesn't require strict consistency. Trade-off accepted: A second simultaneous node failure will cause errors. Acceptable for non-critical user data. Decision framework summary: Priority R W When to use Max availability 1 1 Sessions, ephemeral state, click tracking Balanced 2 2 User profiles, preferences, soft state Consistent reads 2 3 Catalogs, config, rarely-written reference data Highest consistency 3 3 Anywhere you need R+W > N with zero tolerance for stale reads (still not linearizable) Max ukufikelela 1 1 Sessions, indawo ephemeral, ukuyila kwe-click Ukucaciswa 2 2 Iiprofayili zokusebenzisa, iiprofayili, i-soft state Ukufundwa okuqhubekayo 2 3 I-catalogs, i-config, i-Rarely Written Reference Data Ukucaciswa okuphambili 3 3 Xa ufuna R+W > N kunye ne-zero tolerance kwi-stale reads (ukuba akayi linearizable) Problem 3: Testing a Dynamo-Style System Under Partition Scenarios : Indlela yokubhalisa ukuba inkqubo yakho ngokwenene ngokwenene xa ama-nodes zithembisa kunye ne-partitions? The problem Yinto omnye yeengxaki ezininzi kwimvavanyo yesistimu ezidlulileyo ngenxa yokuba ama-bug ziyafumaneka kuphela kwi-interleavings ezizodwa zeengxaki ezinxulumene kunye neengxaki ze-deterministically. Layer 1: Unit tests for the logic in isolation Kwiintsuku yokuVavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavavav def test_concurrent_clocks_detected_as_conflict(): clock_a = VectorClock({"node-A": 2}) clock_b = VectorClock({"node-B": 2}) assert not clock_a.dominates(clock_b) assert not clock_b.dominates(clock_a) # Both survive reconciliation → conflict correctly detected def test_ancestor_clock_is_discarded(): old_clock = VectorClock({"node-A": 1}) new_clock = VectorClock({"node-A": 3}) assert new_clock.dominates(old_clock) # old_clock should be pruned during reconciliation Layer 2: Deterministic fault injection Ngaphandle kokufuna ukuba iingxaki ziyafumaneka ngexesha elifanelekileyo ngexesha lokuphendula, uxhumane ngokufanelekileyo kwaye ngokufanelekileyo. Kwi-implementation ye-demo phezulu, inguqulo elula. Kwiinkqubo yokukhiqiza, library njenge okanye Yenza oku kwi-infrastructure level. node.down = True Iimveliso I-Chaos Monkey Iingxaki ezininzi zokusetyenziswa: Scenario A: Write succeeds with W=2, third replica is down. → Verify: the data is readable after the down node recovers. → Verify: no data loss occurred. Scenario B: Two nodes accept concurrent writes to the same key. → Verify: the next read surfaces exactly 2 conflicting versions. → Verify: after the application writes a merged version, the next read is clean. Scenario C: Node goes down mid-write (wrote to W-1 nodes). → Verify: the write is correctly rejected (RuntimeError). → Verify: no partial writes are visible to readers. Scenario D: All N nodes recover after a full partition. → Verify: no data was lost across the cluster. → Verify: vector clocks are still meaningful (no spurious conflicts). Layer 3: Property-based testing Kwimeko yokubhalisa iingxaki ze-individual, ukucacisa ukuba kufuneka uvale kwaye zibonise ezininzi iziganeko zokusebenza ngempumelelo ukucwangcisa: Ukucinga # Invariant: after any sequence of writes and merges, a final get() # should always return exactly one version (no unresolved conflicts). # Invariant: a value written with a context derived from a previous read # should never produce a conflict with that read's version # (it should dominate it). # Invariant: if R + W > N, a value written successfully should always # be visible in the next read (read-your-writes, absent concurrent writes). Izixhobo ezifana (Python) ufumane ukuba uthetha le invariants kwaye ngokuzenzakalelayo ukufumana iinkonzo. Ukucinga Layer 4: Linearizability checkers Ukubonisa ixesha lokuqala, ixesha lokugqibela kunye ne-result yeenkcukacha kwinkqubo ye-fault injection, ke ukutya i-history kwi-linearizability checker efana ne-linearizability checker. It uyazi ukuba nayiphi na imbali ezaziwa kubhalwe ngexesha elungileyo - nangona kwinkqubo ekugqibeleni-ukusebenza phantsi kwezigcebiso zayo zithunyelwe. Ukucinga Ibhaliwe kwi-trance ye-system eyenziwe. Iingcebiso ze-battle-tested, i-zero hand-waving. Iinkcukacha ze-LINK Iinkcukacha ze-LINK