A senior engineer’s perspective on building highly available distributed systems Meza ya Maudhui Utangulizi: Kwa nini Dynamo ilibadilisha kila kitu Maelezo ya Theorem ya Trade-off Core Architecture Components Consistent Hashing for Partitioning Replication Strategy (N, R, W) Vector Clocks for Versioning Sloppy Quorum and Hinted Handoff Usimamizi wa migogoro: tatizo la gari la ununuzi Kusoma na kuandika mtiririko Mti wa Merkle kwa ajili ya Anti-entropy Ushirika na Utafutaji wa Kushindwa Kiwango cha ufanisi: Nambari halisi Mabadiliko ya Mkakati wa Partitioning Kulinganisha Dynamo na mifumo ya kisasa Nini Dynamo haiwezi kukupa wewe Mfano wa utekelezaji wa vitendo Mafunzo muhimu kwa ajili ya kubuni mfumo Jinsi ya kutumia mfumo wa Dynamo-Style Mwisho wa Appendix: matatizo ya kubuni na mbinu Hii ni utambulisho wa muda mrefu - kila sehemu inapatikana kwa uhusiano wake mwenyewe, hivyo usisahau kuruka moja kwa moja kwenye chochote kinachohusiana na wewe. Hii ni utambulisho wa muda mrefu - kila sehemu inapatikana kwa uhusiano wake mwenyewe, hivyo usisahau kuruka moja kwa moja kwenye chochote kinachohusiana na wewe. Utangulizi: Kwa nini Dynamo ilibadilisha kila kitu Wakati Amazon ilichapisha makala ya Dynamo mwaka 2007, haikuwa tu mafunzo mengine ya kitaaluma. Ilikuwa suluhisho la vita la matatizo halisi kwa kiwango kikubwa.Nakumbuka wakati niliposoma makala hii kwa mara ya kwanza—ilibadilisha kimsingi jinsi nilivyofikiri juu ya mifumo iliyosambazwa. Ilikuwa iliyoundwa ili kusaidia huduma za trafiki kubwa za Amazon kama vile gari la ununuzi na mifumo ya usimamizi wa mikutano. Hakuna viashiria vya sekondari, hakuna joints, hakuna semantics ya uhusiano - tu vifungo na thamani, na lengo kubwa la upatikanaji na uwezekano wa kupanua. Hakuna uhakika wa linearizability au utaratibu wa kimataifa, hata katika mipangilio ya kiwango cha juu kabisa. Ikiwa mfumo wako unahitaji mali hizi, Dynamo sio chombo sahihi. Dynamo is a distributed key-value storage system. Tatizo la msingi la Amazon lililokabiliwa na ilikuwa rahisi kusema lakini la kikatili kutatua: Wakati mtu anajaribu kuongeza bidhaa kwenye sanduku la ununuzi wakati wa kupangwa kwa mtandao au kushindwa kwa seva, kukataa kuandika hiyo haiwezi kukubalika. How do you build a storage system that never says “no” to customers? The CAP Theorem Trade-off: Kwa nini Dynamo Chagua Upatikanaji Kabla ya kujifunza jinsi Dynamo inavyofanya kazi, unahitaji kuelewa vikwazo vya msingi ambavyo vimeundwa. Nini maana ya Teorema ya CAP? Teorema ya CAP inaelezea ushindani wa msingi katika mifumo iliyosambazwa: wakati sehemu ya mtandao hutokea, unapaswa kuchagua kati ya usawa na upatikanaji. Ufuatiliaji (C): viungo vyote vinaona data sawa wakati huo huo Upatikanaji (A): Kila ombi inapata jibu (mafanikio au kushindwa) Upatikanaji wa sehemu (P): Mfumo unaendelea kufanya kazi licha ya kushindwa kwa mtandao Mfano wa kawaida ni "chagua 2 ya 3", lakini hii ni usahihi zaidi. Katika mazoezi, vipande vya mtandao ni vya kutosha kwa kiwango, hivyo uamuzi halisi ni: Hii ndiyo chaguo halisi ya kubuni. when partitions occur (and they will), do you sacrifice consistency or availability? : Mipangilio ya mtandao yatatokea. Kabla huchukuliwa, switches kushindwa, vituo vya data kupoteza uhusiano. Huwezi kuepuka, hivyo unapaswa kuchagua: Ufuatiliaji au Upatikanaji? The harsh reality Databases ya kawaida kuchagua uendelevu : Traditional approach Database: "I can't guarantee all replicas are consistent, so I'll reject your write to be safe." Result: Customer sees error, cart is empty Impact: Lost revenue, poor experience Dynamo inachagua upatikanaji : Dynamo’s approach Dynamo: "I'll accept your write with the replicas I can reach. The unreachable replica will catch up later." Result: Customer sees success, item in cart Impact: Sale continues, happy customer Maoni ya biashara iliyoonyeshwa When a partition occurs: Traditional Database: Choose C over A → Sacrifice Availability - ✓ All replicas always have same data - ✓ No conflicts to resolve - ❌ Rejects writes during failures - ❌ Poor customer experience - ❌ Lost revenue Dynamo: Choose A over C → Sacrifice Strong Consistency - ✓ Accepts writes even during failures - ✓ Excellent customer experience - ✓ No lost revenue - ❌ Replicas might temporarily disagree - ❌ Application must handle conflicts Mfano halisi wa Amazon: Black Friday Shopping Cart Fikiria Ijumaa ya Black. Milioni ya wateja ni kununua. Kable ya mtandao inapita kati ya vituo vya data. : With traditional database Time: 10:00 AM - Network partition occurs Result: - All shopping cart writes fail - "Service Unavailable" errors - Customers can't checkout - Twitter explodes with complaints - Estimated lost revenue: $100,000+ per minute : With Dynamo Time: 10:00 AM - Network partition occurs Result: - Shopping cart writes continue - Customers see success - Some carts might have conflicts (rare) - Application merges conflicting versions - Estimated lost revenue: $0 - A few edge cases need conflict resolution (acceptable) Kwa nini uchaguzi huu una maana kwa biashara ya e-commerce Amazon ilifanya Math: Gharama ya kukataa kuandika: Uuzaji wa mara moja uliopotea ($ 50-200) Gharama ya kukubali maandishi yanayopinga: Mara kwa mara haja ya kuunganisha gari la kununua (haraka hutokea, rahisi kurekebishwa) Uamuzi wa biashara: kukubali maandishi, kukabiliana na migogoro ya nadra : Types of data where Availability > Consistency Vifaa vya kununua (kuunganisha viambatanisho vya migogoro) Data ya Mkutano (maelezo ya mwisho-kushinda ni nzuri) Mapendekezo ya mtumiaji (ikiwa ni pamoja inaweza kukubalika) Orodha ya wateja bora (kama karibu ni nzuri) : Types of data where Consistency > Availability Malipo ya akaunti ya benki (siwezi kuwa na malipo ya migogoro) Inventory Counts (Huwezi kuuza zaidi) Utaratibu wa Utaratibu wa Utaratibu (Must be ordered) Hiyo ni kwa nini Dynamo sio kwa kila kitu - lakini kwa masharti ya matumizi ya e-commerce ya Amazon, kuchagua upatikanaji juu ya utaratibu mzuri ulikuwa suluhisho sahihi. Kipengele muhimu: Ingawa Dynamo mara nyingi inajulikana kama mfumo wa AP, ni sahihi zaidi kuitwa mfumo wa utaratibu unaoweza kurekebishwa. Kulingana na muundo wako wa R na W quorum, inaweza kuendesha karibu na CP. Orodha ya AP inatumika kwa muundo wake wa default / mapendekezo iliyoundwa kwa kazi za e-commerce. Wakati Dynamo mara nyingi inajulikana kama mfumo wa AP, ni sahihi zaidi kuitwa Kulingana na muundo wako wa R na W quorum, inaweza kuendesha karibu na CP. Orodha ya AP inatumika kwa muundo wake wa default / ilipendekezwa uliopendekezwa kwa kazi za e-commerce. Important nuance tunable consistency system Sehemu ya msingi ya usanifu 1.Hashing ya kudumu kwa ajili ya partitioning Niruhusu mimi kufafanua hii na mfano maalum, kwa sababu hashing ya utaratibu ni mojawapo ya dhana hizo ambazo zinaonekana kichawi mpaka unapoona katika vitendo. Tatizo: Sharding ya jadi ya Hash-Based Fikiria kuwa una seva 3 na unataka kusambaza data kati yao. # Traditional approach - DON'T DO THIS def get_server(key, num_servers): hash_value = hash(key) return hash_value % num_servers # Modulo operation # With 3 servers: get_server("user_123", 3) # Returns server 0 get_server("user_456", 3) # Returns server 1 get_server("user_789", 3) # Returns server 2 Hii inafanya kazi ... hadi unapoweka au kuondoa seva. Hebu tuangalie kile kinachotokea wakati sisi kwenda kutoka seva 3 hadi 4: # Before (3 servers): "user_123" → hash % 3 = 0 → Server 0 "user_456" → hash % 3 = 1 → Server 1 "user_789" → hash % 3 = 2 → Server 2 # After (4 servers): "user_123" → hash % 4 = 0 → Server 0 ✓ (stayed) "user_456" → hash % 4 = 1 → Server 1 ✓ (stayed) "user_789" → hash % 4 = 2 → Server 2 ✓ (stayed) # But wait - this is lucky! In reality, most keys MOVE: "product_ABC" → hash % 3 = 2 → Server 2 "product_ABC" → hash % 4 = 3 → Server 3 ✗ (MOVED!) Wakati wa kubadilisha idadi ya seva, karibu data yako yote inahitajika kusambazwa upya. The disaster Suluhisho: Hashing ya kudumu Ufanisi wa hashing hupunguza hili kwa kutibu nafasi ya hash kama mzunguko (0 hadi 2^32 - 1, kuingiza karibu). Step 1: Place servers on the ring Kila seva inapatikana nafasi ya random kwenye ring (ambayo inajulikana kama "token"). Fikiria hii kama kuweka alama kwenye mstari wa mzunguko. Step 2: Place data on the ring Ikiwa unataka kuhifadhi data, unahitaji: Piga kifungo ili kupata nafasi kwenye ring Kuendesha gari kwa muda mrefu basi 2 Kuhifadhi data kwenye seva ya kwanza unayokutana Mfano wa kipekee: Mchoro kamili Ifuatayo ni orodha ya kata na vituo ambavyo zoezi la Uwekaji wazi wa Daftari la awali umeanza ilala, kinondoni, temeke Kichwa kinakwenda kwa njia ya saa hadi kuwasili kwenye seva. Simple rule : Examples user_123 katika 30° → huenda hadi 45° → Server A ana user_456 katika 150° → huenda 200° → Server C ana cart_789 katika 250° → huenda kwa 280° → Server D ana product_ABC katika 300° → huenda zaidi ya 360°, hufungua hadi 0°, huendelea hadi 45° → Server A ana Who owns what range? Server A (45°): inamiliki kila kitu kutoka 281° hadi 45° (kufunika karibu) Server B (120°): ana kila kitu kutoka 46° hadi 120° Server C (200°): ana kila kitu kutoka 121° hadi 200° Server D (280°): ana kila kitu kutoka 201° hadi 280° Uchawi: Kuongezea server Sasa hebu tuangalie kwa nini hii ni nzuri. Tunaongeza Server E katika nafasi ya 160°: BEFORE: Server A (45°) → owns 281°-45° Server B (120°) → owns 46°-120° Server C (200°) → owns 121°-200° ← THIS RANGE WILL SPLIT Server D (280°) → owns 201°-280° AFTER: Server A (45°) → owns 281°-45° ← NO CHANGE Server B (120°) → owns 46°-120° ← NO CHANGE Server E (160°) → owns 121°-160° ← NEW! Takes part of C's range Server C (200°) → owns 161°-200° ← SMALLER range Server D (280°) → owns 201°-280° ← NO CHANGE : Kichwa tu katika kiwango cha 121°-160° inahitaji kuhamia (kutoka C hadi E). seva A, B, na D si kuathiriwa kabisa! Result Ufanisi wa Virtual Nodes Kuna tatizo muhimu na mbinu ya msingi ya hashing: . random distribution can be extremely uneven The Problem in Detail: Wakati wewe randomly kuagiza nafasi moja kwa server, wewe ni kimsingi kuruka darts juu ya meza ya mzunguko. Wakati mwingine darts cluster pamoja, wakati mwingine wao kuenea. Hii inajenga hotspots. Nitawaonyesha mfano maalum: Scenario: 4 servers with single random tokens Server A: 10° } Server B: 25° } ← Only 75° apart! Tiny ranges Server C: 100° } Server D: 280° ← 180° away from C! Huge range Range sizes: - Server A owns: 281° to 10° = 89° (25% of ring) - Server B owns: 11° to 25° = 14° (4% of ring) ← Underutilized! - Server C owns: 26° to 100° = 74° (21% of ring) - Server D owns: 101° to 280° = 179° (50% of ring) ← Overloaded! Real-world consequences: Upungufu usio sawa: Server D inashughulikia 50% ya data yote wakati Server B inashughulikia tu 4%. CPU, disk, na mtandao wa Server D ni maxed nje Server B ni kwa kiasi kikubwa kipofu (upoteza uwezo) Urefu wako wa asilimia 99.9 unaongozwa na Server D kuwa na overload Hotspot Cascading: Wakati Server D inakuwa polepole au kushindwa: Kila 50% ya mzigo wake unabadilika kwa Server A (ya pili kwa saa moja) Server A sasa inakuwa overloaded Ufanisi wa mfumo unazidi kuanguka Usimamizi usio na ufanisi: Kuongezea seva haina kusaidia kwa usawa kwa sababu seva mpya zinaweza kufika katika mzunguko mdogo Visualizing the problem: : Each physical server gets multiple virtual positions (tokens). Dynamo’s solution Badala ya risasi moja kwa server, risasi nyingi. zaidi risasi, zaidi hata usambazaji inakuwa (sheria ya idadi kubwa). How Virtual Nodes Fix the Problem: Hebu tuchukue seva nne sawa, lakini sasa kila seva inapata nodes 3 virtual (tokens) badala ya 1: Physical Server A gets 3 tokens: 10°, 95°, 270° Physical Server B gets 3 tokens: 25°, 180°, 310° Physical Server C gets 3 tokens: 55°, 150°, 320° Physical Server D gets 3 tokens: 75°, 200°, 340° Now the ring looks like: 10° A, 25° B, 55° C, 75° D, 95° A, 150° C, 180° B, 200° D, 270° A, 310° B, 320° C, 340° D Range sizes (approximately): - Server A total: 15° + 55° + 40° = 110° (31% of ring) - Server B total: 30° + 20° + 30° = 80° (22% of ring) - Server C total: 20° + 30° + 20° = 70° (19% of ring) - Server D total: 20° + 70° + 20° = 110° (31% of ring) Kiwango cha uzito ni kati ya 19% hadi 31% badala ya 4% hadi 50%. Much better! Why this works: : With more samples (tokens), the random distribution averages out. This is the law of large numbers in action. Statistics : When a server fails, its load is distributed across many servers, not just one neighbor: Granular load distribution Server A fails: - Its token at 10° → load shifts to Server B's token at 25° - Its token at 95° → load shifts to Server C's token at 150° - Its token at 270° → load shifts to Server B's token at 310° Result: The load is spread across multiple servers! : When adding a new server with 3 tokens, it steals small amounts from many servers instead of a huge chunk from one server. Smooth scaling Real Dynamo configurations: Makala inazungumzia mikakati tofauti iliyobadilika kwa muda. katika uzalishaji: Toleo la awali: 100-200 virtual nodes kwa seva ya kimwili Baadaye optimized kwa: Q / S tokens kwa node (ambapo Q = jumla ya partitions, S = idadi ya seva) Typical setup: Each physical server might have 128-256 virtual nodes The Trade-off: Balance vs Overhead Nyumba zaidi za kisasa zina maana ya usambazaji bora wa mzigo, lakini kuna gharama. What you gain with more virtual nodes: With 1 token per server (4 servers): Load variance: 4% to 50% (±46% difference) ❌ With 3 tokens per server (12 virtual nodes): Load variance: 19% to 31% (±12% difference) ✓ With 128 tokens per server (512 virtual nodes): Load variance: 24% to 26% (±2% difference) ✓✓ What it costs: Ukubwa wa metadata: Kila node inahifadhi habari ya njia Token 1 kwa kila seva: kufuatilia machapisho 4 128 tokens kwa seva: kufuatilia vipengele vya 512 : Nodes exchange membership info periodically Gossip overhead Zaidi ya tokens = data zaidi ya kuunganisha kati ya nodes Every second, nodes gossip their view of the ring Ufanisi wa kurekebisha: Wakati nodes kuungana / kuondoka Vifaa vingi vya virtual = uhamisho wa sehemu zaidi wa kuunganisha Lakini kila uhamisho ni mdogo (ambayo kwa kweli ni nzuri kwa bootstrapping) Dynamo’s evolution: Makala hii inaelezea jinsi Amazon ilivyotengeneza hii kwa muda: Strategy 1 (Initial): - 100-200 random tokens per server - Problem: Huge metadata (multiple MB per node) - Problem: Slow bootstrapping (had to scan for specific key ranges) Strategy 3 (Current): - Q/S tokens per server (Q=total partitions, S=number of servers) - Equal-sized partitions - Example: 1024 partitions / 8 servers = 128 tokens per server - Benefit: Metadata reduced to KB - Benefit: Fast bootstrapping (transfer whole partition files) Real production sweet spot: Most Dynamo deployments use 128-256 virtual nodes per physical server. This achieves: Usambazaji wa uzito ndani ya 10-15% variance (ndiyo nzuri) Metadata juu ya chini ya 100KB kwa node (si muhimu) Urejesho wa haraka wa kushindwa (mahesabu huenea kwenye nodes nyingi) Kutoka kwa tokens 128 hadi 512 inaboresha kiwango cha mzigo kwa 2-3%, lakini huongezeka mara mbili ukubwa wa metadata na trafiki ya gossip. Why not more? Huduma za kimwili (kiwango cha juu) huchapisha kwenye maeneo mengi ya kisasa (kiwango cha chini) kwenye mstari.Hili huchapisha mzigo wa kila seva katika sehemu tofauti za nafasi ya hash. Key concept : Benefits Pata zaidi ya usambazaji Wakati seva inashindwa, mzigo wake unagharibiwa kwenye seva nyingi (si tu jirani moja) Wakati server inashirikiana, inachukua kiasi kidogo kutoka kwa seva nyingi Upatikanaji wa Upatikanaji wa Dunia ya Kweli Hebu tuangalie tofauti na idadi halisi: Traditional Hashing (3 servers → 4 servers): - Keys that need to move: ~75% (3 out of 4) - Example: 1 million keys → 750,000 keys must migrate Consistent Hashing (3 servers → 4 servers): - Keys that need to move: ~25% (1 out of 4) - Example: 1 million keys → 250,000 keys must migrate With Virtual Nodes (150 vnodes total → 200 vnodes): - Keys that need to move: ~12.5% (spread evenly) - Example: 1 million keys → 125,000 keys must migrate - Load is balanced across all servers Siku ya “Aha!” The key insight is this: Consistent hashing decouples the hash space from the number of servers. Traditional: ← num_servers is in the formula! server = hash(key) % num_servers Ufuatiliaji: server = ring.findNextClockwise(hash(key)) ← num_servers si katika fomu! Hii ni kwa nini kuongeza / kuondoa seva huathiri tu sehemu ndogo ya data. thamani za hash hazibadilika - tu ambayo seva "inamiliki" ambayo kiwango kinachobadilika, na tu ndani. Fikiria kama barabara ya mzunguko na vituo vya maji (server). Ikiwa unaongeza kituo kipya cha maji, waendeshaji tu kubadilisha vituo ikiwa wao ni kati ya vituo vya zamani karibu na mpya. Mkakati wa Replication (N, R, W) The Problem: Availability vs Consistency Trade-off Fikiria wewe ni kujenga gari la ununuzi wa Amazon. Mteja anaongezea bidhaa kwenye gari lake, lakini wakati huo huo hasa: Mtendaji mmoja anafanya upya kwa ajili ya utunzaji Mtumiaji mwingine ana hiccup ya mtandao Server ya tatu ni nzuri kabisa (strong consistency): Traditional database approach Client: "Add this item to my cart" Database: "I need ALL replicas to confirm before I say yes" Server 1: ✗ (rebooting) Server 2: ✗ (network issue) Server 3: ✓ (healthy) Result: "Sorry, service unavailable. Try again later." “Siwezi kuongeza vitu kwenye gari langu wakati wa Ijumaa ya Black?” Customer experience This is unacceptable for e-commerce. Every rejected write is lost revenue. Suluhisho la Dynamo: Quorums ya Tunable Dynamo inakupa vifungo vitatu ili kurekebisha tathmini sahihi unayotaka: N: Idadi ya replicas (japo nakala nyingi za data) : Read quorum (how many replicas must respond for a successful read) R : Write quorum (how many replicas must acknowledge for a successful write) W • Wakati wa , unahakikisha kuunganisha quorum - maana angalau node moja ambayo ilipokea kuandika itahesabiwa wakati wa kusoma yoyote. kuunganisha hii inaruhusu kugundua toleo la hivi karibuni, ikiwa mantiki ya kuunganisha inatambua kwa usahihi saa ya vektor ya juu. The magic formula R + W > N Let me show you why this matters with real scenarios: Hatua ya 1: Tumia kadi ya kununua (kuweka upya upatikanaji) N = 3 # Three replicas for durability R = 1 # Read from any single healthy node W = 1 # Write to any single healthy node # Trade-off analysis: # ✓ Writes succeed even if 2 out of 3 nodes are down # ✓ Reads succeed even if 2 out of 3 nodes are down # ✓ Maximum availability - never reject customer actions # ✗ Might read stale data # ✗ Higher chance of conflicts (but we can merge shopping carts) What happens during failure: Client: "Add item to cart" Coordinator tries N=3 nodes: - Node 1: ✗ Down - Node 2: ✓ ACK (W=1 satisfied!) - Node 3: Still waiting... Result: SUCCESS returned to client immediately Node 3 eventually gets the update (eventual consistency) Scenario 2: Session State (Balanced Approach) N = 3 R = 2 # Must read from 2 nodes W = 2 # Must write to 2 nodes # Trade-off analysis: # ✓ R + W = 4 > N = 3 → Read-your-writes guaranteed # ✓ Tolerates 1 node failure # ✓ Good balance of consistency and availability # ✗ Write fails if 2 nodes are down # ✗ Read fails if 2 nodes are down Why R + W > N enables read-your-writes: Write to W=2 nodes: [A, B] Later, read from R=2 nodes: [B, C] Because W + R = 4 > N = 3, there's guaranteed overlap! At least one node (B in this case) will have the latest data. The coordinator detects the newest version by comparing vector clocks. This guarantees seeing the latest write as long as reconciliation picks the causally most-recent version correctly. Mfano wa 3: Takwimu za kifedha (kuweka umuhimu wa uaminifu) N = 3 R = 3 # Must read from ALL nodes W = 3 # Must write to ALL nodes # Trade-off analysis: # ✓ Full replica quorum — reduces likelihood of divergent versions # ✓ Any read will overlap every write quorum # ✗ Write fails if ANY node is down # ✗ Read fails if ANY node is down # ✗ Poor availability during failures Mifumo ambayo inahitaji dhamana kali ya biashara mara nyingi huchagua mifumo ya CP badala yake. muundo huu unasaidia kiufundi na Dynamo lakini unahangaika na sifa za upatikanaji ambazo zinahamasisha kutumia kwanza. Configuration Comparison Table Config N R W Availability Consistency Use Case High Availability 3 1 1 ⭐⭐⭐⭐⭐ ⭐⭐ Shopping cart, wish list Balanced 3 2 2 ⭐⭐⭐⭐ ⭐⭐⭐⭐ Session state, user preferences Full Quorum 3 3 3 ⭐⭐ ⭐⭐⭐⭐⭐ High-stakes reads (not linearizable) Read-Heavy 3 1 3 ⭐⭐⭐ (reads) ⭐⭐⭐⭐ Product catalog, CDN metadata Write-Heavy 3 3 1 ⭐⭐⭐ (writes) ⭐⭐⭐ Click tracking, metrics High Availability 3 1 1 ⭐⭐⭐⭐⭐ ⭐⭐ Shopping cart, wish list Balanced 3 2 2 ⭐⭐⭐⭐ ⭐⭐⭐⭐ Hali ya Mkutano, Mapendekezo ya Mtumiaji Full Quorum 3 3 3 ⭐⭐ ⭐⭐⭐⭐⭐ Maoni ya Kiwango cha juu (si linearizable) Read-Heavy 3 1 3 ⭐⭐⭐⭐ Product catalog, CDN metadata Write-Heavy 3 3 1 ⭐⭐⭐ (writes) ⭐⭐⭐ Click tracking, metrics Kumbuka kuhusu mifumo ya kifedha: mifumo ambayo inahitaji dhamana kali ya shughuli (kwa mfano, hesabu za akaunti za benki) kwa kawaida haipaswi kutumia Dynamo. : Systems requiring strong transactional guarantees (e.g., bank account balances) typically shouldn’t use Dynamo. That said, some financial systems do build on Dynamo-style storage for their persistence layer while enforcing stronger semantics at the application or business logic layer. Note on financial systems The Key Insight Wengi wa mifumo ya kutumia Kwa sababu ya: N=3, R=2, W=2 Usaidizi: Inaweza kuvumilia kuni 2 makosa ya replica kabla ya kupoteza data ya kudumu (kutokana na makosa ya kujitegemea na hakuna upungufu unaohusiana). Upatikanaji: Inasaidia kushindwa kwa 1 node kwa kusoma na kuandika : R + W > N guarantees that read and write quorums overlap, enabling read-your-writes behavior in the absence of concurrent writes. Consistency Ufanisi: Usisubiri kwa node ya polepole zaidi (unahitaji tu 2 kati ya 3) Real production numbers from the paper: Amazon’s shopping cart service during peak (holiday season): Configuration: N=3, R=2, W=2 Handled tens of millions of requests Over 3 million checkouts in a single day Hakuna upungufu, hata na kushindwa kwa seva This tunable approach is what made Dynamo revolutionary. You’re not stuck with one-size-fits-all—you tune it based on your actual business requirements. 3. Vector Clocks for Versioning Matatizo: Kutambua sababu katika mifumo ya kusambazwa When multiple nodes can accept writes independently, you need to answer a critical question: Are these two versions of the same data related, or were they created concurrently? Why timestamps don’t work: Scenario: Two users edit the same shopping cart simultaneously User 1 at 10:00:01.500 AM: Adds item A → Writes to Node X User 2 at 10:00:01.501 AM: Adds item B → Writes to Node Y Physical timestamp says: User 2's version is "newer" Reality: These are concurrent! Both should be kept! Problem: - Clocks on different servers are NEVER perfectly synchronized - Clock skew can be seconds or even minutes - Network delays are unpredictable - Physical time doesn't capture causality What we really need to know: Version A happened before Version B? → B can overwrite A Version A and B are concurrent? → Keep both, merge later Version A came from reading Version B? → We can track this! The Solution: Vector Clocks Vector clock ni muundo wa data rahisi: orodha ya viungo ambavyo vinashughulikia ambayo nodes wameona ambayo toleo. (node_id, counter) The rules: Wakati node huandika data, inaongezeka kwa msimbo wake mwenyewe When a node reads data, it gets the vector clock When comparing two vector clocks: If all counters in A ≤ counters in B → A is an ancestor of B (B is newer) Ikiwa baadhi ya hesabu katika A > B na baadhi ya B > A → A na B ni pamoja (ugomvi!) Step-by-Step Example Hebu kufuatilia gari la ununuzi kupitia updates nyingi: Breaking down the conflict: D3: [Sx:2, Sy:1] vs D4: [Sx:2, Sz:1] Comparing: - Sx: 2 == 2 ✓ (equal) - Sy: 1 vs missing in D4 → D3 has something D4 doesn't - Sz: missing in D3 vs 1 → D4 has something D3 doesn't Conclusion: CONCURRENT! Neither is an ancestor of the other. Both versions must be kept and merged. Maana ya ulimwengu wa kweli Ripoti ya Dynamo inasema usambazaji wa migogoro ifuatavyo uliofanywa juu ya masaa 24 ya trafiki ya gari la ununuzi wa uzalishaji wa Amazon. Takwimu hizi zinaonyesha kazi maalum ya Amazon - kiwango cha juu cha kusoma / kuandika, hasa mikutano ya mtumiaji mmoja - na haipaswi kukubalika kwa utekelezaji wote wa Dynamo: 99.94% - Single version (no conflict) 0.00057% - 2 versions 0.00047% - 3 versions 0.00009% - 4 versions Mzozo ni nadra sana katika vitendo! Key insight Why conflicts happen: Si kwa kawaida kutokana na kushindwa mtandao Mostly from concurrent writers (often automated processes/bots) Watumiaji wa binadamu hawana kawaida ya kuunda migogoro kwa sababu ni polepole ikilinganishwa na kasi ya mtandao The Size Problem Vector saa inaweza kukua usio na kikomo ikiwa nodes nyingi kusimamia kuandika. ufumbuzi wa Dynamo: once the clock exceeds a size threshold. truncate the oldest entries // When vector clock exceeds threshold (e.g., 10 entries) // Remove the oldest entry based on wall-clock timestamp vectorClock = { 'Sx': {counter: 5, timestamp: 1609459200}, 'Sy': {counter: 3, timestamp: 1609459800}, 'Sz': {counter: 2, timestamp: 1609460400}, // ... 10 more entries } // If size > 10, remove entry with oldest timestamp // ⚠ Risk: Dropping an entry collapses causality information. // Two versions that were causally related may now appear // concurrent, forcing the application to resolve a conflict // that didn't actually exist. In practice, Amazon reports // this has not been a significant problem — but it is a // real theoretical risk in high-churn write environments // with many distinct coordinators. 4. Sloppy Quorum and Hinted Handoff Tatizo: Quorums ngumu kuua upatikanaji Traditional quorum systems are rigid and unforgiving. Traditional strict quorum: Your data is stored on nodes: A, B, C (preference list) Write requirement: W = 2 Scenario: Node B is down for maintenance Coordinator: "I need to write to 2 nodes from {A, B, C}" Tries: A ✓, B ✗ (down), C ✓ Result: SUCCESS (got 2 out of 3) Scenario: Nodes B AND C are down Coordinator: "I need to write to 2 nodes from {A, B, C}" Tries: A ✓, B ✗ (down), C ✗ (down) Result: FAILURE (only got 1 out of 3) Customer: "Why can't I add items to my cart?!" 😡 Mwisho wa tatizo: . If those specific nodes are down, the system becomes unavailable. Strict quorums require specific nodes Real scenario at Amazon: Black Friday, 2:00 PM - Datacenter 1: 20% of nodes being rebooted (rolling deployment) - Datacenter 2: Network hiccup (1-2% packet loss) - Traffic: 10x normal load With strict quorum: - 15% of write requests fail - Customer support phones explode - Revenue impact: Millions per hour The Solution: Sloppy Quorum Dynamo hupunguza mahitaji ya quorum: “Write to the first N healthy nodes in the preference list, walking further down the ring if needed.” Preference list for key K: A, B, C But B is down... Sloppy Quorum says: "Don't give up! Walk further down the ring: A, B, C, D, E, F, ..." Coordinator walks until N=3 healthy nodes are found: A, C, D (D is a temporary substitute for B) Jinsi ya kufanya Handoff Wakati node kwa muda huchukua nafasi ya node iliyokuwa na kushindwa, inahifadhi "kumbukumbu" na data. Detailed Hinted Handoff Process Step 1: Detect failure and substitute def write_with_hinted_handoff(key, value, N, W): preference_list = get_preference_list(key) # [A, B, C] healthy_nodes = [] for node in preference_list: if is_healthy(node): healthy_nodes.append((node, is_hint=False)) # If we don't have N healthy nodes, expand the list if len(healthy_nodes) < N: extended_list = get_extended_preference_list(key) for node in extended_list: if node not in preference_list and is_healthy(node): healthy_nodes.append((node, is_hint=True)) if len(healthy_nodes) >= N: break # Write to first N healthy nodes acks = 0 for node, is_hint in healthy_nodes[:N]: if is_hint: # Store with hint metadata intended_node = find_intended_node(preference_list, node) success = node.write_hinted(key, value, hint=intended_node) else: success = node.write(key, value) if success: acks += 1 if acks >= W: return SUCCESS return FAILURE Step 2: Background hint transfer # Runs periodically on each node (e.g., every 10 seconds) def transfer_hints(): hints_db = get_hinted_replicas() for hint in hints_db: intended_node = hint.intended_for if is_healthy(intended_node): try: intended_node.write(hint.key, hint.value) hints_db.delete(hint) log(f"Successfully transferred hint to {intended_node}") except: log(f"Will retry later for {intended_node}") Kwa nini hii ni nzuri Durability maintained: Even though B is down: - We still have N=3 copies: A, C, D - Data won't be lost even if another node fails - System maintains durability guarantee Availability maximized: Client perspective: - Write succeeds immediately - No error message - No retry needed - Customer happy Traditional quorum would have failed: - Only 2 nodes available (A, C) - Need 3 for N=3 - Write rejected - Customer sees error Eventual consistency: Timeline: T=0: Write succeeds (A, C, D with hint) T=0-5min: B is down, but system works fine T=5min: B recovers T=5min+10sec: D detects B is back, transfers hint T=5min+11sec: B has the data, D deletes hint Result: Eventually, all correct replicas have the data Configuration Example // High availability configuration const config = { N: 3, // Want 3 replicas W: 2, // Only need 2 ACKs to succeed R: 2, // Read from 2 nodes // Sloppy quorum allows expanding preference list sloppy_quorum: true, // How far to expand when looking for healthy nodes max_extended_preference_list: 10, // How often to check for hint transfers hint_transfer_interval: 10_seconds, // How long to keep trying to transfer hints hint_retention: 7_days }; Madhara ya Dunia ya Kweli From Amazon’s production experience: During normal operation: Handoff mara chache huchukuliwa Wengi wa maandishi kwenda kwa nodes favorite Database ya Hints ni bure kwa ujumla During failures: Scenario: 5% of nodes failing at any time (normal at Amazon's scale) Without hinted handoff: - Write success rate: 85% - Customer impact: 15% of cart additions fail With hinted handoff: - Write success rate: 99.9%+ - Customer impact: Nearly zero During datacenter failure: Scenario: Entire datacenter unreachable (33% of nodes) Without hinted handoff: - Many keys would lose entire preference list - Massive write failures - System effectively down With hinted handoff: - Writes redirect to other datacenters - Hints accumulate temporarily - When datacenter recovers, hints transfer - Zero customer-visible failures The Trade-off Benefits: Upatikanaji wa kuandika kwa kiwango cha juu ✓ Durability maintained during failures ✓ Automatic recovery when nodes come back Hakuna mchakato wa manually unahitajika Costs: ✗ Temporary inconsistency (data not on “correct” nodes) Uhifadhi wa ziada kwa database ya hints ✗ Background bandwidth for hint transfers Msimbo mdogo zaidi Hinted handoff hutoa utulivu wa muda, sio replication ya kudumu. Ikiwa node ya kubadilisha (kama D) inashindwa kabla inaweza kuhamisha alama yake nyuma kwa B, idadi ya replicas halisi huanguka chini ya N mpaka hali inafanywa. Faida za upatikanaji zinachukua zaidi ya gharama za kazi za e-commerce. Amazon’s verdict: Usimamizi wa migogoro: tatizo la gari la ununuzi Hebu tuzungumze juu ya mfano maarufu zaidi kutoka kwa gazeti: gari la ununuzi. What Is a Conflict (and Why Does It Happen)? ya hutokea wakati maandishi mawili hutokea kwenye kifungo kimoja kwenye nodes tofauti, bila hata kuandika "kujua kuhusu" mwingine. Hii inawezekana tu kwa sababu Dynamo inakubali maandishi hata wakati nodes hawawezi kuwasiliana - ambayo ni suala la yote! conflict Here’s a concrete sequence of events that creates a conflict: Timeline: T=0: Customer logs in. Cart has {shoes} on all 3 nodes. T=1: Network partition: Node1 can't talk to Node2. T=2: Customer adds {jacket} on their laptop → goes to Node1. Cart on Node1: {shoes, jacket} ← Vector clock: [N1:2] T=3: Customer adds {hat} on their phone → goes to Node2. Cart on Node2: {shoes, hat} ← Vector clock: [N2:2] T=4: Network heals. Node1 and Node2 compare notes. Node1 says: "I have version [N1:2]" Node2 says: "I have version [N2:2]" Neither clock dominates the other → CONFLICT! Hakuna toleo lolote lililo "vikwazo" - zote mbili zinawakilisha matendo halisi ambayo mteja alifanya. kazi ya Dynamo ni kugundua hali hii (kwa njia ya masaa ya vektor) na uso kwa maombi ili maombi yanaweza kuamua nini cha kufanya. both versions What Does the Application Do With a Conflict? This is the crucial part that the paper delegates to you: . Dynamo gives you all the concurrent versions; your code decides how to merge them. the application must resolve conflicts using business logic For the shopping cart, Amazon chose a : keep all items from all concurrent versions. The rationale is simple—losing an item from a customer’s cart (missing a sale) is worse than occasionally showing a stale item they already deleted. union merge Conflict versions: Version A (from Node1): {shoes, jacket} Version B (from Node2): {shoes, hat} Merge strategy: union Merged cart: {shoes, jacket, hat} ← All items preserved Hapa ni kanuni halisi ya upatanisho: from __future__ import annotations from dataclasses import dataclass, field class VectorClock: def __init__(self, clock: dict[str, int] | None = None): self.clock: dict[str, int] = clock.copy() if clock else {} def merge(self, other: "VectorClock") -> "VectorClock": """Merged clock = max of each node's counter across both versions.""" all_keys = set(self.clock) | set(other.clock) merged = {k: max(self.clock.get(k, 0), other.clock.get(k, 0)) for k in all_keys} return VectorClock(merged) def __repr__(self): return f"VectorClock({self.clock})" @dataclass class ShoppingCart: items: list[str] = field(default_factory=list) vector_clock: VectorClock = field(default_factory=VectorClock) @staticmethod def reconcile(carts: list["ShoppingCart"]) -> "ShoppingCart": if len(carts) == 1: return carts[0] # No conflict, nothing to do # Merge strategy: union of all items (never lose additions). # This is Amazon's choice for shopping carts. # A different application might choose last-write-wins or something else. all_items: set[str] = set() merged_clock = VectorClock() for cart in carts: all_items.update(cart.items) # Union: keep everything merged_clock = merged_clock.merge(cart.vector_clock) return ShoppingCart(items=sorted(all_items), vector_clock=merged_clock) # Example conflict scenario cart1 = ShoppingCart(items=["shoes", "jacket"], vector_clock=VectorClock({"N1": 2})) cart2 = ShoppingCart(items=["shoes", "hat"], vector_clock=VectorClock({"N2": 2})) # Dynamo detected a conflict and passes both versions to our reconcile() reconciled = ShoppingCart.reconcile([cart1, cart2]) print(reconciled.items) # ['hat', 'jacket', 'shoes'] — union! The Deletion Problem (Why This Gets Tricky) The union strategy has a nasty edge case: . deleted items can come back from the dead T=0: Cart: {shoes, hat} T=1: Customer removes hat → Cart: {shoes} Clock: [N1:3] T=2: Network partition — Node2 still has old state T=3: Some concurrent write to Node2 Clock: [N2:3] T=4: Network heals → conflict detected T=5: Union merge: {shoes} ∪ {shoes, hat} = {shoes, hat} Result: Hat is BACK! Customer removed it, but it reappeared. Amazon kwa uwazi inakubali mkataba huu. Bidhaa ya "mawazo" katika gari ni shida ndogo. Kupoteza kuongeza gari wakati wa mauzo ya Ijumaa ya Black ni kupoteza mapato. : Merge logic must be domain-specific and carefully designed. Adding items is commutative (order doesn’t matter) and easy to merge. Removing items is not—a deletion in one concurrent branch may be silently ignored during a union-based merge. This is an intentional trade-off in Dynamo’s design, but it means the application must reason carefully about add vs. remove semantics. If your data doesn’t naturally support union merges (e.g., a counter, a user’s address), you need a different strategy—such as CRDTs, last-write-wins with timestamps, or simply rejecting concurrent writes for that data type. Engineering depth note : Merge logic must be domain-specific and carefully designed. Adding items is commutative (order doesn’t matter) and easy to merge. Removing items is not—a deletion in one concurrent branch may be silently ignored during a union-based merge. This is an intentional trade-off in Dynamo’s design, but it means the application must reason carefully about add vs. remove semantics. If your data doesn’t naturally support union merges (e.g., a counter, a user’s address), you need a different strategy—such as CRDTs, last-write-wins with timestamps, or simply rejecting concurrent writes for that data type. Engineering depth note Kusoma na kuandika mtiririko Mipango hapo juu inaonyesha mtiririko wa kiwango cha juu, lakini hebu tuangalie kile kinachotokea hatua kwa hatua wakati wa kusoma na kuandika. Write Path Step-by-step narration of a PUT request: Mteja anapeleka ombi kwa node yoyote (kwa njia ya balancer ya mzigo) au moja kwa moja kwa msimamizi. Mchanganyiko unatarajiwa - hii ni node ya kwanza katika orodha ya mapendekezo ya nafasi ya hash ya kichwa kwenye mstari. — the coordinator increments its own counter in the vector clock, creating a new version. Vector clock is updated Mchanganyiko huandika ndani, kisha mashabiki huchapisha kuandika kwa nodes nyingine za N-1 katika orodha ya mapendekezo wakati huo huo huo. Msimamizi anasubiri kukubalika kwa W. Haipaswi kusubiri kwa N wote - tu W ya kwanza ya kujibu. viungo vingine ambavyo havijibu bado watapata kuandika hatimaye (au kwa njia ya handoff iliyoonyeshwa ikiwa ni chini). Mara baada ya W ACKs kuwasili, msimamizi anatoa 200 OK kwa mteja. Kutoka mtazamo wa mteja, kuandika imekamilika. : Mteja anapata majibu ya mafanikio mara tu nodes W kuthibitisha. nodes nyingine (N - W) itapokea kuandika asynchronously. Hiyo ni kwa nini mfumo ni "mwisho ufuatiliaji" - nodes zote kuwa na data, lakini sio lazima wakati huo huo. Key insight about the write path ya Will Kusoma njia ya Step-by-step narration of a GET request: Mteja anapeleka ombi kwa msimamizi kwa kifungo hicho. Mkusanyiko unatuma maombi ya kusoma kwa nodes zote za N katika orodha ya mapendekezo kwa wakati mmoja (si tu R). Hii ni muhimu - inakaribisha N zote, lakini inahitaji tu R kujibu. Kusubiri kwa majibu ya R. Msimamizi anakuja mara tu nodes ya R wamejibu, bila kusubiri kwa ajili ya nyuma. The coordinator checks all the vector clocks: Compare the versions returned. If all versions are identical → return the single version immediately. If one version’s clock dominates the others (it’s causally “newer”) → return that version. If versions are concurrent (neither clock dominates) → return to the client, which must merge them. all versions Kusoma ukarabati hutokea katika nyuma: ikiwa msimamizi aliona node yoyote alirejea toleo halisi, inatuma toleo la hivi karibuni kwa node hiyo ili kuhariri. Kwa sababu Dynamo ni injini ya kuhifadhi kwa madhumuni ya jumla. Haijulikani kama unahifadhi gari la ununuzi, maelezo ya mtumiaji, au token ya kikao. anajua jinsi ya kuunganisha matoleo mawili yanayopinga kwa njia ambayo ina maana ya biashara. Mratibu anakuja na matoleo ya kawaida pamoja na mazingira ya saa ya vektor, na unachofanya jambo sahihi kwa kesi yako ya matumizi. Why does the client receive the conflict instead of the coordinator resolving it? Maombi yako : wakati mteja anaandika toleo la kuunganishwa nyuma, lazima iwe pamoja na mazingira (jamani ya vektor iliyohusishwa). Hii inasema Dynamo kwamba kuandika mpya imekuwa "kuona" matoleo yote ya wakati huo huo, hivyo mgogoro umebadilishwa. concurrent write on top of the still-unresolved conflict. The vector clock context is the key to closing the loop another Mti wa Merkle kwa ajili ya Anti-entropy Tatizo: Jinsi ya kujua wakati replicas ni nje ya sync? Baada ya node kurejeshwa kutokana na kushindwa, inaweza kupoteza baadhi ya maandishi. Baada ya mstari wa mtandao kurejeshwa, majaribio mawili yanaweza kutofautiana. Jinsi ya Dynamo kugundua na kurekebisha tofauti hizi? Njia ya nguvu kubwa itakuwa: "Kila saa, kulinganisha kila kifungo kwenye Node A na Node B, na kulinganisha chochote ambacho ni tofauti." Lakini kwa kiwango cha Amazon, kifungo kimoja kinaweza kuhifadhi mamia ya mamilioni ya kifungo. The core idea: instead of comparing individual keys, compare Ikiwa hash inashirikiana, kundi hilo lenyewe ni sawa—kuhamisha. Dynamo uses Merkle trees to solve this efficiently. Mifano ya makundi ya Key Muhimu: Sync ya mti wa Merkle ni mchakato wa anti-entropy ya nyuma. Haipo kwenye njia ya kusoma / kuandika ya moto. Kusoma na kuandika kawaida hutumia saa za vektor na quorums kwa ajili ya toleo. Mti wa Merkle ni kwa mchakato wa kurekebisha ambayo huendesha mara kwa mara katika nyuma ili kukamata upungufu wowote ambao umepita. : Merkle mti sync ni a Mfumo. Si juu ya barabara ya kusoma / kuandika ya joto. Kusoma na kuandika ya kawaida hutumia saa za vektor na quorums kwa ajili ya versioning. Mti wa Merkle ni kwa mchakato wa ukarabati ambao unaendesha mara kwa mara katika background kukamata upungufu wowote ambao umepita. Important background anti-entropy Jinsi ya Kuunda Mti wa Merkle Each node builds a Merkle tree over its data, organized by key ranges: contain the hash of a small range of actual data keys (e.g., hash of all values for keys k1, k2, k3). Leaf nodes contain the hash of their children’s hashes. Internal nodes Rangi ni hash moja ambayo inawakilisha data zote kwenye node. Jinsi Node mbili Sync Kutumia Mti wa Merkle Wakati Node A na Node B wanataka kuangalia ikiwa ni pamoja: : Compare root hashes. If they’re the same, everything is identical. Done! (No network traffic for the data itself.) Step 1 : If roots differ, compare their left children. Same? Skip that entire half of the key space. Step 2 : Keep descending only into subtrees where hashes differ, until you reach the leaf nodes. Step 3 : Sync only the specific keys in the differing leaf nodes. Step 4 Example: Comparing two nodes Node A root: abc789 ← differs from Node B! Node B root: abc788 Compare left subtrees: Node A left: xyz123 Node B left: xyz123 ← same! Skip entire left half. Compare right subtrees: Node A right: def456 Node B right: def457 ← differs! Go deeper. Compare right-left subtree: Node A right-left: ghi111 Node B right-left: ghi111 ← same! Skip. Compare right-right subtree: Node A right-right: jkl222 Node B right-right: jkl333 ← differs! These are leaves. → Sync only the keys in the right-right leaf range (e.g., k10, k11, k12) Instead of comparing all 1 million keys, we compared 6 hashes and synced only 3 keys! : Synchronization process in code def sync_replicas(node_a, node_b, key_range): """ Efficiently sync two nodes using Merkle trees. Instead of comparing all keys, we compare hashes top-down. Only the ranges where hashes differ need actual key-level sync. """ tree_a = node_a.get_merkle_tree(key_range) tree_b = node_b.get_merkle_tree(key_range) # Step 1: Compare root hashes first. # If they match, every key in this range is identical — nothing to do! if tree_a.root_hash == tree_b.root_hash: return # Zero data transferred — full match! # Step 2: Recursively find differences by traversing top-down. # Only descend into subtrees where hashes differ. differences = [] stack = [(tree_a.root, tree_b.root)] while stack: node_a_subtree, node_b_subtree = stack.pop() if node_a_subtree.hash == node_b_subtree.hash: continue # This whole subtree matches — skip it! if node_a_subtree.is_leaf: # Found a differing leaf — these keys need syncing differences.extend(node_a_subtree.keys) else: # Not a leaf yet — recurse into children for child_a, child_b in zip(node_a_subtree.children, node_b_subtree.children): stack.append((child_a, child_b)) # Step 3: Sync only the specific keys that differed at leaf level. # This might be a handful of keys, not millions. for key in differences: sync_key(node_a, node_b, key) Kwa nini hii ni ufanisi The power of Merkle trees is that the number of hash comparisons you need scales with the (Logarithmic katika idadi ya vifungo), si idadi ya vifungo wenyewe. Ukubwa wa Mti Node with 1,000,000 keys: Naive approach: Compare 1,000,000 keys individually Cost: 1,000,000 comparisons Merkle tree: Compare O(log N) hashes top-down Tree depth ≈ 20 levels Cost: 20 comparisons to find differences Then sync only the differing leaves (~few keys) Speedup: ~50,000x fewer comparisons! Kwa upande mwingine, ikiwa viungo viwili vya (ambayo ni kweli karibu daima katika cluster yenye afya), hashes mizizi mara nyingi kufanana kikamilifu na data zero inahitajika kuhamishwa. mostly in sync Ushirika na Utafutaji wa Kushindwa Dynamo hutumia mkataba wa maneno kwa ajili ya usimamizi wa wanachama. Kila node mara kwa mara kubadilishana habari ya wanachama na washirika wa random. Hakuna node kuu - usimamizi wote ni kikamilifu. Ushiriki wa Gossip-Based Vifaa muhimu vya kubuni Kila node ina mtazamo wake mwenyewe wa uanachama wa kikundi. Hakuna usajili wa kati, hivyo hakuna pointi moja ya kushindwa kwa data ya uanachama. No single coordinator : Dynamo inatumia detector ya kushindwa kwa msingi wa kuhesabu (kama Phi Accrual). Badala ya hukumu ya binary ya "kuishi / kufa", nodes zinaendelea ambayo huongezeka kwa muda mrefu zaidi mtaalam hana jibu. Hili hupunguza mafanikio ya uongo kutoka kwa hiccups ya mtandao wa muda mrefu. Failure suspicion vs. detection Kiwango cha shaka Node A's view of Node B: - Last heartbeat: 3 seconds ago → Suspicion low → Healthy - Last heartbeat: 15 seconds ago → Suspicion rising → Likely slow/degraded - Last heartbeat: 60 seconds ago → Suspicion high → Treat as failed Kumbukumbu mpya zinawasiliana na kiungo cha mbegu ili kuungana, kisha jibu hueneza uwepo wao kwa wingi wa kikundi. Ushirika wa kiungo ni wa mwisho - vipengele tofauti vinaweza kuwa na mtazamo tofauti kidogo wa kiungo kwa wakati, ambayo ni kukubalika. Decentralized bootstrapping Kiwango cha ufanisi: Nambari halisi Makala inatoa data ya kuvutia ya utendaji. Usambazaji wa Latency Metric | Average | 99.9th Percentile --------------------|---------|------------------ Read latency | ~10ms | ~200ms Write latency | ~15ms | ~200ms Key insight: 99.9th percentile is ~20x the average! Mfumo wa 99.9 unaathiriwa na: Why the huge gap? Mkusanyiko wa Garbage Mabadiliko ya Disk I/O Mtandao wa Jitter Ukosefu wa usawa Hii ni kwa nini Amazon SLAs ni maalum kwa 99.9th percentile, sio wastani. Mchakato wa migogoro Kutoka masaa ya 24 ya trafiki ya gari la bidhaa ya Amazon (kwa karatasi ya Dynamo). Kumbuka haya yanaonyesha sifa maalum za kazi za Amazon, sio msingi wa jumla: 99.94% - Saw exactly one version (no conflict) 0.00057% - Saw 2 versions 0.00047% - Saw 3 versions 0.00009% - Saw 4 versions Migogoro ni nadra katika vitendo! Mara nyingi kusababishwa na waandishi wa pamoja (robots), sio kushindwa. Takeaway Mabadiliko ya Mkakati wa Partitioning Dynamo evolved through three partitioning strategies. This evolution teaches us important lessons: Mkakati wa 1: Token za Random (Kuanzia) Problem: Random token assignment → uneven load Problem: Adding nodes → expensive data scans Problem: Can't easily snapshot the system : Usambazaji wa token ya random inaweza kuonekana kifahari lakini ni mbaya katika mazoezi. Kila node inapata nafasi ya random kwenye ring, ambayo inamaanisha viwango tofauti vya umiliki wa data na usambazaji usio na usawa wa mzigo. Operational lesson Mkakati wa 2: Sehemu za ukubwa sawa + Tokens za random Improvement: Decouples partitioning from placement Problem: Still has load balancing issues Mkakati wa 3: Tokens za Q / S kwa Node - Usafiri wa ukubwa sawa + Upatikanaji wa Deterministic (Siku) What Q and S mean: Q = idadi ya jumla ya vipande vya imara vinavyoshirikiana (kwa mfano, 1024). Fikiria hizi kama vipande vya ukubwa sawa, vya kuharibiwa kabla ya nafasi ya hash ambayo haiwezi kubadilisha sura. = the number of physical servers currently in the cluster (e.g. 8). S = how many of those fixed slices each server is responsible for (e.g. 1024 / 8 = ). Q/S 128 partitions per server Mabadiliko muhimu kutoka kwa mikakati ya awali: mstari sasa umegawanywa katika Q imara, sehemu za ukubwa sawa Server hawana tena nafasi za random - kila mmoja wao ana sehemu za Q / S hasa, zinazoshirikiana kwa usawa karibu na ring. kwanza ya Example: Q=12 partitions, S=3 servers Ring divided into 12 equal slices (each covers 30° of the 360° ring): Partition 1: 0°– 30° → Server A Partition 2: 30°– 60° → Server B Partition 3: 60°– 90° → Server C Partition 4: 90°–120° → Server A Partition 5: 120°–150° → Server B Partition 6: 150°–180° → Server C ...and so on, round-robin Each server owns exactly Q/S = 12/3 = 4 partitions → perfectly balanced. When a 4th server joins (S becomes 4): New Q/S = 12/4 = 3 partitions per server. Each existing server hands off 1 partition to the new server. Only 3 out of 12 partitions move — the rest are untouched. Mabadiliko haya - kutoka kwa tokens za random hadi sehemu za kudumu, za ukubwa sawa na umiliki wa usawa - ni mojawapo ya mafunzo ya uendeshaji kutoka kwa Dynamo. Njia ya awali ilipendekeza urahisi wa utekelezaji; Njia ya baadaye ilipendekeza urahisi wa uendeshaji na utabiri. Comparing Dynamo to Modern Systems System Consistency Model Use Case Dynamo Influence Cassandra Tunable (N, R, W) Time-series, analytics Direct descendant — heavily inspired by Dynamo, uses same consistent hashing and quorum concepts Riak Tunable, vector clocks Key-value store Closest faithful Dynamo implementation Amazon DynamoDB Eventually consistent by default Managed NoSQL DynamoDB is a completely different system internally, with no vector clocks and much simpler conflict resolution. Shares the name and high-level inspiration only. ⚠️ Not the same as Dynamo! Voldemort Tunable LinkedIn's data store Open-source Dynamo implementation Google Spanner Linearizable Global SQL Opposite choice to Dynamo — prioritizes CP via TrueTime clock synchronization Redis Cluster Eventually consistent Caching, sessions Uses consistent hashing; much simpler conflict resolution Cassandra Ufanisi wa N, R, W Uchambuzi wa muda, uchambuzi Ufuasi wa moja kwa moja — iliyoongozwa sana na Dynamo, hutumia dhana sawa ya hashing na quorum Riak Kuanzishwa kwa Vector Clock Maduka ya thamani Closest faithful Dynamo implementation Amazon DynamoDB Hatua ya mwisho kwa default Usimamizi wa NoSQL DynamoDB is a completely different system internally, with no vector clocks and much simpler conflict resolution. Shares the name and high-level inspiration only. ⚠️ Not the same as Dynamo! Voldemort Tunable Huduma ya Data ya LinkedIn Utekelezaji wa Dynamo Open Source Google Spanner Linear ya Ulimwengu wa SQL Chaguo la kinyume na Dynamo — inachukua kipaumbele kwa CP kupitia synchronization ya saa ya TrueTime Redis Cluster Eventually consistent Caching, sessions Uses consistent hashing; much simpler conflict resolution Ugumu wa DynamoDB: Wanasayansi wengi huchanganya Amazon DynamoDB na karatasi ya Dynamo. Wao ni tofauti sana. DynamoDB ni huduma iliyoendeshwa iliyoundwa kwa urahisi wa uendeshaji. Haifichua saa za vektor, haina kutumia mpango huo huo wa kupangwa, na hutumia mfano wa ufanisi wa kibinafsi. Karatasi hiyo ni kuhusu injini ya ndani ya uhifadhi wa Dynamo ambayo iko mbele ya DynamoDB. Ugumu wa DynamoDB: Wanasayansi wengi huchanganya Amazon DynamoDB na karatasi ya Dynamo. Wao ni tofauti sana. DynamoDB ni huduma iliyoendeshwa iliyoundwa kwa urahisi wa uendeshaji. Haifichua saa za vektor, haina kutumia mpango huo huo wa kupangwa, na hutumia mfano wa ufanisi wa kibinafsi. Karatasi hiyo ni kuhusu injini ya ndani ya uhifadhi wa Dynamo ambayo iko mbele ya DynamoDB. Nini Dynamo haiwezi kukupa wewe Every senior engineer blog should be honest about limitations. Here’s what Dynamo explicitly trades away: Hakuna shughuli: Operesheni ni muhimu tu. Huwezi kuboresha kifungo vingi kwa atomic. : You can only look up data by its primary key (at least in the original design). No secondary indexes Hakuna kuungana: Ni duka la thamani muhimu. Hakuna lugha ya kuuliza. : Events across different keys have no guaranteed ordering. No global ordering Hakuna linearizability: Hata katika R = W = N, Dynamo haitoi reads linearizable. : The system detects conflicts and surfaces them to the application. The must resolve them. If your engineers don’t understand this, you will have subtle data bugs. No automatic conflict resolution application : The anti-entropy process (Merkle tree reconciliation) is not free. At large scale, background repair traffic can be significant. Repair costs at scale Kuongezeka kwa saa ya vektor: Katika mazingira ya kuandika ya juu na washirika wengi, saa za vektor zinaweza kukua kubwa ya kutosha ili kuhitaji truncation, ambayo inaonyesha kupoteza uwezekano wa causality. Kuelewa vikwazo hivi ni muhimu kwa uendeshaji mafanikio wa mifumo ya mtindo wa Dynamo katika uzalishaji. Mfano wa utekelezaji wa vitendo Ifuatayo ni utekelezaji wa Python mwenyewe wa dhana za msingi za Dynamo. Ni kwa makusudi iliyo rahisi - hakuna mtandao halisi, hakuna uvumilivu - lakini inawakilisha kwa uaminifu jinsi masaa ya vektor, mchanganyiko wa haraka wa hash, kusoma / kuandika quorum, na kugundua migogoro. Sehemu ya 1: Vector Clock ya Class ni msingi wa ufuatiliaji wa toleo. Ni tu mapitio ya kamusi 2 Operesheni ya msingi: VectorClock node_id → counter — bump our own counter when we write increment(node) — check if one clock is causally “after” another; if neither dominates, the writes were concurrent (conflict) dominates(other) from __future__ import annotations from dataclasses import dataclass, field from typing import Optional class VectorClock: """ Tracks causality across distributed writes. A clock like {"nodeA": 2, "nodeB": 1} means: - nodeA has coordinated 2 writes - nodeB has coordinated 1 write - Any version with these counters has "seen" those writes """ def __init__(self, clock: dict[str, int] | None = None): self.clock: dict[str, int] = clock.copy() if clock else {} def increment(self, node_id: str) -> "VectorClock": """Return a new clock with node_id's counter bumped by 1.""" new_clock = self.clock.copy() new_clock[node_id] = new_clock.get(node_id, 0) + 1 return VectorClock(new_clock) def dominates(self, other: "VectorClock") -> bool: """ Returns True if self is causally AFTER other. self dominates other when: - Every counter in self is >= the same counter in other, AND - At least one counter in self is strictly greater. Meaning: self has seen everything other has seen, plus more. """ all_keys = set(self.clock) | set(other.clock) at_least_one_greater = False for key in all_keys: self_val = self.clock.get(key, 0) other_val = other.clock.get(key, 0) if self_val < other_val: return False # self is missing something other has seen if self_val > other_val: at_least_one_greater = True return at_least_one_greater def merge(self, other: "VectorClock") -> "VectorClock": """ Merge two clocks by taking the max of each counter. Used after resolving a conflict to produce a new clock that has "seen" everything both conflicting versions saw. """ all_keys = set(self.clock) | set(other.clock) merged = {k: max(self.clock.get(k, 0), other.clock.get(k, 0)) for k in all_keys} return VectorClock(merged) def __repr__(self): return f"VectorClock({self.clock})" Sehemu ya 2: thamani ya toleo Kila thamani iliyohifadhiwa katika Dynamo imefungwa na saa yake ya vektor. Kuunganisha hii ni nini inaruhusu msimamizi kulinganisha matoleo wakati wa kusoma na kugundua migogoro. @dataclass class VersionedValue: """ A value paired with its causal history (vector clock). When a client reads, they get back a VersionedValue. When they write an update, they must include the context (the vector clock they read) so Dynamo knows what version they're building on top of. """ value: object vector_clock: VectorClock def __repr__(self): return f"VersionedValue(value={self.value!r}, clock={self.vector_clock})" Sehemu ya 3: Simulated Node Katika Dynamo halisi kila node ni mchakato tofauti. Hapa tunajifunza kama vitu katika kumbukumbu. maelezo muhimu: kila node ina eneo lake mwenyewe Node inaweza kuwa alama kama kufanya simulation ya kushindwa. storage down class DynamoNode: """ Simulates a single Dynamo storage node. In production this would be a separate server with disk storage. Here it's an in-memory dict so we can demo the logic without networking. """ def __init__(self, node_id: str, token: int): self.node_id = node_id self.token = token # Position on the consistent hash ring self.storage: dict[str, list[VersionedValue]] = {} self.down = False # Toggle to simulate node failures def write(self, key: str, versioned_value: VersionedValue) -> bool: """ Store a versioned value. Returns False if the node is down. We store a LIST of versions per key, because a node might hold multiple concurrent (conflicting) versions until they are resolved by the application. """ if self.down: return False # In a real node this would be written to disk (e.g. BerkeleyDB) self.storage[key] = [versioned_value] return True def read(self, key: str) -> list[VersionedValue] | None: """ Return all versions of a key. Returns None if the node is down. A healthy node with no data for the key returns an empty list. """ if self.down: return None return self.storage.get(key, []) def __repr__(self): status = "DOWN" if self.down else f"token={self.token}" return f"DynamoNode({self.node_id}, {status})" Sehemu ya 4: Mwanga wa Hash Tunachanganya nodes kwa token yao (mtazamo) na kutumia hatua ya saa ili kupata muungano na orodha ya mapendekezo kwa kifungo chochote. import hashlib class ConsistentHashRing: """ Maps any key to an ordered list of N nodes (the preference list). Nodes are placed at fixed positions (tokens) on a conceptual ring from 0 to 2^32. A key hashes to a position, then walks clockwise to find its nodes. This means adding/removing one node only rebalances ~1/N of keys, rather than reshuffling everything like modulo hashing would. """ def __init__(self, nodes: list[DynamoNode]): # Sort nodes by token so we can do clockwise lookup efficiently self.nodes = sorted(nodes, key=lambda n: n.token) def _hash(self, key: str) -> int: """Consistent hash of a key into the ring's token space.""" # Use MD5 for a simple, evenly distributed hash. # Real Dynamo uses a more sophisticated hash (e.g., SHA-1). digest = hashlib.md5(key.encode()).hexdigest() return int(digest, 16) % (2**32) def get_preference_list(self, key: str, n: int) -> list[DynamoNode]: """ Return the first N nodes clockwise from key's hash position. These are the nodes responsible for storing this key. The first node in the list is the coordinator — it receives the client request and fans out to the others. """ if not self.nodes: return [] key_hash = self._hash(key) # Find the first node whose token is >= key's hash (clockwise) start_idx = 0 for i, node in enumerate(self.nodes): if node.token >= key_hash: start_idx = i break # If key_hash is greater than all tokens, wrap around to node 0 else: start_idx = 0 # Walk clockwise, collecting N unique nodes preference_list = [] for i in range(len(self.nodes)): idx = (start_idx + i) % len(self.nodes) preference_list.append(self.nodes[idx]) if len(preference_list) == n: break return preference_list Sehemu ya 5: Msimamizi wa Dynamo Hii ni moyo wa mfumo - mantiki ambayo inashughulikia maombi ya mteja, mashabiki kwa replicas, kusubiri kwa quorum, na kugundua migogoro. class SimplifiedDynamo: """ Coordinates reads and writes across a cluster of DynamoNodes. Any node can act as coordinator for any request — there's no dedicated master. The coordinator is simply whichever node receives the client request (or the first node in the preference list, if using partition-aware routing). Configuration: N = total replicas per key R = minimum nodes that must respond to a read (read quorum) W = minimum nodes that must acknowledge a write (write quorum) """ def __init__(self, nodes: list[DynamoNode], N: int = 3, R: int = 2, W: int = 2): self.N = N self.R = R self.W = W self.ring = ConsistentHashRing(nodes) # ------------------------------------------------------------------ # # WRITE # # ------------------------------------------------------------------ # def put(self, key: str, value: object, context: VectorClock | None = None) -> VectorClock: """ Write a key-value pair to N replicas, wait for W ACKs. The 'context' is the vector clock from a previous read. Always pass context when updating an existing key — it tells Dynamo which version you're building on top of, so it can detect whether your write is concurrent with anything else. Returns the new vector clock, which the caller should store and pass back on future writes to this key. Raises: RuntimeError if fewer than W nodes acknowledged. """ preference_list = self.ring.get_preference_list(key, self.N) if not preference_list: raise RuntimeError("No nodes available") # The coordinator is always the first node in the preference list. coordinator = preference_list[0] # Increment the coordinator's counter in the vector clock. # If no context was provided (brand new key), start a fresh clock. base_clock = context if context is not None else VectorClock() new_clock = base_clock.increment(coordinator.node_id) versioned = VersionedValue(value=value, vector_clock=new_clock) # Fan out to all N replicas. # In a real system these would be concurrent RPC calls. # Here we call them sequentially for simplicity. ack_count = 0 for node in preference_list: success = node.write(key, versioned) if success: ack_count += 1 # Only need W ACKs to declare success. # The remaining replicas are updated asynchronously (or via # hinted handoff if they were down). if ack_count < self.W: raise RuntimeError( f"Write quorum not met: got {ack_count} ACKs, needed {self.W}" ) print(f"[PUT] key={key!r} value={value!r} clock={new_clock} " f"({ack_count}/{self.N} nodes wrote)") return new_clock # ------------------------------------------------------------------ # # READ # # ------------------------------------------------------------------ # def get(self, key: str) -> list[VersionedValue]: """ Read a key from N replicas, wait for R responses, reconcile. Returns a LIST of VersionedValues: - Length 1 → clean read, no conflict - Length >1 → concurrent versions detected; application must merge After reading, the caller should: 1. If no conflict: use the single value normally. 2. If conflict: merge the values using application logic, then call put() with the merged value and the merged vector clock as context. This "closes" the conflict. Read repair happens in the background: any replica that returned a stale version is silently updated with the latest version. """ preference_list = self.ring.get_preference_list(key, self.N) # Collect responses from all N nodes all_versions: list[VersionedValue] = [] responding_nodes: list[tuple[DynamoNode, list[VersionedValue]]] = [] for node in preference_list: result = node.read(key) if result is not None: # None means the node is down all_versions.extend(result) responding_nodes.append((node, result)) if len(responding_nodes) < self.R: raise RuntimeError( f"Read quorum not met: got {len(responding_nodes)} responses, needed {self.R}" ) # Reconcile: discard any version that is strictly dominated # (i.e., is a causal ancestor of) another version. # What remains is the set of concurrent versions. reconciled = self._reconcile(all_versions) # Background read repair: if any node returned something older # than the reconciled result, send it the latest version. # (Simplified: only meaningful when there's a single winner.) if len(reconciled) == 1: latest = reconciled[0] for node, versions in responding_nodes: if not versions or versions[0].vector_clock != latest.vector_clock: node.write(key, latest) # Repair silently in background status = "clean" if len(reconciled) == 1 else f"CONFLICT ({len(reconciled)} versions)" print(f"[GET] key={key!r} status={status} " f"({len(responding_nodes)}/{self.N} nodes responded)") return reconciled # ------------------------------------------------------------------ # # INTERNAL: VERSION RECONCILIATION # # ------------------------------------------------------------------ # def _reconcile(self, versions: list[VersionedValue]) -> list[VersionedValue]: """ Remove any version that is a causal ancestor of another version. If version A's clock is dominated by version B's clock, then B is strictly newer — A adds no new information and can be dropped. Whatever remains after pruning are CONCURRENT versions: writes that happened without either "knowing about" the other. The application must merge these using domain-specific logic. Example: versions = [clock={A:1}, clock={A:2}, clock={B:1}] {A:2} dominates {A:1} → drop {A:1} {A:2} and {B:1} are concurrent → both survive result = [{A:2}, {B:1}] ← conflict! application must merge """ dominated = set() for i, v1 in enumerate(versions): for j, v2 in enumerate(versions): if i != j and v2.vector_clock.dominates(v1.vector_clock): dominated.add(i) # v1 is an ancestor of v2, discard v1 survivors = [v for i, v in enumerate(versions) if i not in dominated] # De-duplicate: identical clocks from different replicas are the same version seen_clocks: list[VectorClock] = [] unique: list[VersionedValue] = [] for v in survivors: if not any(v.vector_clock.clock == s.clock for s in seen_clocks): unique.append(v) seen_clocks.append(v.vector_clock) return unique if unique else versions Sehemu ya 6: Putting It All Together - Demo Hebu tuendane na hali kamili: maandishi ya kawaida / kusoma, kisha mgogoro wa simulated ambapo nodes mbili hutofautiana na programu inapaswa kuunganisha. def demo(): # ── Setup ────────────────────────────────────────────────────────── # # Five nodes placed at evenly spaced positions on the hash ring. # In a real cluster these would span multiple datacenters. nodes = [ DynamoNode("node-A", token=100), DynamoNode("node-B", token=300), DynamoNode("node-C", token=500), DynamoNode("node-D", token=700), DynamoNode("node-E", token=900), ] dynamo = SimplifiedDynamo(nodes, N=3, R=2, W=2) print("=" * 55) print("SCENARIO 1: Normal write and read (no conflict)") print("=" * 55) # Write the initial shopping cart ctx = dynamo.put("cart:user-42", {"items": ["shoes"]}) # Read it back — should be a clean single version versions = dynamo.get("cart:user-42") print(f"Read result: {versions[0].value}\n") # Update the cart, passing the context from our earlier read. # The context tells Dynamo "this write builds on top of clock ctx". ctx = dynamo.put("cart:user-42", {"items": ["shoes", "jacket"]}, context=ctx) versions = dynamo.get("cart:user-42") print(f"After update: {versions[0].value}\n") print("=" * 55) print("SCENARIO 2: Simulated conflict — two concurrent writes") print("=" * 55) # Write the base version base_ctx = dynamo.put("cart:user-99", {"items": ["hat"]}) # Now simulate a network partition: # node-A and node-B can't talk to each other. # We model this by writing directly to individual nodes. pref_list = dynamo.ring.get_preference_list("cart:user-99", 3) node_1, node_2, node_3 = pref_list[0], pref_list[1], pref_list[2] # Write 1: customer adds "scarf" via node_1 (e.g., their laptop) clock_1 = base_ctx.increment(node_1.node_id) node_1.write("cart:user-99", VersionedValue({"items": ["hat", "scarf"]}, clock_1)) # Write 2: customer adds "gloves" via node_2 (e.g., their phone) # This write also descends from base_ctx, not from clock_1. # Neither write knows about the other → they are concurrent. clock_2 = base_ctx.increment(node_2.node_id) node_2.write("cart:user-99", VersionedValue({"items": ["hat", "gloves"]}, clock_2)) # Read — coordinator sees two concurrent versions and surfaces the conflict versions = dynamo.get("cart:user-99") if len(versions) > 1: print(f"\nConflict detected! {len(versions)} concurrent versions:") for i, v in enumerate(versions): print(f" Version {i+1}: {v.value} clock={v.vector_clock}") # Application-level resolution: union merge (Amazon's shopping cart strategy) # Merge items: take the union so no addition is lost all_items = set() merged_clock = versions[0].vector_clock for v in versions: all_items.update(v.value["items"]) merged_clock = merged_clock.merge(v.vector_clock) merged_value = {"items": sorted(all_items)} print(f"\nMerged result: {merged_value}") # Write the resolved version back with the merged clock as context. # This "closes" the conflict — future reads will see a single version. final_ctx = dynamo.put("cart:user-99", merged_value, context=merged_clock) versions = dynamo.get("cart:user-99") print(f"\nAfter resolution: {versions[0].value}") assert len(versions) == 1, "Should be a single version after merge" if __name__ == "__main__": demo() Expected output: ======================================================= SCENARIO 1: Normal write and read (no conflict) ======================================================= [PUT] key='cart:user-42' value={'items': ['shoes']} clock=VectorClock({'node-A': 1}) (3/3 nodes wrote) [GET] key='cart:user-42' status=clean (3/3 nodes responded) Read result: {'items': ['shoes']} [PUT] key='cart:user-42' value={'items': ['shoes', 'jacket']} clock=VectorClock({'node-A': 2}) (3/3 nodes wrote) [GET] key='cart:user-42' status=clean (3/3 nodes responded) After update: {'items': ['shoes', 'jacket']} ======================================================= SCENARIO 2: Simulated conflict — two concurrent writes ======================================================= [PUT] key='cart:user-99' value={'items': ['hat']} clock=VectorClock({'node-A': 1}) (3/3 nodes wrote) [GET] key='cart:user-99' status=CONFLICT (2 versions) (3/3 nodes responded) Conflict detected! 2 concurrent versions: Version 1: {'items': ['hat', 'scarf']} clock=VectorClock({'node-A': 2}) Version 2: {'items': ['hat', 'gloves']} clock=VectorClock({'node-A': 1, 'node-B': 1}) Merged result: {'items': ['gloves', 'hat', 'scarf']} [PUT] key='cart:user-99' value={'items': ['gloves', 'hat', 'scarf']} ... (3/3 nodes wrote) [GET] key='cart:user-99' status=clean (3/3 nodes responded) After resolution: {'items': ['gloves', 'hat', 'scarf']} Katika hatua ya pili, mwanachama anaweza kuamua kwa usahihi na ya hawana sawa wala katika uhusiano wa kutawala - wala sio baba wa mwingine - hivyo wote wawili huonekana kama pamoja. programu kisha inachukua jukumu la kuunganisha na kuandika nyuma toleo lililopangwa na saa iliyounganishwa. What to notice: {'node-A': 2} {'node-A': 1, 'node-B': 1} Mafunzo muhimu kwa ajili ya kubuni mfumo Baada ya kufanya kazi na mifumo iliyoongozwa na Dynamo kwa miaka mingi, hapa ni hatua zangu muhimu: 1. Always-On Beats Strongly-Consistent Kwa maombi yanayohusiana na mtumiaji, upatikanaji karibu daima unashinda. watumiaji wataweza kuvumilia kuona data kidogo iliyopunguzwa. hawataweza kuvumilia "Huduma isiyopatikana." 2. Application-Level Reconciliation is Powerful Usiogope kushinikiza ufumbuzi wa migogoro kwenye programu. programu inajua mantiki ya biashara na inaweza kufanya maamuzi mazuri zaidi kuliko database inaweza. 3. Tunable Consistency is Essential Ongezeko la gari la ununuzi unahitaji upatikanaji mkubwa (W = 1). shughuli za kifedha zinahitaji dhamana kubwa (W = N). Uwezo wa kurekebisha hii kwa kila shughuli ni muhimu sana. 4. The 99.9th Percentile Matters More Than Average Kujenga jitihada zako za kuboresha juu ya muda wa nyuma. Hiyo ni nini watumiaji kweli uzoefu wakati wa nyakati za juu. 5. Gossip Protocols Scale Beautifully Utaratibu wa kipekee kwa njia ya mazungumzo huondoa pointi moja za kushindwa na kiwango cha maelfu ya nodes. Jinsi ya kutumia mfumo wa Dynamo-Style Tafadhali usisahau kwamba unapaswa kuwa na wasiwasi – usisahau kutumia njia hii wakati: Ufuatiliaji mkubwa unahitajika (mchakato wa kifedha, usimamizi wa hifadhi) Maswali magumu yanahitajika (reporting, analytics, joins) Utaratibu unajumuisha vitu vingi (Dynamo ni shughuli za msingi tu) Timu yako haiwezi kushughulikia ufuatiliaji wa mwisho (kama watengenezaji hawaelewi saa za vektor na ufumbuzi wa migogoro, utapata matatizo) Mwisho wa Dynamo inawakilisha mabadiliko ya kimsingi katika jinsi tunavyofikiri kuhusu mifumo iliyosambazwa. Kwa kukubali utaratibu wowote na kutoa mabadiliko yanayowezekana, inawezesha ujenzi wa mifumo ambayo kupanua kwa ukubwa mkubwa wakati wa kudumisha upatikanaji mkubwa. Mafundisho ya makala haya yameathiri kizazi kimoja cha database iliyosambazwa. Ikiwa unatumia Cassandra, Riak, au DynamoDB, unafaidika na ufahamu wa kwanza uliochapishwa katika makala hii. Kama wahandisi, kazi yetu ni kuelewa makubaliano haya kwa kina na kutumika kwa usahihi. Dynamo inatoa chombo chenye nguvu, lakini kama chombo chochote, ni nzuri tu kama ufahamu wetu wa wakati na jinsi ya kutumia. Kusoma zaidi Makala ya awali ya Dynamo: SOSP 2007 Tony Korologos - Kutoka kwenye Blog ya Golf Taarifa ya Cassandra: Kuelewa jinsi dhana hizi zinavyotumika “Kujenga Maombi ya Data-Intensive” ya Martin Kleppmann – Sura ya 5 juu ya Replication Appendix: matatizo ya kubuni na mbinu Maswali matatu ya wazi ambayo yanakuja katika mahojiano ya kubuni mfumo na kazi halisi ya uhandisi. kufikiria kila kabla ya kusoma mjadala. Tatizo 1: ufumbuzi wa migogoro kwa mhariri wa hati ya ushirikiano : Unaunda kitu kama Google Docs iliyohifadhiwa na duka la mtindo wa Dynamo. Watumiaji wawili huhariri kifungu kimoja kwa wakati mmoja. Jinsi ya kukabiliana na migogoro? The problem Mkakati wa gari la ununuzi (mshirika wa vitu vyote) ni salama tu kwa sababu kuongeza vitu ni commutative — Ikiwa mtumiaji A anapoteza sentensi na mtumiaji B anapoteza katikati yake, muunganisho wa mabadiliko yao hauna maana au ni kinyume chake. Why shopping cart union doesn’t work here {A} ∪ {B} = {B} ∪ {A} The right approach: Operational Transformation (OT) or CRDTs Suluhisho la sekta ni kuwakilisha hati si kama blob ya maandishi, lakini kama mfululizo wa shughuli, na kubadilisha shughuli za pamoja ili wawili wawili waweze kutumika bila migogoro: User A's operation: delete(position=50, length=20) User B's operation: insert(position=60, text="new sentence") Without OT: B's insert position (60) is now wrong because A deleted 20 chars. With OT: Transform B's operation against A's: B's insert position shifts to 40 (60 - 20). Both operations now apply cleanly. Mkakati wa kutatua migogoro kwa kiwango cha Dynamo utakuwa: Kuhifadhi operesheni (si snapshots kamili ya hati) kama thamani kwa kila kifungo. Katika migogoro, kukusanya orodha zote za shughuli za pamoja kutoka kwa kila toleo. Tumia OT ili kuunganisha katika kumbukumbu moja ya uendeshaji. Kuandika nyaraka za kuunganishwa nyuma na saa ya vektor ya kuunganishwa kama mazingira. Operesheni ya log kwa sehemu ya hati, sio maandishi yaliyotolewa. Hii inafanya mchanganyiko wa deterministic na isiyo na hasara. What to store in Dynamo Hii ni kimsingi jinsi Google Docs, Notion, na Figma kazi. ngazi zao za kuhifadhi hutumia OT au aina ya CRDTs (conflict-free Replicated Data Types), ambayo ni miundo ya data ambayo ni dhamana ya kuunganisha bila migogoro bila kujali utaratibu wa utendaji. Real-world reference Tatizo 2: Chagua N, R, W kwa matukio tofauti ya matumizi : Unaweza kuchagua usanidi gani kwa (a) duka la mkutano, (b) orodha ya bidhaa, (c) maelezo ya mtumiaji? The problem Njia sahihi ya kufikiri juu ya hili: kutambua hali ya kushindwa ambayo inahusisha gharama kubwa - kuandika iliyopoteza (kupoteza data) au kuandika iliyopunguzwa (hakuna upatikanaji). Session store — prioritize availability Mikutano ni ya muda na ya mtumiaji maalum. Ikiwa mkusanyiko wa mtumiaji unashindwa kwa muda mfupi au umepoteza, wanaingia nje na kuingia tena. N=3, R=1, W=1 Rationale: - W=1: Accept session writes even during heavy failures. A user can't log in if their session write is rejected. - R=1: Read from any single node. Stale session data is harmless. - N=3: Still replicate to 3 nodes for basic durability. Trade-off accepted: Stale session reads are possible but inconsequential. Product catalog — prioritize read performance and consistency Data ya bidhaa huandikwa mara chache (kwa timu za operesheni) lakini kusoma mamilioni ya mara kwa siku. bei za kudumu au maelezo ni matatizo. Unahitaji kusoma haraka, kwa utaratibu. N=3, R=2, W=3 Rationale: - W=3: All replicas must confirm a catalog update before it's live. A price change half-published is worse than a brief write delay. - R=2: Read quorum overlap with W=3 guarantees fresh data. Acceptable: catalog writes are rare, so write latency doesn't matter. - N=3: Standard replication for durability. Trade-off accepted: Writes are slow and fail if any node is down. Acceptable because catalog updates are infrequent. User profiles — balanced Takwimu za wasifu (jina, barua pepe, mapendekezo) ni muhimu kwa kiasi kikubwa. maelezo yasiyo na uhakika ni ya kutisha lakini sio hatari. Updates zilizochukuliwa (kwa mfano, mtumiaji hawezi kuboresha barua pepe yake) ni tatizo halisi. N=3, R=2, W=2 Rationale: - The classic balanced configuration. - R + W = 4 > N = 3, so quorums overlap: reads will see the latest write. - Tolerates 1 node failure for both reads and writes. - Appropriate for data that matters but doesn't require strict consistency. Trade-off accepted: A second simultaneous node failure will cause errors. Acceptable for non-critical user data. Decision framework summary: Priority R W When to use Max availability 1 1 Sessions, ephemeral state, click tracking Balanced 2 2 User profiles, preferences, soft state Consistent reads 2 3 Catalogs, config, rarely-written reference data Highest consistency 3 3 Anywhere you need R+W > N with zero tolerance for stale reads (still not linearizable) Upatikanaji wa Max 1 1 Mkutano, hali ya ephemeral, kufuatilia click Uwiano wa 2 2 Maelezo ya mtumiaji, mapendekezo, hali ya soft Kusoma kwa uhakika 2 3 Catalogs, config, data ya kumbukumbu ya nadra Ufuatiliaji wa juu 3 3 Kila mahali unahitaji R+W > N na uvumilivu wa zero kwa reads za kudumu (hata haiwezi linearizable) Tatizo la 3: Kujaribu mfumo wa mtindo wa Dynamo chini ya mipangilio ya sehemu Jinsi ya kuthibitisha kwamba mfumo wako kwa kweli hufanya vizuri wakati nodes kushindwa na partitions kutokea? The problem Hii ni mojawapo ya matatizo magumu zaidi katika majaribio ya mifumo iliyosambazwa kwa sababu bugs huonekana tu katika interleavings maalum ya matukio ya pamoja ambayo ni vigumu kuigiza deterministically. Layer 1: Unit tests for the logic in isolation Kabla ya kujaribu tabia ya kusambazwa, angalia viungo vya kujenga kwa kujitegemea. mantiki ya kulinganisha kwa saa ya vektor, kugundua migogoro, na kazi za kukubaliana zinaweza kujaribiwa kwa majaribio ya kipekee - hakuna mtandao unahitajika. def test_concurrent_clocks_detected_as_conflict(): clock_a = VectorClock({"node-A": 2}) clock_b = VectorClock({"node-B": 2}) assert not clock_a.dominates(clock_b) assert not clock_b.dominates(clock_a) # Both survive reconciliation → conflict correctly detected def test_ancestor_clock_is_discarded(): old_clock = VectorClock({"node-A": 1}) new_clock = VectorClock({"node-A": 3}) assert new_clock.dominates(old_clock) # old_clock should be pruned during reconciliation Layer 2: Deterministic fault injection Badala ya matumaini ya kushindwa kutokea katika utaratibu sahihi wakati wa mtihani wa mzigo, kuingiza kwa makusudi na mara kwa mara. ni toleo rahisi la hii. Katika mifumo ya uzalishaji, maktaba kama au kufanya hivyo katika ngazi ya miundombinu. node.down = True ya ya ya Chaos ya nguruwe Mipango muhimu ya kujaribu: Scenario A: Write succeeds with W=2, third replica is down. → Verify: the data is readable after the down node recovers. → Verify: no data loss occurred. Scenario B: Two nodes accept concurrent writes to the same key. → Verify: the next read surfaces exactly 2 conflicting versions. → Verify: after the application writes a merged version, the next read is clean. Scenario C: Node goes down mid-write (wrote to W-1 nodes). → Verify: the write is correctly rejected (RuntimeError). → Verify: no partial writes are visible to readers. Scenario D: All N nodes recover after a full partition. → Verify: no data was lost across the cluster. → Verify: vector clocks are still meaningful (no spurious conflicts). Layer 3: Property-based testing Badala ya kuandika kesi za mtihani binafsi, kufafanua ambayo lazima daima kudumisha na kuzalisha maelfu ya mfululizo wa utendaji wa random ili kujaribu kuvunja: Mabadiliko ya # Invariant: after any sequence of writes and merges, a final get() # should always return exactly one version (no unresolved conflicts). # Invariant: a value written with a context derived from a previous read # should never produce a conflict with that read's version # (it should dominate it). # Invariant: if R + W > N, a value written successfully should always # be visible in the next read (read-your-writes, absent concurrent writes). Vifaa kama (Python) kuruhusu wewe kuelezea invariants hizi na moja kwa moja kupata mifano. Hifadhi ya Layer 4: Linearizability checkers Kwa uhakika wa juu, kumbuka wakati wa kuanza, wakati wa mwisho wa kila operesheni, na matokeo wakati wa mtihani wa kuingiza makosa, kisha kuingiza historia kwa mtihani wa linearizability kama vile Itakujulisha kama historia yoyote iliyotajwa ni sawa na utekelezaji sahihi wa mfululizo - hata kwa mfumo wa mwisho unaofaa unaofanya kazi ndani ya dhamana zake zilizotajwa. ya Knossos Imeandikwa kutoka katika shamba la mifumo ya usambazaji. ufahamu wa vita, kushindwa kwa mkono. Maelezo ya Link Maelezo ya Link