Tại sao Amazon Dynamo vẫn định hình lưu trữ phân tán hiện đại 17 năm sau

A senior engineer’s perspective on building highly available distributed systems Bảng nội dung Giới thiệu: Tại sao Dynamo thay đổi mọi thứ Định lý CAP Trade-off Core Architecture Components Consistent Hashing for Partitioning Replication Strategy (N, R, W) Vector Clocks for Versioning Sloppy Quorum and Hinted Handoff Giải quyết tranh chấp: Vấn đề giỏ hàng Đọc & Viết Flow Cây Merkle cho Anti-Entropy Thành viên và phát hiện thất bại Tính năng chính: Real Numbers Chiến lược phân chia Evolution So sánh Dynamo với các hệ thống hiện đại Những gì Dynamo không cho bạn Ví dụ thực hành thực tế Những bài học quan trọng về thiết kế hệ thống Khi nào không nên sử dụng Dynamo-Style Systems Kết luận Phụ lục: Các vấn đề thiết kế và phương pháp tiếp cận Đây là một tham chiếu dài - mỗi phần đứng riêng, vì vậy hãy nhảy thẳng vào bất cứ điều gì có liên quan nhất với bạn. Đây là một tham chiếu dài - mỗi phần đứng riêng, vì vậy hãy nhảy thẳng vào bất cứ điều gì có liên quan nhất với bạn. Giới thiệu: Tại sao Dynamo thay đổi mọi thứ Khi Amazon xuất bản bài báo Dynamo vào năm 2007, nó không chỉ là một bài tập học tập khác. đó là một giải pháp thử nghiệm cho các vấn đề thực sự ở quy mô lớn.Tôi nhớ khi tôi lần đầu đọc bài báo này - nó đã thay đổi cơ bản cách tôi nghĩ về các hệ thống phân tán. Nó được thiết kế để hỗ trợ các dịch vụ lưu lượng truy cập cao của Amazon như hệ thống quản lý giỏ hàng và phiên. Không có chỉ số thứ cấp, không liên kết, không có ngữ nghĩa quan hệ - chỉ là chìa khóa và giá trị, với sự tập trung cực đoan vào tính sẵn có và khả năng mở rộng. Nó không cung cấp khả năng tuyến tính hoặc đảm bảo đặt hàng toàn cầu, ngay cả ở các thiết lập giới hạn cao nhất. Nếu hệ thống của bạn yêu cầu các thuộc tính đó, Dynamo không phải là công cụ phù hợp. Dynamo is a distributed key-value storage system. Vấn đề cốt lõi mà Amazon phải đối mặt là đơn giản để tuyên bố nhưng tàn bạo để giải quyết: Khi ai đó cố gắng thêm một mục vào giỏ hàng của họ trong một sự cố phân vùng mạng hoặc máy chủ, từ chối viết đó là không thể chấp nhận được. How do you build a storage system that never says “no” to customers? The CAP Theorem Trade-off: Tại sao Dynamo chọn khả năng sẵn có Trước khi đi sâu vào cách Dynamo hoạt động, bạn cần hiểu những hạn chế cơ bản mà nó được thiết kế xung quanh. CAP Theorem là gì? Định lý CAP mô tả một sự thỏa hiệp cơ bản trong các hệ thống phân tán: khi xảy ra phân vùng mạng, bạn phải chọn giữa tính nhất quán và tính sẵn có. Consistency (C): Tất cả các nút nhìn thấy cùng một dữ liệu cùng một lúc Availability (A): Mỗi yêu cầu được trả lời (thành công hoặc thất bại) Partition Tolerance (P): Hệ thống tiếp tục hoạt động bất chấp sự cố mạng Một từ viết tắt phổ biến là “chọn 2 trong 3”, nhưng đây là một sự đơn giản hóa quá mức.Trong thực tế, phân vùng mạng là không thể tránh khỏi ở quy mô, vì vậy quyết định thực sự là: Đó là sự lựa chọn thiết kế thực sự. when partitions occur (and they will), do you sacrifice consistency or availability? : Các phân vùng mạng sẽ xảy ra. Cáp bị cắt, chuyển đổi thất bại, trung tâm dữ liệu mất kết nối. Bạn không thể tránh chúng, vì vậy bạn phải chọn: Sự nhất quán hay sẵn có? The harsh reality Cơ sở dữ liệu truyền thống chọn sự nhất quán : Traditional approach Database: "I can't guarantee all replicas are consistent, so I'll reject your write to be safe." Result: Customer sees error, cart is empty Impact: Lost revenue, poor experience Dynamo chọn khả năng : Dynamo’s approach Dynamo: "I'll accept your write with the replicas I can reach. The unreachable replica will catch up later." Result: Customer sees success, item in cart Impact: Sale continues, happy customer Trade-off được hình dung When a partition occurs: Traditional Database: Choose C over A → Sacrifice Availability - ✓ All replicas always have same data - ✓ No conflicts to resolve - ❌ Rejects writes during failures - ❌ Poor customer experience - ❌ Lost revenue Dynamo: Choose A over C → Sacrifice Strong Consistency - ✓ Accepts writes even during failures - ✓ Excellent customer experience - ✓ No lost revenue - ❌ Replicas might temporarily disagree - ❌ Application must handle conflicts Amazon thực sự ví dụ: Black Friday giỏ hàng Hãy tưởng tượng đó là Thứ Sáu Đen. Hàng triệu khách hàng đang mua sắm. Một cáp mạng bị cắt giữa các trung tâm dữ liệu. : With traditional database Time: 10:00 AM - Network partition occurs Result: - All shopping cart writes fail - "Service Unavailable" errors - Customers can't checkout - Twitter explodes with complaints - Estimated lost revenue: $100,000+ per minute : With Dynamo Time: 10:00 AM - Network partition occurs Result: - Shopping cart writes continue - Customers see success - Some carts might have conflicts (rare) - Application merges conflicting versions - Estimated lost revenue: $0 - A few edge cases need conflict resolution (acceptable) Tại sao sự lựa chọn này có ý nghĩa đối với thương mại điện tử Amazon đã làm toán học: Chi phí từ chối một bài viết: Bán hàng bị mất ngay lập tức ($ 50-200) Chi phí chấp nhận viết mâu thuẫn: Thỉnh thoảng cần phải sáp nhập giỏ hàng (hiếm khi xảy ra, dễ dàng sửa chữa) Quyết định kinh doanh: Chấp nhận viết, đối phó với các xung đột hiếm : Types of data where Availability > Consistency Thùng mua sắm (merge conflicting additions) Dữ liệu phiên (Last-write-wins is fine) Ưu tiên của người dùng (sự nhất quán có thể chấp nhận được) Danh sách bán chạy nhất (khoảng là tốt) : Types of data where Consistency > Availability Tài khoản ngân hàng (không thể có số dư mâu thuẫn) Số lượng hàng tồn kho (không thể bán quá mức) Giao dịch nhật ký (phải được đặt hàng) Đó là lý do tại sao Dynamo không phải là cho tất cả mọi thứ - nhưng đối với các trường hợp sử dụng thương mại điện tử của Amazon, việc chọn khả năng sẵn có hơn là sự nhất quán mạnh là sự thỏa hiệp đúng đắn. Quan trọng: Trong khi Dynamo thường được mô tả là một hệ thống AP, nó chính xác hơn để gọi nó là một hệ thống nhất quán có thể điều chỉnh.Tùy thuộc vào cấu hình R và W quorum của bạn, nó có thể cư xử gần hơn với CP. Nhãn AP áp dụng cho cấu hình mặc định / khuyến nghị của nó được tối ưu hóa cho tải công việc thương mại điện tử. Quan trọng: Trong khi Dynamo thường được mô tả là một hệ thống AP, nó chính xác hơn để gọi nó là một hệ thống nhất quán có thể điều chỉnh.Tùy thuộc vào cấu hình R và W quorum của bạn, nó có thể cư xử gần hơn với CP. Nhãn AP áp dụng cho cấu hình mặc định / khuyến nghị của nó được tối ưu hóa cho tải công việc thương mại điện tử. Các thành phần cơ bản của kiến trúc 1. consistent hashing cho phân vùng Hãy để tôi giải thích điều này bằng một ví dụ cụ thể, bởi vì hashing nhất quán là một trong những khái niệm có vẻ kỳ diệu cho đến khi bạn thấy nó trong hành động. Vấn đề: Hash-Based Sharding truyền thống Hãy tưởng tượng bạn có 3 máy chủ và muốn phân phối dữ liệu trên chúng. # Traditional approach - DON'T DO THIS def get_server(key, num_servers): hash_value = hash(key) return hash_value % num_servers # Modulo operation # With 3 servers: get_server("user_123", 3) # Returns server 0 get_server("user_456", 3) # Returns server 1 get_server("user_789", 3) # Returns server 2 Điều này hoạt động... cho đến khi bạn thêm hoặc loại bỏ một máy chủ. Hãy xem điều gì xảy ra khi chúng ta đi từ 3 đến 4 máy chủ: # Before (3 servers): "user_123" → hash % 3 = 0 → Server 0 "user_456" → hash % 3 = 1 → Server 1 "user_789" → hash % 3 = 2 → Server 2 # After (4 servers): "user_123" → hash % 4 = 0 → Server 0 ✓ (stayed) "user_456" → hash % 4 = 1 → Server 1 ✓ (stayed) "user_789" → hash % 4 = 2 → Server 2 ✓ (stayed) # But wait - this is lucky! In reality, most keys MOVE: "product_ABC" → hash % 3 = 2 → Server 2 "product_ABC" → hash % 4 = 3 → Server 3 ✗ (MOVED!) Khi bạn thay đổi số lượng máy chủ, gần như tất cả dữ liệu của bạn cần phải được phân phối lại. The disaster Giải pháp: Hashing nhất quán Hashing nhất quán giải quyết điều này bằng cách xử lý không gian hash như một vòng tròn (0 đến 2^32 – 1, bao quanh xung quanh). Step 1: Place servers on the ring Mỗi máy chủ được gán một vị trí ngẫu nhiên trên vòng (được gọi là “token”). Step 2: Place data on the ring Khi bạn muốn lưu trữ dữ liệu, bạn: Hash chìa khóa để có được một vị trí trên vòng Đi theo hướng đồng hồ từ vị trí đó Lưu trữ dữ liệu trên máy chủ đầu tiên bạn gặp Lời bài hát: Full Ring Dưới đây là vòng được đặt theo thứ tự. phím đi theo hướng đồng hồ đến máy chủ tiếp theo: Một phím chạy theo hướng đồng hồ cho đến khi nó chạm vào một máy chủ. Simple rule : Examples user_123 ở 30° → đi đến 45° → Server A sở hữu nó user_456 ở 150° → đi đến 200° → Server C sở hữu nó cart_789 ở 250° → đi đến 280° → Server D sở hữu nó product_ABC ở 300° → đi qua 360°, đóng gói đến 0°, tiếp tục đến 45° → Server A sở hữu nó Who owns what range? Server A (45°): sở hữu mọi thứ từ 281° đến 45° (bọc xung quanh) Server B (120°): sở hữu mọi thứ từ 46° đến 120° Server C (200°): sở hữu mọi thứ từ 121° đến 200° Server D (280°): sở hữu mọi thứ từ 201° đến 280° The Magic: Thêm một máy chủ Bây giờ chúng ta hãy xem tại sao điều này là tuyệt vời. chúng ta thêm Server E ở vị trí 160°: BEFORE: Server A (45°) → owns 281°-45° Server B (120°) → owns 46°-120° Server C (200°) → owns 121°-200° ← THIS RANGE WILL SPLIT Server D (280°) → owns 201°-280° AFTER: Server A (45°) → owns 281°-45° ← NO CHANGE Server B (120°) → owns 46°-120° ← NO CHANGE Server E (160°) → owns 121°-160° ← NEW! Takes part of C's range Server C (200°) → owns 161°-200° ← SMALLER range Server D (280°) → owns 201°-280° ← NO CHANGE : Chỉ có các phím trong phạm vi 121°-160° cần di chuyển (từ C đến E). máy chủ A, B, và D hoàn toàn không bị ảnh hưởng! Result Ứng dụng Virtual Nodes Optimization Có một vấn đề quan trọng với cách tiếp cận hashing nhất quán cơ bản: . random distribution can be extremely uneven The Problem in Detail: Khi bạn ngẫu nhiên gán một vị trí cho mỗi máy chủ, bạn về cơ bản đang ném darts vào một bảng tròn.Đôi khi các darts nhóm lại với nhau, đôi khi chúng lan ra. Để tôi cho bạn một ví dụ cụ thể: Scenario: 4 servers with single random tokens Server A: 10° } Server B: 25° } ← Only 75° apart! Tiny ranges Server C: 100° } Server D: 280° ← 180° away from C! Huge range Range sizes: - Server A owns: 281° to 10° = 89° (25% of ring) - Server B owns: 11° to 25° = 14° (4% of ring) ← Underutilized! - Server C owns: 26° to 100° = 74° (21% of ring) - Server D owns: 101° to 280° = 179° (50% of ring) ← Overloaded! Real-world consequences: Không tải bằng nhau: Server D xử lý 50% tất cả dữ liệu trong khi Server B chỉ xử lý 4%. CPU, đĩa và mạng của Server D được tối đa hóa Server B chủ yếu là vô dụng (khả năng bị lãng phí) Độ trễ phần trăm 99,9 của bạn được chi phối bởi Server D bị quá tải Hotspot Cascading: Khi Server D trở nên chậm hoặc thất bại: Tất cả 50% tải của nó chuyển sang Server A (một lần đồng hồ tiếp theo) Server A bây giờ trở nên quá tải Hiệu suất hệ thống suy giảm thảm khốc Không hiệu quả quy mô: Thêm máy chủ không giúp đồng đều bởi vì máy chủ mới có thể hạ cánh trong phạm vi đã nhỏ Visualizing the problem: Mỗi máy chủ vật lý nhận được nhiều vị trí ảo (token). Dynamo’s solution Thay vì ném một dart cho mỗi máy chủ, hãy ném nhiều dart.Càng ném nhiều, phân phối càng trở nên nhiều (luật số lượng lớn). How Virtual Nodes Fix the Problem: Let’s take the same 4 servers, but now each server gets 3 virtual nodes (tokens) instead of 1: Physical Server A gets 3 tokens: 10°, 95°, 270° Physical Server B gets 3 tokens: 25°, 180°, 310° Physical Server C gets 3 tokens: 55°, 150°, 320° Physical Server D gets 3 tokens: 75°, 200°, 340° Now the ring looks like: 10° A, 25° B, 55° C, 75° D, 95° A, 150° C, 180° B, 200° D, 270° A, 310° B, 320° C, 340° D Range sizes (approximately): - Server A total: 15° + 55° + 40° = 110° (31% of ring) - Server B total: 30° + 20° + 30° = 80° (22% of ring) - Server C total: 20° + 30° + 20° = 70° (19% of ring) - Server D total: 20° + 70° + 20° = 110° (31% of ring) Tải trọng dao động từ 19% đến 31% thay vì 4% đến 50%. Much better! Why this works: : With more samples (tokens), the random distribution averages out. This is the law of large numbers in action. Statistics : When a server fails, its load is distributed across many servers, not just one neighbor: Granular load distribution Server A fails: - Its token at 10° → load shifts to Server B's token at 25° - Its token at 95° → load shifts to Server C's token at 150° - Its token at 270° → load shifts to Server B's token at 310° Result: The load is spread across multiple servers! : When adding a new server with 3 tokens, it steals small amounts from many servers instead of a huge chunk from one server. Smooth scaling Real Dynamo configurations: The paper mentions different strategies evolved over time. In production: Early versions: 100-200 virtual nodes per physical server Sau đó được tối ưu hóa thành: token Q/S cho mỗi nút (nơi Q = tổng phân vùng, S = số lượng máy chủ) Cài đặt thông thường: Mỗi máy chủ vật lý có thể có 128-256 nút ảo The Trade-off: Balance vs Overhead Nhiều nút ảo hơn có nghĩa là phân phối tải tốt hơn, nhưng có một chi phí. What you gain with more virtual nodes: With 1 token per server (4 servers): Load variance: 4% to 50% (±46% difference) ❌ With 3 tokens per server (12 virtual nodes): Load variance: 19% to 31% (±12% difference) ✓ With 128 tokens per server (512 virtual nodes): Load variance: 24% to 26% (±2% difference) ✓✓ What it costs: Kích thước siêu dữ liệu: Mỗi nút duy trì thông tin định tuyến 1 token mỗi máy chủ: Track 4 entries 128 token mỗi máy chủ: Theo dõi 512 mục : Nodes exchange membership info periodically Gossip overhead Thêm token = thêm dữ liệu để đồng bộ hóa giữa các nút Every second, nodes gossip their view of the ring Sự phức tạp của tái cân bằng: Khi các nút tham gia / rời Nhiều nút ảo = nhiều phân vùng chuyển để phối hợp Nhưng mỗi chuyển nhượng nhỏ hơn (đó thực sự là tốt cho bootstrapping) Dynamo’s evolution: The paper describes how Amazon optimized this over time: Strategy 1 (Initial): - 100-200 random tokens per server - Problem: Huge metadata (multiple MB per node) - Problem: Slow bootstrapping (had to scan for specific key ranges) Strategy 3 (Current): - Q/S tokens per server (Q=total partitions, S=number of servers) - Equal-sized partitions - Example: 1024 partitions / 8 servers = 128 tokens per server - Benefit: Metadata reduced to KB - Benefit: Fast bootstrapping (transfer whole partition files) Real production sweet spot: Hầu hết các triển khai Dynamo sử dụng 128-256 nút ảo cho mỗi máy chủ vật lý. Phân phối tải trong phạm vi 10-15% biến động (đủ tốt) Metadata overhead under 100KB per node (negligible) Phục hồi thất bại nhanh (sự phân tán tải trên nhiều nút) Diminishing returns. Going from 128 to 512 tokens only improves load balance by 2-3%, but doubles metadata size and gossip traffic. Why not more? : Các máy chủ vật lý (trên) bản đồ đến nhiều vị trí ảo (bên dưới) trên vòng. điều này phân phối tải của mỗi máy chủ trên các phần khác nhau của không gian hash. Key concept : Benefits Thêm tải phân phối When a server fails, its load is distributed across many servers (not just one neighbor) When a server joins, it steals a small amount from many servers So sánh tác động thế giới thực Let’s see the difference with real numbers: Traditional Hashing (3 servers → 4 servers): - Keys that need to move: ~75% (3 out of 4) - Example: 1 million keys → 750,000 keys must migrate Consistent Hashing (3 servers → 4 servers): - Keys that need to move: ~25% (1 out of 4) - Example: 1 million keys → 250,000 keys must migrate With Virtual Nodes (150 vnodes total → 200 vnodes): - Keys that need to move: ~12.5% (spread evenly) - Example: 1 million keys → 125,000 keys must migrate - Load is balanced across all servers Khoảnh khắc “Aha!” The key insight is this: Consistent hashing decouples the hash space from the number of servers. Traditional: server = hash(key) % num_servers ← num_servers là trong công thức! Consistent: server = ring.findNextClockwise(hash(key)) ← num_servers không có trong công thức! Đây là lý do tại sao việc thêm / loại bỏ máy chủ chỉ ảnh hưởng đến một phần nhỏ của dữ liệu. giá trị hash không thay đổi - chỉ có máy chủ nào "người sở hữu" phạm vi nào thay đổi, và chỉ cục bộ. Think of it like a circular running track with water stations (servers). If you add a new water station, runners only change stations if they’re between the old nearest station and the new one. Everyone else keeps going to their same station. Chiến lược sao chép (N, R, W) The Problem: Availability vs Consistency Trade-off Imagine you’re building Amazon’s shopping cart. A customer adds an item to their cart, but at that exact moment: Một máy chủ đang được khởi động lại để bảo trì Một máy chủ khác có một hiccup mạng Một máy chủ thứ ba hoàn toàn ổn (Hợp nhất mạnh mẽ): Traditional database approach Client: "Add this item to my cart" Database: "I need ALL replicas to confirm before I say yes" Server 1: ✗ (rebooting) Server 2: ✗ (network issue) Server 3: ✓ (healthy) Result: "Sorry, service unavailable. Try again later." : 😡 “I can’t add items to my cart during Black Friday?!” Customer experience This is unacceptable for e-commerce. Every rejected write is lost revenue. Dynamo’s Solution: Tunable Quorums Dynamo gives you three knobs to tune the exact trade-off you want: N: Số bản sao (nhiều bản sao của dữ liệu) A: Đọc quorum (nhiều bản sao phải trả lời để đọc thành công) W: Write quorum (nhiều bản sao phải thừa nhận để viết thành công) : Khi nào , bạn đảm bảo quorum overlap - có nghĩa là ít nhất một nút nhận được viết sẽ được truy vấn trong bất kỳ đọc nào. overlap này cho phép phát hiện phiên bản mới nhất, miễn là logic hòa giải xác định chính xác đồng hồ vector cao nhất. nó không tự động đảm bảo read-your-writes trừ khi điều phối viên giải quyết phiên bản đúng. The magic formula R + W > N Hãy để tôi cho bạn thấy tại sao điều này quan trọng với các kịch bản thực tế: Scenario 1: Shopping Cart (Prioritize Availability) N = 3 # Three replicas for durability R = 1 # Read from any single healthy node W = 1 # Write to any single healthy node # Trade-off analysis: # ✓ Writes succeed even if 2 out of 3 nodes are down # ✓ Reads succeed even if 2 out of 3 nodes are down # ✓ Maximum availability - never reject customer actions # ✗ Might read stale data # ✗ Higher chance of conflicts (but we can merge shopping carts) What happens during failure: Client: "Add item to cart" Coordinator tries N=3 nodes: - Node 1: ✗ Down - Node 2: ✓ ACK (W=1 satisfied!) - Node 3: Still waiting... Result: SUCCESS returned to client immediately Node 3 eventually gets the update (eventual consistency) Kịch bản 2: Tình trạng phiên họp (Phương pháp cân bằng) N = 3 R = 2 # Must read from 2 nodes W = 2 # Must write to 2 nodes # Trade-off analysis: # ✓ R + W = 4 > N = 3 → Read-your-writes guaranteed # ✓ Tolerates 1 node failure # ✓ Good balance of consistency and availability # ✗ Write fails if 2 nodes are down # ✗ Read fails if 2 nodes are down Why R + W > N enables read-your-writes: Write to W=2 nodes: [A, B] Later, read from R=2 nodes: [B, C] Because W + R = 4 > N = 3, there's guaranteed overlap! At least one node (B in this case) will have the latest data. The coordinator detects the newest version by comparing vector clocks. This guarantees seeing the latest write as long as reconciliation picks the causally most-recent version correctly. Kịch bản 3: Dữ liệu tài chính (Đặt ưu tiên tính nhất quán) N = 3 R = 3 # Must read from ALL nodes W = 3 # Must write to ALL nodes # Trade-off analysis: # ✓ Full replica quorum — reduces likelihood of divergent versions # ✓ Any read will overlap every write quorum # ✗ Write fails if ANY node is down # ✗ Read fails if ANY node is down # ✗ Poor availability during failures Các hệ thống đòi hỏi đảm bảo giao dịch nghiêm ngặt thường chọn hệ thống CP thay vì. cấu hình này được hỗ trợ kỹ thuật bởi Dynamo nhưng hy sinh các thuộc tính khả dụng thúc đẩy sử dụng nó ở nơi đầu tiên. Configuration Comparison Table Config N R W Availability Consistency Use Case High Availability 3 1 1 ⭐⭐⭐⭐⭐ ⭐⭐ Shopping cart, wish list Balanced 3 2 2 ⭐⭐⭐⭐ ⭐⭐⭐⭐ Session state, user preferences Full Quorum 3 3 3 ⭐⭐ ⭐⭐⭐⭐⭐ High-stakes reads (not linearizable) Read-Heavy 3 1 3 ⭐⭐⭐ (reads) ⭐⭐⭐⭐ Product catalog, CDN metadata Write-Heavy 3 3 1 ⭐⭐⭐ (writes) ⭐⭐⭐ Click tracking, metrics High Availability 3 1 1 ⭐⭐⭐⭐⭐ ⭐⭐ Shopping cart, wish list Balanced 3 2 2 ⭐⭐⭐⭐ ⭐⭐⭐⭐ Session Status, User Preferences (Tình trạng phiên, sở thích người dùng) Full Quorum 3 3 3 ⭐⭐ ⭐⭐⭐⭐⭐ High-stakes read (không phải là linearizable) Read-Heavy 3 1 3 ⭐⭐⭐ (reads) ⭐⭐⭐⭐ Danh mục sản phẩm, CDN metadata Write-Heavy 3 3 1 ⭐⭐⭐ (bài viết) ⭐⭐⭐ Click tracking, metrics Lưu ý về các hệ thống tài chính: Các hệ thống yêu cầu đảm bảo giao dịch mạnh mẽ (ví dụ, số dư tài khoản ngân hàng) thường không nên sử dụng Dynamo. Điều đó nói rằng, một số hệ thống tài chính thực sự xây dựng trên lưu trữ theo phong cách Dynamo cho lớp kiên trì của họ trong khi áp dụng ngữ nghĩa mạnh mẽ hơn trên lớp ứng dụng hoặc logic kinh doanh. Lưu ý về các hệ thống tài chính: Các hệ thống yêu cầu đảm bảo giao dịch mạnh mẽ (ví dụ, số dư tài khoản ngân hàng) thường không nên sử dụng Dynamo. Điều đó nói rằng, một số hệ thống tài chính thực sự xây dựng trên lưu trữ theo phong cách Dynamo cho lớp kiên trì của họ trong khi áp dụng ngữ nghĩa mạnh mẽ hơn trên lớp ứng dụng hoặc logic kinh doanh. The Key Insight Hầu hết hệ thống sử dụng because: N=3, R=2, W=2 Độ bền: Có thể chịu được tối đa 2 sự cố sao chép trước khi mất dữ liệu vĩnh viễn (giả sử các sự cố độc lập và không có sự gián đoạn liên quan). Availability: Tolerates 1 node failure for both reads and writes (Chấp nhận 1 node thất bại cho cả đọc và viết) Sự nhất quán: R + W > N đảm bảo rằng đọc và viết quorums chồng chéo, cho phép hành vi đọc và viết của bạn trong trường hợp không có viết đồng thời. : Don’t wait for the slowest node (only need 2 out of 3) Performance Real production numbers from the paper: Dịch vụ giỏ hàng của Amazon trong đỉnh cao (mùa lễ): Configuration: N=3, R=2, W=2 xử lý hàng chục triệu yêu cầu Over 3 million checkouts in a single day No downtime, even with server failures Cách tiếp cận có thể điều chỉnh này là những gì làm cho Dynamo cách mạng. Bạn không bị mắc kẹt với một kích thước phù hợp với tất cả - bạn điều chỉnh nó dựa trên yêu cầu kinh doanh thực tế của bạn. 3. Vector Clocks for Versioning Vấn đề: Phát hiện nguyên nhân trong các hệ thống phân tán Khi nhiều nút có thể chấp nhận viết độc lập, bạn cần trả lời một câu hỏi quan trọng: Are these two versions of the same data related, or were they created concurrently? Why timestamps don’t work: Scenario: Two users edit the same shopping cart simultaneously User 1 at 10:00:01.500 AM: Adds item A → Writes to Node X User 2 at 10:00:01.501 AM: Adds item B → Writes to Node Y Physical timestamp says: User 2's version is "newer" Reality: These are concurrent! Both should be kept! Problem: - Clocks on different servers are NEVER perfectly synchronized - Clock skew can be seconds or even minutes - Network delays are unpredictable - Physical time doesn't capture causality What we really need to know: Version A happened before Version B? → B can overwrite A Version A and B are concurrent? → Keep both, merge later Version A came from reading Version B? → We can track this! The Solution: Vector Clocks A vector clock is a simple data structure: a list of pairs that tracks which nodes have seen which versions. (node_id, counter) The rules: When a node writes data, it increments its own counter Khi một nút đọc dữ liệu, nó nhận được đồng hồ vector When comparing two vector clocks: If all counters in A ≤ counters in B → A is an ancestor of B (B is newer) Nếu một số đếm trong A > B và một số B > A → A và B đồng thời (mâu thuẫn!) Step-by-step ví dụ Let’s trace a shopping cart through multiple updates: Breaking down the conflict: D3: [Sx:2, Sy:1] vs D4: [Sx:2, Sz:1] Comparing: - Sx: 2 == 2 ✓ (equal) - Sy: 1 vs missing in D4 → D3 has something D4 doesn't - Sz: missing in D3 vs 1 → D4 has something D3 doesn't Conclusion: CONCURRENT! Neither is an ancestor of the other. Both versions must be kept and merged. Đặc điểm thế giới thực Các báo cáo Dynamo báo cáo phân phối xung đột sau đây được đo trên 24 giờ lưu lượng truy cập giỏ hàng sản xuất của Amazon. những con số này phản ánh khối lượng công việc cụ thể của Amazon - tỷ lệ đọc / viết cao, chủ yếu là phiên người dùng duy nhất - và không nên được giả định chung cho tất cả các triển khai Dynamo: 99.94% - Single version (no conflict) 0.00057% - 2 versions 0.00047% - 3 versions 0.00009% - 4 versions : Conflicts are RARE in practice! Key insight Why conflicts happen: Không thường xuyên từ thất bại mạng Mostly from concurrent writers (often automated processes/bots) Human users rarely create conflicts because they’re slow compared to network speed The Size Problem Vector clocks can grow unbounded if many nodes coordinate writes. Dynamo’s solution: đồng hồ vượt quá một ngưỡng kích thước. truncate the oldest entries // When vector clock exceeds threshold (e.g., 10 entries) // Remove the oldest entry based on wall-clock timestamp vectorClock = { 'Sx': {counter: 5, timestamp: 1609459200}, 'Sy': {counter: 3, timestamp: 1609459800}, 'Sz': {counter: 2, timestamp: 1609460400}, // ... 10 more entries } // If size > 10, remove entry with oldest timestamp // ⚠ Risk: Dropping an entry collapses causality information. // Two versions that were causally related may now appear // concurrent, forcing the application to resolve a conflict // that didn't actually exist. In practice, Amazon reports // this has not been a significant problem — but it is a // real theoretical risk in high-churn write environments // with many distinct coordinators. Sloppy Quorum và Hinted Handoff The Problem: Strict Quorums Kill Availability Traditional quorum systems are rigid and unforgiving. Traditional strict quorum: Your data is stored on nodes: A, B, C (preference list) Write requirement: W = 2 Scenario: Node B is down for maintenance Coordinator: "I need to write to 2 nodes from {A, B, C}" Tries: A ✓, B ✗ (down), C ✓ Result: SUCCESS (got 2 out of 3) Scenario: Nodes B AND C are down Coordinator: "I need to write to 2 nodes from {A, B, C}" Tries: A ✓, B ✗ (down), C ✗ (down) Result: FAILURE (only got 1 out of 3) Customer: "Why can't I add items to my cart?!" 😡 Vấn đề : Nếu các nút cụ thể đó xuống, hệ thống trở nên không sẵn dùng. Strict quorums require specific nodes Real scenario at Amazon: Black Friday, 2:00 PM - Datacenter 1: 20% of nodes being rebooted (rolling deployment) - Datacenter 2: Network hiccup (1-2% packet loss) - Traffic: 10x normal load With strict quorum: - 15% of write requests fail - Customer support phones explode - Revenue impact: Millions per hour Lời bài hát: Sloppy Quorum Dynamo relaxes the quorum requirement: “Write to the first N healthy nodes in the preference list, walking further down the ring if needed.” Preference list for key K: A, B, C But B is down... Sloppy Quorum says: "Don't give up! Walk further down the ring: A, B, C, D, E, F, ..." Coordinator walks until N=3 healthy nodes are found: A, C, D (D is a temporary substitute for B) How Hinted Handoff Works When a node temporarily substitutes for a failed node, it stores a “hint” with the data. Detailed Hinted Handoff Process Step 1: Detect failure and substitute def write_with_hinted_handoff(key, value, N, W): preference_list = get_preference_list(key) # [A, B, C] healthy_nodes = [] for node in preference_list: if is_healthy(node): healthy_nodes.append((node, is_hint=False)) # If we don't have N healthy nodes, expand the list if len(healthy_nodes) = N: break # Write to first N healthy nodes acks = 0 for node, is_hint in healthy_nodes[:N]: if is_hint: # Store with hint metadata intended_node = find_intended_node(preference_list, node) success = node.write_hinted(key, value, hint=intended_node) else: success = node.write(key, value) if success: acks += 1 if acks >= W: return SUCCESS return FAILURE Step 2: Background hint transfer # Runs periodically on each node (e.g., every 10 seconds) def transfer_hints(): hints_db = get_hinted_replicas() for hint in hints_db: intended_node = hint.intended_for if is_healthy(intended_node): try: intended_node.write(hint.key, hint.value) hints_db.delete(hint) log(f"Successfully transferred hint to {intended_node}") except: log(f"Will retry later for {intended_node}") Tại sao điều này là tuyệt vời Durability maintained: Even though B is down: - We still have N=3 copies: A, C, D - Data won't be lost even if another node fails - System maintains durability guarantee Availability maximized: Client perspective: - Write succeeds immediately - No error message - No retry needed - Customer happy Traditional quorum would have failed: - Only 2 nodes available (A, C) - Need 3 for N=3 - Write rejected - Customer sees error Eventual consistency: Timeline: T=0: Write succeeds (A, C, D with hint) T=0-5min: B is down, but system works fine T=5min: B recovers T=5min+10sec: D detects B is back, transfers hint T=5min+11sec: B has the data, D deletes hint Result: Eventually, all correct replicas have the data Configuration Ví dụ // High availability configuration const config = { N: 3, // Want 3 replicas W: 2, // Only need 2 ACKs to succeed R: 2, // Read from 2 nodes // Sloppy quorum allows expanding preference list sloppy_quorum: true, // How far to expand when looking for healthy nodes max_extended_preference_list: 10, // How often to check for hint transfers hint_transfer_interval: 10_seconds, // How long to keep trying to transfer hints hint_retention: 7_days }; Ảnh hưởng thế giới thực From Amazon’s production experience: During normal operation: Hinted handoff hiếm khi kích hoạt Hầu hết các văn bản đi đến các nút ưa thích Cơ sở dữ liệu Hints hầu hết trống rỗng During failures: Scenario: 5% of nodes failing at any time (normal at Amazon's scale) Without hinted handoff: - Write success rate: 85% - Customer impact: 15% of cart additions fail With hinted handoff: - Write success rate: 99.9%+ - Customer impact: Nearly zero During datacenter failure: Scenario: Entire datacenter unreachable (33% of nodes) Without hinted handoff: - Many keys would lose entire preference list - Massive write failures - System effectively down With hinted handoff: - Writes redirect to other datacenters - Hints accumulate temporarily - When datacenter recovers, hints transfer - Zero customer-visible failures Các trade-off Benefits: ✓ Tối đa viết sẵn ✓ Độ bền được duy trì trong thời gian thất bại ✓ Automatic recovery when nodes come back Không cần sự can thiệp thủ công Costs: Không nhất quán tạm thời (dữ liệu không có trên các nút “đúng”) ✗ Extra storage for hints database Dải băng thông nền cho Hint Transfer Mã phức tạp hơn một chút Hinted handoff cung cấp độ bền tạm thời, không phải là sự sao chép vĩnh viễn.Nếu một nút thay thế (như D) thất bại trước khi nó có thể chuyển dấu gợi ý của nó trở lại B, số lượng bản sao thực sự giảm xuống dưới N cho đến khi tình huống được giải quyết.Đây là một trường hợp cạnh quan trọng để hiểu trong lập kế hoạch thất bại. Lợi ích về tính sẵn có vượt xa chi phí cho khối lượng công việc thương mại điện tử. Amazon’s verdict: Conflict Resolution: The Shopping Cart Problem Let’s talk about the most famous example from the paper: the shopping cart. This is where rubber meets road. What Is a Conflict (and Why Does It Happen)? a là occurs when two writes happen to the same key on different nodes, without either write “knowing about” the other. This is only possible because Dynamo accepts writes even when nodes can’t communicate—which is the whole point! conflict Dưới đây là một chuỗi các sự kiện cụ thể tạo ra xung đột: Timeline: T=0: Customer logs in. Cart has {shoes} on all 3 nodes. T=1: Network partition: Node1 can't talk to Node2. T=2: Customer adds {jacket} on their laptop → goes to Node1. Cart on Node1: {shoes, jacket} ← Vector clock: [N1:2] T=3: Customer adds {hat} on their phone → goes to Node2. Cart on Node2: {shoes, hat} ← Vector clock: [N2:2] T=4: Network heals. Node1 and Node2 compare notes. Node1 says: "I have version [N1:2]" Node2 says: "I have version [N2:2]" Neither clock dominates the other → CONFLICT! Cả hai phiên bản đều không “sai” - cả hai đều đại diện cho hành động thực tế mà khách hàng đã thực hiện. công việc của Dynamo là phát hiện tình huống này (thông qua đồng hồ vector) và bề mặt to the application so the application can decide what to do. both versions What Does the Application Do With a Conflict? Đây là phần quan trọng mà bài báo giao phó cho bạn: Dynamo cung cấp cho bạn tất cả các phiên bản đồng thời; mã của bạn quyết định cách hợp nhất chúng. the application must resolve conflicts using business logic Đối với giỏ hàng, Amazon đã chọn một Lý do là đơn giản - mất một mục từ giỏ hàng của khách hàng (mất một bán hàng) là tồi tệ hơn so với thỉnh thoảng hiển thị một mục chưa được xóa. union merge Conflict versions: Version A (from Node1): {shoes, jacket} Version B (from Node2): {shoes, hat} Merge strategy: union Merged cart: {shoes, jacket, hat} ← All items preserved Đây là mã hòa giải thực tế: from __future__ import annotations from dataclasses import dataclass, field class VectorClock: def __init__(self, clock: dict[str, int] | None = None): self.clock: dict[str, int] = clock.copy() if clock else {} def merge(self, other: "VectorClock") -> "VectorClock": """Merged clock = max of each node's counter across both versions.""" all_keys = set(self.clock) | set(other.clock) merged = {k: max(self.clock.get(k, 0), other.clock.get(k, 0)) for k in all_keys} return VectorClock(merged) def __repr__(self): return f"VectorClock({self.clock})" @dataclass class ShoppingCart: items: list[str] = field(default_factory=list) vector_clock: VectorClock = field(default_factory=VectorClock) @staticmethod def reconcile(carts: list["ShoppingCart"]) -> "ShoppingCart": if len(carts) == 1: return carts[0] # No conflict, nothing to do # Merge strategy: union of all items (never lose additions). # This is Amazon's choice for shopping carts. # A different application might choose last-write-wins or something else. all_items: set[str] = set() merged_clock = VectorClock() for cart in carts: all_items.update(cart.items) # Union: keep everything merged_clock = merged_clock.merge(cart.vector_clock) return ShoppingCart(items=sorted(all_items), vector_clock=merged_clock) # Example conflict scenario cart1 = ShoppingCart(items=["shoes", "jacket"], vector_clock=VectorClock({"N1": 2})) cart2 = ShoppingCart(items=["shoes", "hat"], vector_clock=VectorClock({"N2": 2})) # Dynamo detected a conflict and passes both versions to our reconcile() reconciled = ShoppingCart.reconcile([cart1, cart2]) print(reconciled.items) # ['hat', 'jacket', 'shoes'] — union! The Deletion Problem (Why This Gets Tricky) Chiến lược liên minh có một trường hợp cạnh xấu: . deleted items can come back from the dead T=0: Cart: {shoes, hat} T=1: Customer removes hat → Cart: {shoes} Clock: [N1:3] T=2: Network partition — Node2 still has old state T=3: Some concurrent write to Node2 Clock: [N2:3] T=4: Network heals → conflict detected T=5: Union merge: {shoes} ∪ {shoes, hat} = {shoes, hat} Result: Hat is BACK! Customer removed it, but it reappeared. Amazon rõ ràng chấp nhận sự thỏa hiệp này.Một mặt hàng "hồn ma" trong giỏ hàng là một sự phiền toái nhỏ.Mất thêm một giỏ hàng trong một cuộc bán hàng Thứ Sáu Đen là mất thu nhập. Ghi chú kỹ thuật sâu: Logic hợp nhất phải cụ thể cho lĩnh vực và được thiết kế cẩn thận. Thêm các mục là chuyển đổi (lệnh không quan trọng) và dễ dàng để hợp nhất. Loại bỏ các mục không phải là—một việc xóa trong một chi nhánh đồng thời có thể được lặng lẽ bỏ qua trong một hợp nhất dựa trên hiệp hội. Đây là một sự thỏa hiệp có chủ ý trong thiết kế của Dynamo, nhưng nó có nghĩa là ứng dụng phải suy nghĩ cẩn thận về việc thêm so với loại bỏ ngữ nghĩa. Nếu dữ liệu của bạn không tự nhiên hỗ trợ hợp nhất (ví dụ, một con số, địa chỉ của người dùng), bạn cần một chiến lược khác – chẳng hạn như CRDTs, cuối cùng-wins-wins với timestamps, hoặc đơn giản là từ chối viết đồng thời cho loại dữ liệu đó. Ghi chú kỹ thuật sâu: Logic hợp nhất phải cụ thể cho lĩnh vực và được thiết kế cẩn thận. Thêm các mục là chuyển đổi (lệnh không quan trọng) và dễ dàng để hợp nhất. Loại bỏ các mục không phải là—một việc xóa trong một chi nhánh đồng thời có thể được lặng lẽ bỏ qua trong một hợp nhất dựa trên hiệp hội. Đây là một sự thỏa hiệp có chủ ý trong thiết kế của Dynamo, nhưng nó có nghĩa là ứng dụng phải suy nghĩ cẩn thận về việc thêm so với loại bỏ ngữ nghĩa. Nếu dữ liệu của bạn không tự nhiên hỗ trợ hợp nhất (ví dụ, một con số, địa chỉ của người dùng), bạn cần một chiến lược khác – chẳng hạn như CRDTs, cuối cùng-wins-wins với timestamps, hoặc đơn giản là từ chối viết đồng thời cho loại dữ liệu đó. Đọc & Viết Flow The diagrams above show the high-level flow, but let’s walk through what actually happens step-by-step during a read and a write. Understanding this concretely will make the earlier concepts click. Viết đường Step-by-step narration of a PUT request: Khách hàng gửi yêu cầu đến bất kỳ nút nào (thông qua bộ cân bằng tải) hoặc trực tiếp đến điều phối viên. — this is the first node in the preference list for the key’s hash position on the ring. The coordinator is determined — the coordinator increments its own counter in the vector clock, creating a new version. Vector clock is updated Điều phối viên viết địa phương, sau đó người hâm mộ viết ra các nút N-1 khác trong danh sách ưu tiên cùng một lúc. Điều phối viên chờ đợi sự thừa nhận của W. Nó KHÔNG chờ đợi tất cả N - chỉ là W đầu tiên để trả lời. Các nút còn lại chưa trả lời sẽ nhận được ghi cuối cùng (hoặc thông qua handoff gợi ý nếu chúng xuống). to the client. From the client’s perspective, the write is done. Once W ACKs arrive, the coordinator returns 200 OK : Khách hàng nhận được một phản hồi thành công ngay khi các nút W được xác nhận. Các nút khác (N - W) sẽ nhận được bản ghi không đồng bộ.Đó là lý do tại sao hệ thống là "đến cuối cùng nhất quán" - tất cả các nút có dữ liệu, nhưng không nhất thiết phải cùng một lúc. Key insight about the write path will Read Path Step-by-step narration of a GET request: Khách hàng gửi yêu cầu cho điều phối viên cho khóa đó. in the preference list simultaneously (not just R). This is important — it contacts all N, but only needs R to respond. The coordinator sends read requests to all N nodes Chờ đợi câu trả lời R. Điều phối viên quay trở lại ngay khi các nút R đã trả lời, mà không chờ đợi những nút chậm hơn. The coordinator checks all the vector clocks: Compare the versions returned. If all versions are identical → return the single version immediately. If one version’s clock dominates the others (it’s causally “newer”) → return that version. If versions are concurrent (neither clock dominates) → return to the client, which must merge them. all versions Sửa chữa đọc xảy ra trong nền: nếu điều phối viên nhận thấy bất kỳ nút nào trả về một phiên bản tạm dừng, nó sẽ gửi phiên bản mới nhất cho nút đó để cập nhật nó. Because Dynamo is a general-purpose storage engine. It doesn’t know whether you’re storing a shopping cart, a user profile, or a session token. Only biết làm thế nào để hợp nhất hai phiên bản mâu thuẫn theo cách có ý nghĩa kinh doanh. điều phối viên cung cấp cho bạn các phiên bản đồng thời thô cùng với bối cảnh đồng hồ vector, và bạn làm điều đúng cho trường hợp sử dụng của bạn. Why does the client receive the conflict instead of the coordinator resolving it? ứng dụng của bạn : khi khách hàng viết phiên bản sáp nhập trở lại, nó phải bao gồm ngữ cảnh (giờ vector sáp nhập). Điều này cho Dynamo biết rằng viết mới đã "nhìn thấy" tất cả các phiên bản đồng thời, vì vậy mâu thuẫn được giải quyết. cạnh tranh viết lên trên xung đột chưa được giải quyết. The vector clock context is the key to closing the loop khác Cây Merkle cho Anti-Entropy Vấn đề: Làm thế nào để bạn biết khi các bản sao không được đồng bộ hóa? Sau khi một nút phục hồi từ một thất bại, nó có thể đã bỏ lỡ một số viết. Sau khi một phân vùng mạng chữa lành, hai bản sao có thể khác nhau. Làm thế nào Dynamo phát hiện và sửa chữa những sự khác biệt này? The brute-force approach would be: “Every hour, compare every key on Node A against Node B, and sync anything that’s different.” But at Amazon’s scale, a single node might store hundreds of millions of keys. Comparing them all one by one would be so slow and bandwidth-intensive that it would interfere with normal traffic. Ý tưởng cốt lõi: thay vì so sánh các phím riêng lẻ, so sánh Nếu hash phù hợp, thì toàn bộ nhóm đó là giống hệt nhau – skip nó. Dynamo uses Merkle trees to solve this efficiently. Hashes của nhóm key : Merkle tree sync is a mechanism. It’s not on the hot read/write path. Normal reads and writes use vector clocks and quorums for versioning. Merkle trees are for the repair process that runs periodically in the background to catch any inconsistencies that slipped through. Important background anti-entropy : Merkle tree sync is a Cơ chế. Nó không phải là trên con đường đọc / viết nóng. Đọc và viết bình thường sử dụng đồng hồ vector và giới hạn cho phiên bản. cây Merkle là cho quá trình sửa chữa chạy định kỳ trong nền để nắm bắt bất kỳ sự không nhất quán nào đã trượt qua. Important background anti-entropy How a Merkle Tree Is Built Each node builds a Merkle tree over its data, organized by key ranges: Các nút tờ chứa hash của một phạm vi nhỏ các khóa dữ liệu thực tế (ví dụ, hash của tất cả các giá trị cho các khóa k1, k2, k3). Các nút nội bộ chứa hash của các hash của con họ. Rễ là một hash duy nhất đại diện cho tất cả dữ liệu trên nút. Làm thế nào hai nút đồng bộ bằng cách sử dụng cây Merkle When Node A and Node B want to check if they’re in sync: : So sánh các hash gốc. Nếu chúng giống nhau, mọi thứ đều giống nhau. Đã thực hiện! (Không lưu lượng truy cập mạng cho chính dữ liệu.) Step 1 : Nếu rễ khác nhau, so sánh con cái trái của họ. cùng? bỏ qua toàn bộ nửa không gian khóa đó. Step 2 : Tiếp tục đi xuống chỉ vào các cây phụ nơi hash khác nhau, cho đến khi bạn đạt đến các nút lá. Step 3 : Chỉ đồng bộ các phím cụ thể trong các nút tờ khác nhau. Step 4 Example: Comparing two nodes Node A root: abc789 ← differs from Node B! Node B root: abc788 Compare left subtrees: Node A left: xyz123 Node B left: xyz123 ← same! Skip entire left half. Compare right subtrees: Node A right: def456 Node B right: def457 ← differs! Go deeper. Compare right-left subtree: Node A right-left: ghi111 Node B right-left: ghi111 ← same! Skip. Compare right-right subtree: Node A right-right: jkl222 Node B right-right: jkl333 ← differs! These are leaves. → Sync only the keys in the right-right leaf range (e.g., k10, k11, k12) Instead of comparing all 1 million keys, we compared 6 hashes and synced only 3 keys! : Synchronization process in code def sync_replicas(node_a, node_b, key_range): """ Efficiently sync two nodes using Merkle trees. Instead of comparing all keys, we compare hashes top-down. Only the ranges where hashes differ need actual key-level sync. """ tree_a = node_a.get_merkle_tree(key_range) tree_b = node_b.get_merkle_tree(key_range) # Step 1: Compare root hashes first. # If they match, every key in this range is identical — nothing to do! if tree_a.root_hash == tree_b.root_hash: return # Zero data transferred — full match! # Step 2: Recursively find differences by traversing top-down. # Only descend into subtrees where hashes differ. differences = [] stack = [(tree_a.root, tree_b.root)] while stack: node_a_subtree, node_b_subtree = stack.pop() if node_a_subtree.hash == node_b_subtree.hash: continue # This whole subtree matches — skip it! if node_a_subtree.is_leaf: # Found a differing leaf — these keys need syncing differences.extend(node_a_subtree.keys) else: # Not a leaf yet — recurse into children for child_a, child_b in zip(node_a_subtree.children, node_b_subtree.children): stack.append((child_a, child_b)) # Step 3: Sync only the specific keys that differed at leaf level. # This might be a handful of keys, not millions. for key in differences: sync_key(node_a, node_b, key) Tại sao nó hiệu quả Sức mạnh của cây Merkle là số lượng so sánh hash bạn cần đo lường với (Logarithmic trong số lượng phím), không phải số lượng chính các phím. depth of the tree Node with 1,000,000 keys: Naive approach: Compare 1,000,000 keys individually Cost: 1,000,000 comparisons Merkle tree: Compare O(log N) hashes top-down Tree depth ≈ 20 levels Cost: 20 comparisons to find differences Then sync only the differing leaves (~few keys) Speedup: ~50,000x fewer comparisons! Và quan trọng, nếu hai nút là (mà hầu như luôn luôn đúng trong một cụm khỏe mạnh), các hash gốc thường phù hợp hoàn toàn và không có dữ liệu cần phải được chuyển. mostly in sync Thành viên và phát hiện thất bại Dynamo sử dụng một giao thức trò đùa để quản lý thành viên. Mỗi nút định kỳ trao đổi thông tin thành viên với các đồng nghiệp ngẫu nhiên. Không có nút chính - tất cả sự phối hợp là hoàn toàn phi tập trung. Gossip-Based Thành viên Key Design Points : Mỗi nút duy trì dạng xem riêng của thành viên cụm. Không có sổ đăng ký trung tâm, vì vậy không có điểm thất bại duy nhất cho dữ liệu thành viên. No single coordinator : Dynamo uses an accrual-based failure detector (similar to Phi Accrual). Rather than a binary “alive/dead” judgment, nodes maintain a mà tăng càng lâu một đồng nghiệp là không đáp ứng. Điều này tránh các dương tính giả từ lỗ hổng mạng tạm thời. Failure suspicion vs. detection suspicion level Node A's view of Node B: - Last heartbeat: 3 seconds ago → Suspicion low → Healthy - Last heartbeat: 15 seconds ago → Suspicion rising → Likely slow/degraded - Last heartbeat: 60 seconds ago → Suspicion high → Treat as failed : New nodes contact a seed node to join, then gossip spreads their presence to the rest of the cluster. Ring membership is eventually consistent—different nodes may have slightly different views of the ring momentarily, which is acceptable. Decentralized bootstrapping Tính năng chính: Real Numbers Bài báo cung cấp dữ liệu hiệu suất hấp dẫn. Latency Distribution Metric | Average | 99.9th Percentile --------------------|---------|------------------ Read latency | ~10ms | ~200ms Write latency | ~15ms | ~200ms Key insight: 99.9th percentile is ~20x the average! The 99.9th percentile is affected by: Why the huge gap? Garbage Collection nghỉ ngơi Disk I/O biến thể Mạng Jitter Load mất cân bằng This is why Amazon SLAs are specified at 99.9th percentile, not average. Phiên bản xung đột Từ 24 giờ lưu lượng truy cập giỏ hàng sản xuất của Amazon (theo tờ Dynamo). lưu ý những điều này phản ánh các đặc điểm khối lượng công việc cụ thể của Amazon, không phải là một đường cơ sở phổ quát: 99.94% - Saw exactly one version (no conflict) 0.00057% - Saw 2 versions 0.00047% - Saw 3 versions 0.00009% - Saw 4 versions : Xung đột là hiếm trong thực tế! Thường là do các tác giả đồng thời (robot), không phải thất bại. Takeaway Chiến lược phân chia Evolution Dynamo đã phát triển thông qua ba chiến lược phân vùng. sự phát triển này dạy cho chúng ta những bài học quan trọng: Chiến lược 1: Random Tokens (Bắt đầu) Problem: Random token assignment → uneven load Problem: Adding nodes → expensive data scans Problem: Can't easily snapshot the system Tùy chọn token ngẫu nhiên nghe có vẻ thanh lịch nhưng là một cơn ác mộng trong thực tế.Mỗi nút có một vị trí ngẫu nhiên trên vòng, có nghĩa là phạm vi sở hữu dữ liệu khác nhau và phân phối tải không đồng đều. Operational lesson Chiến lược 2: Các phân vùng có kích thước bằng nhau + Token ngẫu nhiên Improvement: Decouples partitioning from placement Problem: Still has load balancing issues Chiến lược 3: Q/S Tokens Per Node – Partitions Equal-Size + Deterministic Placement (Current) What Q and S mean: Q = tổng số phân vùng cố định mà vòng được chia thành (ví dụ: 1024). Hãy nghĩ về những phân vùng này như là các phần cắt trước của không gian hash không bao giờ thay đổi hình dạng. = the number of physical servers currently in the cluster (e.g. 8). S Q/S = số lượng các phân vùng cố định mà mỗi máy chủ chịu trách nhiệm (ví dụ: 1024 / 8 = 128 phân vùng mỗi máy chủ). Sự thay đổi quan trọng từ các chiến lược trước đây: chiếc nhẫn bây giờ được chia thành các phân vùng Q cố định, kích thước bằng nhau Các máy chủ không còn nhận được các vị trí ngẫu nhiên – mỗi máy chủ đều có chính xác các phân vùng Q/S, phân phối đều quanh vòng. Đầu tiên Example: Q=12 partitions, S=3 servers Ring divided into 12 equal slices (each covers 30° of the 360° ring): Partition 1: 0°– 30° → Server A Partition 2: 30°– 60° → Server B Partition 3: 60°– 90° → Server C Partition 4: 90°–120° → Server A Partition 5: 120°–150° → Server B Partition 6: 150°–180° → Server C ...and so on, round-robin Each server owns exactly Q/S = 12/3 = 4 partitions → perfectly balanced. When a 4th server joins (S becomes 4): New Q/S = 12/4 = 3 partitions per server. Each existing server hands off 1 partition to the new server. Only 3 out of 12 partitions move — the rest are untouched. Sự tiến hóa này – từ token ngẫu nhiên đến phân vùng cố định, kích thước bằng với quyền sở hữu cân bằng – là một trong những bài học hoạt động hữu ích nhất từ Dynamo. Cách tiếp cận đầu tiên ưu tiên sự đơn giản của triển khai; cách tiếp cận sau đó ưu tiên sự đơn giản và dự đoán hoạt động. So sánh Dynamo với các hệ thống hiện đại System Consistency Model Use Case Dynamo Influence Cassandra Tunable (N, R, W) Time-series, analytics Direct descendant — heavily inspired by Dynamo, uses same consistent hashing and quorum concepts Riak Tunable, vector clocks Key-value store Closest faithful Dynamo implementation Amazon DynamoDB Eventually consistent by default Managed NoSQL DynamoDB is a completely different system internally, with no vector clocks and much simpler conflict resolution. Shares the name and high-level inspiration only. ⚠️ Not the same as Dynamo! Voldemort Tunable LinkedIn's data store Open-source Dynamo implementation Google Spanner Linearizable Global SQL Opposite choice to Dynamo — prioritizes CP via TrueTime clock synchronization Redis Cluster Eventually consistent Caching, sessions Uses consistent hashing; much simpler conflict resolution Cassandra Tùy chỉnh (N, R, W) Dòng thời gian, phân tích Tiếp theo trực tiếp - được truyền cảm hứng mạnh mẽ từ Dynamo, sử dụng cùng một khái niệm hashing và quorum nhất quán Riak Đồng hồ, đồng hồ vector Key-value cửa hàng Ứng dụng Dynamo gần nhất Amazon DynamoDB Cuối cùng nhất quán bởi default Quản lý NoSQL DynamoDB là một hệ thống hoàn toàn khác bên trong, không có đồng hồ vector và giải quyết xung đột đơn giản hơn nhiều. ⚠️ Not the same as Dynamo! Voldemort Tunable Cửa hàng dữ liệu LinkedIn Ứng dụng Open-Source Dynamo Google Spanner Linearizable SQL toàn cầu Lựa chọn đối lập với Dynamo — ưu tiên CP thông qua đồng hồ đồng bộ TrueTime Redis Cluster Cuối cùng nhất quán Caching, các phiên Sử dụng hashing nhất quán; giải quyết xung đột đơn giản hơn nhiều Sự nhầm lẫn của DynamoDB: Nhiều kỹ sư nhầm lẫn Amazon DynamoDB với giấy Dynamo. Chúng rất khác nhau. DynamoDB là một dịch vụ được quản lý được tối ưu hóa cho sự đơn giản hoạt động. Nó không phơi bày đồng hồ vector, không sử dụng cùng một kế hoạch phân vùng, và sử dụng một mô hình nhất quán độc quyền. : Nhiều kỹ sư nhầm lẫn Amazon DynamoDB với giấy Dynamo. Chúng rất khác nhau. DynamoDB là một dịch vụ được quản lý được tối ưu hóa cho sự đơn giản hoạt động. Nó không phơi bày đồng hồ vector, không sử dụng cùng một sơ đồ phân vùng, và sử dụng một mô hình nhất quán độc quyền. The DynamoDB confusion Những gì Dynamo không cho bạn Mỗi blog kỹ sư cao cấp nên trung thực về những hạn chế.Đây là những gì Dynamo rõ ràng giao dịch: Không giao dịch: Các hoạt động chỉ có một khóa duy nhất. Bạn không thể cập nhật nhiều khóa một cách nguyên tử. Không có chỉ mục thứ cấp: Bạn chỉ có thể tìm kiếm dữ liệu bằng khóa chính (ít nhất là trong thiết kế ban đầu). No joins: Đây là một cửa hàng giá trị khóa. Không có ngôn ngữ truy vấn. Không có sắp xếp toàn cầu: Sự kiện trên các phím khác nhau không có sắp xếp được đảm bảo. Không có khả năng tuyến tính: Ngay cả ở R = W = N, Dynamo không cung cấp đọc có thể tuyến tính. Không giải quyết xung đột tự động: Hệ thống phát hiện xung đột và hiển thị chúng cho ứng dụng. Ứng dụng phải giải quyết chúng. Nếu kỹ sư của bạn không hiểu điều này, bạn sẽ có lỗi dữ liệu tinh tế. Chi phí sửa chữa ở quy mô lớn: Quá trình chống entropy (sự hòa giải cây Merkle) không phải là miễn phí. : In high-churn write environments with many coordinators, vector clocks can grow large enough to require truncation, which introduces potential causality loss. Vector clock growth Hiểu được những hạn chế này là rất quan trọng để vận hành thành công các hệ thống theo phong cách Dynamo trong sản xuất. Ví dụ thực hành thực tế Dưới đây là một thực hiện Python tự chứa các khái niệm cốt lõi của Dynamo. Nó được đơn giản hóa một cách có chủ ý - không có mạng thực tế, không có sự kiên trì - nhưng nó mô hình trung thành cách đồng hồ vector, vòng hash nhất quán, đọc / viết quorum và phát hiện xung đột tương tác. Phần 1: Đồng hồ vector Các class là nền tảng của phiên bản theo dõi. nó chỉ là một bản đồ từ điển Hai hoạt động chính: VectorClock node_id → counter — bump our own counter when we write increment(node) — check if one clock is causally “after” another; if neither dominates, the writes were concurrent (conflict) dominates(other) from __future__ import annotations from dataclasses import dataclass, field from typing import Optional class VectorClock: """ Tracks causality across distributed writes. A clock like {"nodeA": 2, "nodeB": 1} means: - nodeA has coordinated 2 writes - nodeB has coordinated 1 write - Any version with these counters has "seen" those writes """ def __init__(self, clock: dict[str, int] | None = None): self.clock: dict[str, int] = clock.copy() if clock else {} def increment(self, node_id: str) -> "VectorClock": """Return a new clock with node_id's counter bumped by 1.""" new_clock = self.clock.copy() new_clock[node_id] = new_clock.get(node_id, 0) + 1 return VectorClock(new_clock) def dominates(self, other: "VectorClock") -> bool: """ Returns True if self is causally AFTER other. self dominates other when: - Every counter in self is >= the same counter in other, AND - At least one counter in self is strictly greater. Meaning: self has seen everything other has seen, plus more. """ all_keys = set(self.clock) | set(other.clock) at_least_one_greater = False for key in all_keys: self_val = self.clock.get(key, 0) other_val = other.clock.get(key, 0) if self_val other_val: at_least_one_greater = True return at_least_one_greater def merge(self, other: "VectorClock") -> "VectorClock": """ Merge two clocks by taking the max of each counter. Used after resolving a conflict to produce a new clock that has "seen" everything both conflicting versions saw. """ all_keys = set(self.clock) | set(other.clock) merged = {k: max(self.clock.get(k, 0), other.clock.get(k, 0)) for k in all_keys} return VectorClock(merged) def __repr__(self): return f"VectorClock({self.clock})" Phần 2: Phiên bản giá trị Mỗi giá trị được lưu trữ trong Dynamo được đóng gói với đồng hồ vector của nó. cặp này là những gì cho phép điều phối viên để so sánh các phiên bản trong khi đọc và phát hiện xung đột. @dataclass class VersionedValue: """ A value paired with its causal history (vector clock). When a client reads, they get back a VersionedValue. When they write an update, they must include the context (the vector clock they read) so Dynamo knows what version they're building on top of. """ value: object vector_clock: VectorClock def __repr__(self): return f"VersionedValue(value={self.value!r}, clock={self.vector_clock})" Phần 3: Simulated Node Trong Dynamo thực tế, mỗi nút là một quá trình riêng biệt. Ở đây chúng tôi mô phỏng chúng như các đối tượng trong bộ nhớ. Chi tiết chính: mỗi nút có địa phương riêng của nó node có thể được đánh dấu như để mô phỏng thất bại. storage down class DynamoNode: """ Simulates a single Dynamo storage node. In production this would be a separate server with disk storage. Here it's an in-memory dict so we can demo the logic without networking. """ def __init__(self, node_id: str, token: int): self.node_id = node_id self.token = token # Position on the consistent hash ring self.storage: dict[str, list[VersionedValue]] = {} self.down = False # Toggle to simulate node failures def write(self, key: str, versioned_value: VersionedValue) -> bool: """ Store a versioned value. Returns False if the node is down. We store a LIST of versions per key, because a node might hold multiple concurrent (conflicting) versions until they are resolved by the application. """ if self.down: return False # In a real node this would be written to disk (e.g. BerkeleyDB) self.storage[key] = [versioned_value] return True def read(self, key: str) -> list[VersionedValue] | None: """ Return all versions of a key. Returns None if the node is down. A healthy node with no data for the key returns an empty list. """ if self.down: return None return self.storage.get(key, []) def __repr__(self): status = "DOWN" if self.down else f"token={self.token}" return f"DynamoNode({self.node_id}, {status})" Phần 4: Hash Ring nhất quán Chúng tôi sắp xếp các nút theo token (vị trí) của chúng và sử dụng đường đi theo chiều đồng hồ để tìm danh sách điều phối viên và ưu tiên cho bất kỳ khóa nào. import hashlib class ConsistentHashRing: """ Maps any key to an ordered list of N nodes (the preference list). Nodes are placed at fixed positions (tokens) on a conceptual ring from 0 to 2^32. A key hashes to a position, then walks clockwise to find its nodes. This means adding/removing one node only rebalances ~1/N of keys, rather than reshuffling everything like modulo hashing would. """ def __init__(self, nodes: list[DynamoNode]): # Sort nodes by token so we can do clockwise lookup efficiently self.nodes = sorted(nodes, key=lambda n: n.token) def _hash(self, key: str) -> int: """Consistent hash of a key into the ring's token space.""" # Use MD5 for a simple, evenly distributed hash. # Real Dynamo uses a more sophisticated hash (e.g., SHA-1). digest = hashlib.md5(key.encode()).hexdigest() return int(digest, 16) % (2**32) def get_preference_list(self, key: str, n: int) -> list[DynamoNode]: """ Return the first N nodes clockwise from key's hash position. These are the nodes responsible for storing this key. The first node in the list is the coordinator — it receives the client request and fans out to the others. """ if not self.nodes: return [] key_hash = self._hash(key) # Find the first node whose token is >= key's hash (clockwise) start_idx = 0 for i, node in enumerate(self.nodes): if node.token >= key_hash: start_idx = i break # If key_hash is greater than all tokens, wrap around to node 0 else: start_idx = 0 # Walk clockwise, collecting N unique nodes preference_list = [] for i in range(len(self.nodes)): idx = (start_idx + i) % len(self.nodes) preference_list.append(self.nodes[idx]) if len(preference_list) == n: break return preference_list Phần 5: Dynamo Coordinator Đây là cốt lõi của hệ thống - logic xử lý yêu cầu của khách hàng, người hâm mộ cho các bản sao, chờ đợi thẩm quyền, và phát hiện xung đột. class SimplifiedDynamo: """ Coordinates reads and writes across a cluster of DynamoNodes. Any node can act as coordinator for any request — there's no dedicated master. The coordinator is simply whichever node receives the client request (or the first node in the preference list, if using partition-aware routing). Configuration: N = total replicas per key R = minimum nodes that must respond to a read (read quorum) W = minimum nodes that must acknowledge a write (write quorum) """ def __init__(self, nodes: list[DynamoNode], N: int = 3, R: int = 2, W: int = 2): self.N = N self.R = R self.W = W self.ring = ConsistentHashRing(nodes) # ------------------------------------------------------------------ # # WRITE # # ------------------------------------------------------------------ # def put(self, key: str, value: object, context: VectorClock | None = None) -> VectorClock: """ Write a key-value pair to N replicas, wait for W ACKs. The 'context' is the vector clock from a previous read. Always pass context when updating an existing key — it tells Dynamo which version you're building on top of, so it can detect whether your write is concurrent with anything else. Returns the new vector clock, which the caller should store and pass back on future writes to this key. Raises: RuntimeError if fewer than W nodes acknowledged. """ preference_list = self.ring.get_preference_list(key, self.N) if not preference_list: raise RuntimeError("No nodes available") # The coordinator is always the first node in the preference list. coordinator = preference_list[0] # Increment the coordinator's counter in the vector clock. # If no context was provided (brand new key), start a fresh clock. base_clock = context if context is not None else VectorClock() new_clock = base_clock.increment(coordinator.node_id) versioned = VersionedValue(value=value, vector_clock=new_clock) # Fan out to all N replicas. # In a real system these would be concurrent RPC calls. # Here we call them sequentially for simplicity. ack_count = 0 for node in preference_list: success = node.write(key, versioned) if success: ack_count += 1 # Only need W ACKs to declare success. # The remaining replicas are updated asynchronously (or via # hinted handoff if they were down). if ack_count < self.W: raise RuntimeError( f"Write quorum not met: got {ack_count} ACKs, needed {self.W}" ) print(f"[PUT] key={key!r} value={value!r} clock={new_clock} " f"({ack_count}/{self.N} nodes wrote)") return new_clock # ------------------------------------------------------------------ # # READ # # ------------------------------------------------------------------ # def get(self, key: str) -> list[VersionedValue]: """ Read a key from N replicas, wait for R responses, reconcile. Returns a LIST of VersionedValues: - Length 1 → clean read, no conflict - Length >1 → concurrent versions detected; application must merge After reading, the caller should: 1. If no conflict: use the single value normally. 2. If conflict: merge the values using application logic, then call put() with the merged value and the merged vector clock as context. This "closes" the conflict. Read repair happens in the background: any replica that returned a stale version is silently updated with the latest version. """ preference_list = self.ring.get_preference_list(key, self.N) # Collect responses from all N nodes all_versions: list[VersionedValue] = [] responding_nodes: list[tuple[DynamoNode, list[VersionedValue]]] = [] for node in preference_list: result = node.read(key) if result is not None: # None means the node is down all_versions.extend(result) responding_nodes.append((node, result)) if len(responding_nodes) < self.R: raise RuntimeError( f"Read quorum not met: got {len(responding_nodes)} responses, needed {self.R}" ) # Reconcile: discard any version that is strictly dominated # (i.e., is a causal ancestor of) another version. # What remains is the set of concurrent versions. reconciled = self._reconcile(all_versions) # Background read repair: if any node returned something older # than the reconciled result, send it the latest version. # (Simplified: only meaningful when there's a single winner.) if len(reconciled) == 1: latest = reconciled[0] for node, versions in responding_nodes: if not versions or versions[0].vector_clock != latest.vector_clock: node.write(key, latest) # Repair silently in background status = "clean" if len(reconciled) == 1 else f"CONFLICT ({len(reconciled)} versions)" print(f"[GET] key={key!r} status={status} " f"({len(responding_nodes)}/{self.N} nodes responded)") return reconciled # ------------------------------------------------------------------ # # INTERNAL: VERSION RECONCILIATION # # ------------------------------------------------------------------ # def _reconcile(self, versions: list[VersionedValue]) -> list[VersionedValue]: """ Remove any version that is a causal ancestor of another version. If version A's clock is dominated by version B's clock, then B is strictly newer — A adds no new information and can be dropped. Whatever remains after pruning are CONCURRENT versions: writes that happened without either "knowing about" the other. The application must merge these using domain-specific logic. Example: versions = [clock={A:1}, clock={A:2}, clock={B:1}] {A:2} dominates {A:1} → drop {A:1} {A:2} and {B:1} are concurrent → both survive result = [{A:2}, {B:1}] ← conflict! application must merge """ dominated = set() for i, v1 in enumerate(versions): for j, v2 in enumerate(versions): if i != j and v2.vector_clock.dominates(v1.vector_clock): dominated.add(i) # v1 is an ancestor of v2, discard v1 survivors = [v for i, v in enumerate(versions) if i not in dominated] # De-duplicate: identical clocks from different replicas are the same version seen_clocks: list[VectorClock] = [] unique: list[VersionedValue] = [] for v in survivors: if not any(v.vector_clock.clock == s.clock for s in seen_clocks): unique.append(v) seen_clocks.append(v.vector_clock) return unique if unique else versions Part 6: Putting It All Together — A Demo Chúng ta hãy chạy qua một kịch bản hoàn chỉnh: viết / đọc bình thường, sau đó là một cuộc xung đột mô phỏng nơi hai nút khác nhau và ứng dụng phải hợp nhất chúng. def demo(): # ── Setup ────────────────────────────────────────────────────────── # # Five nodes placed at evenly spaced positions on the hash ring. # In a real cluster these would span multiple datacenters. nodes = [ DynamoNode("node-A", token=100), DynamoNode("node-B", token=300), DynamoNode("node-C", token=500), DynamoNode("node-D", token=700), DynamoNode("node-E", token=900), ] dynamo = SimplifiedDynamo(nodes, N=3, R=2, W=2) print("=" * 55) print("SCENARIO 1: Normal write and read (no conflict)") print("=" * 55) # Write the initial shopping cart ctx = dynamo.put("cart:user-42", {"items": ["shoes"]}) # Read it back — should be a clean single version versions = dynamo.get("cart:user-42") print(f"Read result: {versions[0].value}\n") # Update the cart, passing the context from our earlier read. # The context tells Dynamo "this write builds on top of clock ctx". ctx = dynamo.put("cart:user-42", {"items": ["shoes", "jacket"]}, context=ctx) versions = dynamo.get("cart:user-42") print(f"After update: {versions[0].value}\n") print("=" * 55) print("SCENARIO 2: Simulated conflict — two concurrent writes") print("=" * 55) # Write the base version base_ctx = dynamo.put("cart:user-99", {"items": ["hat"]}) # Now simulate a network partition: # node-A and node-B can't talk to each other. # We model this by writing directly to individual nodes. pref_list = dynamo.ring.get_preference_list("cart:user-99", 3) node_1, node_2, node_3 = pref_list[0], pref_list[1], pref_list[2] # Write 1: customer adds "scarf" via node_1 (e.g., their laptop) clock_1 = base_ctx.increment(node_1.node_id) node_1.write("cart:user-99", VersionedValue({"items": ["hat", "scarf"]}, clock_1)) # Write 2: customer adds "gloves" via node_2 (e.g., their phone) # This write also descends from base_ctx, not from clock_1. # Neither write knows about the other → they are concurrent. clock_2 = base_ctx.increment(node_2.node_id) node_2.write("cart:user-99", VersionedValue({"items": ["hat", "gloves"]}, clock_2)) # Read — coordinator sees two concurrent versions and surfaces the conflict versions = dynamo.get("cart:user-99") if len(versions) > 1: print(f"\nConflict detected! {len(versions)} concurrent versions:") for i, v in enumerate(versions): print(f" Version {i+1}: {v.value} clock={v.vector_clock}") # Application-level resolution: union merge (Amazon's shopping cart strategy) # Merge items: take the union so no addition is lost all_items = set() merged_clock = versions[0].vector_clock for v in versions: all_items.update(v.value["items"]) merged_clock = merged_clock.merge(v.vector_clock) merged_value = {"items": sorted(all_items)} print(f"\nMerged result: {merged_value}") # Write the resolved version back with the merged clock as context. # This "closes" the conflict — future reads will see a single version. final_ctx = dynamo.put("cart:user-99", merged_value, context=merged_clock) versions = dynamo.get("cart:user-99") print(f"\nAfter resolution: {versions[0].value}") assert len(versions) == 1, "Should be a single version after merge" if __name__ == "__main__": demo() Expected output: ======================================================= SCENARIO 1: Normal write and read (no conflict) ======================================================= [PUT] key='cart:user-42' value={'items': ['shoes']} clock=VectorClock({'node-A': 1}) (3/3 nodes wrote) [GET] key='cart:user-42' status=clean (3/3 nodes responded) Read result: {'items': ['shoes']} [PUT] key='cart:user-42' value={'items': ['shoes', 'jacket']} clock=VectorClock({'node-A': 2}) (3/3 nodes wrote) [GET] key='cart:user-42' status=clean (3/3 nodes responded) After update: {'items': ['shoes', 'jacket']} ======================================================= SCENARIO 2: Simulated conflict — two concurrent writes ======================================================= [PUT] key='cart:user-99' value={'items': ['hat']} clock=VectorClock({'node-A': 1}) (3/3 nodes wrote) [GET] key='cart:user-99' status=CONFLICT (2 versions) (3/3 nodes responded) Conflict detected! 2 concurrent versions: Version 1: {'items': ['hat', 'scarf']} clock=VectorClock({'node-A': 2}) Version 2: {'items': ['hat', 'gloves']} clock=VectorClock({'node-A': 1, 'node-B': 1}) Merged result: {'items': ['gloves', 'hat', 'scarf']} [PUT] key='cart:user-99' value={'items': ['gloves', 'hat', 'scarf']} ... (3/3 nodes wrote) [GET] key='cart:user-99' status=clean (3/3 nodes responded) After resolution: {'items': ['gloves', 'hat', 'scarf']} Trong Kịch bản 2, điều phối viên xác định chính xác rằng và không bằng hoặc trong mối quan hệ thống trị - cũng không phải là tổ tiên của nhau - vì vậy cả hai đều xuất hiện đồng thời. Ứng dụng sau đó chịu trách nhiệm hợp nhất chúng và viết lại một phiên bản giải quyết với đồng hồ hợp nhất. What to notice: {'node-A': 2} {'node-A': 1, 'node-B': 1} Những bài học quan trọng về thiết kế hệ thống Sau nhiều năm làm việc với các hệ thống lấy cảm hứng từ Dynamo, đây là những ý tưởng chính của tôi: 1. Always-On Beats Strongly-Consistent Đối với các ứng dụng hướng tới người dùng, khả năng sẵn có hầu như luôn luôn chiến thắng. Người dùng sẽ chịu đựng thấy dữ liệu bị trì trệ một chút. Họ sẽ không chịu đựng “Dịch vụ không có sẵn”. 2. Application-Level Reconciliation is Powerful Đừng sợ đẩy giải quyết xung đột vào ứng dụng. Ứng dụng hiểu logic kinh doanh và có thể đưa ra quyết định thông minh hơn cơ sở dữ liệu bao giờ có thể. 3. Tunable Consistency is Essential Thêm giỏ hàng cần có sẵn cao (W = 1). giao dịch tài chính cần đảm bảo mạnh hơn (W = N). khả năng điều chỉnh mỗi giao dịch này là vô cùng có giá trị. 4. The 99.9th Percentile Matters More Than Average Tập trung các nỗ lực tối ưu hóa của bạn vào sự chậm trễ đuôi.Đó là những gì người dùng thực sự trải nghiệm trong thời gian cao điểm. 5. Gossip Protocols Scale Beautifully Sự phối hợp phi tập trung thông qua tin đồn loại bỏ các điểm thất bại duy nhất và quy mô đến hàng ngàn nút. Khi nào không nên sử dụng Dynamo-Style Systems Hãy trung thực về sự thỏa hiệp.Đừng sử dụng phương pháp này khi: Cần có sự nhất quán mạnh mẽ (hoạt động tài chính, quản lý hàng tồn kho) Các truy vấn phức tạp là cần thiết (báo cáo, phân tích, liên kết) Giao dịch bao gồm nhiều mục (Dynamo chỉ hoạt động với một khóa duy nhất) Nhóm của bạn không thể xử lý sự nhất quán cuối cùng (nếu các nhà phát triển không hiểu đồng hồ vector và giải quyết xung đột, bạn sẽ có vấn đề) Kết luận Dynamo đại diện cho một sự thay đổi cơ bản trong cách chúng ta nghĩ về các hệ thống phân tán.Bằng cách chấp nhận sự nhất quán cuối cùng và cung cấp các thỏa hiệp có thể điều chỉnh, nó cho phép xây dựng các hệ thống có quy mô lớn trong khi duy trì độ sẵn có cao. The paper’s lessons have influenced an entire generation of distributed databases. Whether you’re using Cassandra, Riak, or DynamoDB, you’re benefiting from the insights first published in this paper. Là kỹ sư, công việc của chúng tôi là hiểu sâu sắc những thỏa hiệp này và áp dụng chúng một cách thích hợp. Dynamo cung cấp cho chúng tôi một công cụ mạnh mẽ, nhưng giống như bất kỳ công cụ nào, nó chỉ tốt như sự hiểu biết của chúng tôi về khi nào và làm thế nào để sử dụng nó. Đọc thêm Original Dynamo Paper: SOSP 2007 Blog của Werner Vogels: All Things Distributed Cassandra Documentation: Hiểu cách thực hiện các khái niệm này “Thiết kế các ứng dụng dữ liệu” của Martin Kleppmann – Chương 5 về Replication Phụ lục: Các vấn đề thiết kế và phương pháp tiếp cận Ba vấn đề mở xuất hiện trong các cuộc phỏng vấn thiết kế hệ thống và công việc kỹ thuật thực tế. Vấn đề 1: Giải quyết xung đột cho trình soạn thảo tài liệu hợp tác : Bạn đang xây dựng một cái gì đó giống như Google Docs được hỗ trợ bởi một cửa hàng theo phong cách Dynamo. Hai người dùng chỉnh sửa cùng một đoạn văn cùng một lúc. The problem Chiến lược giỏ hàng (hiệp hội của tất cả các mặt hàng) chỉ an toàn vì việc thêm các mặt hàng là chuyển đổi — Nếu Người dùng A xóa một câu và Người dùng B chỉnh sửa giữa nó, sự kết hợp của những thay đổi của họ là vô nghĩa hoặc mâu thuẫn. Why shopping cart union doesn’t work here {A} ∪ {B} = {B} ∪ {A} The right approach: Operational Transformation (OT) or CRDTs Giải pháp công nghiệp là đại diện cho tài liệu không phải là một khối văn bản, mà là một chuỗi các hoạt động, và chuyển đổi các hoạt động đồng thời để cả hai có thể được áp dụng mà không xung đột: User A's operation: delete(position=50, length=20) User B's operation: insert(position=60, text="new sentence") Without OT: B's insert position (60) is now wrong because A deleted 20 chars. With OT: Transform B's operation against A's: B's insert position shifts to 40 (60 - 20). Both operations now apply cleanly. Chiến lược giải quyết xung đột cho lớp Dynamo sẽ là: Lưu các hoạt động (không phải ảnh chụp toàn bộ tài liệu) làm giá trị cho mỗi phím. Trên xung đột, thu thập tất cả các danh sách hoạt động đồng thời từ mỗi phiên bản. Áp dụng OT để hợp nhất chúng thành một nhật ký hoạt động nhất quán duy nhất. Viết lại nhật ký sáp nhập với đồng hồ vector sáp nhập làm ngữ cảnh. : Nhật ký hoạt động cho từng phân đoạn tài liệu, không phải là văn bản được hiển thị. Điều này làm cho việc sáp nhập là xác định và không mất mát. What to store in Dynamo Các lớp lưu trữ của chúng sử dụng OT hoặc một biến thể của CRDTs (Conflict-free Replicated Data Types), đó là các cấu trúc dữ liệu được đảm bảo toán học để hợp nhất mà không có xung đột bất kể thứ tự hoạt động. Real-world reference Vấn đề 2: Chọn N, R, W cho các trường hợp sử dụng khác nhau : Bạn sẽ chọn cấu hình nào cho (a) cửa hàng phiên, (b) danh mục sản phẩm, (c) hồ sơ người dùng? The problem Cách chính xác để suy nghĩ về điều này: xác định chế độ thất bại có chi phí cao hơn - viết bị bỏ lỡ (mất dữ liệu) hoặc viết bị từ chối (không có sẵn). Session store — prioritize availability Các phiên là tạm thời và cụ thể cho người dùng.Nếu phiên của người dùng bị ngừng hoạt động hoặc bị mất một thời gian ngắn, họ sẽ được đăng xuất và đăng nhập lại.Đó là khó chịu nhưng không phải là thảm họa.Bạn không bao giờ muốn từ chối viết phiên. N=3, R=1, W=1 Rationale: - W=1: Accept session writes even during heavy failures. A user can't log in if their session write is rejected. - R=1: Read from any single node. Stale session data is harmless. - N=3: Still replicate to 3 nodes for basic durability. Trade-off accepted: Stale session reads are possible but inconsequential. Product catalog — prioritize read performance and consistency Dữ liệu sản phẩm hiếm khi được viết (bởi các nhóm ops) nhưng đọc hàng triệu lần mỗi ngày. giá cố định hoặc mô tả là vấn đề. N=3, R=2, W=3 Rationale: - W=3: All replicas must confirm a catalog update before it's live. A price change half-published is worse than a brief write delay. - R=2: Read quorum overlap with W=3 guarantees fresh data. Acceptable: catalog writes are rare, so write latency doesn't matter. - N=3: Standard replication for durability. Trade-off accepted: Writes are slow and fail if any node is down. Acceptable because catalog updates are infrequent. User profiles — balanced Dữ liệu hồ sơ (tên, email, sở thích) rất quan trọng. Một hồ sơ tạm thời là khó chịu nhưng không nguy hiểm. Một bản cập nhật bị từ chối (ví dụ, người dùng không thể cập nhật email của họ) là một vấn đề thực sự. N=3, R=2, W=2 Rationale: - The classic balanced configuration. - R + W = 4 > N = 3, so quorums overlap: reads will see the latest write. - Tolerates 1 node failure for both reads and writes. - Appropriate for data that matters but doesn't require strict consistency. Trade-off accepted: A second simultaneous node failure will cause errors. Acceptable for non-critical user data. Decision framework summary: Priority R W When to use Max availability 1 1 Sessions, ephemeral state, click tracking Balanced 2 2 User profiles, preferences, soft state Consistent reads 2 3 Catalogs, config, rarely-written reference data Highest consistency 3 3 Anywhere you need R+W > N with zero tolerance for stale reads (still not linearizable) Max khả dụng 1 1 Sessions, trạng thái ephemeral, click tracking cân bằng 2 2 Hồ sơ người dùng, sở thích, trạng thái mềm Đọc liên tục 2 3 Catalogs, config, dữ liệu tham chiếu hiếm khi được viết cao nhất consistency 3 3 Bất cứ nơi nào bạn cần R+W > N với không dung nạp cho đọc không ổn định (vẫn không được tuyến tính hóa) Vấn đề 3: Thử nghiệm một hệ thống Dynamo-Style theo các kịch bản phân vùng Làm thế nào để bạn xác minh rằng hệ thống của bạn thực sự cư xử đúng khi các nút thất bại và phân vùng xảy ra? The problem Đây là một trong những vấn đề khó khăn nhất trong thử nghiệm hệ thống phân tán bởi vì các lỗi chỉ xuất hiện trong các kết nối cụ thể của các sự kiện đồng thời mà rất khó để tái tạo một cách xác định. Layer 1: Unit tests for the logic in isolation Trước khi kiểm tra hành vi phân tán, hãy kiểm tra các khối xây dựng một cách độc lập.Logic so sánh đồng hồ vector, phát hiện xung đột và các chức năng hòa giải đều có thể được kiểm tra với các thử nghiệm đơn vị thuần túy - không cần kết nối mạng. def test_concurrent_clocks_detected_as_conflict(): clock_a = VectorClock({"node-A": 2}) clock_b = VectorClock({"node-B": 2}) assert not clock_a.dominates(clock_b) assert not clock_b.dominates(clock_a) # Both survive reconciliation → conflict correctly detected def test_ancestor_clock_is_discarded(): old_clock = VectorClock({"node-A": 1}) new_clock = VectorClock({"node-A": 3}) assert new_clock.dominates(old_clock) # old_clock should be pruned during reconciliation Layer 2: Deterministic fault injection Thay vì hy vọng những thất bại xảy ra theo thứ tự đúng trong quá trình kiểm tra tải, hãy tiêm chúng một cách có chủ ý và lặp đi lặp lại. là một phiên bản đơn giản của điều này. Trong các hệ thống sản xuất, thư viện như hoặc Làm điều này ở cấp độ cơ sở hạ tầng. node.down = True Jepsen Khỉ hỗn loạn Các kịch bản chính để kiểm tra: Scenario A: Write succeeds with W=2, third replica is down. → Verify: the data is readable after the down node recovers. → Verify: no data loss occurred. Scenario B: Two nodes accept concurrent writes to the same key. → Verify: the next read surfaces exactly 2 conflicting versions. → Verify: after the application writes a merged version, the next read is clean. Scenario C: Node goes down mid-write (wrote to W-1 nodes). → Verify: the write is correctly rejected (RuntimeError). → Verify: no partial writes are visible to readers. Scenario D: All N nodes recover after a full partition. → Verify: no data was lost across the cluster. → Verify: vector clocks are still meaningful (no spurious conflicts). Layer 3: Property-based testing Thay vì viết các trường hợp thử nghiệm riêng lẻ, xác định phải luôn luôn giữ và tạo ra hàng ngàn chuỗi hoạt động ngẫu nhiên để cố gắng vi phạm chúng: bất biến # Invariant: after any sequence of writes and merges, a final get() # should always return exactly one version (no unresolved conflicts). # Invariant: a value written with a context derived from a previous read # should never produce a conflict with that read's version # (it should dominate it). # Invariant: if R + W > N, a value written successfully should always # be visible in the next read (read-your-writes, absent concurrent writes). Công cụ như (Python) cho phép bạn thể hiện những bất biến này và tự động tìm thấy các đối chứng. Giả thuyết Layer 4: Linearizability checkers Để đảm bảo độ tin cậy cao nhất, ghi lại thời gian bắt đầu, thời gian kết thúc và kết quả của mỗi hoạt động trong quá trình thử nghiệm tiêm lỗi, sau đó cung cấp lịch sử cho một bộ kiểm tra tuyến tính như: Nó sẽ cho bạn biết liệu bất kỳ lịch sử nào được quan sát là phù hợp với việc thực hiện theo thứ tự chính xác - ngay cả đối với một hệ thống cuối cùng nhất quán hoạt động trong phạm vi bảo đảm được tuyên bố của nó. Knossos Viết từ các khe cắm của các hệ thống phân tán. những hiểu biết được thử nghiệm trong trận chiến, không có sóng tay. Notebook liên kết Notebook liên kết