是基于 Apache Lucene 的开源搜索和分析引擎。在使用 Elasticsearch 构建基于变更数据捕获 (CDC) 数据的应用程序时，您需要设计系统来处理索引中现有文档的频繁更新或修改。 Elasticsearch 在本篇博文中，我们将介绍可用于更新的不同选项，包括完整更新、部分更新和脚本更新。我们还将讨论修改文档时 内部发生的情况，以及频繁更新如何影响系统的 CPU 利用率。 Elasticsearch 频繁更新的示例应用程序 为了更好地理解 的用例，让我们看一下 Netflix 等视频流服务的搜索应用程序。当用户搜索节目（即“政治惊悚片”）时，系统会根据关键字和其他元数据返回一组相关结果。 频繁更新 让我们看一下 Elasticsearch 中电视剧《纸牌屋》的示例文档：   { "name": "House of Cards", "description": "Frank Underwood is a Democrat appointed as the Secretary of State. Along with his wife, he sets out on a quest to seek revenge from the people who betrayed him while successfully rising to supremacy.", "genres": ["drama", "thriller"], "views": 100, } 可以在 Elasticsearch 中配置搜索，以使用 和 作为全文搜索字段。views  存储每个标题的观看次数，可用于提升内容，使更受欢迎的节目排名更高。每次用户观看节目或电影的一集时，  字段都会递增。 name description views views 在 这样规模的应用程序中使用此搜索配置时，根据 执行的更新次数很容易超过每分钟数百万次。根据该报告，用户从 1 月到 7 月观看了约 1000 亿小时的内容。假设每集或一部电影的平均观看时间为 15 分钟，则每分钟的观看次数平均达到 130 万次。使用上面指定的搜索配置，每次观看都需要进行数百万次的更新。 像 Netflix Netflix 参与度报告， 许多搜索和分析应用程序都会经历频繁更新，尤其是基于 CDC 数据构建的应用程序。 在 Elasticsearch 中执行更新 让我们深入研究如何使用以下代码在 Elasticsearch 中执行更新的一般示例：   - from elasticsearch import Elasticsearch # Connect to your Elasticsearch instance es = Elasticsearch([{'host': 'localhost', 'port': 9200}]) # Index name and document ID you want to update index_name = 'movies' document_id = 'your_document_id' # Retrieve the current document to get the current 'views' value try: current_doc = es.get(index=index_name, id=document_id) current_views = current_doc['_source']['views'] except Exception as e: print(f"Error retrieving current document: {e}") current_views = 0 # Set a default value if there's an error # Define the update body to increment 'views' by 1 update_body = { "doc": { "views": current_views + 1 # Increment 'views' by 1 } } # Perform the update try: es.update(index=index_name, id=document_id, body=update_body) print("Document updated successfully!") except Exception as e: print(f"Error updating document: {e}")  Elasticsearch 中的完整更新与部分更新 在 Elasticsearch 中执行更新时，您可以使用 替换现有文档，或使用 对文档进行部分更新。 索引 API 更新 API 索引 API 会检索整个文档，对文档进行更改，然后重新索引该文档。使用更新 API，您只需发送要修改的字段，而不是整个文档。这仍然会导致文档被重新索引，但会最大限度地减少通过网络发送的数据量。更新 API 在文档很大并且通过网络发送整个文档会很耗时的情况下特别有用。 让我们使用 Python 代码看看索引 API 和更新 API 是如何工作的。 使用 Elasticsearch 中的索引 API 进行全面更新 from elasticsearch import Elasticsearch # Connect to Elasticsearch es = Elasticsearch([{'host': 'localhost', 'port': 9200}]) # Index name and document ID index_name = "your_index" document_id = "1" # Retrieve the existing document existing_document = es.get(index=index_name, id=document_id) # Make your changes to the document existing_document["_source"]["field1"] = "new_value1" existing_document["_source"]["field2"] = "new_value2" # Call the index API to perform the full update es.index(index=index_name, id=document_id, body=existing_document["_source"]) 正如您在上面的代码中看到的，索引 API 需要对 Elasticsearch 进行两次单独的调用，这可能会导致集群性能下降和负载增加。 使用 Elasticsearch 中的更新 API 进行部分更新 部分更新内部使用  ，但已配置为只需要一次网络调用以获得更好的性能。 重新索引 API   from elasticsearch import Elasticsearch # Connect to Elasticsearch es = Elasticsearch([{'host': 'localhost', 'port': 9200}]) # Index name and document ID index_name = "your_index" document_id = "1" # Specify the fields to be updated update_fields = { "field1": "new_value1", "field2": "new_value2" } # Use the update API to perform a partial update es.update(index=index_name, id=document_id, body={"doc": update_fields}) 您可以使用 Elasticsearch 中的更新 API 来更新查看次数，但更新 API 本身无法用于根据先前的值增加查看次数。这是因为我们需要旧的查看次数来设置新的查看次数值。 让我们看看如何使用强大的脚本语言 Painless 来解决这个问题。 使用 Elasticsearch 中的 Painless 脚本进行部分更新 是一种为 Elasticsearch 设计的脚本语言，可用于查询和聚合计算、复杂条件、数据转换等。Painless 还支持在更新查询中使用脚本来根据复杂逻辑修改文档。 Painless 在下面的示例中，我们使用 Painless 脚本在单个 API 调用中执行更新，并根据旧视图计数的值增加新的视图计数。   from elasticsearch import Elasticsearch # Connect to your Elasticsearch instance es = Elasticsearch([{'host': 'localhost', 'port': 9200}]) # Index name and document ID you want to update index_name = 'movies' document_id = 'your_document_id' # Define the Painless script for the update update_script = { "script": { "lang": "painless", "source": "ctx._source.views += 1" # Increment 'views' by 1 } } # Perform the update using the Painless script try: es.update(index=index_name, id=document_id, body=update_script) print("Document updated successfully!") except Exception as e: print(f"Error updating document: {e}")  Painless 脚本非常直观易懂，它只是将每个文档的视图计数增加 1。 在 Elasticsearch 中更新嵌套对象 Elasticsearch 中的 是一种数据结构，允许将对象数组作为单个父文档中的单独文档进行索引。处理自然形成嵌套结构的复杂数据（如对象中的对象）时，嵌套对象非常有用。在典型的 Elasticsearch 文档中，对象数组是扁平的，但使用嵌套数据类型可以单独索引和查询数组中的每个对象。 嵌套对象  Painless 脚本还可用于更新 Elasticsearch 中的嵌套对象。   from elasticsearch import Elasticsearch # Connect to your Elasticsearch instance es = Elasticsearch([{'host': 'localhost', 'port': 9200}]) # Index name and document ID for the example index_name = 'your_index' document_id = 'your_document_id' # Specify the nested field and the updated value nested_field = "nested_field_name" updated_value = "new_value" # Define the Painless script for the update update_script = { "script": { "lang": "painless", "source": "ctx._source.nested_field_name = params.updated_value", "params": { "updated_value": updated_value } } } # Perform the update using the Update API and the Painless script try: es.update(index=index_name, id=document_id, body=update_script) print("Nested object updated successfully!") except Exception as e: print(f"Error updating nested object: {e}") 在 Elasticsearch 中添加新字段 可以通过索引操作向 Elasticsearch 中的文档添加新字段。 您可以使用 Update API 使用新字段部分更新现有文档。启用索引上的动态映射后，引入新字段非常简单。只需索引包含该字段的文档，Elasticsearch 就会自动找出合适的映射并将新字段添加到映射中。 如果索引上的动态映射已禁用，您将需要使用更新映射 API。您可以在下面看到一个示例，了解如何通过向电影索引添加“类别”字段来更新索引映射。   PUT /movies/_mapping { "properties": { "category": { "type": "keyword" } } }  Elasticsearch 内部更新 虽然代码很简单，但 Elasticsearch 内部需要做很多繁重的工作来执行这些更新，因为数据存储在不可变的段中。因此，Elasticsearch 无法简单地对文档进行就地更新。执行更新的唯一方法是重新索引整个文档，无论使用哪种 API。  Elasticsearch 的底层使用了 Apache Lucene。Lucene 索引由一个或多个段组成。段是一个独立的、不可变的索引结构，代表整体索引的子集。添加或更新文档时，会创建新的 Lucene 段，并将旧文档标记为软删除。随着时间的推移，随着新文档的添加或现有文档的更新，可能会累积多个段。为了优化索引结构，Lucene 会定期将较小的段合并为较大的段。 更新本质上是 Elasticsearch 中的插入 由于每个更新操作都是重新索引操作，因此所有更新本质上都是带有软删除的插入。 将更新视为插入操作会产生成本影响。一方面，软删除数据意味着旧数据仍会保留一段时间，从而导致索引的存储和内存膨胀。执行软删除、重新索引和垃圾收集操作也会对 CPU 造成严重影响，而在所有副本上重复这些操作会加剧这种影响。 随着产品的增长和数据随时间的变化，更新会变得更加棘手。为了保持 Elasticsearch 的性能，您需要更新集群中的分片、分析器和标记器，这需要对整个集群进行重新索引。对于生产应用程序，这将需要设置一个新集群并迁移所有数据。迁移集群既耗时又容易出错，因此这不是一项可以掉以轻心的操作。  Elasticsearch 中的更新 Elasticsearch 中更新操作的简单性可以掩盖系统底层发生的繁重操作任务。Elasticsearch 将每个更新视为插入，需要重新创建和重新索引整个文档。对于频繁更新的应用程序，这很快就会变得昂贵，正如我们在 Netflix 示例中看到的那样，每分钟都会发生数百万次更新。我们建议使用 批量更新，这会增加工作负载的延迟，或者在面对 Elasticsearch 中的频繁更新时寻找替代解决方案。 Bulk API  Rockset 是一款在云端构建的搜索和分析数据库，是 Elasticsearch 的可变替代方案。Rockset 建立在 之上，后者是一种以可变性而广受欢迎的键值存储，因此可以对文档进行就地更新。这样一来，只会更新和重新索引单个字段的值，而不是整个文档。 RocksDB 如果您想比较 Elasticsearch 和 Rockset 在更新密集型工作负载下的性能，您可以获得 300 美元的信用额度开始 。 Rockset 的免费试用

Walkthroughs, tutorials, guides, and tips. This story will teach you how to do something new or how to do something better.

This writer has a vested interest be it monetary, business, or otherwise, with 1 or more of the products or companies mentioned within.

Build data apps here

該音頻是用故事的原始語言製作的！

Elasticsearch 中更新文档的简单指南

About Author

註釋

標籤

这篇文章刊登在

Related Stories

点击赚钱：Telegram 可能会在 Solana 之前吸引下一个 100 亿加密用户

想赢得 HackerNoon 写作比赛吗？以下是 #crypto-api 比赛获奖者的推荐

从论坛到信息流：社交媒体算法如何塑造数字互动

架构师指南：构建 AI/ML 数据湖参考架构

点击赚钱：Telegram 可能会在 Solana 之前吸引下一个 100 亿加密用户

想赢得 HackerNoon 写作比赛吗？以下是 #crypto-api 比赛获奖者的推荐

从论坛到信息流：社交媒体算法如何塑造数字互动

架构师指南：构建 AI/ML 数据湖参考架构

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps