编辑“︁Elasticsearch搜索引擎”︁

= Elasticsearch搜索引擎 =

== 简介 ==
'''Elasticsearch''' 是一个基于 [[Apache Lucene]] 构建的开源、分布式、RESTful 搜索引擎。它专为处理大规模数据而设计，支持实时搜索和分析，广泛应用于日志分析、全文检索、业务智能等领域。作为 Hadoop 生态工具的一部分，Elasticsearch 能够与 Hadoop、HDFS 和 Spark 等组件集成，提供高效的数据索引和查询能力。

Elasticsearch 的核心特性包括：
* '''分布式架构'''：数据自动分片（Sharding）和复制（Replication），支持水平扩展。
* '''近实时搜索'''：数据索引后几乎立即可供搜索。
* '''RESTful API'''：通过 HTTP 接口进行交互，支持 JSON 格式的请求和响应。
* '''多租户支持'''：通过索引（Index）隔离不同用户或应用的数据。

== 核心概念 ==
=== 索引（Index） ===
索引是 Elasticsearch 中存储数据的逻辑容器，类似于关系数据库中的“表”。每个索引包含多个文档（Document），并且可以定义自己的映射（Mapping）和设置（Settings）。

=== 文档（Document） ===
文档是 Elasticsearch 中的基本数据单元，以 JSON 格式存储。每个文档属于一个索引，并有一个唯一的 ID。

=== 分片与副本（Shards & Replicas） ===
Elasticsearch 通过分片将数据分散到多个节点，副本则提供数据冗余和高可用性。例如：
* 主分片（Primary Shard）：存储原始数据。
* 副本分片（Replica Shard）：主分片的拷贝。

<mermaid>
graph TD
    A[Index] --> B[Primary Shard 1]
    A --> C[Primary Shard 2]
    B --> D[Replica Shard 1]
    C --> E[Replica Shard 2]
</mermaid>

=== 映射（Mapping） ===
映射定义了文档的字段类型及其属性（如是否分词、是否存储）。例如：
<syntaxhighlight lang="json">
{
  "mappings": {
    "properties": {
      "title": { "type": "text" },
      "year": { "type": "integer" }
    }
  }
}
</syntaxhighlight>

== 基本操作 ==
=== 创建索引 ===
通过 REST API 创建名为 `movies` 的索引：
<syntaxhighlight lang="bash">
curl -X PUT "localhost:9200/movies" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1
  }
}'
</syntaxhighlight>

=== 插入文档 ===
向 `movies` 索引插入一条文档：
<syntaxhighlight lang="bash">
curl -X POST "localhost:9200/movies/_doc/1" -H 'Content-Type: application/json' -d'
{
  "title": "The Godfather",
  "year": 1972
}'
</syntaxhighlight>

=== 查询文档 ===
搜索标题包含 “Godfather” 的电影：
<syntaxhighlight lang="bash">
curl -X GET "localhost:9200/movies/_search?q=title:Godfather"
</syntaxhighlight>

输出示例：
<syntaxhighlight lang="json">
{
  "hits": {
    "hits": [
      {
        "_id": "1",
        "_source": {
          "title": "The Godfather",
          "year": 1972
        }
      }
    ]
  }
}
</syntaxhighlight>

== 实际应用案例 ==
=== 日志分析 ===
Elasticsearch 常与 [[Logstash]] 和 [[Kibana]] 组成 ELK Stack，用于实时日志收集和分析。例如：
1. Logstash 从服务器收集日志并发送到 Elasticsearch。
2. Kibana 可视化日志数据，帮助运维人员快速定位问题。

=== 电商搜索 ===
电商平台使用 Elasticsearch 实现商品搜索功能，支持：
* 全文检索（如商品名称、描述）。
* 过滤（如价格范围、品牌）。
* 排序（如销量、评分）。

== 高级特性 ==
=== 聚合分析 ===
Elasticsearch 提供强大的聚合功能，支持统计、分组等操作。例如，按年份统计电影数量：
<syntaxhighlight lang="bash">
curl -X GET "localhost:9200/movies/_search" -H 'Content-Type: application/json' -d'
{
  "aggs": {
    "movies_by_year": {
      "terms": { "field": "year" }
    }
  }
}'
</syntaxhighlight>

=== 性能调优 ===
* 使用批量 API（Bulk API）减少网络开销。
* 调整分片数量（建议每个分片不超过 50GB）。
* 启用缓存（如查询缓存、字段数据缓存）。

== 数学基础 ==
Elasticsearch 的评分模型基于 [[TF-IDF]] 和 [[向量空间模型]]。文档的相关性得分计算为：
<math>
score(q, d) = \sum_{t \in q} tf(t \in d) \times idf(t)^2 \times boost(t) \times norm(t, d)
</math>
其中：
* <math>tf(t \in d)</math>：词项频率。
* <math>idf(t)</math>：逆文档频率。
* <math>boost(t)</math>：权重提升因子。

== 总结 ==
Elasticsearch 是一个功能强大的搜索引擎，适用于多种场景。通过本文的学习，您应掌握其核心概念、基本操作和实际应用。如需进一步探索，可尝试集成 [[Kibana]] 进行数据可视化，或结合 [[Logstash]] 构建日志分析管道。

[[Category:大数据框架]]
[[Category:Apache Hadoop]]
[[Category:Apache Hadoop生态工具]]