编辑“︁HBase二级索引”︁

= HBase二级索引 =

== 介绍 ==
'''HBase二级索引'''是HBase数据库中一种重要的查询优化技术。HBase原生仅支持通过行键（Row Key）快速检索数据，但实际业务中经常需要按其他列值查询，这时就需要二级索引。二级索引通过构建额外的索引表，将非行键列与行键建立映射关系，从而支持高效的多维度查询。

二级索引的核心特点：
* 解决原生HBase只能按行键查询的限制
* 通过空间换时间提升查询性能
* 需要维护索引与主表的数据一致性

== 实现原理 ==
HBase二级索引主要有三种实现方式：

=== 1. 客户端双写 ===
应用程序同时写入主表和索引表：
<mermaid>
graph LR
    A[客户端] --> B[主表]
    A --> C[索引表]
</mermaid>

=== 2. 协处理器(Coprocessor) ===
使用HBase的Observer协处理器自动维护索引：
<mermaid>
sequenceDiagram
    participant Client
    participant RegionServer
    participant PrimaryTable
    participant IndexTable
    
    Client->>RegionServer: 写入主表
    RegionServer->>PrimaryTable: 写入数据
    RegionServer->>IndexTable: 通过协处理器自动更新索引
</mermaid>

=== 3. 索引表设计模式 ===
常见的设计模式包括：
* '''映射表设计'''：将索引列值作为行键，原行键作为值
* '''覆盖索引'''：在索引表中包含查询所需的全部列
* '''组合索引'''：将多个列组合作为索引键

== 代码示例 ==
以下展示使用HBase Java API创建二级索引的示例：

=== 创建索引表 ===
<syntaxhighlight lang="java">
// 创建主表
HTableDescriptor mainTable = new HTableDescriptor(TableName.valueOf("orders"));
mainTable.addFamily(new HColumnDescriptor("cf"));

// 创建索引表（按customer_id索引）
HTableDescriptor indexTable = new HTableDescriptor(TableName.valueOf("orders_by_customer"));
indexTable.addFamily(new HColumnDescriptor("cf"));

admin.createTable(mainTable);
admin.createTable(indexTable);
</syntaxhighlight>

=== 写入数据并维护索引 ===
<syntaxhighlight lang="java">
// 写入主表
Put mainPut = new Put(Bytes.toBytes("order123"));
mainPut.addColumn(Bytes.toBytes("cf"), Bytes.toBytes("customer_id"), Bytes.toBytes("cust789"));
mainPut.addColumn(Bytes.toBytes("cf"), Bytes.toBytes("amount"), Bytes.toBytes("199.99"));
table.put(mainPut);

// 同时写入索引表（customer_id作为行键）
Put indexPut = new Put(Bytes.toBytes("cust789"));
indexPut.addColumn(Bytes.toBytes("cf"), Bytes.toBytes("order_id"), Bytes.toBytes("order123"));
indexTable.put(indexPut);
</syntaxhighlight>

=== 使用索引查询 ===
<syntaxhighlight lang="java">
// 先通过索引表查找到order_id
Get indexGet = new Get(Bytes.toBytes("cust789"));
Result indexResult = indexTable.get(indexGet);
byte[] orderId = indexResult.getValue(Bytes.toBytes("cf"), Bytes.toBytes("order_id"));

// 再通过order_id从主表获取完整数据
Get mainGet = new Get(orderId);
Result mainResult = table.get(mainGet);
</syntaxhighlight>

== 实际应用案例 ==
'''电商订单系统'''的典型场景：

1. '''需求'''：需要同时支持：
   * 按订单ID查询（主键查询）
   * 按客户ID查询所有订单（需要二级索引）

2. '''解决方案'''：
<mermaid>
erDiagram
    ORDERS ||--o{ ORDERS_BY_CUSTOMER : "二级索引"
    ORDERS {
        string order_id PK
        string customer_id
        decimal amount
        timestamp create_time
    }
    ORDERS_BY_CUSTOMER {
        string customer_id PK
        string order_id
    }
</mermaid>

3. '''查询流程'''：
* 用户访问"我的订单"页面时
* 系统先通过customer_id在ORDERS_BY_CUSTOMER表中找到所有order_id
* 然后批量从ORDERS表获取完整订单数据

== 数学原理 ==
二级索引的查询性能可以从时间复杂度分析：

主表查询：<math>O(1)</math> （基于行键）

无索引的列查询：<math>O(n)</math> （全表扫描）

有二级索引的列查询：<math>O(1)</math>（索引查询） + <math>O(1)</math>（主表查询） ≈ <math>O(1)</math>

== 注意事项 ==
使用二级索引时需要考虑：
* '''写入放大'''：每次写入需要更新主表和索引表
* '''一致性保证'''：需要确保主表和索引表的原子性更新
* '''存储成本'''：索引表会占用额外存储空间
* '''热点问题'''：索引键设计不当可能导致Region热点

== 高级主题 ==
对于大规模生产环境，可以考虑：
* '''Phoenix'''：HBase上的SQL层，内置二级索引支持
* '''全局索引 vs 本地索引'''：不同的索引分布策略
* '''异步索引构建'''：降低写入延迟的方案
* '''TTL同步'''：确保主表和索引表的TTL一致

== 总结 ==
HBase二级索引是突破行键查询限制的关键技术，虽然会带来一定的写入和存储开销，但对于读多写少的场景能显著提升查询性能。设计时需要根据具体查询模式选择合适的索引策略，并注意维护数据一致性。

[[Category:大数据框架]]
[[Category:Apache Hadoop]]
[[Category:Hbase数据库]]