编辑“︁HBase数据读写操作”︁

= HBase数据读写操作 =

HBase是一个分布式的、面向列的NoSQL数据库，基于Hadoop生态系统构建。它提供了高性能的随机读写能力，适合处理海量数据。本章节将详细介绍HBase中的数据读写操作机制。

== 数据模型概述 ==
HBase的数据模型由以下几个核心概念组成：
* '''表(Table)'''：数据存储在表中，表由行和列组成
* '''行键(Row Key)'''：唯一标识一行数据，按字典序排序
* '''列族(Column Family)'''：一组列的集合，在创建表时定义
* '''列限定符(Column Qualifier)'''：列族下的具体列
* '''时间戳(Timestamp)'''：数据的版本标识

数据定位格式为：<code>(Row Key, Column Family:Column Qualifier, Timestamp) → Value</code>

<mermaid>
graph LR
    A[Table] --> B[Row1]
    A --> C[Row2]
    B --> D[CF1:col1]
    B --> E[CF1:col2]
    B --> F[CF2:col1]
    C --> G[CF1:col1]
    C --> H[CF2:col1]
</mermaid>

== 写入操作 ==

=== Put操作 ===
Put操作用于向HBase表中插入或更新数据。

<syntaxhighlight lang="java">
// Java示例
Configuration config = HBaseConfiguration.create();
Connection connection = ConnectionFactory.createConnection(config);
Table table = connection.getTable(TableName.valueOf("my_table"));

Put put = new Put(Bytes.toBytes("row1"));  // 创建行键为row1的Put对象
put.addColumn(
    Bytes.toBytes("cf1"),     // 列族
    Bytes.toBytes("name"),    // 列限定符
    Bytes.toBytes("张三")     // 值
);

table.put(put);  // 执行写入
table.close();
connection.close();
</syntaxhighlight>

'''写入特性'''：
* 原子性：单行Put操作是原子的
* 版本控制：默认保留3个版本数据
* 批量写入：支持批量Put操作提高效率

=== 批量写入 ===
使用<code>Table.put(List<Put>)</code>实现批量写入：

<syntaxhighlight lang="java">
List<Put> puts = new ArrayList<>();
puts.add(new Put(Bytes.toBytes("row1")).addColumn(...));
puts.add(new Put(Bytes.toBytes("row2")).addColumn(...));
table.put(puts);  // 批量提交
</syntaxhighlight>

== 读取操作 ==

=== Get操作 ===
Get操作用于读取单行数据：

<syntaxhighlight lang="java">
Get get = new Get(Bytes.toBytes("row1"));
get.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("name"));  // 指定要获取的列
Result result = table.get(get);

byte[] value = result.getValue(
    Bytes.toBytes("cf1"), 
    Bytes.toBytes("name")
);
System.out.println(Bytes.toString(value));  // 输出: 张三
</syntaxhighlight>

=== Scan操作 ===
Scan操作用于扫描多行数据：

<syntaxhighlight lang="java">
Scan scan = new Scan();
scan.setStartRow(Bytes.toBytes("row1"));  // 设置起始行
scan.setStopRow(Bytes.toBytes("row5"));   // 设置结束行(不包含)
scan.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("name"));

ResultScanner scanner = table.getScanner(scan);
for (Result result : scanner) {
    // 处理每一行结果
    byte[] value = result.getValue(...);
    System.out.println(Bytes.toString(value));
}
scanner.close();
</syntaxhighlight>

'''扫描优化技巧'''：
* 设置合理的缓存大小：<code>scan.setCaching(100)</code>
* 指定需要的列，避免全列扫描
* 使用过滤器减少传输数据量

== 过滤器(Filter) ==
HBase提供了多种过滤器来优化查询：

<syntaxhighlight lang="java">
// 单值过滤器示例
SingleColumnValueFilter filter = new SingleColumnValueFilter(
    Bytes.toBytes("cf1"),
    Bytes.toBytes("age"),
    CompareOperator.GREATER,
    Bytes.toBytes("30")
);
scan.setFilter(filter);
</syntaxhighlight>

常用过滤器类型：
* '''RowFilter'''：基于行键过滤
* '''ValueFilter'''：基于值过滤
* '''ColumnPrefixFilter'''：基于列名前缀过滤
* '''PageFilter'''：分页过滤

== 实际案例：用户画像系统 ==

'''场景'''：电商平台需要存储和快速查询用户画像数据

'''表设计'''：
<mermaid>
erDiagram
    USER_PROFILE {
        string user_id PK
        string basic:name
        string basic:gender
        string basic:age
        string behavior:last_view
        string behavior:fav_category
        string purchase:total_amount
        string purchase:last_order_time
    }
</mermaid>

'''读写示例'''：
<syntaxhighlight lang="java">
// 更新用户最近浏览记录
Put put = new Put(Bytes.toBytes("user123"));
put.addColumn(
    Bytes.toBytes("behavior"), 
    Bytes.toBytes("last_view"), 
    System.currentTimeMillis(),  // 使用时间戳作为版本
    Bytes.toBytes("product_789")
);
table.put(put);

// 查询高价值用户
Scan scan = new Scan();
SingleColumnValueFilter filter = new SingleColumnValueFilter(
    Bytes.toBytes("purchase"),
    Bytes.toBytes("total_amount"),
    CompareOperator.GREATER,
    Bytes.toBytes("10000")
);
scan.setFilter(filter);
</syntaxhighlight>

== 性能优化 ==

* '''行键设计'''：
  * 避免单调递增行键（导致热点问题）
  * 考虑使用散列前缀：<code>MD5(username).substring(0,4) + username</code>

* '''写入优化'''：
  * 禁用自动刷写：<code>table.setAutoFlush(false)</code>
  * 合理设置Write Buffer大小

* '''读取优化'''：
  * 使用块缓存：<code>scan.setCacheBlocks(true)</code>
  * 合理设置扫描范围

== 数学原理 ==
HBase的LSM树(Log-Structured Merge-Tree)存储引擎的写入复杂度为O(1)，读取复杂度为O(log<sub>B</sub>N)，其中B为树的分支因子，N为数据量。

<math>
T_{write} = O(1) \\
T_{read} = O(\log_B N)
</math>

== 总结 ==
HBase提供了灵活高效的数据读写接口，通过合理设计行键、列族和使用过滤器，可以构建高性能的大数据存储解决方案。理解其底层原理有助于在实际应用中做出最佳设计决策。

[[Category:大数据框架]]
[[Category:Apache Hadoop]]
[[Category:Hbase数据库]]