编辑“︁HBase数据模型”︁（章节）

= HBase数据模型 =

== 介绍 ==
HBase数据模型是HBase数据库的核心设计理念，它基于Google Bigtable论文的思想，是一个面向列的分布式存储系统。与传统关系型数据库（如MySQL）不同，HBase采用松散结构化的数据组织形式，适合存储海量稀疏数据。其核心特点是：
* '''多维映射'''：数据通过行键（Row Key）、列族（Column Family）、列限定符（Column Qualifier）和时间戳（Timestamp）四维定位
* '''稀疏性'''：允许行中某些列不存在
* '''版本控制'''：每个单元格（Cell）可存储多个版本的数据

== 核心组件 ==
=== 行键（Row Key） ===
行键是数据的主键，按字典序排序。设计良好的行键能显著提升查询效率。例如，若存储用户数据，行键可以是用户ID的反转（如<code>com.example.user:12345</code>），避免热点问题。

=== 列族（Column Family） ===
列族是列的逻辑分组，需在表创建时预定义。物理存储上，同一列族的数据会存放在一起。例如用户表可能包含<code>info</code>和<code>contact</code>两个列族。

<mermaid>
graph TD
    A[User Table] --> B[Row Key: user123]
    B --> C[Column Family: info]
    B --> D[Column Family: contact]
    C --> E[Column Qualifier: name]
    C --> F[Column Qualifier: age]
    D --> G[Column Qualifier: email]
    D --> H[Column Qualifier: phone]
</mermaid>

=== 单元格（Cell） ===
单元格是存储数据的最小单元，由<code>(row, column family:qualifier, timestamp)</code>唯一确定。例如：
<syntaxhighlight lang="text">
user123, info:name, 1635724800000, "John Doe"
</syntaxhighlight>

== 数据操作示例 ==
=== 创建表 ===
<syntaxhighlight lang="java">
// Java API示例
Configuration config = HBaseConfiguration.create();
Connection connection = ConnectionFactory.createConnection(config);
Admin admin = connection.getAdmin();

TableDescriptorBuilder tableDescBuilder = TableDescriptorBuilder.newBuilder(TableName.valueOf("users"));
ColumnFamilyDescriptorBuilder cfDescBuilder = ColumnFamilyDescriptorBuilder.newBuilder(Bytes.toBytes("info"));
tableDescBuilder.setColumnFamily(cfDescBuilder.build());

admin.createTable(tableDescBuilder.build());
</syntaxhighlight>

=== 插入数据 ===
<syntaxhighlight lang="java">
Table table = connection.getTable(TableName.valueOf("users"));
Put put = new Put(Bytes.toBytes("user123"));
put.addColumn(Bytes.toBytes("info"), Bytes.toBytes("name"), Bytes.toBytes("John Doe"));
put.addColumn(Bytes.toBytes("info"), Bytes.toBytes("age"), Bytes.toBytes(28));
table.put(put);
</syntaxhighlight>

=== 查询数据 ===
<syntaxhighlight lang="java">
Get get = new Get(Bytes.toBytes("user123"));
Result result = table.get(get);
byte[] name = result.getValue(Bytes.toBytes("info"), Bytes.toBytes("name"));
System.out.println("Name: " + Bytes.toString(name));
</syntaxhighlight>

== 版本控制机制 ==
HBase支持为每个单元格存储多个版本（默认保留3个），可通过时间戳访问历史数据。版本数由列族定义：
<syntaxhighlight lang="java">
// 设置列族保留5个版本
cfDescBuilder.setMaxVersions(5);
</syntaxhighlight>

时间戳可以是显式指定的long值，或由系统自动生成。查询时可指定时间范围：
<syntaxhighlight lang="java">
Get get = new Get(Bytes.toBytes("user123"));
get.setTimeRange(startTime, endTime);
</syntaxhighlight>

== 实际应用案例 ==
'''电商用户画像存储'''：
* 行键：<code>用户ID</code>
* 列族<code>basic_info</code>：存储姓名、性别等基本信息
* 列族<code>behavior</code>：存储浏览记录（动态列，如<code>product_view:20230101</code>）
* 列族<code>preference</code>：存储标签数据（如<code>category_pref:electronics</code>）

<mermaid>
erDiagram
    USER ||--o{ BEHAVIOR : has
    USER ||--o{ PREFERENCE : has
    USER {
        string user_id PK
        string name
        int age
    }
    BEHAVIOR {
        string event_type
        string product_id
        timestamp event_time
    }
    PREFERENCE {
        string tag_name
        float weight
    }
</mermaid>

== 数学基础 ==
HBase的排序机制基于行键的字典序，其比较算法可表示为：
<math>
compare(a,b) = \begin{cases} 
-1 & \text{if } a < b \\
0 & \text{if } a = b \\
1 & \text{if } a > b 
\end{cases}
</math>

Region分裂策略通常基于大小阈值，当Region达到<math>hbase.hregion.max.filesize</math>（默认10GB）时触发分裂。

== 高级特性 ==
* '''布隆过滤器'''：加速存在性检查
* '''协处理器'''：类似数据库触发器
* '''TTL'''：自动过期数据
* '''ACID语义'''：行级原子性

== 设计建议 ==
1. 行键设计应避免单调递增（如时间戳），采用加盐或哈希技术
2. 列族数量不宜过多（通常2-3个）
3. 动态列适合存储稀疏属性
4. 合理设置数据版本数和TTL

通过理解HBase数据模型，开发者可以高效设计适合大规模数据存储的解决方案，平衡查询性能与存储效率。

[[Category:大数据框架]]
[[Category:Apache Hadoop]]
[[Category:Hbase数据库]]