编辑“︁Airflow数据库后端”︁

= Airflow数据库后端 =

== 介绍 ==
'''Airflow数据库后端'''是Apache Airflow的核心组件之一，负责存储和管理所有工作流元数据、任务状态、变量、连接等信息。它作为Airflow的“大脑”，确保调度器（Scheduler）和执行器（Executor）能够协同工作。Airflow支持多种关系型数据库作为后端，包括SQLite（仅限开发）、PostgreSQL、MySQL和Microsoft SQL Server。

数据库后端的选择直接影响Airflow的性能、可靠性和扩展性。例如：
* '''开发环境'''：SQLite适合快速测试，但无法支持并行任务。
* '''生产环境'''：PostgreSQL或MySQL是推荐选择，支持高并发和分布式部署。

== 支持的数据库 ==
以下是Airflow官方支持的数据库及其特性对比：

{| class="wikitable"
|+
! 数据库类型
! 是否支持并行
! 适用场景
! 限制
|-
| SQLite
| 否
| 开发/测试
| 单线程，无并发支持
|-
| MySQL
| 是
| 生产
| 需启用<code>explicit_defaults_for_timestamp</code>
|-
| PostgreSQL
| 是
| 生产
| 推荐使用，性能最佳
|-
| MS SQL Server
| 是
| 生产
| 需额外配置ODBC驱动
|}

== 配置数据库后端 ==
Airflow通过<code>airflow.cfg</code>文件中的<code>sql_alchemy_conn</code>参数配置数据库连接。以下是一个PostgreSQL配置示例：

<syntaxhighlight lang="ini">
[core]
sql_alchemy_conn = postgresql+psycopg2://user:password@localhost:5432/airflow_db
</syntaxhighlight>

=== 初始化数据库 ===
首次使用前需初始化数据库表结构：
<syntaxhighlight lang="bash">
airflow db init
</syntaxhighlight>

== 数据库架构 ==
Airflow的数据库包含以下关键表：
* '''dag_run'''：存储DAG的执行实例
* '''task_instance'''：记录每个任务的状态（如<code>success</code>、<code>failed</code>）
* '''log'''：存储任务日志
* '''variable'''：保存全局变量
* '''connection'''：存储外部系统连接信息

<mermaid>
erDiagram
    DAG ||--o{ DAG_RUN : has
    DAG_RUN ||--o{ TASK_INSTANCE : contains
    TASK_INSTANCE }|--|| LOG : generates
    AIRFLOW ||--o{ VARIABLE : uses
    AIRFLOW ||--o{ CONNECTION : uses
</mermaid>

== 实际案例 ==
=== 场景：监控长时间运行的任务 ===
通过查询<code>task_instance</code>表，可以监控任务执行状态：

<syntaxhighlight lang="python">
from airflow.settings import Session
from airflow.models import TaskInstance

session = Session()
long_running = session.query(TaskInstance).filter(
    TaskInstance.state == 'running',
    TaskInstance.duration > 3600  # 超过1小时的任务
).all()
</syntaxhighlight>

=== 输出示例 ===
查询结果可能返回如下结构：
<syntaxhighlight lang="text">
[
    TaskInstance(
        dag_id='etl_pipeline',
        task_id='load_data',
        execution_date=datetime(2023, 5, 1),
        state='running',
        duration=4200
    )
]
</syntaxhighlight>

== 性能优化 ==
对于生产环境，建议：
1. '''定期清理'''：使用<code>airflow db clean</code>删除旧数据
2. '''索引优化'''：为<code>dag_id</code>、<code>state</code>等字段添加索引
3. '''连接池'''：配置<code>sql_alchemy_pool_size</code>（建议值：<math>5 \times CPU\_cores</math>）

== 常见问题 ==
=== 数据库迁移 ===
当需要更换数据库时：
<syntaxhighlight lang="bash">
airflow db migrate
</syntaxhighlight>

=== 连接错误处理 ===
若出现<code>sqlalchemy.exc.OperationalError</code>，检查：
* 数据库服务是否运行
* 连接字符串是否正确
* 网络防火墙设置

== 高级主题 ==
=== 自定义元数据 ===
可以通过继承<code>Base</code>类扩展元数据：
<syntaxhighlight lang="python">
from sqlalchemy import Column, Integer, String
from airflow.models.base import Base

class CustomMetadata(Base):
    __tablename__ = 'custom_metadata'
    id = Column(Integer, primary_key=True)
    key = Column(String(50), unique=True)
    value = Column(String(500))
</syntaxhighlight>

=== 分库策略 ===
对于超大规模部署，可考虑：
* 将日志存储到独立数据库
* 使用读写分离架构

== 总结 ==
Airflow数据库后端是工作流管理的基石，正确配置和维护数据库能显著提升系统稳定性。生产环境应始终使用PostgreSQL或MySQL，并定期监控数据库性能指标。

[[Category:大数据框架]]
[[Category:Airflow]]
[[Category:Airflow基础]]