编辑“︁Apache Hadoop作业优化”︁（章节）

== 优化方法 ==

=== 1. 调整Map和Reduce任务数量 ===
合理设置Map和Reduce任务的数量可以避免资源浪费或任务过载。

* '''Map任务数量'''：通常由输入数据的分片（InputSplit）数量决定，可通过`mapreduce.job.maps`参数调整。
* '''Reduce任务数量'''：默认值为1，可通过`mapreduce.job.reduces`设置。经验公式为：<math>\text{Reduce任务数} = \text{节点数} \times \text{每个节点的容器数} \times 0.95</math>

示例配置：
<syntaxhighlight lang="xml">
<property>
  <name>mapreduce.job.reduces</name>
  <value>10</value>
</property>
</syntaxhighlight>

=== 2. 使用Combiner减少数据传输 ===
Combiner是一种本地Reduce操作，可以减少Map阶段到Reduce阶段的数据传输量。

示例代码（WordCount中使用Combiner）：
<syntaxhighlight lang="java">
job.setCombinerClass(IntSumReducer.class);
</syntaxhighlight>

=== 3. 优化数据序列化 ===
使用高效的序列化格式（如Avro或Parquet）可以减少I/O和网络开销。

示例（配置Avro序列化）：
<syntaxhighlight lang="xml">
<property>
  <name>mapreduce.map.output.key.class</name>
  <value>org.apache.avro.mapred.AvroKey</value>
</property>
</syntaxhighlight>

=== 4. 数据倾斜处理 ===
数据倾斜会导致部分Reduce任务负载过高。解决方法包括：
* 自定义分区器（Partitioner）
* 使用Salting技术分散热点数据

示例（自定义分区器）：
<syntaxhighlight lang="java">
public class CustomPartitioner extends Partitioner<Text, IntWritable> {
  @Override
  public int getPartition(Text key, IntWritable value, int numPartitions) {
    // 实现自定义分区逻辑
    return (key.hashCode() & Integer.MAX_VALUE) % numPartitions;
  }
}
</syntaxhighlight>

=== 5. 内存调优 ===
调整JVM堆内存参数以避免频繁GC。

示例配置：
<syntaxhighlight lang="xml">
<property>
  <name>mapreduce.map.memory.mb</name>
  <value>2048</value>
</property>
<property>
  <name>mapreduce.reduce.memory.mb</name>
  <value>4096</value>
</property>
</syntaxhighlight>