编辑“︁Apache Hadoop与机器学习融合”︁（章节）

== 技术实现方式 ==  
=== 1. 基于MapReduce的机器学习 ===  
传统MapReduce适合迭代较少的批量处理，但机器学习常需多轮迭代。改进方案如下：  

==== 示例：并行化K-Means算法 ===  
<syntaxhighlight lang="java">  
// Mapper阶段：计算样本点到质心的距离  
public class KMeansMapper extends Mapper<LongWritable, Text, IntWritable, Text> {  
    private List<Vector> centers = new ArrayList<>();  

    protected void setup(Context context) {  
        // 从HDFS读取初始质心  
        centers = loadCenters(context.getConfiguration().get("centers.path"));  
    }  

    public void map(LongWritable key, Text value, Context context) {  
        Vector sample = parseVector(value.toString());  
        int nearestCenter = findNearestCenter(sample, centers);  
        context.write(new IntWritable(nearestCenter), value);  
    }  
}  

// Reducer阶段：重新计算质心  
public class KMeansReducer extends Reducer<IntWritable, Text, IntWritable, Text> {  
    public void reduce(IntWritable key, Iterable<Text> values, Context context) {  
        List<Vector> samples = new ArrayList<>();  
        for (Text v : values) samples.add(parseVector(v.toString()));  
        Vector newCenter = calculateMean(samples);  // 均值作为新质心  
        context.write(key, new Text(newCenter.toString()));  
    }  
}  
</syntaxhighlight>  

{{Warning|MapReduce的磁盘I/O开销大，迭代效率低，适合教学演示而非生产环境。}}  

=== 2. 基于Spark MLlib的优化方案 ===  
Spark的内存计算特性更适合迭代式机器学习。Hadoop在此场景中主要作为数据存储层（HDFS）。  

==== 示例：逻辑回归训练 ===  
<syntaxhighlight lang="python">  
from pyspark.ml.classification import LogisticRegression  
from pyspark.sql import SparkSession  

# 从HDFS读取数据  
spark = SparkSession.builder.appName("HadoopML").getOrCreate()  
df = spark.read.format("libsvm").load("hdfs://path/to/data")  

# 训练模型  
lr = LogisticRegression(maxIter=10, regParam=0.01)  
model = lr.fit(df)  

# 输出系数  
print("Coefficients: " + str(model.coefficients))  
</syntaxhighlight>  

=== 3. 专用工具集成 ===  
{| class="wikitable"  
|+ Hadoop生态中的机器学习工具对比  
! 工具 !! 适用场景 !! 特点  
|-  
| Apache Mahout || 传统MapReduce算法 || 渐被淘汰  
|-  
| Spark MLlib || 迭代算法 || 内存计算优势  
|-  
| TensorFlowOnSpark || 深度学习 || 支持GPU调度  
|}