机器学习算法详解

一、监督学习算法

1. 线性模型

线性回归
逻辑回归
支持向量机（SVM）

2. 树模型

决策树
随机森林
XGBoost/LightGBM

3. 概率模型

朴素贝叶斯
高斯混合模型
隐马尔可夫模型

二、无监督学习算法

1. 聚类算法

K-means
DBSCAN
层次聚类

2. 降维算法

PCA
t-SNE
UMAP

三、算法实现示例

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# 数据准备
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# 特征标准化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 逻辑回归
lr_model = LogisticRegression()
lr_model.fit(X_train_scaled, y_train)
lr_pred = lr_model.predict(X_test_scaled)
print("逻辑回归结果:")
print(classification_report(y_test, lr_pred))

# 随机森林
rf_model = RandomForestClassifier()
rf_model.fit(X_train_scaled, y_train)
rf_pred = rf_model.predict(X_test_scaled)
print("
随机森林结果:")
print(classification_report(y_test, rf_pred))

四、模型评估

1. 分类问题评估指标

准确率（Accuracy）
精确率（Precision）
召回率（Recall）
F1分数

2. 回归问题评估指标

MSE（均方误差）
MAE（平均绝对误差）
R²分数

五、模型调优

1. 交叉验证

from sklearn.model_selection import cross_val_score

# 5折交叉验证
scores = cross_val_score(model, X, y, cv=5)
print("交叉验证分数:", scores.mean(), "±", scores.std())

2. 网格搜索

from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30, None]
}

grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
grid_search.fit(X_train, y_train)
print("最佳参数:", grid_search.best_params_)

六、特征工程

1. 特征选择

方差选择
相关性分析
特征重要性

2. 特征创建

多项式特征
交互特征
时间特征

七、实战技巧

1. 处理不平衡数据

过采样（SMOTE）
欠采样
类别权重调整

2. 处理缺失值

均值/中位数填充
模型预测填充
特殊值填充

八、进阶主题

1. 集成学习

Bagging
Boosting
Stacking

2. 在线学习

增量学习
流式学习
概念漂移处理

PyTorch深度学习基础

深度学习架构详解