Machine Learning Workflow từ A-Z

Machine Learning không chỉ là việc tạo ra một model và chạy nó. Đó là một quy trình phức tạp bao gồm nhiều bước từ thu thập dữ liệu đến triển khai model. Trong bài viết này, tôi sẽ hướng dẫn bạn quy trình hoàn chỉnh để phát triển một dự án Machine Learning.

Tổng quan về ML Workflow

Quy trình Machine Learning thường bao gồm các bước sau:

Problem Definition - Định nghĩa vấn đề
Data Collection - Thu thập dữ liệu
Data Exploration - Khám phá dữ liệu
Data Preprocessing - Tiền xử lý dữ liệu
Feature Engineering - Kỹ thuật đặc trưng
Model Selection - Lựa chọn model
Model Training - Huấn luyện model
Model Evaluation - Đánh giá model
Model Deployment - Triển khai model
Monitoring & Maintenance - Giám sát và bảo trì

1. Problem Definition

Xác định vấn đề kinh doanh

# Ví dụ: Dự đoán giá nhà
problem_statement = """
Mục tiêu: Dự đoán giá nhà dựa trên các đặc điểm của ngôi nhà
Input: Diện tích, số phòng, vị trí, tuổi nhà, v.v.
Output: Giá nhà (regression problem)
Success Metric: RMSE < $50,000
"""

Chuyển đổi thành bài toán ML

# Xác định loại bài toán
problem_type = "Regression"  # hoặc Classification, Clustering
target_variable = "price"
features = ["area", "bedrooms", "location", "age"]

2. Data Collection

Thu thập dữ liệu từ nhiều nguồn

import pandas as pd
import requests
import sqlite3

# Từ CSV
df_csv = pd.read_csv('house_data.csv')

# Từ API
def fetch_data_from_api(url):
    response = requests.get(url)
    return response.json()

# Từ Database
def fetch_from_database(query):
    conn = sqlite3.connect('database.db')
    df = pd.read_sql_query(query, conn)
    conn.close()
    return df

# Kết hợp dữ liệu
df_combined = pd.concat([df_csv, df_api, df_db], ignore_index=True)

3. Data Exploration

Khám phá dữ liệu

import matplotlib.pyplot as plt
import seaborn as sns

# Thông tin cơ bản
print("Dataset shape:", df.shape)
print("\nData types:")
print(df.dtypes)
print("\nMissing values:")
print(df.isnull().sum())

# Thống kê mô tả
print("\nDescriptive statistics:")
print(df.describe())

# Phân phối của target variable
plt.figure(figsize=(10, 6))
plt.subplot(1, 2, 1)
df['price'].hist(bins=50)
plt.title('Price Distribution')

plt.subplot(1, 2, 2)
df['price'].plot(kind='box')
plt.title('Price Box Plot')
plt.show()

Phân tích tương quan

# Ma trận tương quan
correlation_matrix = df.corr()
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

# Scatter plots
sns.pairplot(df[['price', 'area', 'bedrooms', 'age']])
plt.show()

4. Data Preprocessing

Xử lý giá trị thiếu

from sklearn.impute import SimpleImputer, KNNImputer

# Kiểm tra giá trị thiếu
missing_data = df.isnull().sum()
print("Missing data:\n", missing_data)

# Xử lý giá trị thiếu
# Cách 1: Xóa các dòng có giá trị thiếu
df_dropna = df.dropna()

# Cách 2: Thay thế bằng giá trị trung bình
imputer_mean = SimpleImputer(strategy='mean')
df['area'] = imputer_mean.fit_transform(df[['area']])

# Cách 3: Sử dụng KNN Imputer
knn_imputer = KNNImputer(n_neighbors=5)
df_imputed = pd.DataFrame(knn_imputer.fit_transform(df), columns=df.columns)

Xử lý dữ liệu ngoại lai

from scipy import stats

# Phương pháp IQR
def remove_outliers_iqr(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]

# Phương pháp Z-score
def remove_outliers_zscore(df, column, threshold=3):
    z_scores = stats.zscore(df[column])
    return df[abs(z_scores) < threshold]

# Áp dụng
df_clean = remove_outliers_iqr(df, 'price')

5. Feature Engineering

Tạo features mới

# Tạo features từ features hiện có
df['price_per_sqft'] = df['price'] / df['area']
df['age_category'] = pd.cut(df['age'], bins=[0, 5, 10, 20, 100], labels=['New', 'Recent', 'Old', 'Very Old'])

# One-hot encoding cho categorical variables
df_encoded = pd.get_dummies(df, columns=['location', 'age_category'])

# Feature scaling
from sklearn.preprocessing import StandardScaler, MinMaxScaler

scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df_encoded), columns=df_encoded.columns)

Feature selection

from sklearn.feature_selection import SelectKBest, f_regression, RFE
from sklearn.ensemble import RandomForestRegressor

# SelectKBest
selector = SelectKBest(score_func=f_regression, k=10)
X_selected = selector.fit_transform(X, y)

# Recursive Feature Elimination
rf = RandomForestRegressor()
rfe = RFE(rf, n_features_to_select=10)
X_rfe = rfe.fit_transform(X, y)

6. Model Selection

So sánh các models

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR

models = {
    'Linear Regression': LinearRegression(),
    'Ridge': Ridge(),
    'Lasso': Lasso(),
    'Random Forest': RandomForestRegressor(),
    'Gradient Boosting': GradientBoostingRegressor(),
    'SVR': SVR()
}

# Đánh giá cross-validation
cv_scores = {}
for name, model in models.items():
    scores = cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
    cv_scores[name] = -scores.mean()
    print(f"{name}: {cv_scores[name]:.4f}")

# Chọn model tốt nhất
best_model_name = min(cv_scores, key=cv_scores.get)
print(f"Best model: {best_model_name}")

7. Model Training

Hyperparameter tuning

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# Grid Search
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

grid_search = GridSearchCV(
    RandomForestRegressor(),
    param_grid,
    cv=5,
    scoring='neg_mean_squared_error',
    n_jobs=-1
)

grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_
print("Best parameters:", best_params)

Training final model

# Train với parameters tốt nhất
final_model = RandomForestRegressor(**best_params)
final_model.fit(X_train, y_train)

8. Model Evaluation

Đánh giá trên test set

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Predictions
y_pred = final_model.predict(X_test)

# Metrics
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"MSE: {mse:.4f}")
print(f"MAE: {mae:.4f}")
print(f"R²: {r2:.4f}")

# Residual plot
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Actual vs Predicted')
plt.show()

Feature importance

# Feature importance
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': final_model.feature_importances_
}).sort_values('importance', ascending=False)

plt.figure(figsize=(10, 8))
sns.barplot(data=feature_importance.head(10), x='importance', y='feature')
plt.title('Top 10 Feature Importance')
plt.show()

9. Model Deployment

Lưu model

import joblib
import pickle

# Lưu model
joblib.dump(final_model, 'house_price_model.pkl')
joblib.dump(scaler, 'scaler.pkl')

# Lưu metadata
model_info = {
    'model_name': 'House Price Predictor',
    'version': '1.0',
    'features': list(X.columns),
    'performance': {
        'mse': mse,
        'mae': mae,
        'r2': r2
    }
}

with open('model_info.json', 'w') as f:
    json.dump(model_info, f, indent=2)

Tạo API

from flask import Flask, request, jsonify
import joblib
import pandas as pd

app = Flask(__name__)

# Load model
model = joblib.load('house_price_model.pkl')
scaler = joblib.load('scaler.pkl')

@app.route('/predict', methods=['POST'])
def predict():
    try:
        data = request.json
        df = pd.DataFrame([data])
        df_scaled = scaler.transform(df)
        prediction = model.predict(df_scaled)[0]

        return jsonify({
            'prediction': float(prediction),
            'status': 'success'
        })
    except Exception as e:
        return jsonify({
            'error': str(e),
            'status': 'error'
        })

if __name__ == '__main__':
    app.run(debug=True)

10. Monitoring & Maintenance

Model monitoring

import logging
from datetime import datetime

# Setup logging
logging.basicConfig(filename='model_monitor.log', level=logging.INFO)

def monitor_model_performance(y_true, y_pred):
    mse = mean_squared_error(y_true, y_pred)
    mae = mean_absolute_error(y_true, y_pred)

    # Log performance
    logging.info(f"{datetime.now()}: MSE={mse:.4f}, MAE={mae:.4f}")

    # Alert if performance degrades
    if mse > threshold:
        logging.warning(f"Model performance degraded: MSE={mse:.4f}")
        # Send alert email or notification

Model retraining

def retrain_model(new_data):
    """Retrain model with new data"""
    # Load new data
    df_new = pd.read_csv(new_data)

    # Preprocess
    X_new, y_new = preprocess_data(df_new)

    # Retrain
    model.fit(X_new, y_new)

    # Evaluate
    performance = evaluate_model(model, X_test, y_test)

    # Save if performance is better
    if performance > current_performance:
        joblib.dump(model, 'house_price_model.pkl')
        logging.info("Model retrained and saved")

Best Practices

1. Version Control

# Sử dụng Git để track changes
git add .
git commit -m "Add ML workflow implementation"
git tag -a v1.0 -m "Initial ML model version"

2. Documentation

def train_model(X, y, model_type='random_forest'):
    """
    Train machine learning model

    Parameters:
    X (array-like): Training features
    y (array-like): Target variable
    model_type (str): Type of model to train

    Returns:
    Trained model object
    """
    # Implementation here
    pass

3. Testing

import unittest

class TestMLPipeline(unittest.TestCase):
    def test_data_preprocessing(self):
        # Test data preprocessing functions
        pass

    def test_model_training(self):
        # Test model training
        pass

    def test_model_prediction(self):
        # Test model predictions
        pass

if __name__ == '__main__':
    unittest.main()

Kết luận

Machine Learning workflow là một quy trình phức tạp đòi hỏi sự hiểu biết sâu về cả kỹ thuật và kinh doanh. Việc tuân theo một workflow có cấu trúc sẽ giúp bạn:

Tăng khả năng thành công của dự án
Dễ dàng debug và troubleshoot
Có thể reproduce kết quả
Dễ dàng maintain và update model

Hãy nhớ rằng ML không phải là một quy trình tuyến tính mà là một vòng lặp iterative. Bạn có thể cần quay lại các bước trước đó để cải thiện kết quả.

Chúc bạn thành công trong các dự án Machine Learning! 🤖

Machine Learning Workflow từ A-Z

Machine Learning Workflow từ A-Z

Tổng quan về ML Workflow

1. Problem Definition

Xác định vấn đề kinh doanh

Chuyển đổi thành bài toán ML

2. Data Collection

Thu thập dữ liệu từ nhiều nguồn

3. Data Exploration

Khám phá dữ liệu

Phân tích tương quan

4. Data Preprocessing

Xử lý giá trị thiếu

Xử lý dữ liệu ngoại lai

5. Feature Engineering

Tạo features mới

Feature selection

6. Model Selection

So sánh các models

7. Model Training

Hyperparameter tuning

Training final model

8. Model Evaluation

Đánh giá trên test set

Feature importance

9. Model Deployment

Lưu model

Tạo API

10. Monitoring & Maintenance

Model monitoring

Model retraining

Best Practices

1. Version Control

2. Documentation

3. Testing

Kết luận

Tân Đoàn