Hỏi Đáp Là gì

Robustscaler là gì

Một vài từ khoá [keyword] trong bài xin được giữ nguyên mẫu tiếng anh để tiện cho việc tra cứu của người đọc.

Các bài toán phân loại [classification problems] trên thực tế thường phải xử lý dữ liệu không được "cân đối". Trong những trường hợp này, một trong những nhóm cần phân loại [class] có thể có số mẫu dữ liệu vượt trội hơn hẳn những class khác [ví dụ >90%]. Lúc này dù model của chúng ta cho ra độ chính xác >90% cũng chưa thể nói lên điều gì. Để xứ lý vấn đề này chúng ta cần sử dụng một vài kỹ thuật cân đối lại dữ liệu và những phương pháp đánh giá khác ngoài độ chính xác. Bài toán phát hiện gian lận tín dụng là một trong những ví dụ điển hình của vấn đề này.

Mô tả dữ liệu

Tập dữ liệu [dataset] này được lấy từ nguồn Kaggle. Trên thực tế thì số trường hợp gian lận tín dụng rõ ràng chiếm một phần trăm rất nhỏ so với những giao dịch hợp lệ. Chúng ta sẽ thấy rõ điều này sau khi đi sâu vào phân tích.

Trước hết ta nhìn vào mô tả sơ lược dữ liệu.

Class: 0 - hợp lệ 1 - gian lận
Amount: số tiền giao dịch
V1,V2,...,V28: Do tính bảo mật nên dữ liệu được ẩn đi dưới dạng này. Ngoài ra thì đây là kết quả của phép chuyển đổi PCA [Principal component analysis].
Time: Mốc thời gian tính bằng giây của từng giao dịch, điểm gốc bắt đầu từ giao dịch đầu tiên.

Ok, giờ thì trang bị vũ khí:

# Import libraries import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline

Và coi thử dữ liệu trông thế nào:

data = pd.read_csv['creditcard.csv'] data.head[]

Kiểm tra thử có thiếu dữ liệu ở đâu không:

data.isnull[].values.any[] # Returns False

Biểu diễn dữ liệu

Chúng ta sẽ tính phân phối giữa hai classes Hợp Lệ và Gian Lận theo % để thấy độ lệch dữ liệu ra sao:

# Check ratio between classes percentage_fraud = round[[data['Class'].value_counts[][1] / len [data]] * 100, 2] percentage_no_fraud = round[[data['Class'].value_counts[][0] / len[data]] * 100, 2] print ['Percentage Fraud transactions: ', percentage_fraud] print ['Percentage No-fraud transactions: ', percentage_no_fraud]

Giao dịch gian lận chỉ chiếm 0.17% trong tập dữ liệu.

Và đây là đồ thị biểu diễn:

plt.figure[figsize=[7,7]] sns.set[style="darkgrid"] sns.countplot[x="Class", data=data]

Tiêu chuẩn hoá [Normalization]

Phần lớn các features V1 đến V28 là kết quả của phép chuyển đổi PCA.
Tuy nhiên là chúng ta vẫn còn hai feature là Amount và Time cần được chuẩn hoá lại.

from sklearn.preprocessing import StandardScaler, RobustScaler rob_scaler = RobustScaler[] data['scaled_amount'] = rob_scaler.fit_transform[data['Amount'].values.reshape[-1,1]] data['scaled_time'] = rob_scaler.fit_transform[data['Time'].values.reshape[-1,1]] # Get rid of Time and Amount data.drop[['Time','Amount'], axis=1, inplace=True] # Let's look at the data again ! data.head[]

Kỹ thuật lấy mẫu lại [Resampling Techniques]

Trong số các kỹ thuật ứng dụng để xử lý vấn đề dữ liệu bất đối xứng thì kỹ thuật lấy mẫu lại [data resampling techniques] tương đối phổ biến. Ý tưởng là chúng ta sẽ "cân bằng" lại dữ liệu training trước khi đưa vào "huấn luyện" model.

Một vài kỹ thuật Data Resampling cơ bản:

Random undersampling
Random oversampling
SMOTE [Synthetic Minority Over-sampling Technique]
...

Hình trên minh hoạ hai kỹ thuật đầu tiên là Undersampling và Oversampling. Một là ta lấy mẫu ngẫu nhiên từ class chiếm đa số bằng số mẫu của class thiểu số hoặc ta lặp lại class thiểu số cho bằng số mẫu của class chiếm đa số.

Tuy nhiên là những kỹ thuật này đều có điểm yếu của nó. Trong Oversampling khi lặp lại dữ liệu như vậy có thể dẫn đến overfitting. Trong khi với Undersampling ta để mất rất nhiều dữ liệu của class đa số.

Trong giới hạn bài blog này ta sẽ ứng dụng hai kỹ thuật cơ bản nhất là Random Oversampling và Random Undersampling. Dù vậy nếu bạn muốn tìm hiểu sâu hơn hãy thử tìm hiểu về SMOTE.

Lưu ý quan trọng: Tập testing phải được tạo trước khi áp dụng Resampling. Có nghĩa là tập testing phải dựa trên cơ sở là tập dữ liệu gốc [bất cân đối], như vậy ta mới có thể đánh giá hiệu quả của model trên dữ liệu thực tế.

X = data.drop ['Class', axis = 1] y = data['Class'] from sklearn.cross_validation import train_test_split # Whole dataset X_train, X_test, y_train, y_test = train_test_split[X, y, test_size = 0.3, random_state = 0, stratify = y]

Giảm độ lớn tập dữ liệu

Do tập dữ liệu này quá lớn nên để tiết kiệm thời gian tính toán mình giảm dữ liệu về 100.000 dòng nhưng vẫn giữ nguyên phân phối bất cân đối của dữ liệu.

training_data = pd.concat [[X_train,y_train],axis = 1] training_data['Class'].value_counts[] print ['Percentage original fraud: ', percentage_fraud] print ['Percentage original no-fraud: ', percentage_no_fraud] number_of_instances = 100000 # We will obtain maximum 100.000 data instances with the same class ratio of original data. # Therefore, new data will have 0.17% fraud and 99.83% non-fraud of 100.000. # Which means, new data will have 170 fraud transactions and 99830 non-fraud transactions. number_sub_fraud = int [percentage_fraud/100 * number_of_instances] number_sub_non_fraud = int [percentage_no_fraud/100 * number_of_instances] sub_fraud_data = training_data[training_data['Class'] == 1].head[number_sub_fraud] sub_non_fraud_data = training_data[training_data['Class'] == 0].head[number_sub_non_fraud] print ['Number of newly sub fraud data:',len[sub_fraud_data]] print ['Number of newly sub non-fraud data:',len[sub_non_fraud_data]] sub_training_data = pd.concat [[sub_fraud_data, sub_non_fraud_data], axis = 0] sub_training_data['Class'].value_counts[]

Khởi tạo tập dữ liệu training set

X_train_sub = sub_training_data.drop ['Class', axis = 1] y_train_sub = sub_training_data['Class'] y_train_sub.value_counts[]

Ứng dụng kỹ thuật Undersampling tạo training set

Để đơn giản, ta dùng hàm DataFrame.sample[] để lấy mẫu ngẫu nhiên từng class:

# Fraud/non-fraud data fraud_data = training_data[training_data['Class'] == 1] non_fraud_data = training_data[training_data['Class'] == 0] # Number of fraud, non-fraud transactions number_records_fraud = len[fraud_data] number_records_non_fraud = len [non_fraud_data] under_sample_non_fraud = non_fraud_data.sample [number_records_fraud] under_sample_data = pd.concat [[under_sample_non_fraud, fraud_data], axis = 0] # Showing ratio print["Percentage of normal transactions: ", len[under_sample_data[under_sample_data.Class == 0]]/len[under_sample_data]] print["Percentage of fraud transactions: ", len[under_sample_data[under_sample_data.Class == 1]]/len[under_sample_data]] print["Total number of transactions in resampled data: ", len[under_sample_data]] # Assigning X,y for Under-sampled Data X_train_undersample = under_sample_data.drop ['Class', axis = 1] y_train_undersample = under_sample_data['Class'] # Plot countplot plt.figure[figsize=[7,7]] sns.set[style="darkgrid"] sns.countplot[x="Class", data=under_sample_data]

Ứng dụng Oversampling tạo training set

Mình làm tương tự với kỹ thuật Oversampling.

# Fraud/non-fraud data fraud_data = sub_training_data[sub_training_data['Class'] == 1] non_fraud_data = sub_training_data[sub_training_data['Class'] == 0] # Number of fraud, non-fraud transactions number_records_fraud = len[fraud_data] number_records_non_fraud = len [non_fraud_data] over_sample_fraud = fraud_data.sample [number_records_non_fraud, replace = True] # with replacement, since we take a larger sample than population over_sample_data = pd.concat [[over_sample_fraud, non_fraud_data], axis = 0] # Showing ratio print["Percentage of normal transactions: ", len[over_sample_data[over_sample_data.Class == 0]]/len[over_sample_data]] print["Percentage of fraud transactions: ", len[over_sample_data[over_sample_data.Class == 1]]/len[over_sample_data]] print["Total number of transactions in resampled data: ", len[over_sample_data]] # Assigning X, y for over-sampled dataset X_train_oversample = over_sample_data.drop ['Class', axis = 1] y_train_oversample = over_sample_data['Class'] # Plot countplot plt.figure[figsize=[7,7]] sns.set[style="darkgrid"] sns.countplot[x="Class", data=over_sample_data]

Đánh giá model trong dữ liệu bất cân đối

Như đã đề cập ở trên trong ví dụ này ta không thể sử dụng độ chính xác [accuracy score] để đánh giá mô hình. Vì giả sử model dự đoán tất cả giao dịch đều hợp lệ thì ta vẫn có độ chính xác lên đến 99.83% do 99.83% giao dịch đúng là hợp lệ.
Nên trong trường hợp này ta quan tâm đến chỉ số Recall score, là chỉ số cho ta biết bao nhiêu phần trăm giao dịch Gian lận mà ta có thể phát hiện được.
Recall score được tính thông qua True Positives và False Negative của Confusion Matrix.

Precision = TP/[TP+FP]
Recall = TP/[TP+FN]
TP: True Positives
FP: False Positives
FN: False Negatives

Với:

TP: Bao nhiêu giao dịch gian lận được phát hiện
FP: Bao nhiêu giao dịch gian lận bị coi là bình thường
TN: Bao nhiêu giao dịch bình thường được đánh giá đúng
FN: Bao nhiêu giao dịch bình thường bị coi là gian lận

Ta thấy có 2 dạng dự đoán sai của model:

Type 1 Error: False Positive
Type 2 Error: False Negative

Cùng là dự đoán sai tuy nhiên ta luôn ưu tiên tránh loại 1 hơn là loại 2. Type 2 Error xảy ra khi ta coi một giao dịch bình thường là gian lận, giống như một báo động nhầm không mang lại thiệt hại nhiều như Type 1 Error xảy ra khi ta bỏ sót một giao dịch gian lận. Vì vậy ta vẫn tìm cách tối ưu chỉ số Recall dù khi tăng chỉ số Recall sẽ có thể làm giảm chỉ số Precision.

Áp dụng các mô hình machine learning

Trong bài này ta sẽ áp dụng thuật toán SVM và Logistic Regression và so sánh hiệu quả của hai kỹ thuật Resampling cũng như khi không áp dụng hai kỹ thuật này.

Đánh giá model trên tập dữ liệu gốc không dùng kỹ thuật Resampling

# SVM svm.fit[X_train_sub, y_train_sub] #Logistic Regression lr.fit[X_train_sub, y_train_sub] # Note: We should test on the original skewed test set predictions_svm = svm.predict[X_test] predictions_lr = lr.predict[X_test] # Compute confusion matrix cnf_matrix_svm = confusion_matrix[y_test,predictions_svm] cnf_matrix_lr = confusion_matrix[y_test,predictions_lr] recall_svm = cnf_matrix_svm[1,1]/[cnf_matrix_svm[1,0]+cnf_matrix_svm[1,1]] recall_lr = cnf_matrix_lr[1,1]/[cnf_matrix_lr[1,0]+cnf_matrix_lr[1,1]]

Cho hai kỹ thuật Undersampling và Oversampling ta áp dụng "huấn luyện" model bằng hai tập training set được tạo ở phần trước.

Recall Scores

Và đây là kết quả chỉ số Recall của ba trường hợp:

Original Data [Imbalanced]Undersampled DataOversampled DataSVM47.97 %89.86 %55.4 %Logistic Regression63.51 %89.86 %89.18 %

Trong trường hợp áp dụng trực tiếp vào dữ liệu gốc, có thể thấy kết quả khá khiêm tốn. Kỹ thuật Undersampling cho kết quả ấn tượng hơn nhiều với cả hai thuận toán SVM và Logistic Regression. Tuy nhiên chỉ Logistic Regression tiếp tục giữ được kết quả tốt khi áp dụng Oversampling.

References

Kaggle Kernel: Dealing with Imbalanced Dataset
Kaggle Kernel: In depth skewed data