In today's competitive business environment, companies are increasingly focusing on personalized marketing strategies to engage clients more effectively.
The marketing department in question has conducted an empirical analysis involving the implementation of four diverse campaigns targeted at a specific client base. As a result of this analysis, the clients have been stratified into four distinct segments.
The primary goal is to create a classification algorithm capable of identifying the appropriate client segment.
To be meet the marketing team requirements, the classifier must achieve a minimum accuracy threshold of 50%.
This requirement is crucial in ensuring that the model demonstrates a sufficient level of performance, thereby enabling the marketing team to customize their future campaigns more effectively and allocate resources in a more optimized manner.
Var_1 has no further description.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.figure import SubplotParams
import seaborn as sns
import random as rnd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
df = pd.read_csv('customer segments.csv')
print(f'rows: { df.shape[0]}, \tcolumns: {df.shape[1]}')
pd.set_option('display.max_columns', None)
df
rows: 8068, columns: 11
ID | Gender | Ever_Married | Age | Graduated | Profession | Work_Experience | Spending_Score | Family_Size | Var_1 | Segmentation | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 462809 | Male | No | 22 | No | Healthcare | 1.0 | Low | 4.0 | Cat_4 | D |
1 | 462643 | Female | Yes | 38 | Yes | Engineer | NaN | Average | 3.0 | Cat_4 | A |
2 | 466315 | Female | Yes | 67 | Yes | Engineer | 1.0 | Low | 1.0 | Cat_6 | B |
3 | 461735 | Male | Yes | 67 | Yes | Lawyer | 0.0 | High | 2.0 | Cat_6 | B |
4 | 462669 | Female | Yes | 40 | Yes | Entertainment | NaN | High | 6.0 | Cat_6 | A |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
8063 | 464018 | Male | No | 22 | No | NaN | 0.0 | Low | 7.0 | Cat_1 | D |
8064 | 464685 | Male | No | 35 | No | Executive | 3.0 | Low | 4.0 | Cat_4 | D |
8065 | 465406 | Female | No | 33 | Yes | Healthcare | 1.0 | Low | 1.0 | Cat_6 | D |
8066 | 467299 | Female | No | 27 | Yes | Healthcare | 1.0 | Low | 4.0 | Cat_6 | B |
8067 | 461879 | Male | Yes | 37 | Yes | Executive | 0.0 | Average | 3.0 | Cat_4 | B |
8068 rows × 11 columns
Upon importing the data, it was discovered that the "ID" column consists of unique values. Therefore, it was decided to set this column as the index of the DataFrame.
# Setting ID as index
print('Is ID unique:', df['ID'].is_unique)
df = df.set_index('ID')
Is ID unique: True
To avoid data leakage during model training, the DataFrame was split into training and testing sets. The target variable was separated into training and testing sets separately from the features.
df, X_test, y_train, y_test = train_test_split(df.drop('Segmentation', axis=1), df['Segmentation'], test_size=0.2)
df
Gender | Ever_Married | Age | Graduated | Profession | Work_Experience | Spending_Score | Family_Size | Var_1 | |
---|---|---|---|---|---|---|---|---|---|
ID | |||||||||
460776 | Female | Yes | 47 | Yes | NaN | 2.0 | Low | 1.0 | NaN |
460232 | Male | No | 22 | No | Healthcare | 8.0 | Low | 4.0 | Cat_6 |
459351 | Female | Yes | 57 | Yes | Artist | 0.0 | High | NaN | Cat_1 |
461051 | Female | Yes | 71 | Yes | Entertainment | 1.0 | Average | 2.0 | Cat_6 |
466656 | Male | Yes | 89 | Yes | Marketing | 1.0 | High | 2.0 | Cat_6 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
467479 | Male | Yes | 35 | Yes | Executive | 1.0 | Average | 4.0 | Cat_6 |
462151 | Female | No | 35 | Yes | Artist | 8.0 | Low | 1.0 | Cat_6 |
467459 | Male | Yes | 40 | Yes | Executive | NaN | High | 5.0 | Cat_6 |
463288 | Female | NaN | 59 | Yes | Artist | 9.0 | Average | 2.0 | Cat_6 |
460651 | Male | No | 32 | No | Homemaker | 8.0 | Low | NaN | Cat_4 |
6454 rows × 9 columns
# Missing values percentages for each column
(df.isnull().sum()/len(df)).sort_values(ascending=False)
Work_Experience 0.100868 Family_Size 0.039820 Ever_Married 0.016579 Profession 0.014565 Graduated 0.009916 Var_1 0.009452 Gender 0.000000 Age 0.000000 Spending_Score 0.000000 dtype: float64
The columns Work_Experience, Family_Size, Ever_Married, Profession, Var_1, and Graduated exhibit missing values. Considering the dataset's limited size, eliminating the missing values is not a preferable option. Instead, these values must be replaced in order to preserve the dataset's integrity.
# Show 10 sample rows with missing values in each column
for column in df.columns:
if df[column].isna().sum() > 0:
print(f"Samples with missing values in column '{column}':")
display(df[df[column].isna()].sample(10))
Samples with missing values in column 'Ever_Married':
Gender | Ever_Married | Age | Graduated | Profession | Work_Experience | Spending_Score | Family_Size | Var_1 | |
---|---|---|---|---|---|---|---|---|---|
ID | |||||||||
460305 | Female | NaN | 49 | Yes | Entertainment | 1.0 | High | 1.0 | Cat_6 |
459062 | Male | NaN | 48 | Yes | Executive | NaN | High | 5.0 | Cat_6 |
461989 | Female | NaN | 57 | Yes | Engineer | 0.0 | Average | 4.0 | Cat_2 |
465987 | Male | NaN | 20 | No | Healthcare | 1.0 | Low | 3.0 | Cat_2 |
460349 | Male | NaN | 35 | Yes | Artist | 3.0 | Average | 2.0 | Cat_6 |
460516 | Female | NaN | 85 | No | Lawyer | 0.0 | High | 1.0 | Cat_3 |
465295 | Female | NaN | 37 | Yes | Doctor | 8.0 | Average | 1.0 | Cat_6 |
463387 | Male | NaN | 20 | No | Marketing | 3.0 | Low | 2.0 | Cat_3 |
466026 | Female | NaN | 49 | No | Entertainment | 0.0 | Low | 1.0 | Cat_3 |
466795 | Male | NaN | 55 | Yes | Entertainment | NaN | Average | 5.0 | Cat_6 |
Samples with missing values in column 'Graduated':
Gender | Ever_Married | Age | Graduated | Profession | Work_Experience | Spending_Score | Family_Size | Var_1 | |
---|---|---|---|---|---|---|---|---|---|
ID | |||||||||
464321 | Female | No | 28 | NaN | Engineer | 0.0 | Low | 3.0 | Cat_3 |
461614 | Female | No | 25 | NaN | Healthcare | 1.0 | Low | 1.0 | Cat_4 |
460210 | Male | No | 29 | NaN | Entertainment | 8.0 | Low | 9.0 | Cat_6 |
462132 | Male | Yes | 47 | NaN | Artist | 4.0 | Low | 1.0 | Cat_6 |
466041 | Female | Yes | 52 | NaN | Engineer | 1.0 | Low | 1.0 | Cat_4 |
463351 | Female | No | 22 | NaN | Healthcare | 3.0 | Low | 4.0 | Cat_5 |
464257 | Male | Yes | 35 | NaN | Entertainment | 1.0 | High | 2.0 | Cat_6 |
465803 | Male | Yes | 37 | NaN | NaN | 2.0 | High | 5.0 | Cat_6 |
464830 | Male | NaN | 47 | NaN | Executive | 1.0 | Average | 5.0 | Cat_4 |
459518 | Male | Yes | 60 | NaN | NaN | NaN | Average | 4.0 | Cat_6 |
Samples with missing values in column 'Profession':
Gender | Ever_Married | Age | Graduated | Profession | Work_Experience | Spending_Score | Family_Size | Var_1 | |
---|---|---|---|---|---|---|---|---|---|
ID | |||||||||
465820 | Female | NaN | 30 | Yes | NaN | NaN | Average | 4.0 | Cat_4 |
467160 | Male | Yes | 40 | No | NaN | 0.0 | Average | 2.0 | Cat_3 |
459197 | Female | Yes | 39 | Yes | NaN | 0.0 | Average | 2.0 | Cat_6 |
460076 | Female | No | 49 | Yes | NaN | 0.0 | Low | 2.0 | Cat_6 |
460998 | Male | No | 27 | Yes | NaN | 0.0 | Low | 3.0 | Cat_3 |
467384 | Male | No | 42 | Yes | NaN | NaN | Low | 1.0 | Cat_6 |
465528 | Female | Yes | 46 | Yes | NaN | 0.0 | Low | 2.0 | Cat_6 |
464933 | Male | No | 25 | No | NaN | 0.0 | Low | 1.0 | Cat_4 |
459426 | Male | No | 23 | No | NaN | 1.0 | Low | 4.0 | Cat_6 |
465803 | Male | Yes | 37 | NaN | NaN | 2.0 | High | 5.0 | Cat_6 |
Samples with missing values in column 'Work_Experience':
Gender | Ever_Married | Age | Graduated | Profession | Work_Experience | Spending_Score | Family_Size | Var_1 | |
---|---|---|---|---|---|---|---|---|---|
ID | |||||||||
467499 | Male | Yes | 40 | Yes | Artist | NaN | Low | 3.0 | Cat_6 |
462909 | Female | No | 36 | No | Engineer | NaN | Low | 1.0 | Cat_6 |
459645 | Male | Yes | 43 | Yes | Artist | NaN | Average | 2.0 | Cat_6 |
462691 | Male | Yes | 25 | No | Doctor | NaN | Average | 2.0 | Cat_4 |
463792 | Female | No | 38 | No | Engineer | NaN | Low | 1.0 | Cat_6 |
462896 | Male | Yes | 58 | Yes | Entertainment | NaN | Low | 2.0 | Cat_6 |
462155 | Female | Yes | 57 | Yes | Artist | NaN | High | 3.0 | Cat_6 |
467480 | Male | Yes | 35 | Yes | Artist | NaN | Average | NaN | Cat_6 |
467843 | Female | No | 23 | No | Healthcare | NaN | Low | 4.0 | Cat_6 |
463570 | Male | Yes | 35 | Yes | Artist | NaN | Average | 4.0 | Cat_6 |
Samples with missing values in column 'Family_Size':
Gender | Ever_Married | Age | Graduated | Profession | Work_Experience | Spending_Score | Family_Size | Var_1 | |
---|---|---|---|---|---|---|---|---|---|
ID | |||||||||
462619 | Male | NaN | 61 | No | NaN | 0.0 | High | NaN | Cat_4 |
466703 | Female | Yes | 46 | Yes | Homemaker | 0.0 | Average | NaN | Cat_6 |
466604 | Male | No | 32 | Yes | Healthcare | 1.0 | Low | NaN | Cat_6 |
461411 | Male | No | 29 | No | Healthcare | 1.0 | Low | NaN | Cat_6 |
462273 | Female | No | 41 | Yes | Entertainment | 1.0 | Low | NaN | Cat_6 |
461758 | Male | No | 33 | Yes | Healthcare | 8.0 | Low | NaN | Cat_3 |
462093 | Male | No | 33 | Yes | Artist | 1.0 | Low | NaN | Cat_4 |
461074 | Male | Yes | 25 | No | Entertainment | NaN | Average | NaN | Cat_3 |
459586 | Male | Yes | 73 | No | Executive | 1.0 | Low | NaN | Cat_6 |
459875 | Male | Yes | 71 | No | Lawyer | 0.0 | High | NaN | Cat_6 |
Samples with missing values in column 'Var_1':
Gender | Ever_Married | Age | Graduated | Profession | Work_Experience | Spending_Score | Family_Size | Var_1 | |
---|---|---|---|---|---|---|---|---|---|
ID | |||||||||
462176 | Male | Yes | 29 | No | Executive | 0.0 | Average | 2.0 | NaN |
464055 | Male | Yes | 32 | Yes | Doctor | 0.0 | Low | NaN | NaN |
459083 | Male | Yes | 58 | Yes | Artist | 8.0 | Average | 3.0 | NaN |
467571 | Male | No | 23 | No | Marketing | 0.0 | Low | NaN | NaN |
465293 | Female | Yes | 47 | Yes | Artist | 1.0 | High | 5.0 | NaN |
466143 | Female | Yes | 48 | No | Engineer | 0.0 | Average | 5.0 | NaN |
459762 | Female | Yes | 45 | Yes | Artist | 8.0 | Average | 2.0 | NaN |
461282 | Male | No | 21 | No | Healthcare | 0.0 | Low | 4.0 | NaN |
462262 | Male | No | 27 | No | Healthcare | 1.0 | Low | 3.0 | NaN |
462686 | Male | Yes | 41 | Yes | Entertainment | 9.0 | Average | 4.0 | NaN |
After analyzing the data, it was observed that age and gender are good indicators of marital status. Both male and female customers showed similar patterns, with the mode being 'No' for age groups under or equal to 35 and 'Yes' for age groups over 35 years old.
To address the missing values in the Ever_Married column, the following approach was taken:
# Create age gropus
age_groups = pd.cut(df["Age"], [18, 19, 25, 35, 50, float("inf")])
# Group the data by "Gender" and age groups, then calculate the mode of "Ever_Married"
marital_mode_by_gender_age = df.groupby(["Gender", age_groups])["Ever_Married"].apply(lambda x: x.mode()[0])
# Print the resulting Series
print(marital_mode_by_gender_age)
Gender Age Female (18.0, 19.0] No (19.0, 25.0] No (25.0, 35.0] No (35.0, 50.0] Yes (50.0, inf] Yes Male (18.0, 19.0] No (19.0, 25.0] No (25.0, 35.0] No (35.0, 50.0] Yes (50.0, inf] Yes Name: Ever_Married, dtype: object
# Replace missing values in "Ever_Married" based on "Age", using mode
def fill_missing_marital(df):
df.loc[(df['Ever_Married'].isnull()) & (df['Age'] < 36), 'Ever_Married'] = 'No'
df.loc[(df['Ever_Married'].isnull()) & (df['Age'] >= 36), 'Ever_Married'] = 'Yes'
return df
Given the lack of domain expertise or information to determine a client's profession, the decision was made to introduce a new category, 'unknown', to replace the missing values in the Profession column. This approach avoids generating false patterns in the dataset.
# Replace missing values in "Profession" with "unknown"
def fill_missing_profession(df):
df["Profession"] = df["Profession"].fillna("unknown")
return df
Upon examining the dataset, missing values in the Graduated column were replaced with the mode of each age group and profession combination. For the youngest age group (18-19) with missing values due to the lack of clients of the specific professionin that age range, the missing values were replaced with 'No'. It was chosen based on the fact that the mode of all other profession within that range was 'No'.
age_groups = pd.cut(df["Age"], [18, 19, 25, 35, 50, float("inf")])
# Group the data by "Profession" and age groups, and calculate the mode of "Graduated"
graduated_mode_by_profession_age = df.groupby(["Profession", age_groups])["Graduated"]\
.apply(lambda x: x.mode().iloc[0])
# Print results
print(graduated_mode_by_profession_age)
graduated_mode_by_profession_age.fillna('No', inplace=True)
Profession Age Artist (18.0, 19.0] NaN (19.0, 25.0] Yes (25.0, 35.0] Yes (35.0, 50.0] Yes (50.0, inf] Yes Doctor (18.0, 19.0] No (19.0, 25.0] No (25.0, 35.0] Yes (35.0, 50.0] Yes (50.0, inf] Yes Engineer (18.0, 19.0] No (19.0, 25.0] No (25.0, 35.0] No (35.0, 50.0] Yes (50.0, inf] No Entertainment (18.0, 19.0] No (19.0, 25.0] No (25.0, 35.0] Yes (35.0, 50.0] Yes (50.0, inf] Yes Executive (18.0, 19.0] No (19.0, 25.0] No (25.0, 35.0] No (35.0, 50.0] Yes (50.0, inf] Yes Healthcare (18.0, 19.0] No (19.0, 25.0] No (25.0, 35.0] Yes (35.0, 50.0] Yes (50.0, inf] Yes Homemaker (18.0, 19.0] NaN (19.0, 25.0] No (25.0, 35.0] No (35.0, 50.0] Yes (50.0, inf] Yes Lawyer (18.0, 19.0] NaN (19.0, 25.0] No (25.0, 35.0] No (35.0, 50.0] Yes (50.0, inf] Yes Marketing (18.0, 19.0] No (19.0, 25.0] No (25.0, 35.0] No (35.0, 50.0] Yes (50.0, inf] Yes Name: Graduated, dtype: object
# Replace missing values in "Graduated" based on "Age" and "Profession", using mode
def fill_missing_graduated(df):
df.loc[(df['Graduated'].isnull()) & (df['Age'] < 20), 'Graduated'] = 'No'
df.loc[(df['Graduated'].isnull()) & (df['Profession'] == 'Artist'), 'Graduated'] = 'Yes'
df.loc[(df['Graduated'].isnull()) & (df['Age'] < 35) & (df['Profession'] == 'Engineer'), 'Graduated'] = 'No'
df.loc[(df['Graduated'].isnull()) & (df['Age'] < 35) & (df['Profession'] == 'Executive'), 'Graduated'] = 'No'
df.loc[(df['Graduated'].isnull()) & (df['Age'] >= 35), 'Graduated'] = 'Yes'
df['Graduated'] = df['Graduated'].fillna('No')
return df
Addressing missing values in the Work_Experience column was challenging. After grouping the dataset by profession and age group, it was observed that the median is often 1.
After plotting the count for the general dataset, it is clear that the column is dominated by two values: 0 and 1, accounting for about 30% of the entire dataset each. Other values have lower representation in the data.
Due to the lack of domain expertise and considering the distribution of values, a time-effective approach was chosen to randomly assign the missing values to either 0 or 1, which compose the majority of the dataset. More sophisticated approaches could be explored, but the chosen method is suitable for addressing the 10% missing values in the dataset.
age_groups = pd.cut(df["Age"], [18, 25, 35, float("inf")])
# Group the data by "Profession" and the age groups, and calculate the mode of "Work_Experience"
# The lenght of the soub gropus has been added to better understand the distribution and possible outliers
exp_median_by_profession_age = df.groupby(["Profession", age_groups])["Work_Experience"]\
.apply(lambda x: (x.median(), len(x)))
# Print results
print('\t\t\t\tmode, subgroup size')
print(exp_median_by_profession_age)
plt.figure(figsize=(11,2))
sns.countplot(x=df['Work_Experience'])
plt.title('Work_Experience')
plt.show()
print('percentages')
print(df['Work_Experience'].value_counts()/len(df))
mode, subgroup size Profession Age Artist (18.0, 25.0] (5.0, 31) (25.0, 35.0] (2.0, 320) (35.0, inf] (1.0, 1633) Doctor (18.0, 25.0] (1.0, 64) (25.0, 35.0] (1.0, 215) (35.0, inf] (1.0, 264) Engineer (18.0, 25.0] (1.0, 30) (25.0, 35.0] (1.0, 149) (35.0, inf] (1.0, 373) Entertainment (18.0, 25.0] (1.0, 45) (25.0, 35.0] (1.0, 192) (35.0, inf] (1.0, 530) Executive (18.0, 25.0] (1.0, 16) (25.0, 35.0] (1.0, 55) (35.0, inf] (1.0, 426) Healthcare (18.0, 25.0] (1.0, 439) (25.0, 35.0] (1.0, 445) (35.0, inf] (1.0, 105) Homemaker (18.0, 25.0] (4.0, 10) (25.0, 35.0] (8.0, 94) (35.0, inf] (6.0, 87) Lawyer (18.0, 25.0] (4.5, 2) (25.0, 35.0] (0.5, 2) (35.0, inf] (1.0, 507) Marketing (18.0, 25.0] (1.0, 48) (25.0, 35.0] (1.0, 72) (35.0, inf] (1.0, 101) Name: Work_Experience, dtype: object
percentages 0.0 0.289898 1.0 0.289433 9.0 0.058878 8.0 0.058104 2.0 0.035482 3.0 0.033003 4.0 0.030834 7.0 0.025411 5.0 0.024481 6.0 0.024326 11.0 0.006663 10.0 0.006043 13.0 0.005733 14.0 0.005578 12.0 0.005268 Name: Work_Experience, dtype: float64
# Deeper analysis of value 8 and 9
print('df size:', len(df))
print('Experience = 8')
print(df[df['Work_Experience']==8]['Profession'].value_counts()/len(df))
print()
print('Experience = 9')
print(df[df['Work_Experience']==9]['Profession'].value_counts()/len(df))
df size: 6454 Experience = 8 Artist 0.017509 Healthcare 0.011776 Entertainment 0.006663 Homemaker 0.005113 Doctor 0.004958 Engineer 0.004338 Executive 0.002479 Marketing 0.002169 Lawyer 0.002014 Name: Profession, dtype: float64 Experience = 9 Artist 0.016424 Healthcare 0.009142 Entertainment 0.008212 Engineer 0.006508 Doctor 0.005888 Homemaker 0.005578 Executive 0.004183 Marketing 0.002014 Lawyer 0.000775 Name: Profession, dtype: float64
# Replace missing values in "Work_Experience" with random values 0 or 1
def fill_missing_working_exp(df):
df.loc[df['Work_Experience'].isnull(), 'Work_Experience'] =\
np.random.rand(df['Work_Experience'].isnull().sum()).round()
return df
To address missing values in the Family_Size column, the dataset was grouped by age groups, and missing values were replaced using the median of each group.
age_groups = pd.cut(df["Age"], [18, 19, 25, 35, 50, float("inf")])
# Group the data by "Profession" and the age groups, and calculate the median of "Family_Size"
famsize_median_by_age = df.groupby([age_groups])["Family_Size"]\
.apply(lambda x: x.median())
print(famsize_median_by_age)
Age (18.0, 19.0] 4.0 (19.0, 25.0] 4.0 (25.0, 35.0] 3.0 (35.0, 50.0] 2.0 (50.0, inf] 2.0 Name: Family_Size, dtype: float64
# Replace missing values in "Family_Size" based on "Age", using median
def fill_missing_famsize(df):
df.loc[(df['Family_Size'].isnull()) & (df['Age'] < 26), 'Family_Size'] = 4
df.loc[(df['Family_Size'].isnull()) & (df['Age'] < 36), 'Family_Size'] = 3
df.loc[(df['Family_Size'].isnull()), 'Family_Size'] = 2
return df
Since the meaning of the values in the Var_1 column was not provided and the missing values account for only about 1% of the dataset, the decision was made to replace them with the mode of the entire dataset.
print(df['Var_1'].mode())
0 Cat_6 Name: Var_1, dtype: object
# Replace missing values in "Var_1" using mode
def fill_missing_var1(df):
df['Var_1'] = df['Var_1'].fillna('Cat_6')
return df
def fill_missing_values(df):
df = fill_missing_marital(df)
df = fill_missing_profession(df)
df = fill_missing_graduated(df)
df = fill_missing_working_exp(df)
df = fill_missing_famsize(df)
df = fill_missing_var1(df)
return df
df = fill_missing_values(df)
# checking missing values percentage for each column
(df.isnull().sum()/len(df)).sort_values(ascending=False)
Gender 0.0 Ever_Married 0.0 Age 0.0 Graduated 0.0 Profession 0.0 Work_Experience 0.0 Spending_Score 0.0 Family_Size 0.0 Var_1 0.0 dtype: float64
Four columns in the dataset were transformed to more suitable data types to facilitate further analysis and model training. The following transformations were applied:
Age: The Age column was converted from an integer data type to a float. This transformation ensures better compatibility with some machine learning algorithms that may require float inputs for numerical features.
Gender: The Gender column was transformed into a boolean variable, where Male is represented by True (1) and Female by False (0). This binary encoding simplifies the data representation and can potentially improve model performance by reducing the complexity of the input features.
Ever_Married: The Ever_Married column was mapped to a boolean data type, with 'Yes' represented as True and 'No' as False. This transformation simplifies the data and helps the model to better understand the relationship between the feature and the target variable.
Graduated: The Graduated column was transformed into a boolean representation, with 'Yes' as True and 'No' as False. Similar to the other binary categorical features, this transformation makes it easier for the model to interpret the relationship between the feature and the target variable.
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 6454 entries, 460776 to 460651 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Gender 6454 non-null object 1 Ever_Married 6454 non-null object 2 Age 6454 non-null int64 3 Graduated 6454 non-null object 4 Profession 6454 non-null object 5 Work_Experience 6454 non-null float64 6 Spending_Score 6454 non-null object 7 Family_Size 6454 non-null float64 8 Var_1 6454 non-null object dtypes: float64(2), int64(1), object(6) memory usage: 504.2+ KB
def transforming_columns_type(df):
df['Age'] = df['Age'].astype(float)
df['Gender'] = df['Gender'].map({'Male': True, 'Female': False})
df['Ever_Married'] = df['Ever_Married'].map({'Yes': True, 'No': False})
df['Graduated'] = df['Graduated'].map({'Yes': True, 'No': False})
return df
df = transforming_columns_type(df)
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 6454 entries, 460776 to 460651 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Gender 6454 non-null bool 1 Ever_Married 6454 non-null bool 2 Age 6454 non-null float64 3 Graduated 6454 non-null bool 4 Profession 6454 non-null object 5 Work_Experience 6454 non-null float64 6 Spending_Score 6454 non-null object 7 Family_Size 6454 non-null float64 8 Var_1 6454 non-null object dtypes: bool(3), float64(3), object(3) memory usage: 371.9+ KB
def preprocessing(df):
df = fill_missing_values(df)
df = transforming_columns_type(df)
return df
The target variable, Segmentation, consists of four categories: A, B, C, and D. The distribution among these categories is relatively balanced, ranging from approximately 22% to 28% for each category.
# Segmentation countplot
sns.countplot(data=y_train.to_frame(), x='Segmentation')
print('Segments distribution')
plt.show()
print('Segments percentages')
print(y_train.value_counts().sort_index()/len(y_train))
Segments distribution
Segments percentages A 0.242485 B 0.229935 C 0.243725 D 0.283855 Name: Segmentation, dtype: float64
Analyzing the categorical features revealed the following insights:
non_float_cols = df.select_dtypes(exclude=['float']).columns
# Plot all categorical and boolean columns, add percentages
for col in non_float_cols:
print(col.upper())
print(col, 'percentages')
print(df[col].value_counts()/len(df))
plt.figure(figsize=(11,2))
sns.countplot(data=df, x=col)
plt.show()
GENDER Gender percentages True 0.549737 False 0.450263 Name: Gender, dtype: float64
EVER_MARRIED Ever_Married percentages True 0.583359 False 0.416641 Name: Ever_Married, dtype: float64
GRADUATED Graduated percentages True 0.62225 False 0.37775 Name: Graduated, dtype: float64
PROFESSION Profession percentages Artist 0.307716 Healthcare 0.166408 Entertainment 0.119306 Engineer 0.085528 Doctor 0.085373 Lawyer 0.079176 Executive 0.077316 Marketing 0.035017 Homemaker 0.029594 unknown 0.014565 Name: Profession, dtype: float64
SPENDING_SCORE Spending_Score percentages Low 0.606601 Average 0.240781 High 0.152619 Name: Spending_Score, dtype: float64
VAR_1 Var_1 percentages Cat_6 0.660985 Cat_4 0.134800 Cat_3 0.099938 Cat_2 0.052526 Cat_7 0.024636 Cat_1 0.017199 Cat_5 0.009916 Name: Var_1, dtype: float64
Examination of the numerical features yielded the following insights:
# select print ad describe numerical columns
columns = df.select_dtypes(include=['float']).columns
print(columns)
display(df[columns].describe().T)
# plot Age
plt.figure(figsize=(11,2))
sns.kdeplot(data=df['Age'], shade=True, clip=(df['Age'].min(),df['Age'].max()), gridsize=len(df))
plt.title('Age')
plt.show()
# plot Work_Experience
plt.figure(figsize=(11,2))
sns.countplot(x=df['Work_Experience'])
plt.title('Work_Experience')
plt.show()
print(df['Work_Experience'].value_counts()/len(df))
# plot Family_Size
plt.figure(figsize=(11,2))
sns.countplot(x=df['Family_Size'])
plt.title('Family_Size')
plt.show()
print(df['Family_Size'].value_counts()/len(df))
Index(['Age', 'Work_Experience', 'Family_Size'], dtype='object')
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
Age | 6454.0 | 43.400992 | 16.771175 | 18.0 | 30.0 | 40.0 | 53.0 | 89.0 |
Work_Experience | 6454.0 | 2.423303 | 3.294369 | 0.0 | 0.0 | 1.0 | 4.0 | 14.0 |
Family_Size | 6454.0 | 2.834986 | 1.511322 | 1.0 | 2.0 | 2.0 | 4.0 | 9.0 |
0.0 0.340719 1.0 0.339479 9.0 0.058878 8.0 0.058104 2.0 0.035482 3.0 0.033003 4.0 0.030834 7.0 0.025411 5.0 0.024481 6.0 0.024326 11.0 0.006663 10.0 0.006043 13.0 0.005733 14.0 0.005578 12.0 0.005268 Name: Work_Experience, dtype: float64
2.0 0.317787 3.0 0.196622 1.0 0.182677 4.0 0.178184 5.0 0.076077 6.0 0.025256 7.0 0.011156 9.0 0.006198 8.0 0.006043 Name: Family_Size, dtype: float64
Upon analyzing the correlation matrix for the numerical features, no strong correlation (over 0.75) was observed between any pair of features. The most correlated features were Age and Ever_Married, with a correlation of approximately 0.57. This is followed by Age and Graduated, with a correlation of 0.24.
Age and Family_Size were negatively correlated, with a correlation of -0.29, as expected. Additionally, Graduated and Family_Size were negatively correlated with a correlation of -0.23.
# display the correlation heatmap
corr_matrix = df.corr()
fig, ax = plt.subplots(1,figsize=(5, 3))
sns.heatmap(corr_matrix, cmap='coolwarm', annot=True, mask=(corr_matrix==1), fmt='.3f', ax=ax)
<AxesSubplot:>
When considering the one-hot encoded categorical columns, some notable correlations were observed:
# One-hot encode the categorical variables for correlation matrix
df_encoded = pd.get_dummies(df)
# display heatmap
corr_matrix = df_encoded.corr()
fig, ax = plt.subplots(1,figsize=(8, 6))
sns.heatmap(corr_matrix, cmap='coolwarm', annot=False, mask=(corr_matrix==1), fmt='.3f', ax=ax)
<AxesSubplot:>
Given the insights from the correlation matrix analysis, it can be assumed that:
Considering the observed correlations and the nature of the dataset, the following models may be suitable for this classification problem:
At the end of the following feature engineering step, PCA will be explored to evaluate the possible use of clustering models
Standardisation and logarithmic transformation play crucial roles in preprocessing datasets, as they facilitate the mitigation of varying scales and skewed distributions among features. These techniques promote enhanced model performance and increased interpretability by ensuring that the data adheres to assumptions required for many machine learning algorithms.
Age: apply a log10 transformation to reduce the effect of the skewed distribution. Then, StandardScaler to standardize the Age column, which scales the values to have a mean of 0 and a standard deviation of 1. This helps mitigate the impact of outliers and reduces the range of the Age feature.
Work_Experience: scale the Work_Experience column by dividing each value by the maximum value (workexp_max). This is a simple normalization technique that scales the values to the range [0, 1]. Although it may not be as robust as using a standard scaler or another normalization method, it is a reasonable approach, especially given the non-normal distribution of the Work_Experience feature.
Family_Size: Similar to the Age column, first apply a log10 transformation to reduce the effect of the skewed distribution. Then, you use StandardScaler to standardize the Family_Size column.
In the following cell the distribution pre and post engineering will be shown and the scalers and important variables will be stored to use them with real world data after the training.
# show distribution before processing of numerical variables: Age, Work Experience, Family Size
print('PRE PROCESSING DISTRIBUTIONS')
plt.figure(figsize=(7, 2))
sns.histplot(df['Age'], bins=67, kde=True)
plt.show()
plt.figure(figsize=(7, 2))
sns.histplot(df['Work_Experience'], bins=14, kde=True)
plt.show()
plt.figure(figsize=(7, 2))
sns.histplot(df['Family_Size'], bins = 9,kde=True)
plt.show()
PRE PROCESSING DISTRIBUTIONS
# 'Work_Experience' max to be used for work experience processing
workexp_max = df['Work_Experience'].max()
# Cat columns that will be one hot encoded
cat_cols = df.select_dtypes(include=['object']).columns
# class used as a custom transformer for applying the log10 transformation
class Log10Transformer(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
return np.log10(X)
# Create the column transformer processor for 'Age', 'Family_Size' and 'Work_Experience'
def create_numerical_preprocessor(df):
preprocessor = ColumnTransformer(
transformers=[
('age', Pipeline([('log10', Log10Transformer()),
('scaler', StandardScaler())]), ['Age']),
('family_size', Pipeline([('log10', Log10Transformer()),
('scaler', StandardScaler())]), ['Family_Size']),
('work_experience', FunctionTransformer(lambda x: x / workexp_max), ['Work_Experience']),
], remainder='passthrough')
return preprocessor
# store names of columns in the new order
new_columns_order = ['Age','Family_Size','Work_Experience','Gender','Ever_Married','Graduated','Profession','Spending_Score','Var_1']
# create and fit the numerical_processor
numerical_preprocessor = create_numerical_preprocessor(df)
numerical_preprocessor.fit(df)
# Apply numerical transformations to the dataset
df = pd.DataFrame(numerical_preprocessor.transform(df),
columns=new_columns_order,
index=df.index)
df
Age | Family_Size | Work_Experience | Gender | Ever_Married | Graduated | Profession | Spending_Score | Var_1 | |
---|---|---|---|---|---|---|---|---|---|
ID | |||||||||
460776 | 0.396833 | -1.638754 | 0.142857 | False | True | True | unknown | Low | Cat_6 |
460232 | -1.566955 | 0.888593 | 0.571429 | True | False | False | Healthcare | Low | Cat_6 |
459351 | 0.89587 | -0.37508 | 0.0 | False | True | True | Artist | High | Cat_1 |
461051 | 1.464045 | -0.37508 | 0.071429 | False | True | True | Entertainment | Average | Cat_6 |
466656 | 2.048589 | -0.37508 | 0.071429 | True | True | True | Marketing | High | Cat_6 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
467479 | -0.365807 | 0.888593 | 0.071429 | True | True | True | Executive | Average | Cat_6 |
462151 | -0.365807 | -1.638754 | 0.571429 | False | False | True | Artist | Low | Cat_6 |
467459 | -0.020364 | 1.295406 | 0.0 | True | True | True | Executive | High | Cat_6 |
463288 | 0.985085 | -0.37508 | 0.642857 | False | True | True | Artist | Average | Cat_6 |
460651 | -0.597631 | 0.364121 | 0.571429 | True | False | False | Homemaker | Low | Cat_4 |
6454 rows × 9 columns
# show distribution after processing of numerical variables: Age, Work Experience, Family Size
print('POST PROCESSING DISTRIBUTIONS')
plt.figure(figsize=(7, 2))
sns.histplot(df['Age'], bins=67, kde=True)
plt.show()
plt.figure(figsize=(7, 2))
sns.histplot(df['Work_Experience'], bins=14, kde=True)
plt.show()
plt.figure(figsize=(7, 2))
sns.histplot(df['Family_Size'], bins = 9,kde=True)
plt.show()
POST PROCESSING DISTRIBUTIONS
To represent the categorical features in a suitable format for machine learning algorithms, one-hot encoding will be applied to the Profession, Spending_Score, and Var_1 columns. This transformation will create binary columns for each unique category within these features, allowing the model to better capture relationships between these features and the target variable.
# Create the column transformer processor for categorical columns using OneHotEncoder
def create_column_transformer(cat_cols=cat_cols, log=False):
if log:
print('Columns encoded: ', cat_cols)
# Create the ColumnTransformer
ohencoder = ColumnTransformer(transformers=[('', OneHotEncoder(), cat_cols)],
remainder='passthrough')
return ohencoder
ohencoder = create_column_transformer()
ohencoder.fit(df)
# Fit and transform the DataFrame
df = pd.DataFrame(ohencoder.transform(df),
columns=ohencoder.get_feature_names_out(),
index=df.index)
df = df.astype(float)
df
__Profession_Artist | __Profession_Doctor | __Profession_Engineer | __Profession_Entertainment | __Profession_Executive | __Profession_Healthcare | __Profession_Homemaker | __Profession_Lawyer | __Profession_Marketing | __Profession_unknown | __Spending_Score_Average | __Spending_Score_High | __Spending_Score_Low | __Var_1_Cat_1 | __Var_1_Cat_2 | __Var_1_Cat_3 | __Var_1_Cat_4 | __Var_1_Cat_5 | __Var_1_Cat_6 | __Var_1_Cat_7 | remainder__Age | remainder__Family_Size | remainder__Work_Experience | remainder__Gender | remainder__Ever_Married | remainder__Graduated | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ID | ||||||||||||||||||||||||||
460776 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.396833 | -1.638754 | 0.142857 | 0.0 | 1.0 | 1.0 |
460232 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | -1.566955 | 0.888593 | 0.571429 | 1.0 | 0.0 | 0.0 |
459351 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.895870 | -0.375080 | 0.000000 | 0.0 | 1.0 | 1.0 |
461051 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.464045 | -0.375080 | 0.071429 | 0.0 | 1.0 | 1.0 |
466656 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 2.048589 | -0.375080 | 0.071429 | 1.0 | 1.0 | 1.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
467479 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | -0.365807 | 0.888593 | 0.071429 | 1.0 | 1.0 | 1.0 |
462151 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | -0.365807 | -1.638754 | 0.571429 | 0.0 | 0.0 | 1.0 |
467459 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | -0.020364 | 1.295406 | 0.000000 | 1.0 | 1.0 | 1.0 |
463288 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.985085 | -0.375080 | 0.642857 | 0.0 | 1.0 | 1.0 |
460651 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | -0.597631 | 0.364121 | 0.571429 | 1.0 | 0.0 | 0.0 |
6454 rows × 26 columns
df.describe()
__Profession_Artist | __Profession_Doctor | __Profession_Engineer | __Profession_Entertainment | __Profession_Executive | __Profession_Healthcare | __Profession_Homemaker | __Profession_Lawyer | __Profession_Marketing | __Profession_unknown | __Spending_Score_Average | __Spending_Score_High | __Spending_Score_Low | __Var_1_Cat_1 | __Var_1_Cat_2 | __Var_1_Cat_3 | __Var_1_Cat_4 | __Var_1_Cat_5 | __Var_1_Cat_6 | __Var_1_Cat_7 | remainder__Age | remainder__Family_Size | remainder__Work_Experience | remainder__Gender | remainder__Ever_Married | remainder__Graduated | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 6454.000000 | 6454.000000 | 6454.000000 | 6454.000000 | 6454.000000 | 6454.000000 | 6454.000000 | 6454.000000 | 6454.000000 | 6454.000000 | 6454.000000 | 6454.000000 | 6454.000000 | 6454.000000 | 6454.000000 | 6454.000000 | 6454.000000 | 6454.000000 | 6454.000000 | 6454.000000 | 6.454000e+03 | 6.454000e+03 | 6454.000000 | 6454.000000 | 6454.000000 | 6454.000000 |
mean | 0.307716 | 0.085373 | 0.085528 | 0.119306 | 0.077316 | 0.166408 | 0.029594 | 0.079176 | 0.035017 | 0.014565 | 0.240781 | 0.152619 | 0.606601 | 0.017199 | 0.052526 | 0.099938 | 0.134800 | 0.009916 | 0.660985 | 0.024636 | 3.674023e-16 | -1.582506e-16 | 0.173093 | 0.549737 | 0.583359 | 0.622250 |
std | 0.461584 | 0.279458 | 0.279688 | 0.324173 | 0.267114 | 0.372476 | 0.169478 | 0.270034 | 0.183837 | 0.119811 | 0.427591 | 0.359647 | 0.488542 | 0.130021 | 0.223102 | 0.299941 | 0.341536 | 0.099093 | 0.473411 | 0.155025 | 1.000077e+00 | 1.000077e+00 | 0.235312 | 0.497559 | 0.493040 | 0.484862 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -2.086086e+00 | -1.638754e+00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -7.645912e-01 | -3.750804e-01 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
50% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | -2.036414e-02 | -3.750804e-01 | 0.071429 | 1.000000 | 1.000000 | 1.000000 |
75% | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 7.076436e-01 | 8.885934e-01 | 0.285714 | 1.000000 | 1.000000 | 1.000000 |
max | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 2.048589e+00 | 2.366997e+00 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
Principal Component Analysis (PCA) is a valuable technique for dimensionality reduction and data visualization, as it enables the identification of the most significant variables in a dataset.
By transforming the original variables into a new set of variables, PCA can reveal the underlying structure and relationships within the data.
Visualizing the first three principal components in a 3D plot can further elucidate the data structure and highlight any existing patterns or clusters by providing a compact and intuitive representation of the data.
# Print dimension ridution to acheive a threshold variance if thresh between 0 and 1,
# Print the n principal components if thresh is integer > 1
def pca_test(df, thresh=0.99, exp_var=False):
pca = PCA(thresh)
pca_df = pca.fit_transform(df)
print(f'Original dimensions: {df.shape[1]}\t can be reduced by {df.shape[1]-pca_df.shape[1]} dimension')
print(f'to {pca_df.shape[1]} dimensions, keeping {sum(pca.explained_variance_ratio_)} of the variance')
print()
if exp_var:
print('Explained variance:', pca.explained_variance_ratio_)
pca_test(df, 0.99)
pca_test(df, 3, exp_var=True)
Original dimensions: 26 can be reduced by 6 dimension to 20 dimensions, keeping 0.9902720324401137 of the variance Original dimensions: 26 can be reduced by 23 dimension to 3 dimensions, keeping 0.5870061218476903 of the variance Explained variance: [0.31614928 0.19799414 0.0728627 ]
pca = PCA(n_components=3)
pca_features = pca.fit_transform(df)
# Show the 3d rapresentation of the three principal components and target with different colors
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(pca_features[y_train=='A', 2], pca_features[y_train=='A', 1], pca_features[y_train=='A', 0],
c='green', alpha=0.5, s=2)
ax.scatter(pca_features[y_train=='B', 2], pca_features[y_train=='B', 1], pca_features[y_train=='B', 0],
c='yellow', alpha=0.5, s=1)
ax.scatter(pca_features[y_train=='C', 2], pca_features[y_train=='C', 1], pca_features[y_train=='C', 0],
c='red', alpha=0.5, s=1)
ax.scatter(pca_features[y_train=='D', 2], pca_features[y_train=='D', 1], pca_features[y_train=='D', 0],
c='blue', alpha=0.5, s=1)
ax.set_xlabel('PC1')
ax.set_ylabel('PC2')
ax.set_zlabel('PC3')
plt.title('3D PCA, A: Green, B: Yellow')
plt.show()
Results from the PCA analysis:
Considering that the dataset has a relatively low number of dimensions (26), I decided not to perform dimensionality reduction, allowing to preserve all the information contained in the original dataset, ensuring that no valuable data are lost during the analysis and modeling process.
Upon visualizing the 3D representation of the top 3 principal components, it can be observed that some clusters are evident and some patterns seem to appear. However, only one cluster is dominated by a single target variable, while the others have a mix of target variables.
This suggests that clustering algorithms, such as Support Vector Machines or K-Nearest Neighbors, could be explored in future projects to better understand the patterns in the data. However, for this project, I decided not to explore these clustering approaches.
# for semplicity i renamed df as X_train
X_train = df
# single funtion to preprocess data
def preproces_dataset(df, numerical_preprocessor=numerical_preprocessor, new_columns_order=new_columns_order,ohencoder=ohencoder):
df = preprocessing(df)
df = pd.DataFrame(numerical_preprocessor.transform(df), columns=new_columns_order, index=df.index)
df = pd.DataFrame(ohencoder.transform(df), columns=ohencoder.get_feature_names_out(), index=df.index)
return df
X_test = preproces_dataset(X_test)
# Print and plot performances
def get_performances(y_test, y_pred):
# performances
cm = confusion_matrix(y_test, y_pred)
# print performance metrics
print(classification_report(y_test, y_pred, target_names=['A','B','C','D']))
print()
plt.figure(figsize=(2,2))
sns.heatmap(cm, annot=True, fmt='g', cmap='flare')
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.title('Confusion Matrix')
plt.xticks([0.5, 1.5, 2.5, 3.5], ['A', 'B', 'C', 'D']) # set tick labels for x-axis
plt.yticks([0.5, 1.5, 2.5, 3.5], ['A', 'B', 'C', 'D']) # set tick labels for y-axis
plt.show()
# display predictions of random forest compared to the test segmentation
def display_predicted(y_pred, y_test):
y_pred = pd.DataFrame(y_pred, columns = ['Predicted'])
y_pred.index = y_test.index
results = pd.concat([y_test, y_pred], axis=1)
return results
# display predictions of logistic regression compared to the test segmentation, including probability for each class
def get_probabilities_df(probabilities, y_test):
prob_df = pd.DataFrame(probabilities, columns=['A prob', 'B prob', 'C prob', 'D prob'])
prob_df['Predicted'] = prob_df.apply(lambda x: 'A' if x['A prob'] > x[['B prob', 'C prob', 'D prob']].max() else
('B' if x['B prob'] > x[['A prob', 'C prob', 'D prob']].max() else
('C' if x['C prob'] > x[['A prob', 'B prob', 'D prob']].max() else 'D')), axis=1)
prob_df = prob_df[['Predicted', 'A prob', 'B prob', 'C prob', 'D prob']]
prob_df.index = y_test.index
results = pd.concat([y_test, prob_df], axis=1)
return results
Random Forest algorithm was chosen for the classification problem with four target classes. Random Forest is an ensemble learning method that builds multiple decision trees and combines their outputs to improve accuracy and reduce overfitting. This makes it a suitable choice for the problem, as it can effectively learn complex relationships between features and the target classes.
To ensure a balanced representation of the target classes during model evaluation, StratifiedKFold cross-validation was used. This technique helps in getting a better understanding of the model's performance and its ability to generalize to unseen data.
To find the optimal set of hyperparameters for the Random Forest model, GridSearchCV was employed. This technique systematically searches through a specified range of hyperparameter values and selects the combination that yields the best model performance. By using GridSearchCV in combination with StratifiedKFold, I aim to improve the model's accuracy and generalization performance, ultimately resulting in a more reliable and robust classifier.
# Create the random forest
rf = RandomForestClassifier()
# set parameter grid for gridsearch
param_grid = {
'max_depth': range (6, 18),
'criterion': ['entropy', 'gini'],
'min_samples_split': range(2,14,2),
'min_samples_leaf': range(2,14,2),
'max_features': ['sqrt', 'log2', 'auto']
}
# Create a StratifiedKFold cross-validator with 10 splits
cv = StratifiedKFold(n_splits=10, shuffle=True)
# grid search
gs_forest = GridSearchCV(estimator=rf, param_grid=param_grid, scoring='accuracy', cv=cv,
n_jobs=-1,
# verbose=1
)
# performing grid search
gs_forest.fit(X_train, y_train)
print("Best score:", gs_forest.best_score_)
print()
print('best parameters:', gs_forest.best_params_)
# get best model
forest = gs_forest.best_estimator_
# train the model
forest.fit(X_train, y_train)
# evaluate performances on training data
y_pred = forest.predict(X_train)
get_performances(y_train, y_pred)
Best score: 0.5464835481316149 best parameters: {'criterion': 'gini', 'max_depth': 16, 'max_features': 'sqrt', 'min_samples_leaf': 8, 'min_samples_split': 12} precision recall f1-score support A 0.55 0.58 0.57 1565 B 0.54 0.42 0.47 1484 C 0.64 0.61 0.63 1573 D 0.68 0.79 0.73 1832 accuracy 0.61 6454 macro avg 0.60 0.60 0.60 6454 weighted avg 0.61 0.61 0.61 6454
After performing GridSearchCV with StratifiedKFold cross-validation, the best combination of hyperparameters for the Random Forest classifier has been founded.
The best score achieved during cross-validation was 0.5465, which is a reasonable result considering the complexity of the problem, the number of target classes, the small sample size and little information avaiable (9 initial variables).
When evaluating the model on the entire training set, an accuracy of 0.61 was acheived.
These results show that the model performs reasonably well in classifying the target classes, with some variation in performance across the different classes.
It is essential to validate the model on unseen test data to ensure its generalization performance and to detect possible overfitting. Overfitting occurs when a model learns the noise in the training data, leading to poor performance on new, unseen data.
To accept the model, it is expected that the accuracy on the test set to will similar to the cross-validation score (around 0.53 to 0.56). A significant drop in test accuracy compared to the training or cross-validation scores may indicate overfitting and should prompt us to further investigate the model or consider alternative approaches.
# evaluate performaces on test data for validation
y_pred_rf = forest.predict(X_test)
get_performances(y_test, y_pred_rf)
display_predicted(y_pred_rf, y_test).T
precision recall f1-score support A 0.48 0.47 0.48 407 B 0.44 0.35 0.39 374 C 0.59 0.60 0.60 397 D 0.62 0.73 0.67 436 accuracy 0.55 1614 macro avg 0.53 0.54 0.54 1614 weighted avg 0.54 0.55 0.54 1614
ID | 463752 | 460117 | 467273 | 461976 | 463482 | 465807 | 464445 | 467001 | 461500 | 466060 | 461849 | 459181 | 461946 | 460711 | 460839 | 460332 | 467599 | 462114 | 461820 | 466756 | 466664 | 462849 | 467216 | 467717 | 462186 | 466860 | 464352 | 461201 | 464490 | 466725 | 462666 | 467051 | 467643 | 461491 | 463048 | 466368 | 466975 | 461186 | 462086 | 459895 | 460896 | 464003 | 463293 | 463012 | 459631 | 466743 | 464002 | 462124 | 463669 | 461855 | 459979 | 463121 | 459907 | 466157 | 461766 | 459675 | 465256 | 459925 | 464896 | 462953 | 467238 | 466226 | 460162 | 464023 | 465853 | 467774 | 466707 | 465590 | 466667 | 460146 | 467314 | 461394 | 467221 | 462857 | 463410 | 465537 | 464385 | 460690 | 467045 | 463952 | 464968 | 459712 | 464122 | 463537 | 462633 | 465290 | 459590 | 466704 | 459630 | 459644 | 467381 | 467924 | 463660 | 459209 | 467886 | 467699 | 461530 | 463536 | 460494 | 460033 | 462492 | 465963 | 462641 | 463513 | 460760 | 465268 | 464881 | 466595 | 460587 | 463437 | 460097 | 464508 | 459358 | 465079 | 465650 | 462223 | 462961 | 467860 | 463302 | 460300 | 465710 | 465603 | 467229 | 465272 | 463579 | 460551 | 466613 | 459367 | 463216 | 464836 | 462111 | 462357 | 459943 | 462195 | 461444 | 465931 | 459051 | 460101 | 466568 | 460299 | 467256 | 461321 | 466621 | 465140 | 462971 | 461642 | 467706 | 466476 | 459764 | 467653 | 466972 | 466345 | 463839 | 467408 | 460662 | 461085 | 459022 | 466834 | 465097 | 464506 | 460975 | 466847 | 462162 | 467803 | 463933 | 460067 | 466872 | 460027 | 461258 | 466555 | 466762 | 464107 | 463693 | 467249 | 459218 | 463866 | 463759 | 467416 | 459739 | 466706 | 461312 | 463853 | 465338 | 466087 | 465416 | 460284 | 459347 | 463214 | 460325 | 460886 | 461621 | 465179 | 462002 | 461144 | 462314 | 460381 | 466729 | 464952 | 462683 | 467313 | 464341 | 466243 | 460245 | 464795 | 461832 | 465491 | 464751 | 464377 | 462800 | 463337 | 466084 | 465821 | 464696 | 464049 | 463212 | 461431 | 460081 | 464756 | 467629 | 460474 | 459314 | 459973 | 464679 | 461435 | 465644 | 465930 | 467095 | 459726 | 467890 | 462825 | 459535 | 467236 | 461718 | 464334 | 466425 | 461082 | 461848 | 459565 | 460649 | 461079 | 467619 | 467607 | 465346 | 461326 | 465030 | 459490 | 461118 | 461529 | 464666 | 466833 | 465846 | 463163 | 459945 | 465881 | 463462 | 459117 | 460102 | 461042 | 463116 | 460158 | 461245 | 466455 | 466738 | 464522 | 466525 | 466343 | 463725 | 465954 | 465752 | 462001 | 459230 | 463784 | 463483 | 459299 | 460648 | 464225 | 462742 | 459788 | 466135 | 465170 | 460750 | 464042 | 462826 | 467858 | 467168 | 467834 | 459738 | 460589 | 466924 | 461502 | 464545 | 463464 | 459270 | 460257 | 461664 | 462793 | 467531 | 459673 | 466127 | 466768 | 461117 | 463402 | 464325 | 466996 | 465426 | 461087 | 462174 | 466223 | 464083 | 464558 | 460244 | 466439 | 461011 | 467447 | 462328 | 464572 | 465980 | 463401 | 466904 | 464747 | 464372 | 467181 | 466543 | 461592 | 466297 | 464437 | 463546 | 464205 | 465958 | 463895 | 462320 | 465156 | 463111 | 460351 | 465852 | 462636 | 463509 | 459480 | 464781 | 464432 | 460219 | 463006 | 465911 | 463455 | 464997 | 466339 | 460538 | 464433 | 463769 | 463991 | 460269 | 466191 | 459265 | 463756 | 460845 | 464453 | 462145 | 465028 | 460192 | 460912 | 459342 | 459449 | 461681 | 463822 | 466529 | 462778 | 466205 | 464330 | 462075 | 464261 | 466563 | 465777 | 466487 | 464160 | 463653 | 460844 | 460596 | 460605 | 465804 | 459856 | 464392 | 462254 | 464326 | 467592 | 464047 | 467853 | 467685 | 466293 | 463924 | 458990 | 463110 | 465522 | 459416 | 461481 | 467737 | 466247 | 461671 | 464468 | 465690 | 461430 | 463476 | 467132 | 462383 | 460450 | 464786 | 463737 | 461594 | 465789 | 466359 | 467431 | 459338 | 464368 | 461627 | 464878 | 465134 | 466659 | 464628 | 462799 | 466859 | 464446 | 467583 | 463818 | 467270 | 459316 | 465328 | 463037 | 465070 | 467827 | 461701 | 463069 | 462941 | 464347 | 462567 | 461563 | 459556 | 463647 | 466570 | 463832 | 460055 | 461871 | 459752 | 464540 | 463999 | 459253 | 466385 | 466325 | 459745 | 460698 | 463061 | 465552 | 459989 | 465249 | 459235 | 459171 | 461479 | 462815 | 466328 | 466869 | 466937 | 462282 | 464908 | 462818 | 464112 | 464275 | 465977 | 466085 | 461607 | 467633 | 467462 | 466959 | 462292 | 460526 | 466201 | 465967 | 467009 | 465148 | 464191 | 466942 | 464586 | 467614 | 465731 | 466982 | 463297 | 467175 | 466776 | 466961 | 462352 | 467902 | 460701 | 466990 | 465689 | 463161 | 460926 | 463975 | 465394 | 464227 | 462692 | 460279 | 461991 | 466930 | 459803 | 459166 | 459390 | 465351 | 466781 | 462674 | 459020 | 464464 | 460324 | 463323 | 467082 | 461857 | 466522 | 466178 | 463109 | 460205 | 466561 | 460086 | 460920 | 466274 | 462903 | 466813 | 463490 | 460307 | 460255 | 461414 | 459055 | 467804 | 463030 | 463176 | 462326 | 462994 | 460836 | 463774 | 460733 | 465860 | 462246 | 460486 | 461952 | 465023 | 465918 | 467167 | 465137 | 460093 | 459792 | 460787 | 467117 | 465793 | 461812 | 459564 | 465539 | 466639 | 464260 | 467388 | 463793 | 459343 | 461985 | 465994 | 459891 | 465589 | 462126 | 462229 | 465550 | 465227 | 467776 | 465104 | 466004 | 462974 | 466062 | 467329 | 460543 | 462463 | 459175 | 459830 | 459151 | 463781 | 459929 | 465447 | 465523 | 459382 | 460721 | 467004 | 462064 | 459418 | 465364 | 464727 | 464359 | 465709 | 463505 | 465146 | 466926 | 465437 | 464844 | 464241 | 465781 | 464798 | 460931 | 461267 | 460909 | 461884 | 465240 | 460479 | 463956 | 459751 | 459593 | 464059 | 459685 | 463735 | 462834 | 461648 | 465408 | 466983 | 459203 | 466136 | 463807 | 464925 | 465708 | 460613 | 460149 | 463598 | 466567 | 466015 | 460084 | 460266 | 467728 | 463053 | 465680 | 459356 | 462676 | 463504 | 459019 | 462483 | 467309 | 466632 | 464761 | 466660 | 464004 | 463675 | 462833 | 465260 | 462251 | 465538 | 466978 | 465343 | 467898 | 461640 | 466855 | 460375 | 460142 | 461719 | 465118 | 461463 | 459274 | 460172 | 461453 | 465507 | 462388 | 464794 | 464926 | 466532 | 462919 | 467015 | 462545 | 467574 | 467272 | 465700 | 460136 | 463905 | 461004 | 461776 | 461317 | 459614 | 462876 | 462521 | 463874 | 463728 | 460133 | 461306 | 464698 | 461031 | 459719 | 465379 | 459506 | 461537 | 466208 | 459517 | 463695 | 461208 | 462760 | 465423 | 464937 | 459850 | 460824 | 465187 | 466916 | 462871 | 464580 | 461734 | 467941 | 460169 | 459667 | 467325 | 467650 | 461809 | 461162 | 467296 | 466623 | 466524 | 465432 | 462243 | 466386 | 467617 | 464607 | 465660 | 459254 | 464394 | 464899 | 459702 | 463135 | 460049 | 461935 | 467793 | 462462 | 465670 | 459864 | 467731 | 465033 | 462316 | 466789 | 462991 | 463010 | 465985 | 467453 | 463317 | 463395 | 463863 | 459454 | 459179 | 462202 | 459098 | 464301 | 459226 | 461595 | 464620 | 466474 | 460607 | 462963 | 459942 | 464010 | 463619 | 460306 | 465081 | 466741 | 461173 | 463481 | 460330 | 467298 | 459193 | 459195 | 463222 | 465125 | 465060 | 467669 | 465581 | 460233 | 462099 | 462626 | 460757 | 463276 | 466646 | 463637 | 464651 | 459616 | 466603 | 465862 | 464547 | 464838 | 461729 | 464203 | 464100 | 460910 | 461272 | 459432 | 463542 | 462950 | 462048 | 467208 | 459465 | 461552 | 464296 | 465431 | 461497 | 467123 | 465375 | 465106 | 459440 | 466712 | 461942 | 462535 | 465718 | 467305 | 466242 | 465172 | 459048 | 465198 | 467496 | 465624 | 462238 | 463154 | 463936 | 466579 | 462412 | 463990 | 465400 | 464780 | 462080 | 462349 | 465463 | 462568 | 459471 | 465554 | 462297 | 467190 | 467557 | 465495 | 460395 | 461199 | 462060 | 463394 | 465434 | 463374 | 465415 | 462609 | 464145 | 459280 | 467845 | 466302 | 460876 | 465245 | 466175 | 465381 | 466000 | 463916 | 462449 | 460939 | 459431 | 465169 | 464994 | 462424 | 467953 | 465102 | 465082 | 464555 | 461327 | 459986 | 467280 | 467054 | 459459 | 465219 | 463872 | 465893 | 465182 | 459854 | 464599 | 460555 | 467598 | 463206 | 459672 | 459096 | 465135 | 465497 | 467504 | 462584 | 463761 | 459352 | 459652 | 462385 | 460062 | 463424 | 460499 | 459588 | 462832 | 467380 | 459795 | 462923 | 459777 | 464473 | 464096 | 461633 | 460376 | 462644 | 464733 | 462740 | 464168 | 464300 | 463392 | 463492 | 461659 | 460116 | 458992 | 462895 | 462612 | 462514 | 463602 | 466159 | 460472 | 466651 | 461834 | 461260 | 462719 | 461424 | 467446 | 461964 | 459207 | 461086 | 465305 | 467411 | 460348 | 460159 | 466099 | 465441 | 464839 | 461685 | 462937 | 463616 | 466231 | 460409 | 460616 | 465761 | 462188 | 466268 | 466648 | 466305 | 462013 | 466459 | 463436 | 459863 | 465908 | 460520 | 467802 | 466199 | 463629 | 461301 | 461892 | 461442 | 464749 | 459817 | 463175 | 462595 | 462998 | 463906 | 467473 | 463706 | 459414 | 464175 | 465555 | 465436 | 465048 | 463712 | 463641 | 464538 | 467131 | 465110 | 459780 | 460385 | 461572 | 467327 | 462600 | 464422 | 459318 | 461663 | 460570 | 462313 | 467882 | 467163 | 466118 | 463113 | 462507 | 460887 | 465439 | 464151 | 459308 | 466037 | 462043 | 467856 | 462291 | 459238 | 463368 | 466919 | 459732 | 461248 | 462707 | 460424 | 467806 | 462548 | 467386 | 464796 | 459369 | 460034 | 461075 | 461382 | 460807 | 465316 | 460290 | 461119 | 463468 | 466432 | 463553 | 463002 | 462082 | 465122 | 466031 | 462863 | 462765 | 465217 | 461785 | 466195 | 467418 | 465474 | 461402 | 461078 | 462634 | 463062 | 460875 | 459602 | 460289 | 460129 | 466602 | 462694 | 464408 | 463202 | 466174 | 466460 | 460139 | 463071 | 466811 | 459410 | 463608 | 465160 | 467540 | 464702 | 467060 | 462163 | 461521 | 461141 | 465678 | 461650 | 465199 | 459118 | 463456 | 464154 | 464371 | 466565 | 462227 | 461677 | 466406 | 459329 | 464943 | 466939 | 459995 | 463978 | 466614 | 463139 | 461204 | 460631 | 465241 | 467732 | 467452 | 466211 | 463742 | 460581 | 459591 | 461160 | 461125 | 461542 | 467603 | 466515 | 459705 | 462537 | 464123 | 467813 | 460303 | 466204 | 466068 | 460955 | 465532 | 459095 | 462382 | 463581 | 466067 | 464692 | 467448 | 467712 | 465406 | 461814 | 459923 | 460304 | 459902 | 459964 | 464090 | 461053 | 465371 | 465230 | 467023 | 465401 | 459292 | 460492 | 465188 | 463428 | 462087 | 467308 | 466526 | 459372 | 466187 | 460291 | 465858 | 462702 | 466065 | 462572 | 462367 | 462509 | 463296 | 465607 | 459185 | 467940 | 467721 | 463389 | 467569 | 459189 | 460313 | 464670 | 459828 | 462886 | 463341 | 464020 | 459761 | 463710 | 460978 | 462718 | 459069 | 459496 | 467214 | 465031 | 461630 | 460128 | 465285 | 466216 | 459557 | 467332 | 459772 | 463360 | 460968 | 465222 | 463777 | 467324 | 463025 | 466907 | 467093 | 467162 | 465897 | 463881 | 459027 | 458995 | 466454 | 466521 | 465622 | 464974 | 464735 | 460051 | 465962 | 464605 | 461624 | 463169 | 467257 | 461652 | 467240 | 461081 | 465533 | 467455 | 465270 | 461730 | 461250 | 465446 | 466330 | 465549 | 463754 | 464038 | 460154 | 467909 | 464396 | 464762 | 462214 | 467677 | 462392 | 466225 | 459934 | 466481 | 461338 | 463380 | 462827 | 465531 | 461313 | 462889 | 466784 | 467759 | 460902 | 460675 | 463376 | 463250 | 462271 | 463259 | 464553 | 463533 | 467855 | 462072 | 461055 | 462284 | 459225 | 459546 | 460716 | 467585 | 461972 | 463319 | 464460 | 465413 | 459086 | 461655 | 465067 | 464342 | 460288 | 460810 | 462791 | 467449 | 467246 | 466217 | 467850 | 462723 | 459423 | 465470 | 462472 | 467519 | 465279 | 460706 | 464945 | 462611 | 461810 | 461804 | 462369 | 466542 | 461193 | 462445 | 463908 | 460835 | 460652 | 459493 | 461372 | 461760 | 463044 | 465955 | 464758 | 463635 | 465162 | 466379 | 461002 | 462183 | 463085 | 465399 | 463179 | 466746 | 465499 | 464384 | 460378 | 463770 | 461482 | 461885 | 460419 | 467676 | 459029 | 464001 | 459814 | 459456 | 464382 | 465614 | 466897 | 466844 | 467652 | 466344 | 462409 | 462212 | 460319 | 460621 | 461992 | 466887 | 464783 | 460371 | 465765 | 459698 | 466095 | 460600 | 465573 | 462094 | 462253 | 460720 | 460068 | 459962 | 459736 | 460092 | 460265 | 459108 | 466053 | 463681 | 460497 | 465253 | 459495 | 465391 | 461601 | 459147 | 461647 | 462852 | 465337 | 464995 | 465433 | 459632 | 464668 | 462493 | 466624 | 464074 | 459776 | 464233 | 465809 | 460868 | 466900 | 467063 | 465657 | 459104 | 461926 | 467950 | 465679 | 464272 | 466446 | 462977 | 467482 | 463378 | 462193 | 462642 | 463658 | 463860 | 459081 | 460259 | 465676 | 459849 | 462652 | 462617 | 463694 | 466846 | 462905 | 462411 | 465202 | 463459 | 464106 | 466702 | 461914 | 464738 | 465946 | 465741 | 460021 | 461153 | 466039 | 462972 | 465767 | 463346 | 465513 | 464393 | 463755 | 467020 | 460256 | 464741 | 461176 | 466148 | 462362 | 460171 | 463023 | 465501 | 466092 | 467215 | 466655 | 465872 | 465633 | 460715 | 465121 | 466571 | 467390 | 462938 | 460019 | 460827 | 463679 | 465193 | 466395 | 459124 | 461666 | 467126 | 458996 | 465514 | 459433 | 463704 | 461967 | 467412 | 462899 | 463412 | 467414 | 467709 | 459537 | 460312 | 459623 | 460811 | 466699 | 461803 | 460167 | 460015 | 464715 | 465387 | 461132 | 459271 | 466023 | 464582 | 467475 | 461145 | 460783 | 464124 | 467746 | 463942 | 462868 | 466777 | 461252 | 462967 | 460667 | 463631 | 463057 | 460913 | 463828 | 463847 | 467187 | 460791 | 465788 | 463347 | 466340 | 465481 | 461689 | 459877 | 463295 | 459707 | 460442 | 461308 | 461739 | 465871 | 465641 | 466948 | 467884 | 467948 | 467596 | 463630 | 467319 | 463919 | 465755 | 461597 | 466101 | 460823 | 459054 | 466775 | 459143 | 460735 | 463962 | 461299 | 462252 | 459085 | 467534 | 464378 | 466862 | 464244 | 461483 | 464148 | 465978 | 464814 | 460100 | 462022 | 467419 | 461819 | 463439 | 464270 | 464052 | 462844 | 465276 | 466108 | 465664 | 461284 | 463684 | 465221 | 462473 | 461387 | 463855 | 465669 | 465043 | 460423 | 459706 | 465754 | 462536 | 459787 | 459484 | 459451 | 467217 | 461150 | 460131 | 467965 | 464149 | 467079 | 461532 | 462365 | 461860 | 465524 | 464700 | 459010 | 466252 | 463270 | 461577 | 466653 | 462066 | 460860 | 459893 | 460951 | 466280 | 461017 | 460946 | 466142 | 464179 | 459880 | 464109 | 462627 | 464851 | 459228 | 459637 | 463760 | 466504 | 461019 | 466336 | 463898 | 467133 | 466213 | 463757 | 463589 | 467393 | 466954 | 463927 | 461853 | 464612 | 464526 | 459294 | 464638 | 461001 | 464725 | 463417 | 466463 | 460334 | 460382 | 461496 | 461818 | 462789 | 463316 | 462197 | 459320 | 466688 | 462037 | 464984 | 459089 | 462400 | 459058 | 463235 | 465785 | 464570 | 460878 | 463526 | 460981 | 459748 | 461688 | 462820 | 459483 | 465940 | 464299 | 461748 | 464887 | 467733 | 462208 | 466430 | 466434 | 466970 | 465620 | 463168 | 459325 | 466146 | 467554 | 464770 | 460294 | 462359 | 465417 | 463497 | 464590 | 460377 | 462524 | 463939 | 466803 | 466909 | 465091 | 460848 | 459747 | 467007 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Segmentation | D | D | B | B | C | B | C | B | B | B | A | D | C | D | D | D | D | D | C | C | A | C | D | B | D | C | C | C | A | A | C | D | A | C | A | B | C | D | B | C | A | D | D | D | D | A | A | A | A | C | B | A | C | A | A | D | D | D | A | B | D | C | D | D | A | A | B | A | D | A | C | D | C | D | B | A | B | A | A | A | A | D | B | D | D | B | D | C | A | D | B | A | A | C | C | C | A | C | D | A | B | B | D | B | B | B | C | C | B | B | B | C | D | A | A | D | B | B | B | D | C | A | C | C | B | C | D | D | A | D | C | A | B | D | C | D | C | D | B | B | B | C | C | B | C | A | A | D | B | C | A | D | B | A | D | A | D | A | A | D | B | A | B | B | A | A | B | A | C | D | B | C | C | D | C | C | C | B | D | C | D | B | C | D | A | A | B | A | B | B | C | C | D | B | A | A | A | B | B | B | A | A | B | A | C | B | A | A | B | A | A | A | A | C | D | C | B | B | B | A | C | A | C | C | C | A | C | B | B | B | C | B | B | D | A | D | D | A | D | A | B | D | D | C | C | A | A | D | D | B | A | D | C | A | A | C | B | D | B | D | A | D | D | D | B | A | C | A | C | D | B | A | D | B | D | C | D | A | B | C | C | B | D | A | B | D | D | B | D | B | A | D | A | D | C | B | D | B | A | C | D | D | B | A | B | A | D | B | C | D | B | D | B | B | D | D | A | D | C | A | C | B | A | B | D | B | C | B | D | B | C | C | D | A | B | D | D | A | D | D | A | D | A | B | B | D | D | A | B | A | D | C | C | B | D | C | D | C | D | D | C | B | D | C | A | D | C | A | B | C | D | A | A | A | A | A | A | D | A | B | B | B | A | A | D | C | A | B | B | D | A | B | A | C | A | C | D | C | B | C | B | B | D | B | D | C | C | D | D | C | D | B | C | D | B | B | A | D | D | B | B | B | C | D | A | B | C | D | C | B | D | D | A | C | D | A | A | D | C | C | B | A | C | C | C | A | A | B | D | D | B | D | A | C | C | A | C | D | B | D | A | C | D | D | C | D | B | A | C | D | A | C | C | A | D | A | A | D | B | B | A | B | D | D | B | C | B | A | B | B | D | A | B | B | B | C | B | B | B | A | C | A | B | B | A | B | C | C | A | D | C | C | D | C | C | B | D | C | C | A | B | A | D | D | A | D | D | C | C | B | C | B | C | A | D | B | B | D | D | C | B | C | C | A | A | A | A | C | A | A | B | B | C | D | A | A | A | D | A | A | D | D | D | D | D | A | C | D | B | C | C | C | C | A | B | D | A | C | C | A | D | A | C | D | A | B | B | C | D | B | D | C | D | C | D | B | A | D | D | B | D | A | D | B | D | A | C | C | D | C | D | C | B | A | B | A | C | C | C | C | D | A | D | D | B | C | D | A | A | D | C | C | B | C | B | D | D | A | D | D | C | D | C | B | C | C | A | C | B | D | C | B | B | C | D | B | C | A | A | D | A | C | D | D | A | D | D | A | C | B | C | B | D | D | D | B | A | C | A | C | D | C | A | B | C | D | A | D | D | C | D | B | B | C | C | B | A | B | A | C | C | B | A | D | B | C | C | B | B | A | B | A | C | C | A | A | D | A | C | A | C | D | C | D | A | C | B | C | B | D | D | D | D | A | B | C | A | B | A | D | B | C | A | B | A | A | C | B | C | A | D | C | B | B | B | A | A | A | C | A | A | C | A | A | D | C | B | A | A | B | A | D | D | C | A | B | D | D | B | D | A | B | A | A | D | D | A | B | B | A | B | B | B | B | C | B | B | D | D | C | B | A | C | D | C | A | C | A | D | B | D | C | B | C | D | C | C | A | A | C | A | B | B | C | A | D | D | C | B | B | D | C | D | D | B | C | B | C | D | C | C | A | A | D | B | A | B | C | C | A | B | A | D | C | B | D | D | A | C | A | B | A | C | B | C | A | A | C | A | B | A | D | D | A | B | A | C | D | C | D | A | D | C | A | A | A | C | C | D | C | D | A | B | C | A | B | D | C | D | D | D | B | D | D | C | A | A | C | D | B | B | A | D | B | D | D | B | C | B | B | A | D | D | C | B | C | D | C | A | B | C | C | B | C | B | B | A | A | C | C | C | B | C | D | C | B | D | A | D | B | B | D | D | C | D | D | B | C | C | B | D | B | D | A | D | C | C | D | A | D | D | C | C | B | D | A | D | A | B | C | D | C | A | C | A | D | A | C | D | A | D | C | A | B | A | D | A | A | D | B | B | D | D | B | D | B | A | B | B | C | C | A | D | B | A | C | D | C | D | D | A | A | B | A | D | C | A | C | A | C | D | D | D | D | D | D | B | A | C | C | C | D | C | B | A | A | B | A | D | B | C | C | B | B | D | A | A | B | B | A | C | D | B | D | D | D | D | C | C | C | A | C | A | A | B | C | A | A | C | B | A | A | C | C | D | A | A | B | D | C | D | A | A | D | A | D | D | C | A | A | D | A | C | A | B | C | D | B | A | A | C | B | B | C | B | A | A | D | A | B | C | D | B | D | A | D | C | D | C | A | B | D | D | A | D | A | B | D | D | A | A | D | A | C | D | D | A | D | B | C | B | A | A | C | D | A | C | A | D | A | B | B | B | A | A | C | B | D | A | A | A | B | D | D | C | D | D | B | C | C | B | D | A | C | D | D | A | C | A | B | B | D | D | A | D | B | B | C | A | A | D | C | B | D | C | A | D | C | D | B | A | B | C | D | A | B | B | C | B | B | A | B | B | D | A | B | A | C | C | B | D | D | C | D | C | D | B | D | C | C | D | A | D | A | D | B | A | C | B | D | D | C | A | D | A | A | C | B | A | B | A | B | C | D | D | A | B | A | C | A | C | A | D | C | C | A | D | B | C | B | D | C | C | D | D | D | A | A | D | D | C | A | A | A | C | B | A | C | A | B | D | A | B | B | B | A | C | B | D | D | D | D | D | D | A | C | D | C | D | A | D | B | D | D | C | C | C | D | C | B | B | C | B | C | D | C | D | D | A | C | B | C | B | B | D | C | C | C | D | B | C | B | D | D | D | C | A | D | C | A | C | C | A | A | C | B | A | A | B | B | D | B | A | B | B | B | A | B | C | B | C | D | D | B | C | C | A | B | A | B | D | C | B | C | D | D | D | A | B | C | B | A | D | D | A | C | B | D | A | D | D | A | D | B | A | B | D | C | D | D | C | B | A | B | D | C | A | B | B | B | D | A | C | D | B | A | D | D | D | C | C | C | C | D | A | D | B | C | C | A | D | B | D | C | A | C | B | D | D | C | A | A | D | C | B | D | B | C | C | B | A | B | C | A | A | C | B | D | B | C | C | B | D | A | B | B | C | B | A | A | D | C | B | A | D | A | C | A | D | C | A | C | A | C | C | B | A | A | C | C | A | D | A | C | B | A | B | C | C | D | B | B | A | C | A | B | B | D | A | B | A | D | A | C | C | C | D | D | C | C | A | B | D | D | C | C | D | B | D | C | C | A | D | A | A | D | B | B | A | B | B | A | D | B | C | D | A | A | D | D | B | D | B | B | A | D | D | A | C | A | A | C | C | C | A | A | A | D | C | B | B | A | B | A | C | C | D | C | D | A | D | D | B | D | A | C | C | D | D | B | C | A | C | A |
Predicted | D | D | C | B | D | A | C | B | D | C | D | C | D | D | D | D | D | D | C | C | B | C | D | B | D | D | C | B | C | B | B | D | A | C | B | B | C | D | A | C | B | D | D | A | C | D | D | A | D | C | D | C | C | A | A | D | D | A | A | C | D | C | D | D | D | D | B | C | D | B | C | D | D | D | A | C | B | A | A | A | D | D | C | D | D | C | D | C | C | D | A | A | A | D | C | A | A | C | D | A | C | B | A | C | C | D | A | C | A | C | B | C | A | D | D | D | B | A | B | C | B | A | D | B | C | B | D | D | A | D | C | A | A | D | C | A | C | D | C | B | D | B | A | B | B | A | C | D | C | C | A | D | A | C | A | A | B | A | A | D | A | D | B | B | D | A | A | A | C | D | D | C | C | D | A | C | C | A | B | A | D | C | C | D | B | D | A | D | D | A | C | C | A | D | D | A | D | B | A | C | A | A | C | A | D | A | A | A | B | A | A | D | A | D | D | C | C | D | B | A | C | B | C | C | D | B | B | A | C | A | C | A | B | D | A | D | D | A | D | A | B | D | D | A | C | B | D | A | B | B | B | D | B | A | B | C | C | D | C | A | B | D | D | A | C | D | B | D | C | A | C | A | D | A | D | D | D | A | B | A | C | C | D | C | C | D | A | B | D | B | A | D | C | D | C | B | D | A | C | C | D | D | D | B | D | A | B | B | B | D | A | A | B | B | D | D | D | D | C | B | C | C | D | A | D | B | D | B | B | B | C | C | D | A | A | D | A | A | D | D | B | D | A | B | C | D | D | A | B | D | D | B | C | B | D | C | A | C | A | D | A | B | D | C | A | A | B | A | A | C | D | A | D | B | C | D | D | A | A | C | C | B | A | A | D | C | A | A | A | B | C | D | D | C | B | C | D | C | D | C | C | B | A | B | D | D | C | D | D | C | A | C | A | D | B | B | B | D | D | B | A | B | D | D | B | C | D | D | C | A | D | D | A | C | A | A | C | D | B | C | C | B | D | C | C | D | A | C | D | D | C | D | A | C | C | A | A | D | C | D | A | C | D | D | C | D | B | A | C | A | B | D | B | A | D | A | A | D | B | C | B | C | D | A | A | C | C | A | C | B | D | A | C | D | D | A | C | A | D | D | A | A | A | C | C | B | C | D | D | D | C | D | D | C | C | B | D | C | A | B | C | B | D | D | A | D | D | C | C | C | C | C | B | D | D | D | A | D | A | C | A | C | C | B | A | D | B | D | A | D | C | A | C | D | D | A | D | D | C | C | D | D | D | D | D | B | C | A | A | A | D | C | C | D | A | D | A | B | B | A | D | B | C | C | B | B | A | C | D | B | D | C | D | C | D | B | B | D | D | C | C | A | D | D | D | B | C | D | D | B | D | C | D | D | D | B | C | B | B | A | A | A | D | D | B | C | D | B | B | D | C | C | A | C | B | D | D | A | D | D | C | A | B | B | B | C | A | A | C | A | C | A | C | C | B | A | C | B | B | D | A | C | D | D | C | A | A | A | C | B | B | B | D | D | D | D | A | D | A | C | A | D | D | A | C | D | D | D | D | B | B | A | A | A | A | C | D | A | D | B | B | C | C | D | D | C | B | B | B | D | A | A | C | C | A | C | D | C | C | C | C | D | A | A | C | C | C | C | C | D | D | A | D | D | D | C | C | C | A | A | C | C | D | C | D | C | B | B | D | C | D | C | B | A | C | A | D | A | C | D | B | C | A | A | A | C | D | D | A | B | D | A | D | C | C | A | D | D | B | D | D | B | A | B | A | D | B | D | C | A | A | D | C | A | C | B | D | D | D | D | B | A | C | D | C | C | A | A | A | A | D | A | C | A | D | B | C | D | A | D | A | A | D | C | D | D | A | C | B | D | D | A | A | D | A | A | B | C | D | D | D | A | D | D | A | B | A | C | B | C | B | A | D | C | B | D | D | B | D | A | B | A | C | C | B | B | D | C | D | D | D | D | A | B | C | D | D | A | C | B | B | D | C | B | A | A | C | D | D | D | A | D | B | C | A | B | D | C | A | D | D | D | D | D | B | B | A | B | D | C | B | B | D | C | D | A | C | B | D | A | A | D | D | C | B | C | A | A | A | B | C | C | D | C | D | D | D | A | D | C | B | A | C | D | C | B | D | A | A | C | B | D | D | A | D | A | D | C | C | D | D | B | A | A | D | C | C | A | A | D | D | D | C | A | D | A | D | A | B | C | B | C | A | D | B | D | C | C | D | C | D | C | A | C | D | D | A | A | B | A | C | D | D | C | D | C | A | A | C | B | C | D | D | C | B | C | D | C | D | D | D | B | C | A | A | C | D | C | C | C | D | D | A | D | D | D | B | D | C | C | B | A | B | C | A | A | A | A | D | B | C | A | B | B | D | A | B | A | B | B | C | D | B | C | B | D | D | C | B | C | D | C | A | A | A | C | B | A | A | B | A | A | C | B | D | B | A | C | D | C | D | A | B | D | A | D | A | C | D | A | B | A | C | A | C | B | D | C | D | D | A | D | D | C | C | B | B | D | D | C | C | D | B | D | D | A | C | D | D | A | A | D | B | B | D | A | A | D | D | A | A | D | C | C | D | A | D | D | B | C | B | A | A | B | D | D | C | C | D | B | B | C | B | A | A | B | B | D | A | D | A | C | A | D | C | D | D | B | B | C | B | C | D | D | D | D | B | D | A | C | B | A | B | D | D | B | A | C | A | D | A | C | D | D | B | D | D | C | A | B | A | B | C | D | A | B | C | C | A | C | D | A | A | D | B | D | B | C | C | B | A | D | A | D | D | D | C | D | B | A | D | B | D | C | D | B | A | C | C | D | D | C | A | D | D | A | C | A | D | B | C | C | A | A | D | A | A | A | C | B | D | A | D | B | C | A | A | C | C | B | A | C | B | B | D | D | A | D | D | A | A | A | A | A | A | B | B | D | B | B | D | A | C | B | B | A | A | C | D | D | A | A | D | A | C | C | A | C | A | A | B | B | B | D | C | C | C | D | C | D | C | C | C | C | D | C | A | D | A | C | C | B | D | D | A | C | B | C | D | B | C | B | D | D | D | D | A | D | C | A | A | C | B | A | D | C | C | A | B | D | D | A | B | C | C | B | B | B | B | C | C | A | A | C | B | A | D | C | B | A | D | B | B | A | A | D | A | A | C | C | B | A | D | D | A | D | B | D | A | D | C | A | A | B | B | A | D | D | A | B | C | B | B | B | D | C | A | A | A | C | D | A | C | A | C | D | D | D | D | B | C | B | D | D | C | D | D | A | C | D | A | B | D | B | D | C | B | D | D | A | A | D | D | C | C | D | B | C | A | C | B | B | B | A | D | B | C | D | C | D | C | C | C | A | C | B | C | C | B | A | A | C | B | A | D | D | C | A | D | C | A | C | B | C | C | B | D | B | C | C | D | D | D | C | B | A | C | B | B | D | C | D | D | C | A | A | C | B | B | D | A | B | A | C | C | C | A | D | C | D | D | B | D | D | D | C | D | A | D | C | B | A | D | D | C | D | A | B | A | B | D | A | D | A | C | D | B | B | D | D | B | A | D | B | A | D | D | A | C | A | A | B | C | D | C | B | A | D | C | C | D | A | C | A | A | C | D | B | D | A | B | A | A | A | A | A | B | D | D | B | C | C | C | A |
The performance of the Random Forest classifier on the unseen test data is reported in the above cell.
The overall accuracy on the test set is 0.55, which is consistent with the cross-validation score (0.5465) and within the acceptable range of 0.53 to 0.56. This indicates that the model generalizes reasonably well to new data and do not overfit. Moreover, considering that other projects on Kaggle have an accuracy around 0.5 and only the best models achieve 0.54, the model's performance is competitive.
However, there is still room for improvement in the model's performance, especially for classes A and B, which have precision, recall, and F1-score values lower than 0.5. Some possible avenues for future exploration include trying other classification algorithms, ensembling methods, or refining feature engineering techniques.
In summary, the Random Forest classifier demonstrates relatively good performance in this multi-class classification problem, and its predictions are competitive with the best models available for this dataset. Nevertheless, there is potential for further optimization and improvement in its performance.
Logistic Regression is a linear model for classification problems that predicts the probability of a given class. It is one of the simplest and most widely used classification algorithms, particularly when the number of features is not too large, and the relationship between the features and the target variable is not overly complex.
In this section, Logistic Regression was applied to the multi-class classification problem with four target classes. As for the Random Forest model StratifiedKFold cross-validation and GridSearchCV were used to tune the hyperparameters of the logistic regression model, ensuring that the best possible performance was acheived.
After identifying the optimal hyperparameters, the model will be evaluated on the unseen test data to assess its generalization capability and compare its performance with the Random Forest classifier.
# Create logistic regression
reg = LogisticRegression()
# set parameter grid for gridsearch
param_grid = {
'C': [0.01, 0.05, 0.1, 0.3, 0.7, 1],
'class_weight': [None, 'balanced'],
'fit_intercept':[True, False],
'multi_class': ['ovr', 'multinomial'],
'solver': ['newton-cg', 'sag', 'saga'],
}
# Create a StratifiedKFold cross-validator with 10 splits
cv = StratifiedKFold(n_splits=10, shuffle=True)
# grid search
gs_reg = GridSearchCV(estimator=reg, param_grid=param_grid, scoring='accuracy', cv=cv,
n_jobs=-1,
# verbose=1
)
# performing gridsearch
gs_reg.fit(X_train, y_train)
print("Best score:", gs_reg.best_score_)
print()
print('best parameters:', gs_reg.best_params_)
# get best model
reg = gs_reg.best_estimator_
# train the model
reg.fit(X_train, y_train)
# evaluate performances on training data
y_pred = reg.predict(X_train)
get_performances(y_train, y_pred)
Best score: 0.515031079751362 best parameters: {'C': 0.3, 'class_weight': None, 'fit_intercept': True, 'multi_class': 'multinomial', 'solver': 'sag'} precision recall f1-score support A 0.44 0.49 0.46 1565 B 0.42 0.21 0.28 1484 C 0.51 0.62 0.56 1573 D 0.66 0.72 0.69 1832 accuracy 0.52 6454 macro avg 0.50 0.51 0.50 6454 weighted avg 0.51 0.52 0.51 6454
After performing GridSearchCV with StratifiedKFold cross-validation, the best combination of hyperparameters for the Logistic Regression classifier has been founded.
The best score achieved during cross-validation was 0.5150, which is a reasonable result considering the complexity of the problem, the number of target classes, the small sample size and little information.
When evaluating the model on the entire training set, an accuracy of 0.52 was observed.
These results show that the model performs reasonably well in classifying the target classes, with some variation in performance across the different classes.
# evaluate performaces on test data for validation
y_pred_reg = reg.predict(X_test)
get_performances(y_test, y_pred_reg)
probabilities = reg.predict_proba(X_test)
get_probabilities_df(probabilities, y_test).sample(50).T
precision recall f1-score support A 0.43 0.46 0.45 407 B 0.39 0.23 0.29 374 C 0.51 0.62 0.56 397 D 0.64 0.70 0.67 436 accuracy 0.51 1614 macro avg 0.49 0.50 0.49 1614 weighted avg 0.50 0.51 0.50 1614
ID | 467728 | 467776 | 463712 | 460733 | 465474 | 464124 | 460570 | 465371 | 459029 | 466247 | 467408 | 459726 | 465664 | 464974 | 465432 | 465549 | 461964 | 459517 | 460131 | 463695 | 463392 | 466127 | 459588 | 467448 | 462214 | 461810 | 467045 | 462114 | 466869 | 459369 | 462923 | 460621 | 466023 | 466613 | 459830 | 467054 | 463468 | 464378 | 463542 | 466204 | 467585 | 460419 | 461078 | 465272 | 462791 | 464943 | 462292 | 465122 | 461312 | 467599 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Segmentation | B | D | C | D | D | B | C | A | B | C | A | B | A | B | B | D | D | D | D | A | B | A | A | A | D | B | A | D | D | A | A | C | B | D | A | D | B | B | A | C | A | A | A | C | D | B | A | C | D | D |
Predicted | B | D | C | D | D | B | D | B | B | C | C | A | A | B | B | D | D | D | D | A | A | C | A | A | A | B | A | D | D | A | D | A | B | D | D | D | C | C | B | C | C | A | D | C | D | B | B | C | D | D |
A prob | 0.311278 | 0.258465 | 0.187824 | 0.177319 | 0.120205 | 0.255972 | 0.150676 | 0.328941 | 0.243895 | 0.14562 | 0.065334 | 0.339341 | 0.396984 | 0.271091 | 0.238424 | 0.189278 | 0.077469 | 0.077916 | 0.091326 | 0.266931 | 0.458448 | 0.111643 | 0.362459 | 0.495693 | 0.481716 | 0.329248 | 0.437958 | 0.046244 | 0.122039 | 0.316213 | 0.252249 | 0.375523 | 0.231721 | 0.049058 | 0.398154 | 0.055041 | 0.071852 | 0.134191 | 0.233803 | 0.077276 | 0.135552 | 0.503058 | 0.198872 | 0.135434 | 0.081584 | 0.320089 | 0.319066 | 0.109244 | 0.046387 | 0.05706 |
B prob | 0.348326 | 0.187979 | 0.315185 | 0.144468 | 0.023667 | 0.365884 | 0.076171 | 0.330269 | 0.375231 | 0.385398 | 0.233296 | 0.27094 | 0.276243 | 0.427925 | 0.489224 | 0.252802 | 0.062556 | 0.0999 | 0.103266 | 0.26691 | 0.326809 | 0.270223 | 0.360302 | 0.237675 | 0.188503 | 0.452749 | 0.269456 | 0.029674 | 0.130468 | 0.303718 | 0.214837 | 0.257106 | 0.380626 | 0.024563 | 0.137228 | 0.033902 | 0.237571 | 0.292233 | 0.352543 | 0.258281 | 0.226913 | 0.209654 | 0.144707 | 0.291519 | 0.089156 | 0.358622 | 0.3847 | 0.292661 | 0.026349 | 0.037466 |
C prob | 0.182925 | 0.214548 | 0.430721 | 0.283134 | 0.010191 | 0.324033 | 0.097607 | 0.279051 | 0.277983 | 0.448691 | 0.690908 | 0.207845 | 0.158361 | 0.26742 | 0.223969 | 0.23487 | 0.081361 | 0.211641 | 0.194048 | 0.225464 | 0.118619 | 0.577902 | 0.221218 | 0.037061 | 0.036489 | 0.102374 | 0.13271 | 0.039216 | 0.127821 | 0.103692 | 0.105507 | 0.145399 | 0.297975 | 0.026655 | 0.063067 | 0.038768 | 0.677634 | 0.553028 | 0.344864 | 0.653792 | 0.34582 | 0.056568 | 0.073581 | 0.465107 | 0.113101 | 0.128283 | 0.259534 | 0.58364 | 0.018479 | 0.045292 |
D prob | 0.157471 | 0.339008 | 0.06627 | 0.395079 | 0.845938 | 0.054111 | 0.675546 | 0.061739 | 0.102891 | 0.020291 | 0.010461 | 0.181874 | 0.168412 | 0.033564 | 0.048383 | 0.32305 | 0.778614 | 0.610543 | 0.611361 | 0.240694 | 0.096124 | 0.040232 | 0.056021 | 0.229572 | 0.293293 | 0.115629 | 0.159875 | 0.884865 | 0.619672 | 0.276376 | 0.427407 | 0.221972 | 0.089678 | 0.899725 | 0.401551 | 0.872288 | 0.012943 | 0.020548 | 0.06879 | 0.010651 | 0.291716 | 0.230719 | 0.58284 | 0.10794 | 0.716159 | 0.193006 | 0.036701 | 0.014455 | 0.908785 | 0.860182 |
The performance of the Logisti Regression classifier on the unseen test data is reported in the above cell.
The test set results are close to the cross-validation and training set accuracies, indicating that the model is not overfitting and has a reasonable generalization capability. The accuracy of 0.51 is over the average range of the models on Kaggle, suggesting that the logistic regression model is average competitive for this problem. Given the limitations of the dataset and the complexity of the problem, the logistic regression model can be considered an acceptable solution for this classification task.
Logistic Regression classifiers achieved slitly worse performance compared to the fine-tuned Random Forest. The advantqage of Logistic Regression is the option to calculate the probability for each class. Considering these results, either model could be employed for this classification problem, depending on the specific requirements and constraints.
When comparing the predictions made by the Random Forest and Logistic Regression classifiers, it was found that they differ in approximately 24% of the cases. This difference could be attributed to the inherent variations in how these two algorithms make predictions.
One advantage of using Logistic Regression over Random Forest is its ability to provide probability estimates for each class. This can be helpful in situations where it is important to understand not only the predicted class but also the level of confidence the model has in its prediction. This additional information could be useful for decision-making processes or further analysis.
In conclusion, both classifiers have their strengths and weaknesses, and the choice between them ultimately depends on the specific requirements. If probability estimates are required by the marketing team, the Logistic Regression model might be more suitable desplite a sligtly lower accuracy.
comparison = pd.DataFrame(y_test).join(pd.DataFrame(y_pred_rf, y_test.index, columns=['rf_predictions']))\
.join(pd.DataFrame(y_pred_reg, y_test.index, columns=['reg_predictions']))
print('Compare results')
display(comparison.T)
print('differences')
display(comparison[comparison['rf_predictions']!=comparison['reg_predictions']].T)
print()
print('percentages of different values:',
round(len(comparison[comparison['rf_predictions']!=comparison['reg_predictions']])/len(comparison), 4))
Compare results
ID | 463752 | 460117 | 467273 | 461976 | 463482 | 465807 | 464445 | 467001 | 461500 | 466060 | 461849 | 459181 | 461946 | 460711 | 460839 | 460332 | 467599 | 462114 | 461820 | 466756 | 466664 | 462849 | 467216 | 467717 | 462186 | 466860 | 464352 | 461201 | 464490 | 466725 | 462666 | 467051 | 467643 | 461491 | 463048 | 466368 | 466975 | 461186 | 462086 | 459895 | 460896 | 464003 | 463293 | 463012 | 459631 | 466743 | 464002 | 462124 | 463669 | 461855 | 459979 | 463121 | 459907 | 466157 | 461766 | 459675 | 465256 | 459925 | 464896 | 462953 | 467238 | 466226 | 460162 | 464023 | 465853 | 467774 | 466707 | 465590 | 466667 | 460146 | 467314 | 461394 | 467221 | 462857 | 463410 | 465537 | 464385 | 460690 | 467045 | 463952 | 464968 | 459712 | 464122 | 463537 | 462633 | 465290 | 459590 | 466704 | 459630 | 459644 | 467381 | 467924 | 463660 | 459209 | 467886 | 467699 | 461530 | 463536 | 460494 | 460033 | 462492 | 465963 | 462641 | 463513 | 460760 | 465268 | 464881 | 466595 | 460587 | 463437 | 460097 | 464508 | 459358 | 465079 | 465650 | 462223 | 462961 | 467860 | 463302 | 460300 | 465710 | 465603 | 467229 | 465272 | 463579 | 460551 | 466613 | 459367 | 463216 | 464836 | 462111 | 462357 | 459943 | 462195 | 461444 | 465931 | 459051 | 460101 | 466568 | 460299 | 467256 | 461321 | 466621 | 465140 | 462971 | 461642 | 467706 | 466476 | 459764 | 467653 | 466972 | 466345 | 463839 | 467408 | 460662 | 461085 | 459022 | 466834 | 465097 | 464506 | 460975 | 466847 | 462162 | 467803 | 463933 | 460067 | 466872 | 460027 | 461258 | 466555 | 466762 | 464107 | 463693 | 467249 | 459218 | 463866 | 463759 | 467416 | 459739 | 466706 | 461312 | 463853 | 465338 | 466087 | 465416 | 460284 | 459347 | 463214 | 460325 | 460886 | 461621 | 465179 | 462002 | 461144 | 462314 | 460381 | 466729 | 464952 | 462683 | 467313 | 464341 | 466243 | 460245 | 464795 | 461832 | 465491 | 464751 | 464377 | 462800 | 463337 | 466084 | 465821 | 464696 | 464049 | 463212 | 461431 | 460081 | 464756 | 467629 | 460474 | 459314 | 459973 | 464679 | 461435 | 465644 | 465930 | 467095 | 459726 | 467890 | 462825 | 459535 | 467236 | 461718 | 464334 | 466425 | 461082 | 461848 | 459565 | 460649 | 461079 | 467619 | 467607 | 465346 | 461326 | 465030 | 459490 | 461118 | 461529 | 464666 | 466833 | 465846 | 463163 | 459945 | 465881 | 463462 | 459117 | 460102 | 461042 | 463116 | 460158 | 461245 | 466455 | 466738 | 464522 | 466525 | 466343 | 463725 | 465954 | 465752 | 462001 | 459230 | 463784 | 463483 | 459299 | 460648 | 464225 | 462742 | 459788 | 466135 | 465170 | 460750 | 464042 | 462826 | 467858 | 467168 | 467834 | 459738 | 460589 | 466924 | 461502 | 464545 | 463464 | 459270 | 460257 | 461664 | 462793 | 467531 | 459673 | 466127 | 466768 | 461117 | 463402 | 464325 | 466996 | 465426 | 461087 | 462174 | 466223 | 464083 | 464558 | 460244 | 466439 | 461011 | 467447 | 462328 | 464572 | 465980 | 463401 | 466904 | 464747 | 464372 | 467181 | 466543 | 461592 | 466297 | 464437 | 463546 | 464205 | 465958 | 463895 | 462320 | 465156 | 463111 | 460351 | 465852 | 462636 | 463509 | 459480 | 464781 | 464432 | 460219 | 463006 | 465911 | 463455 | 464997 | 466339 | 460538 | 464433 | 463769 | 463991 | 460269 | 466191 | 459265 | 463756 | 460845 | 464453 | 462145 | 465028 | 460192 | 460912 | 459342 | 459449 | 461681 | 463822 | 466529 | 462778 | 466205 | 464330 | 462075 | 464261 | 466563 | 465777 | 466487 | 464160 | 463653 | 460844 | 460596 | 460605 | 465804 | 459856 | 464392 | 462254 | 464326 | 467592 | 464047 | 467853 | 467685 | 466293 | 463924 | 458990 | 463110 | 465522 | 459416 | 461481 | 467737 | 466247 | 461671 | 464468 | 465690 | 461430 | 463476 | 467132 | 462383 | 460450 | 464786 | 463737 | 461594 | 465789 | 466359 | 467431 | 459338 | 464368 | 461627 | 464878 | 465134 | 466659 | 464628 | 462799 | 466859 | 464446 | 467583 | 463818 | 467270 | 459316 | 465328 | 463037 | 465070 | 467827 | 461701 | 463069 | 462941 | 464347 | 462567 | 461563 | 459556 | 463647 | 466570 | 463832 | 460055 | 461871 | 459752 | 464540 | 463999 | 459253 | 466385 | 466325 | 459745 | 460698 | 463061 | 465552 | 459989 | 465249 | 459235 | 459171 | 461479 | 462815 | 466328 | 466869 | 466937 | 462282 | 464908 | 462818 | 464112 | 464275 | 465977 | 466085 | 461607 | 467633 | 467462 | 466959 | 462292 | 460526 | 466201 | 465967 | 467009 | 465148 | 464191 | 466942 | 464586 | 467614 | 465731 | 466982 | 463297 | 467175 | 466776 | 466961 | 462352 | 467902 | 460701 | 466990 | 465689 | 463161 | 460926 | 463975 | 465394 | 464227 | 462692 | 460279 | 461991 | 466930 | 459803 | 459166 | 459390 | 465351 | 466781 | 462674 | 459020 | 464464 | 460324 | 463323 | 467082 | 461857 | 466522 | 466178 | 463109 | 460205 | 466561 | 460086 | 460920 | 466274 | 462903 | 466813 | 463490 | 460307 | 460255 | 461414 | 459055 | 467804 | 463030 | 463176 | 462326 | 462994 | 460836 | 463774 | 460733 | 465860 | 462246 | 460486 | 461952 | 465023 | 465918 | 467167 | 465137 | 460093 | 459792 | 460787 | 467117 | 465793 | 461812 | 459564 | 465539 | 466639 | 464260 | 467388 | 463793 | 459343 | 461985 | 465994 | 459891 | 465589 | 462126 | 462229 | 465550 | 465227 | 467776 | 465104 | 466004 | 462974 | 466062 | 467329 | 460543 | 462463 | 459175 | 459830 | 459151 | 463781 | 459929 | 465447 | 465523 | 459382 | 460721 | 467004 | 462064 | 459418 | 465364 | 464727 | 464359 | 465709 | 463505 | 465146 | 466926 | 465437 | 464844 | 464241 | 465781 | 464798 | 460931 | 461267 | 460909 | 461884 | 465240 | 460479 | 463956 | 459751 | 459593 | 464059 | 459685 | 463735 | 462834 | 461648 | 465408 | 466983 | 459203 | 466136 | 463807 | 464925 | 465708 | 460613 | 460149 | 463598 | 466567 | 466015 | 460084 | 460266 | 467728 | 463053 | 465680 | 459356 | 462676 | 463504 | 459019 | 462483 | 467309 | 466632 | 464761 | 466660 | 464004 | 463675 | 462833 | 465260 | 462251 | 465538 | 466978 | 465343 | 467898 | 461640 | 466855 | 460375 | 460142 | 461719 | 465118 | 461463 | 459274 | 460172 | 461453 | 465507 | 462388 | 464794 | 464926 | 466532 | 462919 | 467015 | 462545 | 467574 | 467272 | 465700 | 460136 | 463905 | 461004 | 461776 | 461317 | 459614 | 462876 | 462521 | 463874 | 463728 | 460133 | 461306 | 464698 | 461031 | 459719 | 465379 | 459506 | 461537 | 466208 | 459517 | 463695 | 461208 | 462760 | 465423 | 464937 | 459850 | 460824 | 465187 | 466916 | 462871 | 464580 | 461734 | 467941 | 460169 | 459667 | 467325 | 467650 | 461809 | 461162 | 467296 | 466623 | 466524 | 465432 | 462243 | 466386 | 467617 | 464607 | 465660 | 459254 | 464394 | 464899 | 459702 | 463135 | 460049 | 461935 | 467793 | 462462 | 465670 | 459864 | 467731 | 465033 | 462316 | 466789 | 462991 | 463010 | 465985 | 467453 | 463317 | 463395 | 463863 | 459454 | 459179 | 462202 | 459098 | 464301 | 459226 | 461595 | 464620 | 466474 | 460607 | 462963 | 459942 | 464010 | 463619 | 460306 | 465081 | 466741 | 461173 | 463481 | 460330 | 467298 | 459193 | 459195 | 463222 | 465125 | 465060 | 467669 | 465581 | 460233 | 462099 | 462626 | 460757 | 463276 | 466646 | 463637 | 464651 | 459616 | 466603 | 465862 | 464547 | 464838 | 461729 | 464203 | 464100 | 460910 | 461272 | 459432 | 463542 | 462950 | 462048 | 467208 | 459465 | 461552 | 464296 | 465431 | 461497 | 467123 | 465375 | 465106 | 459440 | 466712 | 461942 | 462535 | 465718 | 467305 | 466242 | 465172 | 459048 | 465198 | 467496 | 465624 | 462238 | 463154 | 463936 | 466579 | 462412 | 463990 | 465400 | 464780 | 462080 | 462349 | 465463 | 462568 | 459471 | 465554 | 462297 | 467190 | 467557 | 465495 | 460395 | 461199 | 462060 | 463394 | 465434 | 463374 | 465415 | 462609 | 464145 | 459280 | 467845 | 466302 | 460876 | 465245 | 466175 | 465381 | 466000 | 463916 | 462449 | 460939 | 459431 | 465169 | 464994 | 462424 | 467953 | 465102 | 465082 | 464555 | 461327 | 459986 | 467280 | 467054 | 459459 | 465219 | 463872 | 465893 | 465182 | 459854 | 464599 | 460555 | 467598 | 463206 | 459672 | 459096 | 465135 | 465497 | 467504 | 462584 | 463761 | 459352 | 459652 | 462385 | 460062 | 463424 | 460499 | 459588 | 462832 | 467380 | 459795 | 462923 | 459777 | 464473 | 464096 | 461633 | 460376 | 462644 | 464733 | 462740 | 464168 | 464300 | 463392 | 463492 | 461659 | 460116 | 458992 | 462895 | 462612 | 462514 | 463602 | 466159 | 460472 | 466651 | 461834 | 461260 | 462719 | 461424 | 467446 | 461964 | 459207 | 461086 | 465305 | 467411 | 460348 | 460159 | 466099 | 465441 | 464839 | 461685 | 462937 | 463616 | 466231 | 460409 | 460616 | 465761 | 462188 | 466268 | 466648 | 466305 | 462013 | 466459 | 463436 | 459863 | 465908 | 460520 | 467802 | 466199 | 463629 | 461301 | 461892 | 461442 | 464749 | 459817 | 463175 | 462595 | 462998 | 463906 | 467473 | 463706 | 459414 | 464175 | 465555 | 465436 | 465048 | 463712 | 463641 | 464538 | 467131 | 465110 | 459780 | 460385 | 461572 | 467327 | 462600 | 464422 | 459318 | 461663 | 460570 | 462313 | 467882 | 467163 | 466118 | 463113 | 462507 | 460887 | 465439 | 464151 | 459308 | 466037 | 462043 | 467856 | 462291 | 459238 | 463368 | 466919 | 459732 | 461248 | 462707 | 460424 | 467806 | 462548 | 467386 | 464796 | 459369 | 460034 | 461075 | 461382 | 460807 | 465316 | 460290 | 461119 | 463468 | 466432 | 463553 | 463002 | 462082 | 465122 | 466031 | 462863 | 462765 | 465217 | 461785 | 466195 | 467418 | 465474 | 461402 | 461078 | 462634 | 463062 | 460875 | 459602 | 460289 | 460129 | 466602 | 462694 | 464408 | 463202 | 466174 | 466460 | 460139 | 463071 | 466811 | 459410 | 463608 | 465160 | 467540 | 464702 | 467060 | 462163 | 461521 | 461141 | 465678 | 461650 | 465199 | 459118 | 463456 | 464154 | 464371 | 466565 | 462227 | 461677 | 466406 | 459329 | 464943 | 466939 | 459995 | 463978 | 466614 | 463139 | 461204 | 460631 | 465241 | 467732 | 467452 | 466211 | 463742 | 460581 | 459591 | 461160 | 461125 | 461542 | 467603 | 466515 | 459705 | 462537 | 464123 | 467813 | 460303 | 466204 | 466068 | 460955 | 465532 | 459095 | 462382 | 463581 | 466067 | 464692 | 467448 | 467712 | 465406 | 461814 | 459923 | 460304 | 459902 | 459964 | 464090 | 461053 | 465371 | 465230 | 467023 | 465401 | 459292 | 460492 | 465188 | 463428 | 462087 | 467308 | 466526 | 459372 | 466187 | 460291 | 465858 | 462702 | 466065 | 462572 | 462367 | 462509 | 463296 | 465607 | 459185 | 467940 | 467721 | 463389 | 467569 | 459189 | 460313 | 464670 | 459828 | 462886 | 463341 | 464020 | 459761 | 463710 | 460978 | 462718 | 459069 | 459496 | 467214 | 465031 | 461630 | 460128 | 465285 | 466216 | 459557 | 467332 | 459772 | 463360 | 460968 | 465222 | 463777 | 467324 | 463025 | 466907 | 467093 | 467162 | 465897 | 463881 | 459027 | 458995 | 466454 | 466521 | 465622 | 464974 | 464735 | 460051 | 465962 | 464605 | 461624 | 463169 | 467257 | 461652 | 467240 | 461081 | 465533 | 467455 | 465270 | 461730 | 461250 | 465446 | 466330 | 465549 | 463754 | 464038 | 460154 | 467909 | 464396 | 464762 | 462214 | 467677 | 462392 | 466225 | 459934 | 466481 | 461338 | 463380 | 462827 | 465531 | 461313 | 462889 | 466784 | 467759 | 460902 | 460675 | 463376 | 463250 | 462271 | 463259 | 464553 | 463533 | 467855 | 462072 | 461055 | 462284 | 459225 | 459546 | 460716 | 467585 | 461972 | 463319 | 464460 | 465413 | 459086 | 461655 | 465067 | 464342 | 460288 | 460810 | 462791 | 467449 | 467246 | 466217 | 467850 | 462723 | 459423 | 465470 | 462472 | 467519 | 465279 | 460706 | 464945 | 462611 | 461810 | 461804 | 462369 | 466542 | 461193 | 462445 | 463908 | 460835 | 460652 | 459493 | 461372 | 461760 | 463044 | 465955 | 464758 | 463635 | 465162 | 466379 | 461002 | 462183 | 463085 | 465399 | 463179 | 466746 | 465499 | 464384 | 460378 | 463770 | 461482 | 461885 | 460419 | 467676 | 459029 | 464001 | 459814 | 459456 | 464382 | 465614 | 466897 | 466844 | 467652 | 466344 | 462409 | 462212 | 460319 | 460621 | 461992 | 466887 | 464783 | 460371 | 465765 | 459698 | 466095 | 460600 | 465573 | 462094 | 462253 | 460720 | 460068 | 459962 | 459736 | 460092 | 460265 | 459108 | 466053 | 463681 | 460497 | 465253 | 459495 | 465391 | 461601 | 459147 | 461647 | 462852 | 465337 | 464995 | 465433 | 459632 | 464668 | 462493 | 466624 | 464074 | 459776 | 464233 | 465809 | 460868 | 466900 | 467063 | 465657 | 459104 | 461926 | 467950 | 465679 | 464272 | 466446 | 462977 | 467482 | 463378 | 462193 | 462642 | 463658 | 463860 | 459081 | 460259 | 465676 | 459849 | 462652 | 462617 | 463694 | 466846 | 462905 | 462411 | 465202 | 463459 | 464106 | 466702 | 461914 | 464738 | 465946 | 465741 | 460021 | 461153 | 466039 | 462972 | 465767 | 463346 | 465513 | 464393 | 463755 | 467020 | 460256 | 464741 | 461176 | 466148 | 462362 | 460171 | 463023 | 465501 | 466092 | 467215 | 466655 | 465872 | 465633 | 460715 | 465121 | 466571 | 467390 | 462938 | 460019 | 460827 | 463679 | 465193 | 466395 | 459124 | 461666 | 467126 | 458996 | 465514 | 459433 | 463704 | 461967 | 467412 | 462899 | 463412 | 467414 | 467709 | 459537 | 460312 | 459623 | 460811 | 466699 | 461803 | 460167 | 460015 | 464715 | 465387 | 461132 | 459271 | 466023 | 464582 | 467475 | 461145 | 460783 | 464124 | 467746 | 463942 | 462868 | 466777 | 461252 | 462967 | 460667 | 463631 | 463057 | 460913 | 463828 | 463847 | 467187 | 460791 | 465788 | 463347 | 466340 | 465481 | 461689 | 459877 | 463295 | 459707 | 460442 | 461308 | 461739 | 465871 | 465641 | 466948 | 467884 | 467948 | 467596 | 463630 | 467319 | 463919 | 465755 | 461597 | 466101 | 460823 | 459054 | 466775 | 459143 | 460735 | 463962 | 461299 | 462252 | 459085 | 467534 | 464378 | 466862 | 464244 | 461483 | 464148 | 465978 | 464814 | 460100 | 462022 | 467419 | 461819 | 463439 | 464270 | 464052 | 462844 | 465276 | 466108 | 465664 | 461284 | 463684 | 465221 | 462473 | 461387 | 463855 | 465669 | 465043 | 460423 | 459706 | 465754 | 462536 | 459787 | 459484 | 459451 | 467217 | 461150 | 460131 | 467965 | 464149 | 467079 | 461532 | 462365 | 461860 | 465524 | 464700 | 459010 | 466252 | 463270 | 461577 | 466653 | 462066 | 460860 | 459893 | 460951 | 466280 | 461017 | 460946 | 466142 | 464179 | 459880 | 464109 | 462627 | 464851 | 459228 | 459637 | 463760 | 466504 | 461019 | 466336 | 463898 | 467133 | 466213 | 463757 | 463589 | 467393 | 466954 | 463927 | 461853 | 464612 | 464526 | 459294 | 464638 | 461001 | 464725 | 463417 | 466463 | 460334 | 460382 | 461496 | 461818 | 462789 | 463316 | 462197 | 459320 | 466688 | 462037 | 464984 | 459089 | 462400 | 459058 | 463235 | 465785 | 464570 | 460878 | 463526 | 460981 | 459748 | 461688 | 462820 | 459483 | 465940 | 464299 | 461748 | 464887 | 467733 | 462208 | 466430 | 466434 | 466970 | 465620 | 463168 | 459325 | 466146 | 467554 | 464770 | 460294 | 462359 | 465417 | 463497 | 464590 | 460377 | 462524 | 463939 | 466803 | 466909 | 465091 | 460848 | 459747 | 467007 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Segmentation | D | D | B | B | C | B | C | B | B | B | A | D | C | D | D | D | D | D | C | C | A | C | D | B | D | C | C | C | A | A | C | D | A | C | A | B | C | D | B | C | A | D | D | D | D | A | A | A | A | C | B | A | C | A | A | D | D | D | A | B | D | C | D | D | A | A | B | A | D | A | C | D | C | D | B | A | B | A | A | A | A | D | B | D | D | B | D | C | A | D | B | A | A | C | C | C | A | C | D | A | B | B | D | B | B | B | C | C | B | B | B | C | D | A | A | D | B | B | B | D | C | A | C | C | B | C | D | D | A | D | C | A | B | D | C | D | C | D | B | B | B | C | C | B | C | A | A | D | B | C | A | D | B | A | D | A | D | A | A | D | B | A | B | B | A | A | B | A | C | D | B | C | C | D | C | C | C | B | D | C | D | B | C | D | A | A | B | A | B | B | C | C | D | B | A | A | A | B | B | B | A | A | B | A | C | B | A | A | B | A | A | A | A | C | D | C | B | B | B | A | C | A | C | C | C | A | C | B | B | B | C | B | B | D | A | D | D | A | D | A | B | D | D | C | C | A | A | D | D | B | A | D | C | A | A | C | B | D | B | D | A | D | D | D | B | A | C | A | C | D | B | A | D | B | D | C | D | A | B | C | C | B | D | A | B | D | D | B | D | B | A | D | A | D | C | B | D | B | A | C | D | D | B | A | B | A | D | B | C | D | B | D | B | B | D | D | A | D | C | A | C | B | A | B | D | B | C | B | D | B | C | C | D | A | B | D | D | A | D | D | A | D | A | B | B | D | D | A | B | A | D | C | C | B | D | C | D | C | D | D | C | B | D | C | A | D | C | A | B | C | D | A | A | A | A | A | A | D | A | B | B | B | A | A | D | C | A | B | B | D | A | B | A | C | A | C | D | C | B | C | B | B | D | B | D | C | C | D | D | C | D | B | C | D | B | B | A | D | D | B | B | B | C | D | A | B | C | D | C | B | D | D | A | C | D | A | A | D | C | C | B | A | C | C | C | A | A | B | D | D | B | D | A | C | C | A | C | D | B | D | A | C | D | D | C | D | B | A | C | D | A | C | C | A | D | A | A | D | B | B | A | B | D | D | B | C | B | A | B | B | D | A | B | B | B | C | B | B | B | A | C | A | B | B | A | B | C | C | A | D | C | C | D | C | C | B | D | C | C | A | B | A | D | D | A | D | D | C | C | B | C | B | C | A | D | B | B | D | D | C | B | C | C | A | A | A | A | C | A | A | B | B | C | D | A | A | A | D | A | A | D | D | D | D | D | A | C | D | B | C | C | C | C | A | B | D | A | C | C | A | D | A | C | D | A | B | B | C | D | B | D | C | D | C | D | B | A | D | D | B | D | A | D | B | D | A | C | C | D | C | D | C | B | A | B | A | C | C | C | C | D | A | D | D | B | C | D | A | A | D | C | C | B | C | B | D | D | A | D | D | C | D | C | B | C | C | A | C | B | D | C | B | B | C | D | B | C | A | A | D | A | C | D | D | A | D | D | A | C | B | C | B | D | D | D | B | A | C | A | C | D | C | A | B | C | D | A | D | D | C | D | B | B | C | C | B | A | B | A | C | C | B | A | D | B | C | C | B | B | A | B | A | C | C | A | A | D | A | C | A | C | D | C | D | A | C | B | C | B | D | D | D | D | A | B | C | A | B | A | D | B | C | A | B | A | A | C | B | C | A | D | C | B | B | B | A | A | A | C | A | A | C | A | A | D | C | B | A | A | B | A | D | D | C | A | B | D | D | B | D | A | B | A | A | D | D | A | B | B | A | B | B | B | B | C | B | B | D | D | C | B | A | C | D | C | A | C | A | D | B | D | C | B | C | D | C | C | A | A | C | A | B | B | C | A | D | D | C | B | B | D | C | D | D | B | C | B | C | D | C | C | A | A | D | B | A | B | C | C | A | B | A | D | C | B | D | D | A | C | A | B | A | C | B | C | A | A | C | A | B | A | D | D | A | B | A | C | D | C | D | A | D | C | A | A | A | C | C | D | C | D | A | B | C | A | B | D | C | D | D | D | B | D | D | C | A | A | C | D | B | B | A | D | B | D | D | B | C | B | B | A | D | D | C | B | C | D | C | A | B | C | C | B | C | B | B | A | A | C | C | C | B | C | D | C | B | D | A | D | B | B | D | D | C | D | D | B | C | C | B | D | B | D | A | D | C | C | D | A | D | D | C | C | B | D | A | D | A | B | C | D | C | A | C | A | D | A | C | D | A | D | C | A | B | A | D | A | A | D | B | B | D | D | B | D | B | A | B | B | C | C | A | D | B | A | C | D | C | D | D | A | A | B | A | D | C | A | C | A | C | D | D | D | D | D | D | B | A | C | C | C | D | C | B | A | A | B | A | D | B | C | C | B | B | D | A | A | B | B | A | C | D | B | D | D | D | D | C | C | C | A | C | A | A | B | C | A | A | C | B | A | A | C | C | D | A | A | B | D | C | D | A | A | D | A | D | D | C | A | A | D | A | C | A | B | C | D | B | A | A | C | B | B | C | B | A | A | D | A | B | C | D | B | D | A | D | C | D | C | A | B | D | D | A | D | A | B | D | D | A | A | D | A | C | D | D | A | D | B | C | B | A | A | C | D | A | C | A | D | A | B | B | B | A | A | C | B | D | A | A | A | B | D | D | C | D | D | B | C | C | B | D | A | C | D | D | A | C | A | B | B | D | D | A | D | B | B | C | A | A | D | C | B | D | C | A | D | C | D | B | A | B | C | D | A | B | B | C | B | B | A | B | B | D | A | B | A | C | C | B | D | D | C | D | C | D | B | D | C | C | D | A | D | A | D | B | A | C | B | D | D | C | A | D | A | A | C | B | A | B | A | B | C | D | D | A | B | A | C | A | C | A | D | C | C | A | D | B | C | B | D | C | C | D | D | D | A | A | D | D | C | A | A | A | C | B | A | C | A | B | D | A | B | B | B | A | C | B | D | D | D | D | D | D | A | C | D | C | D | A | D | B | D | D | C | C | C | D | C | B | B | C | B | C | D | C | D | D | A | C | B | C | B | B | D | C | C | C | D | B | C | B | D | D | D | C | A | D | C | A | C | C | A | A | C | B | A | A | B | B | D | B | A | B | B | B | A | B | C | B | C | D | D | B | C | C | A | B | A | B | D | C | B | C | D | D | D | A | B | C | B | A | D | D | A | C | B | D | A | D | D | A | D | B | A | B | D | C | D | D | C | B | A | B | D | C | A | B | B | B | D | A | C | D | B | A | D | D | D | C | C | C | C | D | A | D | B | C | C | A | D | B | D | C | A | C | B | D | D | C | A | A | D | C | B | D | B | C | C | B | A | B | C | A | A | C | B | D | B | C | C | B | D | A | B | B | C | B | A | A | D | C | B | A | D | A | C | A | D | C | A | C | A | C | C | B | A | A | C | C | A | D | A | C | B | A | B | C | C | D | B | B | A | C | A | B | B | D | A | B | A | D | A | C | C | C | D | D | C | C | A | B | D | D | C | C | D | B | D | C | C | A | D | A | A | D | B | B | A | B | B | A | D | B | C | D | A | A | D | D | B | D | B | B | A | D | D | A | C | A | A | C | C | C | A | A | A | D | C | B | B | A | B | A | C | C | D | C | D | A | D | D | B | D | A | C | C | D | D | B | C | A | C | A |
rf_predictions | D | D | C | B | D | A | C | B | D | C | D | C | D | D | D | D | D | D | C | C | B | C | D | B | D | D | C | B | C | B | B | D | A | C | B | B | C | D | A | C | B | D | D | A | C | D | D | A | D | C | D | C | C | A | A | D | D | A | A | C | D | C | D | D | D | D | B | C | D | B | C | D | D | D | A | C | B | A | A | A | D | D | C | D | D | C | D | C | C | D | A | A | A | D | C | A | A | C | D | A | C | B | A | C | C | D | A | C | A | C | B | C | A | D | D | D | B | A | B | C | B | A | D | B | C | B | D | D | A | D | C | A | A | D | C | A | C | D | C | B | D | B | A | B | B | A | C | D | C | C | A | D | A | C | A | A | B | A | A | D | A | D | B | B | D | A | A | A | C | D | D | C | C | D | A | C | C | A | B | A | D | C | C | D | B | D | A | D | D | A | C | C | A | D | D | A | D | B | A | C | A | A | C | A | D | A | A | A | B | A | A | D | A | D | D | C | C | D | B | A | C | B | C | C | D | B | B | A | C | A | C | A | B | D | A | D | D | A | D | A | B | D | D | A | C | B | D | A | B | B | B | D | B | A | B | C | C | D | C | A | B | D | D | A | C | D | B | D | C | A | C | A | D | A | D | D | D | A | B | A | C | C | D | C | C | D | A | B | D | B | A | D | C | D | C | B | D | A | C | C | D | D | D | B | D | A | B | B | B | D | A | A | B | B | D | D | D | D | C | B | C | C | D | A | D | B | D | B | B | B | C | C | D | A | A | D | A | A | D | D | B | D | A | B | C | D | D | A | B | D | D | B | C | B | D | C | A | C | A | D | A | B | D | C | A | A | B | A | A | C | D | A | D | B | C | D | D | A | A | C | C | B | A | A | D | C | A | A | A | B | C | D | D | C | B | C | D | C | D | C | C | B | A | B | D | D | C | D | D | C | A | C | A | D | B | B | B | D | D | B | A | B | D | D | B | C | D | D | C | A | D | D | A | C | A | A | C | D | B | C | C | B | D | C | C | D | A | C | D | D | C | D | A | C | C | A | A | D | C | D | A | C | D | D | C | D | B | A | C | A | B | D | B | A | D | A | A | D | B | C | B | C | D | A | A | C | C | A | C | B | D | A | C | D | D | A | C | A | D | D | A | A | A | C | C | B | C | D | D | D | C | D | D | C | C | B | D | C | A | B | C | B | D | D | A | D | D | C | C | C | C | C | B | D | D | D | A | D | A | C | A | C | C | B | A | D | B | D | A | D | C | A | C | D | D | A | D | D | C | C | D | D | D | D | D | B | C | A | A | A | D | C | C | D | A | D | A | B | B | A | D | B | C | C | B | B | A | C | D | B | D | C | D | C | D | B | B | D | D | C | C | A | D | D | D | B | C | D | D | B | D | C | D | D | D | B | C | B | B | A | A | A | D | D | B | C | D | B | B | D | C | C | A | C | B | D | D | A | D | D | C | A | B | B | B | C | A | A | C | A | C | A | C | C | B | A | C | B | B | D | A | C | D | D | C | A | A | A | C | B | B | B | D | D | D | D | A | D | A | C | A | D | D | A | C | D | D | D | D | B | B | A | A | A | A | C | D | A | D | B | B | C | C | D | D | C | B | B | B | D | A | A | C | C | A | C | D | C | C | C | C | D | A | A | C | C | C | C | C | D | D | A | D | D | D | C | C | C | A | A | C | C | D | C | D | C | B | B | D | C | D | C | B | A | C | A | D | A | C | D | B | C | A | A | A | C | D | D | A | B | D | A | D | C | C | A | D | D | B | D | D | B | A | B | A | D | B | D | C | A | A | D | C | A | C | B | D | D | D | D | B | A | C | D | C | C | A | A | A | A | D | A | C | A | D | B | C | D | A | D | A | A | D | C | D | D | A | C | B | D | D | A | A | D | A | A | B | C | D | D | D | A | D | D | A | B | A | C | B | C | B | A | D | C | B | D | D | B | D | A | B | A | C | C | B | B | D | C | D | D | D | D | A | B | C | D | D | A | C | B | B | D | C | B | A | A | C | D | D | D | A | D | B | C | A | B | D | C | A | D | D | D | D | D | B | B | A | B | D | C | B | B | D | C | D | A | C | B | D | A | A | D | D | C | B | C | A | A | A | B | C | C | D | C | D | D | D | A | D | C | B | A | C | D | C | B | D | A | A | C | B | D | D | A | D | A | D | C | C | D | D | B | A | A | D | C | C | A | A | D | D | D | C | A | D | A | D | A | B | C | B | C | A | D | B | D | C | C | D | C | D | C | A | C | D | D | A | A | B | A | C | D | D | C | D | C | A | A | C | B | C | D | D | C | B | C | D | C | D | D | D | B | C | A | A | C | D | C | C | C | D | D | A | D | D | D | B | D | C | C | B | A | B | C | A | A | A | A | D | B | C | A | B | B | D | A | B | A | B | B | C | D | B | C | B | D | D | C | B | C | D | C | A | A | A | C | B | A | A | B | A | A | C | B | D | B | A | C | D | C | D | A | B | D | A | D | A | C | D | A | B | A | C | A | C | B | D | C | D | D | A | D | D | C | C | B | B | D | D | C | C | D | B | D | D | A | C | D | D | A | A | D | B | B | D | A | A | D | D | A | A | D | C | C | D | A | D | D | B | C | B | A | A | B | D | D | C | C | D | B | B | C | B | A | A | B | B | D | A | D | A | C | A | D | C | D | D | B | B | C | B | C | D | D | D | D | B | D | A | C | B | A | B | D | D | B | A | C | A | D | A | C | D | D | B | D | D | C | A | B | A | B | C | D | A | B | C | C | A | C | D | A | A | D | B | D | B | C | C | B | A | D | A | D | D | D | C | D | B | A | D | B | D | C | D | B | A | C | C | D | D | C | A | D | D | A | C | A | D | B | C | C | A | A | D | A | A | A | C | B | D | A | D | B | C | A | A | C | C | B | A | C | B | B | D | D | A | D | D | A | A | A | A | A | A | B | B | D | B | B | D | A | C | B | B | A | A | C | D | D | A | A | D | A | C | C | A | C | A | A | B | B | B | D | C | C | C | D | C | D | C | C | C | C | D | C | A | D | A | C | C | B | D | D | A | C | B | C | D | B | C | B | D | D | D | D | A | D | C | A | A | C | B | A | D | C | C | A | B | D | D | A | B | C | C | B | B | B | B | C | C | A | A | C | B | A | D | C | B | A | D | B | B | A | A | D | A | A | C | C | B | A | D | D | A | D | B | D | A | D | C | A | A | B | B | A | D | D | A | B | C | B | B | B | D | C | A | A | A | C | D | A | C | A | C | D | D | D | D | B | C | B | D | D | C | D | D | A | C | D | A | B | D | B | D | C | B | D | D | A | A | D | D | C | C | D | B | C | A | C | B | B | B | A | D | B | C | D | C | D | C | C | C | A | C | B | C | C | B | A | A | C | B | A | D | D | C | A | D | C | A | C | B | C | C | B | D | B | C | C | D | D | D | C | B | A | C | B | B | D | C | D | D | C | A | A | C | B | B | D | A | B | A | C | C | C | A | D | C | D | D | B | D | D | D | C | D | A | D | C | B | A | D | D | C | D | A | B | A | B | D | A | D | A | C | D | B | B | D | D | B | A | D | B | A | D | D | A | C | A | A | B | C | D | C | B | A | D | C | C | D | A | C | A | A | C | D | B | D | A | B | A | A | A | A | A | B | D | D | B | C | C | C | A |
reg_predictions | D | D | C | A | D | A | C | C | D | C | D | C | D | D | D | D | D | D | C | C | B | C | A | B | D | D | C | B | C | B | B | D | A | C | C | A | C | D | A | C | A | D | D | A | C | D | D | A | C | C | D | C | C | B | A | D | D | B | B | C | D | C | D | D | C | C | B | C | D | B | C | D | D | D | A | C | B | A | A | A | D | D | C | D | C | C | D | C | C | D | A | B | A | D | C | B | A | C | D | A | C | B | A | C | C | D | A | C | A | C | B | C | D | A | D | A | B | A | B | A | B | A | D | C | C | B | D | D | A | A | C | A | A | D | C | A | C | D | C | A | D | A | A | B | A | A | C | D | C | A | A | D | A | C | D | A | A | A | A | D | C | D | A | A | D | A | B | A | C | D | B | C | C | D | C | C | C | A | B | A | D | C | C | D | B | A | A | D | C | A | C | C | A | D | D | A | D | B | D | B | A | A | C | D | D | C | A | A | A | A | A | D | A | D | D | B | C | D | A | B | C | C | C | C | D | C | A | A | C | A | C | D | B | D | A | D | D | A | D | D | A | D | D | A | C | A | D | A | A | B | A | D | B | A | B | C | A | D | C | A | C | A | D | A | C | A | B | A | C | A | C | D | D | C | D | D | D | A | B | A | C | B | D | C | C | D | A | C | D | A | C | D | B | D | C | C | A | A | C | C | D | D | D | A | D | B | B | A | A | D | C | A | B | B | D | A | C | D | C | B | C | C | D | A | D | C | D | B | C | C | C | C | A | A | A | D | A | A | D | D | C | D | A | A | C | D | D | B | C | A | D | B | C | C | D | C | D | C | A | D | A | B | D | C | B | A | B | C | A | B | D | C | D | C | C | D | D | A | A | B | C | C | C | A | D | C | D | A | B | C | B | D | D | C | C | C | D | C | B | C | C | B | A | B | D | D | C | D | D | C | B | A | A | D | C | B | C | D | D | A | D | A | C | D | D | C | D | D | B | A | D | D | A | C | A | A | C | D | A | C | C | B | D | C | C | D | B | C | D | D | C | D | A | C | C | A | B | D | B | D | A | C | D | D | C | D | A | A | C | A | B | C | B | D | D | A | A | D | A | C | B | C | D | A | A | C | C | A | B | A | D | A | C | D | D | C | C | A | D | D | A | B | A | C | C | C | C | D | A | D | C | D | D | C | B | B | D | C | C | B | C | C | D | D | A | A | D | C | C | C | C | C | C | C | D | A | A | D | A | C | C | C | C | A | C | D | A | D | B | A | C | A | C | D | D | A | A | D | B | C | D | D | D | D | D | C | C | A | B | A | A | C | C | D | D | D | A | B | B | A | D | B | C | B | C | B | C | C | D | A | D | C | D | C | D | B | C | D | D | B | C | A | D | D | D | B | C | D | D | B | D | C | A | A | D | B | C | B | B | D | A | A | D | D | B | C | D | B | C | D | D | C | A | C | B | D | D | A | C | D | C | A | A | B | B | C | A | A | C | A | C | A | C | C | A | A | C | A | A | D | D | C | D | D | A | A | A | B | C | B | B | C | D | D | D | C | A | D | A | B | A | D | D | A | B | D | A | D | D | C | A | A | B | B | A | B | D | B | D | B | C | C | C | D | D | C | C | B | B | D | A | A | C | C | A | C | D | A | C | B | C | D | A | A | C | C | C | C | C | D | A | C | D | A | D | C | C | C | A | A | C | C | D | C | D | C | C | A | D | B | D | B | A | C | C | A | C | A | C | D | A | C | A | B | A | C | D | D | B | B | D | A | D | B | C | A | D | D | A | D | D | B | A | B | A | D | B | A | C | C | A | D | C | A | B | B | A | D | D | D | C | A | C | D | C | C | C | A | A | A | D | C | B | C | D | A | C | D | B | C | C | A | D | C | A | D | B | C | C | D | D | C | A | D | A | A | B | C | D | D | D | A | D | D | A | B | C | C | B | C | A | A | D | C | B | D | D | A | D | A | B | A | C | C | B | B | A | C | A | D | D | D | A | C | C | D | D | A | C | C | A | D | C | A | D | A | C | D | D | D | D | A | B | C | A | A | D | C | A | D | D | A | D | D | B | A | A | C | D | C | C | A | D | C | D | A | C | C | D | A | A | D | D | C | B | C | C | B | A | B | C | B | D | C | D | D | D | A | A | C | B | A | C | D | C | A | D | A | A | A | B | D | A | D | D | A | D | C | C | D | D | B | A | A | D | C | C | A | A | A | D | D | B | A | D | A | C | D | B | C | A | C | A | D | C | D | C | C | D | C | D | C | A | C | D | D | D | A | A | A | B | D | D | B | D | C | A | A | C | A | C | A | D | C | C | C | D | C | D | A | D | B | C | A | A | C | D | B | C | C | D | D | A | D | B | D | C | D | C | C | B | D | C | C | A | A | A | A | D | A | C | C | C | B | D | A | B | B | A | A | C | D | C | C | A | D | D | C | B | C | D | C | B | A | A | B | C | A | A | B | A | A | C | B | D | A | D | C | D | C | D | A | C | D | A | D | D | A | D | A | B | B | C | D | B | C | D | C | D | D | A | D | D | C | B | B | A | D | A | C | C | D | A | A | D | A | C | D | D | A | A | D | B | B | D | A | A | C | D | C | C | D | C | C | D | D | D | A | B | C | B | A | A | C | D | D | C | C | D | A | B | C | C | A | D | B | B | D | A | D | A | C | A | D | C | D | D | C | C | C | C | C | D | D | D | A | B | D | A | C | B | A | A | A | D | A | A | C | A | D | D | C | D | D | B | D | A | B | A | C | A | C | C | A | A | B | C | C | A | C | C | C | C | D | B | D | B | B | B | C | B | D | A | D | D | D | C | D | B | A | D | C | D | B | B | B | B | C | C | D | D | A | A | D | D | A | C | C | D | B | C | C | B | A | D | A | A | A | C | B | D | A | D | A | C | A | A | B | C | C | D | C | A | B | D | D | A | D | D | B | A | A | A | A | A | A | C | D | B | C | D | A | C | A | A | A | A | C | D | D | A | A | D | C | C | C | A | C | D | A | A | B | A | D | C | C | C | A | C | D | C | C | C | C | D | C | A | D | A | C | C | C | A | D | B | C | C | C | D | B | C | B | D | D | D | A | A | D | C | A | A | C | B | A | D | C | C | D | B | D | D | A | C | C | C | B | B | B | C | C | C | D | A | C | C | D | A | C | B | A | D | B | A | A | A | D | A | A | C | C | B | A | D | D | C | D | C | D | A | D | B | A | A | A | B | A | D | D | A | C | C | B | B | B | C | C | A | A | B | C | D | A | B | A | A | D | D | D | D | A | C | B | D | D | C | D | A | A | C | D | A | C | D | C | D | C | B | A | D | A | A | D | D | C | C | D | C | C | A | C | B | A | C | A | D | A | C | D | C | D | B | C | C | A | C | A | C | C | A | A | A | C | B | A | D | D | C | A | D | C | A | C | B | C | C | A | D | B | C | C | D | D | D | C | B | A | C | C | A | D | C | B | D | C | A | A | C | C | C | D | A | B | A | C | C | B | A | D | C | D | D | D | D | D | D | C | D | C | D | C | C | A | D | D | C | D | A | C | D | C | D | A | D | A | C | D | B | B | D | D | A | A | D | A | C | D | D | C | C | A | C | B | B | D | C | C | C | D | C | B | A | A | C | D | C | C | D | B | D | A | B | A | A | C | A | A | A | D | D | C | C | A | C | A |
differences
ID | 461976 | 467001 | 467216 | 463048 | 466368 | 460896 | 463669 | 466157 | 459925 | 464896 | 465853 | 467774 | 462633 | 467924 | 467699 | 459358 | 465079 | 462223 | 460300 | 465272 | 464836 | 460299 | 461321 | 462971 | 467653 | 460662 | 459022 | 460975 | 462162 | 467803 | 466872 | 466762 | 459218 | 460284 | 460325 | 462683 | 467313 | 464795 | 465491 | 462800 | 461431 | 467629 | 460474 | 459973 | 465930 | 467095 | 467236 | 461079 | 467619 | 459490 | 464666 | 465846 | 460102 | 461245 | 466455 | 466343 | 465954 | 463784 | 459299 | 464042 | 460589 | 461502 | 464545 | 459270 | 462793 | 467531 | 466996 | 461087 | 466223 | 464083 | 460244 | 464572 | 465980 | 464437 | 465958 | 463895 | 463111 | 460219 | 463455 | 464433 | 463769 | 463991 | 463756 | 462145 | 466529 | 464330 | 464261 | 465777 | 464160 | 459856 | 462254 | 464326 | 467685 | 463924 | 458990 | 463110 | 467737 | 465690 | 459338 | 464368 | 465134 | 464628 | 464446 | 467583 | 463818 | 467270 | 465328 | 461701 | 460055 | 459745 | 466328 | 466937 | 461607 | 460526 | 465967 | 464586 | 460701 | 466990 | 464227 | 459166 | 462674 | 460324 | 463109 | 460920 | 466813 | 461414 | 460836 | 463774 | 465860 | 465918 | 460093 | 459792 | 467117 | 461812 | 459564 | 461985 | 459891 | 465104 | 466062 | 460543 | 459151 | 459418 | 465364 | 464359 | 465146 | 460931 | 461884 | 459203 | 466136 | 463598 | 462676 | 459019 | 462833 | 466978 | 461453 | 464794 | 464926 | 462919 | 467272 | 463905 | 459614 | 463728 | 461031 | 466208 | 463695 | 465423 | 464937 | 460824 | 465187 | 462871 | 461734 | 459667 | 466623 | 459702 | 460049 | 463010 | 465985 | 463317 | 462963 | 459942 | 463619 | 465081 | 466741 | 461173 | 467298 | 465125 | 465581 | 463276 | 466603 | 464203 | 459465 | 464296 | 465106 | 466712 | 467305 | 465624 | 462412 | 463990 | 465400 | 462080 | 462568 | 459471 | 465554 | 465495 | 461199 | 463394 | 465415 | 465169 | 465102 | 459459 | 463206 | 459096 | 463761 | 460499 | 459588 | 459795 | 462923 | 462644 | 464733 | 463392 | 462612 | 460472 | 461834 | 461424 | 467446 | 460348 | 460409 | 460616 | 466648 | 460520 | 464749 | 462998 | 463706 | 459414 | 459318 | 462313 | 463113 | 462507 | 464151 | 467856 | 464796 | 460034 | 461382 | 460290 | 462082 | 466031 | 465217 | 461402 | 466602 | 463071 | 459410 | 467060 | 462163 | 463456 | 464371 | 466565 | 464943 | 466939 | 459995 | 463139 | 460631 | 461160 | 467603 | 466515 | 465532 | 459095 | 467712 | 460304 | 459902 | 465371 | 467023 | 465401 | 459292 | 460291 | 462702 | 462572 | 465607 | 459185 | 460978 | 459069 | 459496 | 465285 | 459557 | 463777 | 465897 | 458995 | 466521 | 465533 | 467455 | 461730 | 463754 | 467677 | 462392 | 459934 | 465531 | 460675 | 463376 | 462271 | 464553 | 467855 | 467585 | 461972 | 463319 | 465067 | 464342 | 460288 | 460810 | 465279 | 464945 | 462611 | 461804 | 463908 | 463044 | 466379 | 461482 | 459029 | 459814 | 459456 | 465614 | 460319 | 465765 | 459698 | 465573 | 460068 | 459962 | 459495 | 462852 | 464995 | 459632 | 459776 | 467482 | 463378 | 462642 | 463860 | 462905 | 466039 | 464393 | 466148 | 463023 | 467215 | 466655 | 465872 | 462938 | 463704 | 467412 | 467709 | 459623 | 464715 | 464582 | 464124 | 466777 | 462967 | 463828 | 465481 | 460442 | 461739 | 467884 | 460823 | 463962 | 461299 | 467534 | 464148 | 467419 | 464270 | 462536 | 461860 | 465524 | 466252 | 459893 | 460951 | 464109 | 466504 | 463757 | 466954 | 461001 | 464725 | 463417 | 462037 | 462400 | 459058 | 464570 | 460981 | 461688 | 465940 | 464299 | 467733 | 462208 | 466970 | 465620 | 463497 | 462524 | 466909 | 460848 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Segmentation | B | B | D | A | B | A | A | A | D | A | A | A | D | A | C | D | A | D | D | C | D | B | C | C | C | D | D | B | B | B | B | B | C | A | B | B | B | A | B | B | C | B | A | A | A | C | B | A | B | A | D | A | B | A | D | A | A | A | B | B | B | B | A | A | B | D | A | A | B | C | B | D | A | B | D | B | D | A | B | A | B | A | B | D | A | A | C | A | A | B | B | A | A | B | D | A | A | B | D | B | B | A | B | B | B | C | A | C | C | A | C | B | B | C | A | B | B | B | C | A | B | A | C | C | A | D | C | A | B | B | A | A | A | A | A | A | A | A | B | C | B | D | A | B | B | A | B | B | A | C | A | C | D | C | D | A | A | A | A | A | B | B | C | C | A | C | D | B | C | B | B | C | C | A | A | D | D | A | C | B | A | C | B | B | A | A | A | A | C | B | B | A | C | B | B | C | C | B | C | C | A | C | A | A | D | B | C | B | B | A | A | A | A | D | A | A | A | D | A | B | B | A | C | B | A | C | D | C | C | C | B | B | D | C | D | C | D | A | D | A | A | D | B | B | C | A | A | D | C | D | B | D | C | B | C | B | B | B | A | B | D | A | C | A | A | A | A | D | C | A | A | B | C | B | A | A | B | D | D | A | A | D | D | C | A | B | A | B | C | B | D | D | A | B | D | D | C | B | B | D | A | B | B | C | C | B | D | A | A | D | A | C | B | C | C | B | B | D | C | D | B | A | B | B | B | D | D | D | D | D | C | B | D | C | C | A | A | C | D | C | C | A | B | A | B | D | B | D | D | B | C | B | C | B | B | C | D | B | B | C | C | C | B | A | B | C | C | B | D | A | C | B | B | C | B | A | B | B | B | A | A | A | C | A | A | B | B | A | C | D | C | B | A |
rf_predictions | B | B | D | B | B | B | D | A | A | A | D | D | D | A | A | A | D | D | C | B | D | B | B | B | C | A | B | A | B | B | A | D | A | D | D | A | C | A | A | B | C | B | A | B | B | B | A | A | B | B | B | B | C | B | D | D | D | A | A | C | B | B | A | C | B | D | B | A | B | B | A | D | D | B | B | B | D | B | B | A | B | D | B | A | A | A | C | A | B | C | B | A | A | A | B | C | B | D | A | C | B | B | B | A | B | D | B | C | B | A | A | C | B | D | A | B | C | B | A | A | B | D | C | A | B | D | B | D | D | A | B | A | B | A | D | D | C | B | A | D | A | C | B | A | B | B | C | D | D | A | B | C | D | B | B | B | B | A | C | A | B | D | C | C | D | B | B | A | A | C | A | B | B | C | C | D | A | D | B | B | C | C | B | A | D | B | A | A | C | B | D | A | C | D | B | A | A | C | A | B | A | D | A | D | A | B | A | A | B | B | D | D | B | B | B | B | A | A | D | B | D | B | B | B | B | B | A | A | C | D | B | C | D | A | D | C | D | A | B | B | A | B | C | C | B | D | B | D | C | D | B | A | B | B | A | B | A | B | B | B | B | A | C | B | B | A | B | A | C | A | A | C | B | C | B | D | B | D | D | A | A | A | D | B | B | B | A | B | B | B | D | B | D | B | A | D | C | B | B | D | D | A | A | C | C | B | A | B | C | D | A | C | A | A | B | C | B | A | B | A | B | B | B | B | B | A | A | B | B | D | B | D | A | B | D | A | B | B | A | B | A | D | B | A | B | C | B | B | D | A | C | C | B | D | B | B | D | B | B | B | B | C | B | B | B | B | B | D | B | B | C | B | A | B | B | A | B | B | B | A | A | A | C | B | A | C | D | A | A | A | B | B | C |
reg_predictions | A | C | A | C | A | A | C | B | B | B | C | C | C | B | B | D | A | A | A | C | A | A | A | A | A | D | A | C | A | A | B | B | C | A | C | D | B | D | C | A | B | A | B | C | C | A | D | D | A | A | A | A | A | C | A | A | A | D | C | B | C | A | C | B | C | A | A | B | A | A | C | A | C | C | C | C | A | C | A | B | C | A | C | D | B | C | B | C | C | B | C | C | D | B | C | B | C | B | B | A | C | C | A | D | A | C | D | B | A | B | B | B | A | C | D | A | B | A | C | B | C | A | B | C | C | A | C | C | A | C | A | C | A | B | A | A | B | C | B | A | D | B | C | C | A | C | B | A | A | D | C | D | C | A | A | A | A | D | A | B | C | C | B | B | A | C | A | B | B | B | B | C | C | A | B | A | C | A | C | A | B | B | A | C | C | A | B | B | B | A | A | C | B | A | C | C | C | B | C | A | B | C | C | A | B | C | C | C | A | A | A | A | C | C | A | A | D | D | A | A | A | A | C | C | A | C | C | B | B | A | A | A | A | D | A | B | C | D | A | C | D | A | B | B | A | A | C | A | B | B | C | D | C | A | C | C | B | A | A | C | A | B | B | C | A | D | C | D | A | B | D | B | C | B | A | A | A | A | C | C | C | D | A | C | A | C | D | C | C | C | A | A | A | A | D | A | B | C | C | A | C | C | C | B | B | C | B | C | B | B | B | A | C | B | A | B | C | D | A | B | A | C | C | A | A | C | D | A | A | A | C | A | B | C | A | D | C | C | D | C | D | A | A | C | C | B | A | C | C | B | B | A | A | A | C | C | A | C | A | C | A | B | A | A | A | C | A | B | C | C | B | D | C | C | C | D | C | A | A | C | C | C | B | C | C | B | A | D | C | C | A | C | A |
percentages of different values: 0.2435
In conclusion, this project aimed to segment customers based on their demographic and behavioral characteristics, employing two classification algorithms: Random Forest Classifier and Logistic Regression. The data preprocessing and feature engineering techniques used include handling missing values, encoding categorical variables, and applying numerical transformations.
The models were selected due to their ability to handle multi-class classification problems with multiple features. Furthermore, they can be easily tuned to balance model complexity and performance. The model evaluation process involved using Stratified K-Fold cross-validation and GridSearchCV to find the best hyperparameters and validate model performance.
The best Random Forest model achieved an accuracy of 0.55 on the unseen test data, while the best Logistic Regression model reached a close but lower accuracy of 0.51. The choice between these two models depends on specific requirements, such as the need for probability estimates in the case of Logistic Regression.
However, there are limitations to this project. The time constraints and limited sample size might have impacted the quality of the results. The dataset itself had limited information, which could have contributed to the relatively low accuracy scores.
Future projects could explore other classification techniques, such as Support Vector Machines or K-Nearest Neighbors, to improve model performance. Additionally, advanced feature engineering might help increase the model's accuracy and provide more valuable insights.