Post

EDA for Classification

Exploratory Data Analysis, or EDA for short, is the process of exploring and visualizing data to find useful patterns and insights that help inform the modeling process. Oftentimes when we’re working with classification data, we might have dozens, if not hundreds of potential features at our disposal.
So the process of EDA is largely about identifying which features are most promising and narrowing down to just a handful, at least to start, as we build a baseline model, we can always add more features later.


When performing EDA for classification, it`s important to explore:

  • The target variale
  • The features
  • Feature-target relationships
  • Feature-feature relationships

Project 1 - income prediction project

We want to be able to validate whether our customers are reporting income accurately using machine learning.

Assignment 1: EDA

  1. Read in income.csv
1
2
3
4
5
6
7
8
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Read file
income = pd.read_csv("Data/income.csv")
income.head()

head

  1. Convert the target, SalStat into a binary numeric variable called target, and build a bar chart that plots the frequency of each value.
1
2
3
4
5
6
7
8
9
# Convert Salt
income["target"] = np.where(income["SalStat"] == ' less than or equal to 50,000', 0, 1)

# Plot
income["target"].value_counts(normalize=True).plot.bar()
plt.title('Distribution of Target Variable')
plt.xlabel('Target')
plt.ylabel('Proportion')
plt.show()

head

  1. Explore the numeric features using histograms or boxplots.
1
2
3
4
5
6
7
8
# 
def num_box_plotter(data):
  for column in data.select_dtypes("number"):
  sns.boxplot(data[column]).set(ylabel=column)
  plt.show()

num_box_plotter(income)

head head head head head

  1. Explore the categorical features using bar charts.
1
2
3
4
5
6
7
8
# 
def cat_bar_plotter(data, normalize=False):

    for column in data.select_dtypes("object"):
        data[column].value_counts(normalize=normalize).plot.bar()
        plt.show()

cat_bar_plotter(income, normalize=True)

head head head head head head head head head

Code Available Here

Link Data file

This post is licensed under CC BY 4.0 by the author.

Trending Tags