Machine Learning Classification in Python – Part 1: Data Profiling and Preprocessing

This is the first part of the series, Automated Classification in Python, in which we demonstrate how to classify a given data set using machine learning classification techniques. In the following article, we show the analysis and processing of the freely available “Adult” data set for classification. We have also published our script together with the data sets and documentation on GitHub. The dataset comes from the Machine Learning Repository of the University of California Irvine. This currently contains 473 datasets (last accessed: May 10, 2019) that are available for machine learning applications. The “Adult” data set is based on US census data. The goal is to use the given data to determine whether a person earns more or less than $ 50,000 a year.

Data Profiling

The first step we take before we can begin the actual classification is to look at the structure of the data set. We find that the data set consists of approximately 45,000 personal data sets and is already divided into training and test data. Some of the data (approx. 7.5%) are incomplete because data points (features) were not specified for individual persons. Due to the relatively low number of incorrect data sets, we will simply ignore them in the analysis. The personal data consist of continuous and categorical features of the persons. The continuous data are age, ‘ final weight’, educational years, capital growth, capital loss, and weekly hours. The categorical data are employment status, degree, marital status, occupation, relationship, race, gender and country of birth. Our target variable is a person’s income, more precisely, whether a person earns less or more than $50,000 a year. Since our target variable can only take two different values, it is a binary classification. Within the dataset, the ratio between people earning less than $50,000 and those earning more is approximately 3:1.

Analysis of Feature Properties

When analyzing each feature, the ‘final weight ‘ feature attracted particular attention: it groups similar people based on socioeconomic factors and this rating depends on the state in which a person lives. Due to the relatively small data set and the inaccurate documentation of the underlying calculation, we decided not to include this feature in the first calculation. A later comparison showed that omitting this feature in individual cases only led to an improvement in the classification results. To solve the problem of predicting a person’s income based on these features, we use a supervised machine learning approach because we have a lot of labeled data. Based on this, the algorithm can estimate the dependence of the individual features on the target. In the second part of our article, we present some of the methods we have already discussed in our blog. However, all these methods require very accurate pre-processing of the data in order for our model to be able to evaluate them and interpret values such as “Monday” or “Tuesday”. Some would say that we are “cleaning,” the data.

Preprocessing of the data

We first have to preprocess our data in order to be able to apply the various machine learning models to it. The different models compare the different features of the data to determine the relationship to the target. In order to do so, the data must be in a uniform form to allow comparability. This is what we talk about when we ”clean” the data. We clean our data with the following function. We will explain how it works in the next sections:

def setup_data(self):
        """ set up the data for classification """
        traindata = self.remove_incomplete_data(self.traindata)
        testdata = self.remove_incomplete_data(self.testdata)
        self.y_train = self.set_target(traindata)
        self.y_test = self.set_target(testdata)

        # set dummies of combined train and test data with removed target variable
        fulldata = self.get_dummies(traindata.append(testdata, ignore_index=True).drop(, axis=1).drop("fnlwgt", axis=1), self.categorical_features)
        self.x_train = fulldata[0:len(traindata)]
        self.x_test = fulldata[len(traindata):len(fulldata)]

Although our data set is already divided into a training data set and a test data set in a ratio of 2:1, we still have to merge it for the creation of dummy variables in order to be able to divide it again later in the same ratio. This procedure offers the decisive advantage that the resulting data sets have the same shape and dimensionality under all circumstances. Otherwise, if a value is missing in either the training or test data set, the new data set may have fewer columns, or the columns with the same index may stand for different feature values. As a result, the comparability of the two records is lost. Furthermore, there are some unknown values in the dataset that we have to address specifically. However, the proportion of data with unknown values in the data set is relatively small (<10%). Therefore, it is possible for us to exclude this incomplete data from the data set and remove it. We achieve this in the function “setup_data” by calling our function “remove_incomplete_data”:

def remove_incomplete_data(self, data):
    """ Remove every row of the data that contains atleast 1 "?". """
    return data.replace("?", np.nan).dropna(0, "any")

In this case, all rows containing at least one “?” are removed from the data. We do this to ensure that the algorithm always receives relevant data and does not create relations between unknown values. These would be regarded as equal values and not interpreted as unknown when the dummy variables are created later. After we executed the function, our data set now consists of 45,222 entries, as opposed to the previous 48,842.

Assigning the Target Variable

In the second part of the “setup_data” function, we use the set_target function call to map the target variable to 0 or 1, depending on whether someone earns more or less than $ 50,000 a year.

def set_target(self, data):
    """ Set the target values of the target variables (0,1 for either case). """
    for i in range(len(data[].unique())):
        data[] = np.where(data[] == data[].unique()[i], i, data[])
    return data[].astype("int")

Replace Categorical Values ​​with Dummy Variables

Before we begin to classify the data, we must first ensure that our model is able to handle categorical values. For this, we generate so-called dummy variables from all categorical variables via the one-hot encoding method. Each possible assignment of a categorical variable is given its own variable so that instead of a single variable that can take on different values, there are many variables that can only assume the value 0 or 1 and each represents a categorical assignment of the replace variable.


An example: We have an object of type “date” with the feature ‘weekday = {‘Monday’, ‘Tuesday’, ‘Wednesday’, …}’. After creating the dummy variable, the ‘weekday’ feature no longer exists. Instead, each possible assignment represents its own feature. These are in our example: weekday_tuesday, …, weekday_sunday. Depending on which weekday the feature had before creation, this variable is set to 1 and the rest to 0. The attentive reader may wonder why the feature “weekday_monday” does not exist. The simple reason for the omission is that from the negative assignment of the other features, it can be implicitly concluded that an object has the value weekday_monday. A further advantage is that a too strong dependency, multicollinearity, of the individual variables are avoided. This could have a negative effect on the result since the strong dependency can make it difficult to determine the exact effect of a particular variable in a model. The generation of dummy variables is therefore necessary because, as already mentioned, a model has no knowledge of a weekday and how to interpret it. After the dummy variables have been created, this no longer plays a role, since the algorithm only differentiates whether the feature of an object has the value 0 or 1. This makes it possible to compare the individual objects with the respective features.


In the last part of our function “setup_data” we created the dummies by calling the function “get_dummies” as follows:

def get_dummies(self, data, categorical_features):
    """ Get the dummies of the categorical features for the given data. """
    for feature in self.categorical_features:
        # create dummyvariable with pd.get_dummies and drop all categorical variables with dataframe.drop
        data = data.join(pd.get_dummies(data[feature], prefix=feature, drop_first=True)).drop(feature, axis=1)
    return data

We create a loop that goes through all categorical variables of the record. For each run, we append the data set to all dummy variables of the respective categorical variable using the function “get_dummies”. Then we remove that categorical variable. After completing the loop statement, our record no longer contains any categorical variables. Instead, it owns the respective dummy variables. So we get from the original features:

          age   workclass
Person1   39    Local-gov
Person2   50    Federal-gov

The following:

          age   workclass_Federal-gov  workclass_Local-gov  workclass_Never-worked
Person1   39    0                      1                    0
Person2   50    1                      0                    0

The reason for the merging of the two data records becomes clear again: If, for example, the value “Local-gov” is present in only one of the data records, the resulting data records have different dimensionalities in the generation of the dummy variables since the entire column is missing in the other data. For example, if the model establishes a strong link between local-gov and an income in excess of $ 50,000, that relationship shifts to the feature in the other record that occupies the place of local-gov. This probably results in a wrong result but in any case a wrong connection. In the last part of the “setup_data” function, we divide the records back into a training and test record.

self.x_train = fulldata[0:len(traindata)]
self.x_test = fulldata[len(traindata):len(fulldata)]

In the second part, we discuss how we can apply the prepared data to different classifiers and then compare and evaluate the results.


Photo by Lukas from Pexels