Machine Learning Clustering in Python

This article, together with the code, has also been published in a Jupyter notebook.

In this article, we show different methods for clustering in Python. Clustering is the combination of different objects in groups of similar objects. For example, the segmentation of different groups of buyers in retail.

Clustering compares the individual properties of an object with the properties of other objects in a vector space. The goal of this vector space is to extrapolate relationships. The individual vectors, which represent the different properties of objects, are clustered according to their proximity to the vectors of other objects. In principle, any metric is allowed to determine the proximity of two vectors. Mostly, however, the Euclidean distance, or squared Euclidean distance (L2 norm), is used.

Often, unsupervised learning models are used for clustering. This offers an advantage from the beginning that no, or rather, only a few assumptions about the found clusters have to be made. In this article, we focus only on models that are the most common. The one disadvantage of these models is that finding a basis for assessing the result is necessary in order to be able to compare and evaluate the individual procedures.

Vergleich verschiedener Clusteringverfahren
Results of the application of different clustering methods to different data sets

You can find a summary of the different clustering procedures in our blog.

Validation of Clustering Results

It is difficult to objectively evaluate a clustering result, for the reason that a clustering algorithm can be discriminatory; the resulting groups can emerge by a matter of interest. If there were already a division of the objects of a data set into groups, clustering would be redundant. But as this grouping is unknown, it is difficult to say whether or not a cluster is well chosen. However, there are some methods that can be used to distinguish, at least better from worse, clustering results. By using multiple methods, finding a cluster becomes more significant. Below we have compiled a list with the most common methods.

Elbow Method

The elbow method is suitable for determining the optimal number of clusters in k-means clustering. In this case, the number of clusters is plotted in a diagram on the x-axis and the sum of the squared deviations of the individual points to the respective cluster center is plotted on the y-axis. If an elbow should be visible on the graph, this point is the optimal number of clusters. Because from this point on, the meaningfulness of the individual clusters decreases because the sum of the squared deviations only changes slightly.

Gap Statistic Method

The Gap Statistic Method compares the deviations of the individual objects within a cluster in relation to the respective center. The separation of the objects is compared with a random distribution. The further away from the points are from the random distribution, depending on the number of clusters, the better the respective number of clusters.

Calinski-Harabasz Index

The Calinski-Harabasz Index correlates with the separation and compactness of the clusters. Thus, the variance of the sums of squares of the distances of individual objects to their cluster center is divided by the sum of squares of the distance between the cluster centers and the center of the data of a cluster. A good clustering result has a high Calinski-Harabasz Index value.

Silhouette Method

The silhouette method compares the average silhouette coefficients of different cluster numbers. The silhouette coefficient indicates how well the assignment of an object to its two nearest clusters, A and B, fails. Cluster A is the cluster to which the object was assigned.

The silhouette coefficient is calculated by taking the difference of the distance of the object to the cluster B from the distance of the object to the cluster A. This difference is then weighted with the maximum distance of the object to clusters A and B. The result S (object) can be between -1 and 1. If S (object) <0, the object is closer to cluster B than to A. Therefore, clustering can be improved. If S (object) ≈ 0, then the object lies in the middle of the two clusters. Thus, clustering is not very meaningful. The closer S (object) approaches 1, the better the clustering. The silhouette method searches for the number of clusters for which the average silhouette coefficient is highest.


First, we will generate a dataset to cluster the data.

Data Generation

We generate our dataset so that it consists of a total of 80 two-dimensional points, which are randomly generated by three points within a defined radius. For this, we use the method “make_blobs” of scikit-learn.

Darstellung des Datensatzes
Representation of the data set

Clustering Process

Now that we’ve generated our data, we can start the actual clustering. For this, we use the methods k-means, DBSCAN, HDBSCAN, and OPTICS. These methods all originate from the scikit-learn library, with the exception of HDBSCAN. However, for the HDBSCAN process, there is also a ready-made library that we have used. K-means is one of the partitioning methods and the remaining methods are called density-based methods. Fundamentally, HDBSCAN and OPTICS are just upgraded versions of the DBSCAN. With the method “fit_predict” of scikit-learn, we determine with each model the relationship of the points to a cluster.

At this point, we briefly introduce the methods used and explain how they work.


The k-means method divides the data into k parts. This is done by minimizing the sum of the squared deviations of the individual points from the cluster centers. On the one hand, the problem with this method is that the number of clusters has to be determined in advance, and on the other hand, this method can deal poorly with data sets of varying densities and shapes. As a result, it is often not suitable for real data records. Likewise, the method can not detect noise objects, that is, objects that are far removed from a cluster and can not be assigned to it.


DBSCAN stands for “Density-Based Spatial Clustering of Applications with Noise”. The basic assumption of the DBSCAN algorithm is that the objects of a cluster are close together. There are 3 types of objects:

  1. Core objects are objects that are self-dense.
  2. Density-reachable objects are objects that are not themselves dense, but accessible from a core object.
  3. Noise points are objects that are not reachable from any other point.

Thereby a parameter epsilon and a parameter minPoints are defined. Epsilon determines the distance at which a point can be reached from a second point, namely, when its distance is less than epsilon. MinPoints determines when an object is dense, ie: how many objects within the distance epsilon have to be in the radius around one point.


HDBSCAN stands for “Hierarchical Density-Based Spatial Clustering” and is based on the DBSCAN algorithm. HDBSCAN extends this by transforming the DBSCAN algorithm into a hierarchical clustering algorithm. Therefore, it does not need a distance epsilon to determine the clusters. This makes it possible to find clusters of different densities and thus to remedy a major vulnerability of DBSCAN.

HDBSCAN works by first setting a number k. This determines how many neighborhood objects are considered when determining the density of a point. This is a significant difference compared to the DBSCAN algorithm. DBSCAN looks for neighbors at a given distance and HDBSCAN searches for the distance from which k objects are in the neighborhood.

After this number k, the core distance is determined for each point. This is the smallest distance with which k objects can be reached. In addition, a reachability metric is introduced. This is defined as the largest value of either the core distance of point A, the core distance of point B, or the distance from point A to point B. This reachability metric is then used to form a minimum spanning tree. With the help of this spanning tree, the name-giving hierarchy is now formed.

HSBSCAN minimaler Spannbaum
HSBSCAN Summarized Hierarchy Tree

For this, the longest edge is always removed from the tree. Then the hierarchy is summarized by dropping the individual points from the hierarchy depending on the distance.

HSBSCAN Hierarchiebaum
HSBSCAN Hierarchy Tree

If more than initially defined min_cluster_size points fall out at a distance, they form a new subtree. Then, the method selects the clusters that have the greatest stability. The stability of a cluster is found by calculating the sum of the reciprocals of the distance from which a point falls from a cluster for all points of a cluster. In other words, if there are many points close to the cluster center, the cluster has high stability. In the picture, it is recognizable by the large area.

HDBSCAN zusammengefasster Hierarchiebaum
HDBSCAN Summarized Hierarchy Tree


For our evaluation, we decided to use the Calinski-Harabasz index, the silhouette method, and, for the k-means clustering specifically, the elbow method.


If you look at the individual metrics for k-means clustering, you find that the optimal number of cluster centers is 3. Although the silhouette method suggests that a fourth cluster would be quite useful, the Elbow method speaks against it, as a fourth cluster forms no elbow and thus provides no significant added value. Add to that the Calinski-Harabasz index, and you can see that the fourth cluster scores only slightly higher. From this, we conclude that three is the optimal number of clusters for k-means clustering on our dataset. This result also makes sense, considering that we have generated our data by 3 points.

k-means Clustering
k-means Clustering

For the other two clustering methods, we can not predetermine the number of clusters to find but merely specify the parameters by which the algorithms find the clusters. So we decided to set the distance epsilon for the DBSCAN method and only to vary the number of minPoints.

The silhouette method indicated four minPoints. On the other hand, the Calinski-Harabasz Index argues for only three minPoints. Since the difference between three and four minPoints is lower in the silhouette method than in the Calinski-Harabasz index, we decided to set the value to three. The result is the following clustering:

DBSCAN Clustering
DBSCAN Clustering


We recognize that the algorithm has chosen 4 cluster centers. It becomes clear that the DBSCAN algorithm has difficulties with different dense clusters.

With the HDBSCAN method, we vary the smallest cluster size. This is the number of points required by the hierarchical method to regard one or more separated points as new clusters. Here are the results for the Calinski-Harabasz Index and the Silhouettes method.


We recognize that the optimal size is between four and five points. We decided to use four and get the following result.

HDBSCAN Clustering
HDBSCAN Clustering


It should be clear that clustering results on a dataset can be difficult to classify. There are many other methods that have also produced different solutions. The metrics for the validation of the clustering results are only partially suitable. It is always useful to use a variety of metrics to help compare them with each other in order to find the most suitable algorithm for a particular application. It also helps to fine-tune the parameters in the interest of obtaining the best result. Despite the weaknesses mentioned above, clustering methods make it possible to divide higher-dimensional data into groups.


Photo by Antonio Esteo from Pexels

Machine Learning Classification in Python - Part 2: Model Implementation and Performance Determination

This is the second part of our series Automated Classification in Python where we use several methods to classify the UCI Machine Learning record “Adult”. In this article, after generating data from Part 1, we will discuss in more detail how we implement the different models and then compare their performance.


We use different scikit-learn models for the classification. The scikit-learn library is open-source and provides a variety of tools for data analysis. Our goal is to compare the performance of the following models: Logistic Regression, Decision Trees, Gradient Boosted Trees, k-Nearest Neighbors, Naive Bayes, and a neural net. Below we briefly summarize the methods used.

In general, the models use the training record to create a function that describes the relationship between features and targets. After training, we can apply this function to data outside the training dataset to predict its targets. More about how these feature works are summarized in our blog. Below we show our implementation of a model using the following code example:

def do_decision_tree(self):
    """ Return a decision tree classifier object. """
    clf = tree.DecisionTreeClassifier()
    return, self.y_train)

We create a scikit-learn Decision Tree Classifier object and then train it on our training dataset.

With the scikit-learn method "predict," it is possible to determine a model for single data or an entire data set. Based on the results obtained, we determine the performance of the model.

Classification Models

In this section, we briefly summarize the operation of each procedure. For a more detailed explanation of the model, we have linked the appropriate article from our blog and the scikit-learn documentation.

Logistic Regression

Logistic regression transforms a linear regression into a sigmoid function. Thus, logistic regression always outputs values ​​between 0 and 1. In general, coefficients of the individual independent features are adjusted so that the separate objects can be assigned to their targets as accurately as possible.

Decision Tree

Decision tree models are simple decision trees in which each node examines an attribute, after which the next node retrieves an attribute. This happens until a “leaf” is reached that indicates the respective classification case. For more information on Decision Trees, see our article on classification.

Gradient Boosted Trees

Gradient Boosted Trees are models that consist of several individual decision trees. The goal is to design a complex representation out of many simple models. The loss function is optimized for a differentiable loss function by the coefficients of the individual trees so that the loss function is as small as possible.

k-nearest Neighbors

In the k-nearest Neighbors model, the individual objects are viewed in a vector space and classified in relation to the k closest neighbors. A detailed explanation of this model can be found in our article on classification. 

Naive Bayes

The Naive Bayes model considers objects as vectors whose entries each correspond to a feature. The probability of each entry belonging to the target is determined and summed up. The object is assigned to the target whose summed probability is highest. Further information on the Naive Bayes model can also be found in our article on classification.

Neural network

Neural networks consist of one or more layers of individual neurons. When training a neural network, the weighted connections are adapted in such a way that a loss function is minimized. We have summarized more information on the structure and function of neural networks in our article, "What are artificial neural networks?”

Performance Determination

To determine the performance of our models, we wrote a function, "get_classification_report". This function gives us the metrics; precision, recall, F-Score, and the area under the ROC curve, as well as the required key figures, which we describe in detail in our article, by calling the scikit-learn function, "metrics.classification_report" and "metrics.confusion_matrix".

def get_classification_report(self, clf, x, y, label=""):
        """ Return a ClassificationReport object.

        clf: The classifier to be reported.
        x: The feature values.
        y: The target values to validate the predictions.
        label: (optionally) sets a label."""

        roc_auc = roc_auc_score(y, clf.predict_proba(x)[:,1])
        fpr, tpr, thresholds = roc_curve(y, clf.predict_proba(x)[:,1])
        return ClassificationReport(metrics.confusion_matrix(y, clf.predict(x)),
        metrics.classification_report(y, clf.predict(x)), roc_auc, fpr, tpr, thresholds , label)

Only the receiver operating characteristic is obtained in the form of a graph that is generated with the help of the Python library matplotlib and the scikit-learn functions "roc_auc_score" and "roc_curve".


Finally, we discuss how the models have performed on our data.

KlassifikatorPrecisionRecallF-ScoreArea under CurveTrainingszeit (Sekunden)
Logistic Regression83%84%83%87%0,8
Decision Tree82%82%82%77%0,5
Gradient Boosted Trees88%88%87%93%142,9 (2 Minuten)
k-nearest Neighbors84%84%84%88%2,1
Naive Bayes84%83%83%89%0,1
Neuronal Networks83%83%83%89%1.746 (29 Minuten)

A direct comparison shows that neural networks need very long training times in order to achieve adequate results. With our settings, the training lasted about 29 minutes and a precision of 83% was achieved. In contrast, the Naive Bayes classifier achieved the same accuracy in just one-tenth of a second.

Reciever Operating Characteristic: Area under curve
Receiver Operating Characteristic: Area under the curve

The best result was achieved with the Gradient Boosted Trees. This classifier managed to achieve a precision of 88% in 143 seconds. In addition to the relatively poor result of the neuronal net for the required training duration, it must be said that our model was most likely not chosen optimally.

In addition, the training of the neural net and the gradient boosted trees weren’t done on GPUs, which would have reduced the training duration.


Photo by Christina Morillo from Pexels

Machine Learning Performance Indicators

In this article, we present a set of metrics that can be used to compare and evaluate different methods and trained models for classification problems. In our article, Machine Learning Classification in Python - Part 1: Data Profiling and Preprocessing, we introduce different classification methods whose performance can be compared with the metrics described here.

The Cases in Detail

These cases all relate to the predicted target of an object to be classified. Below we list all cases together with their meaning.

  • False-negative: When a data point is classified as a negative example(say class 0) but it is actually a positive example(belongs to class 1).
  • False-positive: When a data point is classified as a positive example(say class 1) but it is actually a negative example(belongs to class 0).
  • True-negative: When a data point is classified as a negative example(say class 0) and it is actually a negative example(belongs to class 0).
  • True-positive: When a data point is classified as a positive example(say class 1) and it is actually a positive example(belongs to class 1).
Darstellung Klassifizierungsfälle
Presentation of the classification cases

Example: Transfer of Passenger Data to the BKA

To clarify the terms, we will look at an article on the transfer of passenger data to the Federal Criminal Police Office (BKA) of the SZ dated 24 April 2019. Since 29 August 2018, passenger data has been automatically transmitted to the BKA for evaluation. The data consists of the passenger's name, address, payment data, date of travel, time of day, departure airport, and destination airport, and this information is stored in the Passenger Name Record. This data is to be used to identify wanted offenders and suspects. In the future, this data will not only be used to identify criminals but will also be used in predictive policing. Predictive policing is the analysis of past crimes to calculate the probabilities of future crimes. The aim is to identify patterns, such as the burglary rate in a residential area, in order to be able to deploy police forces in a more targeted manner.

False Positives with Large Amounts of Data

Since the beginning of the survey until March 31st, 2019, a total of 1.2 million passenger data had been recorded and transferred. Of these, an algorithm had classified 94,098 air travelers as potentially suspicious. All hits are then checked by officials for accuracy and it turned out that only 277 of these hits were true positives. The remaining 93,821 classified suspects were false positives. The travelers from the remaining numbers, who have not been classified as suspects, can usually not be determined accurately (false negatives). The rest of the data, which was not properly classified as a match, is the true negative. The article criticizes the high rate of false positives. This would often lead to people being unfairly targeted by authorities, in this case, 99.71% of all hits. It should be mentioned, however, that this high error rate reduces the probability of false-negative cases. This means that at the expense of many false alarms, there are significantly fewer suspicious passengers being targeted as non-suspicious.

Frequently used Performance Metrics

Based on the number of cases that have occurred, we can look at and evaluate a classifier under different metrics. In our article we have decided to use the following metrics:


A very simple metric is precision. This is the ratio of how many objects a classifier from the test data set has correctly classified as positive to those that have been falsely classified as positive. The formula is:

(True Positives) / (True Positives + False Positives)

This metric is particularly problematic in a binary classification in which a class is over-represented in the example data record; the classifier can simply assign all objects of the data record to the over-represented class but nevertheless, would achieve relatively high precision.

Using the example of passenger data, the precision is calculated as follows:

277 / (277 + 93.821) = 0,29%


In contrast to Precision, the Recall Score, also known as the hit rate, indicates the proportion of all positive objects that are correctly classified as positive. This means that we get a measure to determine how many positive targets are recognized as such. The formula is:

(True Positives) / (True Positives + False Negatives)


The F-Score combines Precision and Recall to obtain the best possible estimate of how accurately the targets of the objects are determined.

The F-Score is calculated using the harmonic mean of the formula:

F = 2 * (Precision * Recall) / (Precision + Recall)

Since we have no information about the False Negatives in our example, we are unable to determine Recall and F-Score. In our article, however, we have this information, and we can calculate both metrics.

Advanced Metrics - Receiver Operating Characteristic

The last metric we use is the Receiver Operating Characteristic curve. This indicates how reliably the classifier achieves its results. The ratio of the hit rate to the false-positive classified data as a function of a threshold value is displayed in this graph...

Reciever Operating Characteristic: Area under curve
Receiver Operating Characteristic: Area under the curve

This threshold designates when an object is assigned to one or the other class. The threshold value in the lower-left corner of the graph is the largest, which means that the model only assesses the target of an object as positive if an object belongs to the target with absolute certainty. Thus, the false-positive rate approaches 0. The more one approaches the upper right corner, the lower the threshold and the more likely it is that the model falsely classifies an object as positive even though it was actually negative (False Positive). The area under curve thus serves as a metric for the quality of a tested model.


Although the time required for training does not provide any indication of the quality of a classifier, it is often important to consider how quickly a classifier can be trained. The different procedures are of different complexity, which means that the differences in time and effort can be significant. These time differences can grow exponentially with increasing data size, which is why the aspect of the time a classifier takes to reach a certain accuracy cannot be neglected, especially when working with very large amounts of data. To illustrate this aspect, we have trained our models on reduced data sets and measured the temporal differences. We obtained the following result.

Logarithmische Veränderung der Trainingszeiten bei Veränderung der Datenmenge
Logarithmic change of training times with change of data volume


Photo by Maurício Mascaro from Pexels

Machine Learning Classification in Python - Part 1: Data Profiling and Preprocessing

This is the first part of the series, Automated Classification in Python, in which we demonstrate how to classify a given data set using machine learning classification techniques. In the following article, we show the analysis and processing of the freely available "Adult" data set for classification. We have also published our script together with the data sets and documentation on GitHub. The dataset comes from the Machine Learning Repository of the University of California Irvine. This currently contains 473 datasets (last accessed: May 10, 2019) that are available for machine learning applications. The "Adult" data set is based on US census data. The goal is to use the given data to determine whether a person earns more or less than $ 50,000 a year.

Data Profiling

The first step we take before we can begin the actual classification is to look at the structure of the data set. We find that the data set consists of approximately 45,000 personal data sets and is already divided into training and test data. Some of the data (approx. 7.5%) are incomplete because data points (features) were not specified for individual persons. Due to the relatively low number of incorrect data sets, we will simply ignore them in the analysis. The personal data consist of continuous and categorical features of the persons. The continuous data are age, ' final weight', educational years, capital growth, capital loss, and weekly hours. The categorical data are employment status, degree, marital status, occupation, relationship, race, gender and country of birth. Our target variable is a person's income, more precisely, whether a person earns less or more than $50,000 a year. Since our target variable can only take two different values, it is a binary classification. Within the dataset, the ratio between people earning less than $50,000 and those earning more is approximately 3:1.

Analysis of Feature Properties

When analyzing each feature, the 'final weight ' feature attracted particular attention: it groups similar people based on socioeconomic factors and this rating depends on the state in which a person lives. Due to the relatively small data set and the inaccurate documentation of the underlying calculation, we decided not to include this feature in the first calculation. A later comparison showed that omitting this feature in individual cases only led to an improvement in the classification results. To solve the problem of predicting a person's income based on these features, we use a supervised machine learning approach because we have a lot of labeled data. Based on this, the algorithm can estimate the dependence of the individual features on the target. In the second part of our article, we present some of the methods we have already discussed in our blog. However, all these methods require very accurate pre-processing of the data in order for our model to be able to evaluate them and interpret values such as "Monday" or "Tuesday". Some would say that we are “cleaning,” the data.

Preprocessing of the data

We first have to preprocess our data in order to be able to apply the various machine learning models to it. The different models compare the different features of the data to determine the relationship to the target. In order to do so, the data must be in a uniform form to allow comparability. This is what we talk about when we ”clean” the data. We clean our data with the following function. We will explain how it works in the next sections:

def setup_data(self):
        """ set up the data for classification """
        traindata = self.remove_incomplete_data(self.traindata)
        testdata = self.remove_incomplete_data(self.testdata)
        self.y_train = self.set_target(traindata)
        self.y_test = self.set_target(testdata)

        # set dummies of combined train and test data with removed target variable
        fulldata = self.get_dummies(traindata.append(testdata, ignore_index=True).drop(, axis=1).drop("fnlwgt", axis=1), self.categorical_features)
        self.x_train = fulldata[0:len(traindata)]
        self.x_test = fulldata[len(traindata):len(fulldata)]

Although our data set is already divided into a training data set and a test data set in a ratio of 2:1, we still have to merge it for the creation of dummy variables in order to be able to divide it again later in the same ratio. This procedure offers the decisive advantage that the resulting data sets have the same shape and dimensionality under all circumstances. Otherwise, if a value is missing in either the training or test data set, the new data set may have fewer columns, or the columns with the same index may stand for different feature values. As a result, the comparability of the two records is lost. Furthermore, there are some unknown values in the dataset that we have to address specifically. However, the proportion of data with unknown values in the data set is relatively small (<10%). Therefore, it is possible for us to exclude this incomplete data from the data set and remove it. We achieve this in the function "setup_data" by calling our function "remove_incomplete_data":

def remove_incomplete_data(self, data):
    """ Remove every row of the data that contains atleast 1 "?". """
    return data.replace("?", np.nan).dropna(0, "any")

In this case, all rows containing at least one "?" are removed from the data. We do this to ensure that the algorithm always receives relevant data and does not create relations between unknown values. These would be regarded as equal values and not interpreted as unknown when the dummy variables are created later. After we executed the function, our data set now consists of 45,222 entries, as opposed to the previous 48,842.

Assigning the Target Variable

In the second part of the "setup_data" function, we use the set_target function call to map the target variable to 0 or 1, depending on whether someone earns more or less than $ 50,000 a year.

def set_target(self, data):
    """ Set the target values of the target variables (0,1 for either case). """
    for i in range(len(data[].unique())):
        data[] = np.where(data[] == data[].unique()[i], i, data[])
    return data[].astype("int")

Replace Categorical Values ​​with Dummy Variables

Before we begin to classify the data, we must first ensure that our model is able to handle categorical values. For this, we generate so-called dummy variables from all categorical variables via the one-hot encoding method. Each possible assignment of a categorical variable is given its own variable so that instead of a single variable that can take on different values, there are many variables that can only assume the value 0 or 1 and each represents a categorical assignment of the replace variable.


An example: We have an object of type "date" with the feature 'weekday = {'Monday', 'Tuesday', 'Wednesday', ...}'. After creating the dummy variable, the 'weekday' feature no longer exists. Instead, each possible assignment represents its own feature. These are in our example: weekday_tuesday, ..., weekday_sunday. Depending on which weekday the feature had before creation, this variable is set to 1 and the rest to 0. The attentive reader may wonder why the feature “weekday_monday” does not exist. The simple reason for the omission is that from the negative assignment of the other features, it can be implicitly concluded that an object has the value weekday_monday. A further advantage is that a too strong dependency, multicollinearity, of the individual variables are avoided. This could have a negative effect on the result since the strong dependency can make it difficult to determine the exact effect of a particular variable in a model. The generation of dummy variables is therefore necessary because, as already mentioned, a model has no knowledge of a weekday and how to interpret it. After the dummy variables have been created, this no longer plays a role, since the algorithm only differentiates whether the feature of an object has the value 0 or 1. This makes it possible to compare the individual objects with the respective features.


In the last part of our function "setup_data" we created the dummies by calling the function "get_dummies" as follows:

def get_dummies(self, data, categorical_features):
    """ Get the dummies of the categorical features for the given data. """
    for feature in self.categorical_features:
        # create dummyvariable with pd.get_dummies and drop all categorical variables with dataframe.drop
        data = data.join(pd.get_dummies(data[feature], prefix=feature, drop_first=True)).drop(feature, axis=1)
    return data

We create a loop that goes through all categorical variables of the record. For each run, we append the data set to all dummy variables of the respective categorical variable using the function "get_dummies". Then we remove that categorical variable. After completing the loop statement, our record no longer contains any categorical variables. Instead, it owns the respective dummy variables. So we get from the original features:

          age   workclass
Person1   39    Local-gov
Person2   50    Federal-gov

The following:

          age   workclass_Federal-gov  workclass_Local-gov  workclass_Never-worked
Person1   39    0                      1                    0
Person2   50    1                      0                    0

The reason for the merging of the two data records becomes clear again: If, for example, the value "Local-gov" is present in only one of the data records, the resulting data records have different dimensionalities in the generation of the dummy variables since the entire column is missing in the other data. For example, if the model establishes a strong link between local-gov and an income in excess of $ 50,000, that relationship shifts to the feature in the other record that occupies the place of local-gov. This probably results in a wrong result but in any case a wrong connection. In the last part of the "setup_data" function, we divide the records back into a training and test record.

self.x_train = fulldata[0:len(traindata)]
self.x_test = fulldata[len(traindata):len(fulldata)]

In the second part, we discuss how we can apply the prepared data to different classifiers and then compare and evaluate the results.


Photo by Lukas from Pexels

Model Validation and Overfitting

The model validation procedure describes the method of checking the performance of a statistical or data-analytical model.

A common method for validating neural networks is k-fold cross-validation. In doing so one divides training data set into k subsets. One of the subsets represents the test set. The remaining subsets then serve as the training set. The training set is used to teach the model. By the ratio of the correct results on the test set, it is possible to determine the degree of generalization of the model. The test set is then swapped with a training set and the performance is determined again until each set has finally functioned as a test set. At the end of the process, the average degree of generalization is calculated to estimate the performance of the model. The advantage of this method is that you get a relatively variant-free performance estimate. The reason for this is to prevent important structures in the training data from being excluded. This procedure is basically an extension of the holdout method. However, the holdout method simply splits the record into a training and a test set. The danger with this method is in contrast to the k-fold Cross Validation that important data could not be available for training. This can lead to the model not being able to generalize sufficiently.


Beispielhafte Darstellung k-fold Cross Validation
Exemplary representation k-fold Cross Validation



Overfitting is the term used when a model is too specifically adapted to a training set. With neural networks, for example, this would mean that a network is very accurate for inputs from the training data set, but not for a test set. This means that the model can map the trained data very accurately, but it is unable to produce generalized results. 

Typically, overfitting occurs when the training record is relatively small and the model is relatively complex. Because a complex model can reproduce the training data more accurately, conversely, a simple model does not accurately depict the training data (underfitting). So it is generally useful to keep the model as simple as possible and at the same time not too simple, depending on the existing data set. A perfect model, that is, a model that does not come to over or underfill is almost impossible to create.

In order to reduce the problem of overfitting and at the same time to keep the underfitting low, several methods have been introduced. Among other things, the patented Google dropout. It only deactivates a certain number, usually between 20% and 50%, depending on the fixed factor, of the neurons randomly. This method, despite its simplicity, achieves a significant reduction in overfitting.

Beispielhafte Darstellung für Overfitting

Exemplary representation for overfitting

Photo by Juhasz Imre from Pexels


Clustering with Machine Learning

The clustering problem is the grouping of objects or data into clusters. In contrast to classification, they are not predetermined but result from the similarity of the different objects considered. Similarity means the spatial distance of the objects, represented as vectors. There are different methods to determine this distance, e.g. the Euclidean or Minkowski metrics. Already when determining the similarity one notices that there is no clear cluster determination.


Beispiel Clustering
An Example of Clustering


Clustering Methods

There are different methods to classify the individual objects into clusters, each of which has its own advantages and disadvantages and achieves different results. In most cases, in order to obtain the best results for a data set, you have to try different methods with different parameters and then decide which one to use. There is no procedure that works best in general. Common unsupervised learning methods include:


Hierarchical Clustering

In hierarchical clustering, either the total amount of all objects (top-down) or the amount of all individual objects (bottom-up) is assumed as a cluster at the beginning. In the top-down process, the largest cluster is split by selecting the farthest object as the center of the cluster as a new cluster center. The algorithm then assigns all objects to the nearest cluster center. In the bottom-up procedure, the smallest and closest clusters are combined to form a new cluster. The difficulty with both methods is as such: on the one hand, how to determine the distance between the individual objects and, on the other hand, how far to divide or combine the clusters. After all, the process should not be repeated until as many clusters as objects or only one cluster remains. 



In partitioning clustering, a number k is first defined, which specifies the number of clusters. The respective algorithm then determines k cluster percentages. For example, the K-Means algorithm, one of the most common partitioning clustering methods, randomly determines the cluster centers. The advantage of the partitioning method is that the assignment of the individual objects to a cluster can be changed during the finding of a cluster. On the other hand, the fact that the number of cluster centers is already defined at the beginning is a big disadvantage, since one has to execute the algorithm with different values and then select one in order to achieve optimal results.


Density-Based Clustering

In density-based clustering, the different clusters are determined by the density of the individual objects in space. The most popular of these methods is the DBSCAN algorithm. A distance ε together with a number k is determined. Objects that possess k neighboring objects with a distance smaller than ε together represent core objects. If two core objects are at a distance ε to each other, they are combined into a cluster. Objects that are not core objects but are close to a cluster are assigned to the cluster. The objects that belong to no cluster are considered noise objects.



Since clustering methods find abstract relationships in data that a person may not perceive as obvious, they are now used in many areas, such as in market analysis. In market analysis, clustering is widely used to evaluate data from surveys, for example. Here, customers are divided into general groups in order, for example, to enable better product placements. Clustering is also used in Natural Language Processing (NLP). Here they enable the automated evaluation of texts and language. Another interesting field in which clustering is used in crime analysis. Clustering allows the efficient deployment of police and security forces, as crime hot spots can be better identified.


Hurdles of Clustering

Clustering is very time-consuming for large amounts of data, especially if the room has several dimensions. Furthermore, the effectiveness of clustering is determined by the chosen method of distance measurement. As a rule, there is no generally correct result. This is due to the fact that the result is drastically different depending on the different methods with different parameters. This makes it difficult to interpret the result and check its correctness. Possibilities would be to carry out this examination, to compare the result of the clustering with known classifications or the evaluation by a human expert. However, both variants are only conditionally suitable, since clustering would not be necessary if there were already known classifications and human judgments are always very subjective.


Photo by Pixabay from Pexels