Life on Mahvash Lane - Data Splitting

Data Splitting

Splitting the data into portions when building models is important to make the models unbiased. There are tools available to create unbiased sets to achieve reliable results. Data sets are splitted into train and test portions where the train subset is used for building the model and the test subset is used for testing the model.

Note: In Anaconda Jupyter Notebook to hide outputs of a cell click on the output cell and enter letter 'o'. The output will be cleared.

Information and review of Train Test Split function by Scikit-Learn could be found here.

Code below shows a practical sample with the data source for diabetes available from the Python package used in the code.

Review of K-means Clustering method offered by Scikit-Learn could be found here.

Data Segmentation

Process of grouping the data into at least two subsets is called Data Segmentation. Data segmentation is based on the use cases and types of information as well as data's sensitivity and the needed authority level for accessing that type of information.

K-means clustering

K-means clustering method is used for splitting the dataset into sets of items with similarities. Example below and given in this link give more information on how to apply it.

In this example which is a dataset of customers, the set is partitioned into groups of individuals that have similar characteristics as it is where K-means clustering algorithm is used.

K-means

K-means can group data only unsupervised based on the similarity of customers to each other. To define this technique more formally, there are various types of clustering algorithms such as partitioning, hierarchical, or density based clustering. K-means is a type of partitioning clustering.

Partitioning Clustering
K-Means divides the data into non-overlapping subsets (clusters) without any cluster-internal structure
Examples within a cluster are very similar
Examples across different clusters are very different

It divides the data into k non-overlapping subsets or clusters without any cluster internal structure or labels. This means, it's an unsupervised algorithm. Objects within a cluster are very similar and objects across different clusters are very different or dissimilar.

For using K-means, similar samples have to be found, e.g. similar customers. There are questions that have to be looked into. First, how the similarity of samples in clustering oculd be found and next is how to measure the similarity of two customers with regard to their demographics.

Similarity or Dissimilarity

The objective of K-means is to form clusters in such a way that similar samples go into a cluster and dissimilar samples fall into different clusters. But it could also be shown that instead of a similarity metric, dissimilarity metrics could be used.

In other words, conventionally, the distance of samples from each other is used to shape the clusters. It could mean K-means tries to minimize the intra-cluster distances and maximize the inter-cluster distances.

The next step is to calculate the dissimilarity or distance of two cases, in this case such as two customers.

Similarity/Distance

Assuming there are two customers, customer 1 and customer 2 and there is only one feature for each of these two customers which is the age.

The specific type of Minkowski distance could be used to calculate the distance of these two customers.

Indeed, it is the euclidean distance. Distance of x1 from x2, is root of 34 minus 30 power two which is four.

In the case of more than one feature, for example, age and income, the calculation slightly changes.

As shown in the figures, if there are values for income and age for each customer, the same formula could be used but this time in a two-dimensional space.

The same distance matrix for multidimensional vectors could also be used.

The feature set should be normalized to get the accurate dissimilarity measure. There are other dissimilarity measures that can be used for this purpose, but it is highly dependant on datatype as well as the domain that clustering is done for it. For example, Euclidean distance could be used, cosine similarity, average distance and so on.

Indeed, the similarity measure highly controls how the clusters are formed. It is recommended to understand the domain knowledge of the dataset in use and datatype of features, and then choose the meaningful distance measurement.

K-means Clustering

To keep it simple, the presented dataset has only two features, the age and income of customers, which makes its space a two dimensional one.

The distribution of customers using a scatter plot is shown below. The Y-axis indicates age, and the X-axis shows income of the customers.

As shown on the scatter plot, the customer dataset is clustered into distinct groups or clusters based on the two dimensions Age and Income.

In the first step, the number of clusters should be determined. The key concept of the K-means algorithm is that it randomly picks a center point for each cluster.

Intializing k centroids randomly which represents number of clusters

Setting the k value to 3 is similar to having three representative points for the clusters. These three data points are called centroids of clusters and should be of same feature size of the customer feature set. There are two approaches to choose these centroids.

One, three observations could be chosen randomly out of the dataset and use these observations as the initial means, or two, three random points could be created as centroids of the clusters which is the choice that is shown in the plot with red color.

After the initialization step which was defining the centroid of each cluster, each customer will be assigned to the closest center. For this purpose, the distance for each data point or in this case customer from the centroid points. have to be calculated.

Distance Calculation

As mentioned before, depending on the nature of the data and the purpose for which clustering is being used, different measures of distance may be used to place items into clusters. Therefore, a matrix should be formed where each row represents the distance of a customer from each centroid. It is called the distance matrix.

The main objective of K-means clustering is to minimize the distance of data points from the centroid of it's cluster and maximize the distance from other cluster centroids. Following that, the next step is to find the the closest centroid to each data point.

Assigning each point to the closest centroid

It could easily be said that it does not result in good clusters because the centroids were chosen randomly from the first.

Indeed the model would have a high error. Here, error is the total distance of each point from its centroid. It can be shown as within-cluster sum of squares error.

It can be shown as within-cluster sum of squares error.

Intuitively, the error should be reduced which means the clusters should be shaped in such a way that the total distance of all members of a cluster from its centroid be minimized. Now, the question is, how clusters could become better with less error? Okay, we move centroids.

Compute the new centroids for each cluster

In the next step, each cluster center will be updated to be the mean for data points in its cluster.

Indeed each centroid moves according to their cluster members.

In other words, the centroid of each of the three clusters becomes the new mean.

For example, if point A coordination is 7.4 and 3.6 and point B features are 7.8 and 3.8, the new centroid of this cluster with two points, would be the average of them, which is 7.6 and 3.7. Now, there are new centroids.

Once again, the distance of all points from the new centroids should have to be calculated. The points are re-clustered and the centroids move again. This continues until the centroids no longer move.

It is important to note that whenever a centroid moves, each points distance to the centroid needs to be measured again. Because K-means is an iterative algorithm and the steps two to four have to be repeated until the algorithm converges.

Repeat until there are no more changes

In each iteration, it will move the centroids, calculate the distances from new centroids, and assign data points to the nearest centroid. It results in the clusters with minimum error or the most dense clusters.

However, as it is a heuristic algorithm, there is no guarantee that it will converge to the global optimum and the result may depend on the initial clusters.

It means, this algorithm is guaranteed to converge to a result, but the result may be a local optimum i.e. not necessarily the best possible outcome. To solve this problem, it is common to run the whole process multiple times with different starting conditions.

This means with randomized starting centroids, it may give a better outcome, and as the algorithm is usually very fast, it wouldn't be any problem to run it multiple times.

Page updated

Google Sites

Report abuse