k-mean clustering and its real use-case in the security domain.

Mukesh RMK
5 min readJul 18, 2021

To better understand the concept lets clear a few terms,

Clustering: It is one of the most common exploratory data analysis techniques used to get an intuition about the data structure. It can be defined as the task of identifying subgroups in the data such that data points in the same subgroup (cluster) are very similar while data points in different clusters are very different. In other words, we try to find homogeneous subgroups within the data such that data points in each cluster are as similar as possible according to a similarity measure such as euclidean-based distance or correlation-based distance. The decision of which similarity measure to use is application-specific.

K-means is considered one of the most used clustering algorithms due to its simplicity.

A cluster refers to a collection of data points aggregated together because of certain similarities.

Note:

  1. k, which refers to the number of centroids you need in the dataset. A centroid is the imaginary or real location representing the center of the cluster.
  2. means in the K-means refers to averaging of the data; that is, finding the centroid.

K-means:- It is an iterative algorithm that tries to partition the dataset into Kpre-defined distinct non-overlapping subgroups (clusters) where each data point belongs to only one group. It tries to make the intra-cluster data points as similar as possible while also keeping the clusters as different (far) as possible. It assigns data points to a cluster such that the sum of the squared distance between the data points and the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is at the minimum. The less variation we have within clusters, the more homogeneous (similar) the data points are within the same cluster.

The way k-means algorithm works is as follows:

  1. Specify the number of clusters K.
  2. Initialize centroids by first shuffling the dataset and then randomly selecting K data points for the centroids without replacement.
  3. Keep iterating until there is no change to the centroids. i.e the assignment of data points to clusters isn’t changing.
  • Compute the sum of the squared distance between data points and all centroids.
  • Assign each data point to the closest cluster (centroid).
  • Compute the centroids for the clusters by taking the average of all data points that belong to each cluster.

# Use-Cases in the Security Domain:

  1. Malware Detection:-

Malware interrupts the file registry when entering a computer and basically malware tend to create and modify computer files system and Windows registry entries besides the computer interprocess communication and basic network interaction. Intrusion attacks such as malware are known to breach the policy of network security in organizations and continuously tries to interrupt the core fundamental of cybersecurity which is Confidential, Integrity, and Availability, or known as CIA. Therefore, previous cybersecurity researcher has proposed detection-based for malware intrusion, which is a framework that monitors the behavior of system activity. Then, the behavior will be analyzed by the framework and notify the users if there is a sign of intrusion.

How it works,

  1. Collecting dataset of malware.
  2. Identifying the number of clusters (k).
  3. Initializing the k centroids (k-means) for the data.
  4. Determining the distance of each malware from each centroid and then assign each malware to the cluster with the centroid closest to it.
  5. Recounting the centroids for each cluster.
  6. Steps 4 and 5 are repeated until there is no change in cluster centroids.
  7. If formed clusters do not look reasonable, repeat steps 1–6 for the different numbers of clusters.

2. Anomaly Detection:-

Anomaly detection refers to methods that provide warnings of unusual behaviors which may compromise the security and performance of communication networks. Anomalous behaviors can be identified by comparing the distance between real data and cluster centroids. Identifying network anomalies is essential for communication networks of enterprises or institutions. The goal is to provide an early warning about an unusual behavior that can affect the security and the performance of a network.

3. Crime Analysis:-

Crime analysis is a law enforcement function that involves systematic analysis for identifying and analyzing patterns and trends in crime and disorder. Crime analysis also plays a role in devising solutions to crime problems and formulating crime prevention strategies. Analysis of crime is essential for providing safety and security to the civilian population. K means clustering technique is used to extract useful information from the high volume crime dataset and to interpret the data which assist police in identify and analyze crime patterns to reduce further occurrences of similar incidence and provide information to reduce the crime.

With data related to crimes available in specific localities in a city, the category of crime, the area of the crime, and the association between the two can give quality insight into crime-prone areas within a city or a locality.

CONCLUSION:-

With a better analysis by using the clustering method, we can actually get a better idea about the present situation and we can take several actions based upon that analysis to prevent the loss.

Thank You….For Reading…………..

--

--