Algorithms for Clustering data

Introduction

Clustering refers to finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups.

Kmeans algorithm:

| Kmeans is a flat clustering algorithm. It's an iterative algorithm, whose first step is to select k initial centroids (also called seeds), which are randomly selected data points. In the second step, k clusters are formed by assigning all points to their closest centroid, and the centroid of each cluster is recomputed. This is done iteratively until a stopping criterion is met (for example: the centroids don't change).

The closeness can be measured by various distance measurements, or similarity measurements or correlation. Different distance measurements were experimented for our case, and gave similar results. Thus, results were generated with Euclidean as the chosen distance measurement.

Another parameter to the kmeans algorithm is the number of clusters. Since our data is small (a matrix of size 39*49), we selected two as the number of clusters. The output of the kmeans clustering is validated with the domain knowledge and is shown to give meaningful results.

Anomaly Detection

A typical problem definition of anomaly detection is: Given a dataset D, find all the data points x belonging to D having the top-n largest anomaly scores calculated by a predetermined function f(x). It is associated with numerous challenges, such as:

How does one define normal, and define deviations from it?
How many outliers are there?
How can anomalies be validated as either real or noise?
How does one determine when all the anomalies have been caught?

Some of the application areas include credit card fraud detection, telecommunication fraud detection, network intrusion detection, fault detection.

The general steps for any anomaly detection algorithms include:

Building a profile of the “normal” behavior. An example of a profile could be a pattern or a characteristic.
Using the “normal” profile to detect anomalies. Anomalies are generally defined as data points, whose characteristics differ significantly from the rest of the data.

Some popular types of anomaly detection methods are:

Graphical and Statistical-based
Distance-based
Model-based

Distance-based Method

In this method, data is represented as a vector of features. The three major approaches in this method include:

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Algorithms for Clustering data

Contents

Introduction

Anomaly Detection

Distance-based Method

Nearest-Neighbor Based Approach

Density-based: LOF approach

Clustering-Based

Parzen Window Density based

Non-classical Multidimensional Scaling (MDS)

Association Mining

Breadcrumbs

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Algorithms for Clustering data

Contents

Introduction

Anomaly Detection

Distance-based Method

Nearest-Neighbor Based Approach

Density-based: LOF approach

Clustering-Based

Parzen Window Density based

Non-classical Multidimensional Scaling (MDS)

Association Mining