## Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

# Algorithms for Clustering data

## Contents

## Introduction

**Clustering** refers to finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups.

**Kmeans algorithm:**

| Kmeans is a flat clustering algorithm. It's an iterative algorithm, whose first step is to select k initial centroids (also called seeds), which are randomly selected data points. In the second step, k clusters are formed by assigning all points to their closest centroid, and the centroid of each cluster is recomputed. This is done iteratively until a stopping criterion is met (for example: the centroids don't change).

The closeness can be measured by various distance measurements, or similarity measurements or correlation. Different distance measurements were experimented for our case, and gave similar results. Thus, results were generated with Euclidean as the chosen distance measurement.

Another parameter to the kmeans algorithm is the number of clusters. Since our data is small (a matrix of size 39*49), we selected two as the number of clusters. The output of the kmeans clustering is validated with the domain knowledge and is shown to give meaningful results.

## Anomaly Detection

A typical problem definition of anomaly detection is: Given a dataset D, find all the data points x belonging to D having the top-n largest anomaly scores calculated by a predetermined function f(x). It is associated with numerous challenges, such as:

- How does one define normal, and define deviations from it?
- How many outliers are there?
- How can anomalies be validated as either real or noise?
- How does one determine when all the anomalies have been caught?

Some of the application areas include credit card fraud detection, telecommunication fraud detection, network intrusion detection, fault detection.

The general steps for any anomaly detection algorithms include:

- Building a profile of the “normal” behavior. An example of a profile could be a pattern or a characteristic.
- Using the “normal” profile to detect anomalies. Anomalies are generally defined as data points, whose characteristics differ significantly from the rest of the data.

Some popular types of anomaly detection methods are:

- Graphical and Statistical-based
- Distance-based
- Model-based

### Distance-based Method

In this method, data is represented as a vector of features. The three major approaches in this method include:

- Nearest-neighbor based
- Density based
- Clustering based

### Nearest-Neighbor Based Approach

In the nearest-neighbor based approach, the distance of each data point with every other point is computed, and the outliers are define based on any chosen metric. Some metrics could be:

- Find points which have less than m neighboring points within a distance D.
- Find points whose distance to the kth nearest neighbor is greatest.
- Find points whose average distance to the k nearest neighbors is greatest.

### Density-based: LOF approach

A research paper on the density-based LOF approach may be found here.

- For each point, compute the density of its local neighborhood.
- Compute local outlier factor (LOF) of a sample p as the average of the ratios of the density of sample p and the density of its nearest neighbors.
- Outliers are points with largest LOF value.

### Clustering-Based

To take a cluster-based approach:

- Run any clustering algorithm to cluster the data into groups/clusters.
- Choose a metric for defining outliers, for example points which lie far from the cluster centroids.

### Parzen Window Density based

Parzen Window Density-based is a non-parametric density estimation: These attempt to estimate the density directly from the data without assuming a particular form for the underlying distribution.

- Simple Ex: A histogram (Divide the sample space into a number of bins and approximate the density at the center of each bin by the fraction of points in the training data that fall into the corresponding bin)
- The Parzen window estimate resembles the histogram, with the exception that the bin locations are determined by the data.
- Assume that the region R that encloses the k examples is a hypercube with sides of length h centered at x.
- Volume: V=h
^{d}, where d is the number of dimensions. - To find the number of examples that fall within this region, we define a kernel function
- This kernel, which corresponds to a unit hypercube centered at the origin, is known as a Parzen window or the naïve estimator.

### Non-classical Multidimensional Scaling (MDS)

The non-classical multidimensional scaling method is a way to visualize dissimilarity.

- Non-classical Multidimensional Scaling (MDS) takes a dissimilarity matrix and creates a configuration of points in 1D, 2D, or 3D, where the inter-point distances are close to the original dissimilarity between the points.
- Dissimilarity could be physical or abstract. There are several measure of dissimilarity, distance based ones are prominent.

### Association Mining

Association Mining is a general many-many mapping (association) between two kinds of things. It has traditionally been used in market basket data analysis.

- Briefly, defined as: there is a large set of items e.g., things sold in a supermarket. A large set of baskets, each is a small subset of items e.g., the things one customer buys on one day. We ask about associations like people usually buy x; y; z together, and, then, people who buy x; y; z tend to buy v;w.
- To begin with we have to find frequent items, which is finding a subset of items that appear together frequently in a collection of items.
- An important parameter that needs to be defined for any association mining algorithm is called support. Support for itemset I is the number of baskets containing all items in I. It is often expressed as a fraction of the total number of baskets.
- If the support is set too high, we could miss itemsets involving interesting rare items.
- If the support is set too low, it is computationally expensive and the number of itemsets generated are very large.

- For any association mining algorithm, we are given/assume a support threshold s, then sets of items that appear in at least s baskets are called
**frequent itemsets**. - If an itemset has no immediate superset that is frequent, it is called maximal frequent itemsets.
- For our analysis, we are interested in finding such
**maximal frequent itemsets**. - Given d unique items, a brute force approach to find frequent occurring subsets of items is:
- List all possible subsets (pairs, triplets,..) of items.
- For d unique items, there are 2
^{d}total number of subsets. - Compute the support for each subset by scanning the database.
- Prune subsets that fail the support threshold.

- For our analysis, we are interested in finding such
- As evident, this naive approach is computationally prohibitive (it is exponential in the number of the items).
- Thus, to find maximal frequent itemsets, we employed MAFIA, MAximal Frequent Itemset Algorithm.