Data Mining Methods

The Data Mining methods are well-known by all data scientist. However, for beginners, it seems really interesting to know their different applications in data mining. This post provides a short review of the most important and frequent data mining methods.

This short-review only highlights some of their influences with data-problems and some of the main features of these data mining methods, techniques or procedures.

Note: My intention is not to provide a complete explanation on how these techniques operate with detail, but to stay focused on the data preprocessing step.

In data mining methods, there are two main branches for obtaining knowledge: (A) prediction (A.1 statistical & A1. symbolic methods) and (B) description/classification. Below, you find a short description for each method, references for some representative and summarize concrete algorithms that are largely developed on the knowledge base of ION Data Science.

A) PREDICTION

Within the prediction family of data mining methods, there are two main groups: statistical methods and symbolic methods.

A.1) Statistical methods are characterized by the representation of knowledge through mathematical models with computations.
A.2) Symbolic methods represent the knowledge by means of symbols and connectives (probably, yielding with more interpretable models).

Prediction data mining methods

A.1) Statistical methods

Regression Models

Without any doubt, the regression models are the oldest statistical models used for prediction. Currently, they are still largely used in estimation tasks. Always with the requirement of selecting the equation to be used for modelling: linear, quadratic and logistic regression are the most used functions. However, there are basic requirements that regression methods impose on the data such as the use of numerical attributes (without missings values). Another important consideration is that regression try to fit outliers to the models. And, it must be also considered that regression models use all the features independently whether or not they are useful or dependent on one another.

Artificial Neural Networks (ANNs)

Especially used on predictions, Artificial Neural Networks are mathematical models suitable for almost all Data Mining tasks. ANNs are based on the definition of mathematical-neurons, which can be understood as the atomic parts that compute the aggregation of their input (value number) to an output according to an activation function. ANN’s usually outperform all other models because of their complexity; however, this complexity (based on the suitable configuration of the network) make them unpopular when regarding the efficiency of other methods. In fact, the ANN’s are considered as the typical example of black box models. ANN’s are pretty robust against noise and outliers but they require numeric attributes and no missing values (as regression models). There are different formulations of ANNs depending on their interconnection pattern between the different layers of mathematical-neurons, their learning process for updating the weights of these interconnections and their activation function (which converts a neuron’s weighted input to its output activation). The most common ANNs algorithms probably are the Multi-Layer Perceptron (MLP), the Radial Basis Function Networks (RBFNs) and the Learning Vector Quantization (LVQ).

Bayesian Learning

These methods use the probability theorem of Bayes as a framework for making rational decisions under uncertainty. The most applied bayesian learning method is Naïve Bayes, which assumes that an attribute value (of a given class) is independent of the values of other attributes. A priori, these algorithms only work with categorical attributes because the probability computation can only be made in discrete domains. This independence assumption among attributes a large sensitivity to the redundancy and usefulness of some of the attributes (or examples from the data) together with noise and outliers; furthermore, this bayesian methods cannot deal with missing values. Besides Naïve Bayes, other complex bayesian models are based on dependency structures (such as the Bayesian networks).

Instance-based Learning

The instance-based learning methods (also known as, memory-based learning or lazy learners) are learning algorithms that compares new problem instances with instances seen in training (stored in memory) instead of performing explicit generalization. That is, based on the examples stored, the distance function determine which members are closest to a new example to find a desirable prediction. So, the hypotheses are directly constructed from the training instances themselves. These instance-based methods differ between them because they use different distance functions, different number of examples taken to make the prediction, different influence weights when using voting and different efficiency of their algorithms to find the nearest examples (such as hashing schemes or KD-Trees). Although it suffers from several drawbacks (low efficiency in prediction response, high storage requirements and low noise tolerance), the K-Nearest Neighbor (KNN) is the most applied algorithm of this family because it’s very useful and pretty well known in data science.

Support Vector Machines

Support Vector Machines (SVMs) are machine learning algorithms based on learning theory, which you can find as Learning Kernels. In the sense that they are used for estimation to perform very well when data is linearly separable, they are similar to Artificial Neural Networks (ANNs). As well as ANNs, the SVMs methods require numeric non-missing data and are pretty robust against outliers and noise. Usually, SVMs do not require the generation of interaction among variables (regression methods do), which should save some data preprocessing steps.

A.2) Symbolic methods

Rule Learning

These methods are also known as separate-and-conquer methods or covering rule algorithms. All these methods search for a rule that explains some part of the data to differentiate these examples and recursively conquer the remaining examples. In practice, there are many ways for doing this separation as well as many ways to interpret them (and their inference mechanism). From the point of view of data preprocessing, these methods require nominal or discretized data (although this task is frequently implicit in the algorithm) as well as the requirement of an innate selector of interesting attributes from data. As well as other kind of techniques, missing values, noisy examples and outliers may prejudice their performance. Some of the most known examples of these models are the AQ, RIPPER,CN2, PART and FURIA algorithms.

Decision Trees

The decision trees are predictive models formed by iterations of a divide-and-conquer scheme of hierarchical decisions. In fact, the decision trees work by attempting to split the data using one of the independent variables to separate data into homogeneous subgroups. As you probably guess, the final form of the tree can be translated to a set of If-Then-Else rules from the root to each of the leaf nodes. The decision trees are closely related to “rule learning” methods; consequently, they suffer from the same disadvantages. The most well known decision trees are CART, C4.5 and PUBLIC.

B) DESCRIPTION

Considering the data descriptive task, it’s prefered to categorize the common given problems to be solved instead of the used methods. In fact, both ways (problems and methods to solve these description/classification problems) are intrinsically related to the case of predictive learning.

Descriptive data mining methods

Clustering

This problem appears when there is no class information to be predicted but data must be divided into natural groups or clusters. These clusters reflect subgroups of examples sharing some properties or having some similarities. These methods work by calculating a multivariate distance between observations to group the data that is more closely related. In fact, they belong to three broad categories:

Agglomerative clustering. Agglomerative clustering is a hierarchical type of clustering totally opposite to divisive clustering. Agglomerative clustering considers each example as a cluster to perform an iterative merging of clusters until a criterion is satisfied.
Divisive clustering. The divisive clustering applies recursive divisions if front of agglomerative clustering (which considers each example as a cluster).
Partitioning clustering. Partitioning based clustering starts with a fixed k number of clusters (k-Means algorithms is its most representative algorithm) and, iteratively, adds or removes examples to and from them until no improvement is achieved based on a minimization of intra and/or inter cluster distance measure.

As usual when distance measures are involved, numeric data is preferable together with no-missing data and the absence of noise and outliers. Other well known examples of clustering algorithms are Self-Organizing Maps and COBWEB.

Association Rules

They are a set of techniques that aim to find relationships in the data to find associations. The typical example of these algorithms is the analysis done by a retailer when they aim to find if a customer buys product X would be interested in acquire a product Y. However, association rules algorithms can also be formulated to look for sequential patterns. For association rules algorithms, the data volumes requires to be very large as a result of the requirement for association analysis is transaction data. Also, the data must be discretized because transactions are expressed by categorical values. Data transformation and reduction is often needed to perform high quality analysis in this problems (to address this problem, the Apriori technique is the most emblematic technique).

The data mining techniques described above are the most used data mining methods used to solve data science problems.

Do you not agree with some of the above? Have I forget some of them (please, comment it)? Do you agree with this classification or you use another classification to arrange this data mining methods?