What is data mining? Sometimes, people distinguish Data Mining as synonym of the process of discover knowledge in data (Knowledge Discovery in Databases process). Others, Data Mining (DM) is seen as the main step of Knowledge Discovery in Databases (KDD). I prefer to lead the answer to the data mining process definition (or also known as the most common data mining methodology).
What is data mining?
There are several definitions of Data mining. In my case, I prefer to define Data mining as “the nontrivial process of identifying valuable information in data”. Generally speaking, Data Mining is the main step of Knowledge Discovery in Databases (KDD). Considering KDD process as the exploratory data analysis done to discover understandable patterns from large databases.
The key aspect of the process that characterizes KDD is the way the agreement of several researchers in its stages. Even if there are several methods to characterize this process (with different advantages and disadvantages), I prefer adopt a methodology widely used in recent times. So, what is data mining? Well, in order to appropriately answer that question you must know the general process to deal with data mining problem.
What is the data mining process?
The data mining process have six main phases which ultimate goal is to identify valid-potentially-useful-understandable patterns in data. This data mining methodology categorizes the KDD stages done in Data mining which summarizes the process as follows.
- Problem Specification. All problems addressed by data mining requires not only the knowledge of the data scientist but the relevant prior knowledge of experts to get objectives pursued by the end-user. During this first phase of the process, both sides must design and arrange the application domain to let information properly flow during the rest of the process.
- Problem Understanding. Not only information but real understanding and the comprehension between both sides (the data scientist with their approach and the experts with their knowledge on that specific problem) must be associated to achieve an high degree of reliability of the results.
- Data Preprocessing. The preprocessing of the data probably is the most neglected stage of the process despite its great significance in the final results. Without forget the selection and extraction of features and examples in the database, data preprocessing includes operations for data cleaning (removing data-noise, inconsistent data, etc.), data integration (multiple data sources may be combined), data transformation (transformation and consolidation into forms which are appropriate for specific tasks) and data reduction.
- Data Mining (itself). Essentially, it’s the process to extract valid data patterns. The choice of the most suitable data mining tasks and the algorithm itself (belonging to regression, classification, regression or clustering families) are included on this stage. Finally, the accommodation of the selected algorithm by tuning the essential parameters and validating it (reliability) to be employed in the real world.
- Evaluation. When a solution must be evaluated, the most important tasks concern estimating and interpreting the mined patterns based on real interesting measures. Data scientists become aware of the problem in real world based on their results and experts on the problem agree on the results of the data mining process.
- Result Exploitation. The last stage involves the accommodation of the acquired knowledge from the data patterns into the real world. In most cases, this last step concerns the incorporation of the results into a system to be applied in further processes or, in other cases, reporting the discovered knowledge through powerful visualization results.
As you probably have noticed, all the stages of this data mining process are interconnected. In fact, the KDD (knowledge discovery in data) process is actually a self-organized scheme where the results of each stage conditions the remaining steps of the process to the point of reversing the path (in if required to better solve the problem). In addition, this data mining process becomes an iterative path where recursive improvements are required to improve the quality of the solution in the real world.
The data mining itself comprises a bunch of data mining methods deeper mentioned on this website.
What do you think about this universally adopted data mining process? Where do you find the greatest amount of problems in this methodology? Where do you find the worst problem in this global data mining process?