These are some of the most important aspects to consider for data preprocessing in data mining before apply the data science methodology. Especially for those who are new to Data Science, you must recognize not only the huge relevance of the data preprocessing step but the following aspects before conducting a data mining study.
PRE – Data Preprocessing in Data Mining
In this article, the most relevant aspects must be considered before starting a data mining process are going to be described below.
What are the “predictors/features”?
In any kind of predictive model, you need predictors (also called features – specially in classification methods -) to predict another variable (let’s say “y”). But, what must be those predictors? The more predictors you take to predict a variable “y”, the slower is the prediction process (computing time). However, it’s not completely true that prediction will be better (smaller error) than using fewer predictors. Therefore, the predictors must be adequately chosen when data processing in data mining. Some investigation into the data helps understand what predictors have a huge correlation with the outcome variable. We’ll discuss more about feature selection in other articles of this knowledge base.
How does the data be splitted?
Under any case, data must be properly allocated into certain tasks (e.g. model building, evaluating performance) of data preprocessing in data mining. Especially if the pool of data is small, how the training and test sets are determined reflects how the model will perform. The question is: how much data should be allocated to the training and test sets? A small test would have limited utility to judge the performance of the model (in this case, resampling techniques (i.e., no test set) might be more effective). For large data sets, the criticality of this decision is reduced. In the case you are extrapolating it to a different population, a simple random sample of the data would be more appropriate.
How does the performance of the model must be estimated?
Before using the test set, two techniques are commonly employed to determine the effectiveness of the model.
- First, quantitative assessments of statistics (as the RMSE) using resampling help the user understand how each technique would perform on new data.
- Second, qualitative assessment of statistics as creating simple visualizations of a model (for example, plotting the observed and predicted values) helps discover areas of the data where the model is not particularly good or bad.
In my personal opinion, this type of visual information is critical for improving models. So, it must be included into the statistics summary ot the model.
How must be the different models evaluated/compared?
I’ve seen that some modeling practitioners have a favorite model that is relied on indiscriminately. However, there is no single model that will always do better than any other without having substantive information about the problem (“No Free Lunch” Theorem [Wolpert, 1996]). Therefore, a wide variety of techniques must be tested before determine which model to focus on. For example, you can start with a simple plot of the data to observe relationship between outcome and predictors. Given this knowledge, you might exclude linear models if there a nonlinear relationship between them. In fact, there is a variety of techniques to quickly evaluate the different models.
ne might say that “this model is always the best performing model”. However, a simple quadratic model is extremely competitive for this data. – One of my teachers.
What model must be choose?
A model must be selected to solve a given problem in the real world depending on the application of its solution. As you probably know, a better estimation/solution commonly consumes more computing resources. Rather an amazing precision, the quickest prediction/classification can be required for some cases. In either case, you must rely on splitting data to produce quantitative assessment of the model on you test se to make the choice. However, do not forget you’re looking for the best performing model (computing time vs reliability) depending on the application requirements in the real-world problem.
One goal of this knowledge base of “ION Data Science” is to help novel data scientists gain intuition regarding the strengths and weakness of different models to make based-on-real-data decisions.
What other aspects must be considered before starting a data mining process? Are there other points must be studied before data preprocessing in data mining? Are these aspects pretty well described to be understood of they need some specifications? Comment below, please.