Data mining is a subfield of computer science; it’s a process of discovering patterns in massive data sets involving AI, machine learning, statistics, and database systems. The overall goal of data mining is to get information from data sets and transform them into understandable structures which companies can use.
Data mining involves data management aspects, data pre-processing and visualisation. Data mining is the analysis step of the database information discovery process, the extraction of patterns and knowledge from large amounts of data. It is frequently applied to any application of computer decision support including machine learning, and business intelligence. Often the terms data analysis and analytics are more appropriate.
The actual data mining task is the analysis of large quantities of data so as to extract previously unknown groups of data records. Database techniques such as spatial indices are normally used and can then be seen as a rundown of the data that has been input, data mining can spot multiple groups in data, more accurate prediction results can be obtained by a decision support system.
Data collection, data preparation and reporting aren’t part of the data mining step but are additional steps. Data dredging, data fishing, and data snooping are terms of data mining methods that extract data from parts of a larger population data set. These methods can be used in creating new hypotheses to test against the larger data populations.
Data mining appeared around 1990 in the database community, database mining was used at first. However, the term data mining became more popular in the business communities, since about 2007. Predictive Analytics and Data Science terms were also used to describe this field. The Journal Data Mining and Knowledge Discovery is the main research journal within the field.
Data mining involves six common classes of tasks, anomaly detection, identifying unusual data records, association rule learning, and relationship searches between variables. Using association rule learning for example, a supermarket can determine which products are bought together and use this information for marketing. Clustering is the task of finding groups and structures in data that are similar without using known structures in the data.
Classification in new data is the task of generalizing known structure. Regression tries to find a function which models the data with the least error. Summarization provides a compact idea of the data set with visualization and report generation.
Data mining can produce results which appear to be correct but which do not actually predict future behaviour and can’t really be used. This is usually because they investigate too many hypotheses and do not perform proper statistical hypothesis testing. The final step of knowledge discovery from data is to find the patterns produced by the data mining algorithms in the wider data set, data patterns are not always found.
When data mining algorithms find patterns in the training set which are not present in the general data set then this is called overfitting; to stop this, it uses a test set of data on which the data mining algorithm was not trained. Learned patterns are applied to this data test set and the results are compared to the results which are preferred. A data mining algorithm which tries to tell the difference between spam and normal emails would be trained on a training set of sample e-mails. After this the learned patterns would be applied to the test set on which it had not been trained, accuracy can then be measured from how many emails they correctly group.