Traditional Culture Encyclopedia - Travel guide - What is data mining? How to do data mining?

What is data mining? How to do data mining?

Data Mining refers to the automated process of classifying large data sets to identify trends and patterns through data analysis and establish relationships to solve business problems. In other words, data mining is the process of extracting from large amounts of incomplete, noisy, fuzzy, and random data information and knowledge that is implicit in it and that people do not know in advance but is potentially useful. .

In principle, data mining can be applied to any type of information repository and transient data (such as data streams), such as databases, data warehouses, data marts, transaction databases, spatial databases (such as maps) etc.), engineering design data (such as architectural design, etc.), multimedia data (text, image, video, audio), network, data flow, time series database, etc. Because of this, data mining has the following characteristics:

(1) The data set is large and incomplete

The data set required for data mining is very large. The bigger, the closer the laws obtained can be to the correct actual laws, and the results will be more accurate. Otherwise, the data is often incomplete.

(2) Inaccuracy

Data mining has inaccuracies, which are mainly caused by noisy data. For example, in business, users may provide false data; in a factory environment, normal data often receives electromagnetic or radiation interference and exceeds the normal value. These abnormal data that are absolutely impossible to appear are called noise, and they will lead to inaccuracies in data mining.

(3) Fuzzy and random

Data mining is fuzzy and random. Ambiguity here can be associated with inaccuracy. Due to inaccurate data, it is only possible to observe the data as a whole, or due to the private information involved, it is impossible to obtain some specific content. If you want to do relevant analysis operations at this time, you can only do it in general. Some analyses, cannot be accurately judged.

There are two explanations for the randomness of the data. One is that the data obtained is random; we cannot know what the user filled in. The second is that the analysis results are random. If the data is handed over to the machine for judgment and learning, then all operations are gray box operations.