Traditional Culture Encyclopedia - Weather inquiry - Naive Bayesian Classification —— The Road to Jane

Naive Bayesian Classification —— The Road to Jane

M samples are known, where x is the characteristic variable and y is the corresponding category.

Model function H is needed, which can predict new samples as accurately as possible.

Many machine learning algorithms construct the model function H from the angle of error, that is, first assume an H, then define an error between h(x) and Y, and gradually reduce the error between h(x) and Y to obtain a fitted model H. ..

Now let's consider it from the perspective of probability.

Suppose y has m categories, that is,

For samples, if the conditional probability of each category can be calculated, then it can be considered that the category with the greatest probability is the category to which it belongs. For conditional probability and Bayesian theorem, please refer to Understanding Bayesian Theorem. that is

Given m samples,

X is an n-dimensional characteristic variable, that is,

Y is the corresponding category, and there are k categories, that is,

For any given X, we need to calculate the probability that X belongs to each classification, and the largest value belongs to this classification, that is, the sample X belongs to this classification.

Now we need to calculate and apply Bayesian theorem:

What we are talking about here is the conditional joint probability, which refers to the probability that features take a set of specific values (that is, the eigenvalues of the sample X that need to be predicted) in classification. This probability is not easy to calculate, so for the sake of convenience, naive Bayes appeared grandly. The simple meaning here is to assume that the characteristics of X are conditionally independent. (See Wikipedia-Conditional Independence). therefore

This transformation is actually the product of the joint distribution of independent variables = the prior distribution of each variable (refer to Wikipedia-joint distribution). What we are talking about here is conditional probability, but because there are the same conditions before and after the transformation, from the sample space, it is actually the product of the joint distribution converted into the prior distribution. For the understanding of sample space, please refer to Understanding Bayes Theorem.

Bring (5) back to (4)

For any given sample, the value of X is certain, and X does not depend on C, so P(x) can be regarded as a constant. So it can be ignored.

This is the model function of naive Bayesian classification.

There are two main items in the above formula. Let's see how to calculate it separately.

It is easy to calculate with the above formula. It is only necessary to estimate the probability by frequency and count the frequencies of samples belonging to m samples. Assuming that the category of one of the m samples is, then

In order to calculate, it is necessary to presuppose the data distribution of sample characteristics. The hypothesis of feature distribution, which we call event model, usually adopts the following three assumptions.

Sometimes the number of samples in which a feature has a specific value will lead to the whole, which seriously distorts the probability value of the feature. Therefore, Laplacian smoothing can usually be used to avoid this situation. that is

Usually take

Bring (8) and (10) into Bayesian classifier (7) to obtain

Use a rough schematic diagram to understand how to estimate the conditional probability according to the sample set when the elements are discrete values:

Decide whether to play tennis according to the weather.

This case comes from naive Bayesian classifier.

The above table is the decision data of a classmate playing tennis under different weather conditions.

Suppose today's weather is: prospect = sunny, temperature = cool, humidity = high, wind = strong, will classmates play tennis?

Several features here, such as weather, temperature, humidity and wind speed, are discrete variables, which are suitable for the above polynomial Bayesian classification method. Write the above formula here for easy reference.

We need to calculate the estimated probability in two cases.

Statistically, the number of samples in various cases in the above table shows that:

Total sample number m= 14

Number of samples to play ball (k = Yes) = 9

Number of inactive samples (k = no) = 5

Weather value (sunny/early/rainy)

Number of samples playing on sunny days (k = yes, j = prospect, s = sunny days)

Number of samples not playing on sunny days (k = no, j = prospect, s = sunny days)

Temperature value (hot/mild/cold)

Number of samples played in cold weather (k = yes, j = temperature, s = cool)

Number of samples not playing in cold weather (k = no, j = temperature, s = cool)

Humidity value (high/normal)

Number of samples playing in rainy days (k = yes, j = wet, s = high)

Number of samples not playing in rainy days (k = no, j = humidity, s = high)

Value of wind power generation (strong/weak)

Number of samples playing in windy days (k = yes, j = wind, s = strong)

Number of samples not playing in windy days (k = no, j = wind, s = strong)

Substitute the above data into the formula (1 1), and for the sample, the probability of playing ball (k = yes).

Probability of not playing ball (k=nos)

Here is 0.0 1822 > 0.007084, so students may not play. After normalization,

Probability of not playing ball = 0.01822/(0.01822+0.007084) = 72%.

(Note: The calculation result here is different from the original case, because Laplacian smoothing is done here, but the original case is not. In this case, there is no case where the sample number of a specific feature is 0, so Laplacian smoothing is not needed, but it is written here according to the formula, so it is calculated according to the formula).

In addition, it should be noted that Bernoulli distribution is actually a special case of polynomial distribution, so we can use the above formula (12) or the previous polynomial distribution formula (1 1) to calculate it.

Bernoulli distribution can be used in tasks involving text, such as spam classification. For example, a vector of 5000 different words is constructed as the input feature X. For a piece of text, the position of the corresponding word in X is set to 1, and other positions are set to 0, so that the value of each feature (word) in X is 1 or 0, which conforms to Bernoulli distribution.

For cases, please refer to Wikipedia-Cases-Gender Classification.

Another common technique for dealing with continuous numerical problems is discretization of continuous numerical values. Generally speaking, when the number of training samples is small or the exact distribution is known, the probability distribution method is a better choice.

In the case of a large number of samples, the discretization method performs better, because a large number of samples can learn the actual distribution of data without making "naive" assumptions about its distribution. Usually, many tasks will provide a large number of samples, so it is best to choose discretization method instead of probability distribution estimation method.

By the way, every time I see the word simple, it's like seeing Bayesian wearing patched clothes. Childishness means inexperience; Naive; Ignorance; Belief From the process of formula derivation, Naive Bayesian classifier adopts some assumptions that simplify conditions, such as assuming that each feature of X is conditionally independent, and assuming that the sample feature data conforms to polynomial distribution, Bernoulli distribution, Gaussian distribution, etc. These assumptions may not be completely in line with the actual situation, because ignorance of the sinister real world has adopted some naive assumptions.

However, simplicity has another meaning: simplicity and purity. In this sense, Bayesian classification can also be regarded as simple, and even the cleverest person is stupid.

The main advantages of naive Bayes are:

1) algorithm is simple and the classification efficiency is stable.

2) It performs well on small-scale data, can handle multi-classification tasks, and is suitable for incremental training, especially when the data exceeds the memory, we can do incremental training in batches.

3) Insensitive to missing data.

The main disadvantages of naive Bayes are:

1) If the "naive" assumption is inconsistent with the actual situation, it will affect the model effect.

2) The representation of input feature data, such as continuous feature, discrete feature or binary feature, will affect the probability calculation and the classification effect of the model.

A summary of the principle of naive Bayesian algorithm

Naive Bayesian classifier

Wikipedia-Naive Bayesian Classifier

Understanding Bayes Theorem