Traditional Culture Encyclopedia - Travel guide - How to understand K-L divergence (relative entropy)

How to understand K-L divergence (relative entropy)

Kullback-Leibler divergence or K-L divergence is a method to quantify the difference between two probability distributions P and Q, also called relative entropy. In probability statistics, we often use a relatively simple approximate distribution to replace the observed data or an overly complex distribution. K-L divergence can help us measure the amount of information lost when one distribution is approximated to another.

See appendix 1 for the definition of K-L divergence. In addition, in Appendix 5, it explains why cross entropy is used instead of K-L divergence when training models in deep learning.

Let's think about K-L divergence from the following question.

These data are valuable, but there are also some problems. Are we earthlings? It's too far away, and it's too expensive to send these probability distribution data back to earth. Fortunately, we are a group of clever scientists. Using a simple model with only one or two parameters to approximate the original data will reduce the amount of data transmission. The simplest approximate model is uniform distribution, because the number of worm teeth does not exceed 10, so there are 1 1 possible values, and the probability of worm teeth is11. The distribution map is as follows:

Obviously, our raw data is not evenly distributed, but it is not our known distribution, at least it is not a common distribution. Instead, another simple model we think of is binomial distribution. There are n= 10 alveoli in the worm's mouth, and whether there are teeth in each alveoli is an independent event, and the probability is all p, then the number of teeth of the worm is the expected value E[x]=n*p, and the real expected value is the average value of the observed data, for example, 5.7, then p=0.57, and the binomial distribution shown in the following figure is obtained:

Compared with the original data, it can be seen that the average distribution and binomial distribution can not completely describe the original distribution.

However, we can't help asking, which distribution is closer to the original distribution?

There are many ways to measure the error, but we should consider reducing the amount of information sent. The average distribution and binomial distribution discussed above simplify the problem to only two parameters, the number of teeth and the probability value (the average distribution only needs the number of teeth). So which distribution retains more information about the original data distribution? At this time, K-L divergence is needed.

K-L deviation comes from information theory. Information theory mainly studies how to quantify the information in data. The most important unit of information measurement is entropy, which is generally expressed by H. The formula of distributed entropy is as follows:

The logarithm above has no definite base, and can be 2, e or 10 and so on. If we calculate the value of h by using the logarithm based on 2, we can regard this value as the minimum number of bits needed to encode information. In the above example of spatial worm, information refers to the number of worm teeth given according to the observed empirical distribution. The entropy of the probability distribution of the original data can be calculated as 3. 12 bits. This value just tells us the number of bits needed to encode information about the probability of worm teeth.

However, the entropy value does not give a method to compress the data to the minimum entropy value, that is, how to encode the data to achieve the best (optimal storage space). Optimizing information coding is a very interesting topic, but it is not necessary to know K-L divergence. The main function of entropy is to tell us the theoretical lower bound (storage space) of the optimal coding information scheme and a way to measure the amount of data information. By understanding entropy, we can know how much information is contained in the data. Now we can calculate how much information is lost when we approximate the original data distribution with a probability distribution with parameters. Please continue reading the next section. ↓↓↓

The formula of K-L divergence can be obtained only by slightly modifying the formula of entropy H. Let P be the observation probability distribution and Q be another distribution approximate to P, then the K-L divergence of P and Q is:

Obviously, according to the above formula, the K-L divergence is actually the expectation of the logarithmic difference between the original distribution p and the approximate distribution q of data. If the logarithm calculation with base 2 is continued, the K-L divergence value represents the binary number of information loss. The following formula expresses the expected K-L divergence:

Generally speaking, K-L divergence is common in the following writing:

Note: log a-log b = log (a/b)

Well, now we know how to calculate the information loss when one distribution is used to approximate another. Next, let's go back to the initial problem of the probability distribution of worm teeth.

The first is to approximate the K-L divergence of the original distribution with the average distribution:

Next, calculate the K-L divergence of the original distribution approximated by binomial distribution:

Through the above calculation, we can see that the information loss of approximating the original distribution with mean distribution is smaller than that with binomial distribution. So if you want to choose one from the mean distribution and binomial distribution, the mean distribution is better.

Naturally, some students regard K-L divergence as a measure of the distance between different distributions. This is not right, because from the calculation formula of K-L divergence, it does not conform to symmetry (distance measurement should satisfy symmetry). If the binomial distribution is similar to the data distribution we observed above, we will get the following results:

So Dkl (observation || binomial)! = Dkl (binomial || observed value).

That is to say, the information lost by approximating Q with P is different from that lost by approximating P with Q.

The parameter of binomial distribution used earlier is probability p=0.57, which is the average of the original data. The range of p is between [0, 1]. We should choose a value of p and establish binomial distribution to minimize the approximate error, that is, K-L divergence. So 0.57 is the best?

The following figure shows the variation of K-L divergence of original data distribution and binomial distribution with binomial distribution parameter p:

As can be seen from the above figure, the value of K-L divergence is the smallest at point, that is, p=0.57. So our previous binomial distribution model is already the best binomial model. Note that, as I have said, it is like a model, and here it is only a binomial model.

Only the uniform distribution model and binomial distribution model are considered before, and then we consider another model to approximate the original data. Firstly, the original data is divided into two parts, the probability of 1)0-5 teeth, and 2) the probability of 6 2)6- 10/0 teeth. The probability values are as follows:

When p=0.47, the minimum value of K-L is 0.338. Deja vu? Yes, this value is the same as the uniformly distributed K-L divergence value (it doesn't mean anything)! Let's continue to draw the probability distribution diagram of this strange model, which really looks like a uniformly distributed probability distribution diagram:

As we said, this is a strange model, and we prefer to use a more common and simple uniform distribution model with the same K-L value.

Looking back, we use the K-L divergence as the objective equation of this section, and find the parameter p=0.57 of the binomial distribution model and the parameter p=0.47 of the above model. Yes, this is the focus of this section: using K-L divergence as the objective equation to optimize the model. Of course, the model in this section has only one parameter, and it can also be extended to high-dimensional models with more parameters.

If you are familiar with neural networks, you may have guessed what we are going to learn next. In addition to the detailed information of neural network structure, the whole neural network model is actually constructing a function with a huge number of parameters (millions or even more), which may be written as f(x). By setting the objective function, the neural network can be trained to approximate the very complex real function g(x). The key of training is to set an objective function and feed back the current performance of the neural network. The training process is the process of decreasing the objective function value.

We already know that K-L divergence is used to measure the information loss when approaching a distribution. K-L divergence can give neural network the ability to approximate very complex data distribution. Variational automatic encoder is a common method to learn information in the best approximate data set. The course 20 16 of Variational Automatic Encoder is a very good course about VAEs, which tells the details of how to construct VAEs. What is a variational automatic encoder? Briefly explain and briefly introduce VAEs, build an automatic encoder in Keras and introduce how to realize several kinds of automatic encoders by using Keras library.

Variational Bayesian method is a common method. This paper introduces that the powerful Monte Carlo simulation method can solve many probability problems. Monte Carlo simulation can help solve the intractable integral problem in many Bayesian reasoning problems, although the calculation cost is very high. Variational Bayesian methods, including VAE, can use K-L divergence to generate the optimal approximate distribution, so that it can make more efficient reasoning for intractable integral problems. More knowledge about variational reasoning can be found in Edward library for python.

Because I haven't studied VAE and variational reasoning, the quality of this part cannot be guaranteed. I will contact my friends in this field to improve this part, and welcome your suggestions in the comments section.

Previous article:20 19 recommendations of tourist attractions in Chongqing during the National Day.
Next article:Record Jinhua's Excellent Lecture Draft of Shuanglong Cave