Traditional Culture Encyclopedia - Weather forecast - Neural network algorithm

Neural network algorithm

Five or six in the twentieth century? In 1960s, influenced by Warren mcculloch and Walter Pitts, scientist Frank Rosenblat invented the Perceptrons.

? A perceptron accepts? Answer? Binary input? ,, give birth to another? Binary output:

The perceptron shown above has three inputs. : 。 How much loss can there usually be? . Shall we make an offer again? Weight: Measure the importance of input to output. If the output of the perceptron is 0 or 1, what is the sum of the assigned weights? Equal to or? Determined by the threshold. And weight? Sample, what is the threshold? A real number? Parameters of neurons. ? A more accurate algebraic form is as follows:

Weigh three factors to make a decision:

These three factors can be expressed as corresponding binary variables. . For example, if the day? All right, let's put

, if not,. Similarly, if your friends go with you, otherwise. Is similar.

Of the three, "whether the movie is good or not" is the most important to you, but the weather is not so important. So you will assign weights: then define the threshold =5.

Now, can you do it? Perceptron to construct this decision? Mathematical model.

For example:

With the change of weight and threshold, different decision models can be obtained. Obviously, the perception machine is not? Make a decision? All models. But what about this case? Explain? How does the perceptron weigh different evidences to make a decision? Does this look all right? To explain? Perceptron? Sometimes Luo can really makes some good decisions.

Now the structure of our team is slightly changed, so that b=-threshold, that is, the threshold value is moved to the left of the inequality symbol, which becomes biased. Then the rules of perceptron can be rewritten as:

Quote? Deviation is the sensor we described? Very? Change, but after us? Will you see it go further? Symbol simplification of steps. So, we didn't? Threshold? Always making? Offset.

Perceptron is the first artificial neural network that can be learned, and its appearance caused the first climax of neural network. It should be pointed out that perceptron can only do simple linear classification tasks, and Minsky proved in the book Perceptron published by 1969 that perceptron can't solve such problems as XOR. However, the introduction of perceptron is of great significance to the development of neural network.

By observing the above perceptron, we found a problem. The output of each sensor is only 0 and 1, that is to say, sometimes we just slightly modify the weight w or offset b on a single sensor, which may lead to the final output being completely reversed. In other words, the output of the perceptron is a step function. As shown in the figure below, the output changes obviously near 0, but away from 0, we may adjust the parameters for a long time and the output will not change.

This jump is not what we want. What we need is that when the weight w or offset b of our team is slightly adjusted, the output will change slightly accordingly. This also means that our output is not only 0 and 1, but also decimals. So we introduced S-type neurons.

S-type neurons use S-type function, also called Sigmoid function function, which we use as activation function. Its expression is as follows:

The image is shown in the following figure:

Lee? The actual σ function. For example. A descriptive smoothing perceptron. The smoothness of σ function is the key factor. Not its detailed form. σ Smoothing Means Small Weight and Offset? Is that a change? W and? B, will you produce a microbody from a neuron? Changes in output? Output. In fact, calculus tells us

? Can the output approximate the table well? Used for:

What is the formula above? Linear function reflecting weight, deviation change and output change. This? Linearity allows us to choose the weight and offset of the microprocessor? Change to achieve micro output? Change. Therefore, when the S-shaped neuron and the perceptron are essentially the same, it will be easier for the S-shaped neuron to calculate how to change the weight and bias to make the output change.

With the understanding of S-type neurons, we can introduce the basic structure of neural network. Details are as follows:

Are you online? The leftmost part of the network is called losing? Layer, in which neurons are called lose? Neuron. The rightmost, that is, the output layer contains output neurons. In the figure, the output layer has only? A neuron. Intermediate layer, because the neurons in this layer are neither lost? Or output, which is called the hidden layer.

This is the basic structure of neural network. With the later development, the layers of neural network are also increasing and becoming more and more complex.

Let's review the development of neural networks. The development history of neural network is tortuous, with some moments of praise and some times when nobody cares about it in the street, and it has experienced several ups and downs in the middle.

There are three ascending processes from single-layer neural network (perceptron), to two-layer neural network with hidden layer, and then to multi-layer deep neural network. See the figure below for details.

We hope so? An algorithm that allows us to find the weight and offset. Yu? Can the output y(x) of the network be suitable for all training inputs? X. How do we do this for quantification? Mark, what do we define? Cost function:

This? W watch? All of them? The set of weights in the network, where b is all offsets and n is training loss? The amount of data,

A Is it a watch? When you lose? Is the vector output at x, and lost in the total training? X progress? Yes Of course, the output A depends on X, W and B, but in order to keep the symbol simple, I didn't explicitly point out this dependence. The symbol ∨v∨ is the module of the pointing quantity V, which we call C? Secondary cost function; Sometimes it's called "average"? Error or MSE. Observation? In the form of two-level cost function, we can see that C(w, b) is? No, because every? Everything? No. In addition, are the values of the cost function C(w, b) equivalent? , that is, c (w; B) ≈ 0, exactly, when you lose all the training? When x, y(x) approaches the output A. Because

If our learning algorithm can find suitable weights and deviations, C (w; B) ≈ 0, can it be good? Work. On the contrary, when c (w; B) very? Not good. What does this mean? Lose by quantity? What's the difference between y(x) and output a? . What about our training algorithm? Yes, it is the most? Cost function c (w; B) Yes. In other words, we want to find? Can the series make the price as high as possible? The weight and offset of. We'll adopt? An algorithm called gradient descent to achieve this? Yes

Let's simplify the cost function to C(v). It can be any multivariate real function.

Note that us V replaces w and b to emphasize that it may be an arbitrary function. Let's not confine ourselves to nervousness for the time being. Network environment.

To make the problem easier, let's consider the case of two variables first. Imagine that c is? A function with only the sum of two variables, our goal is to find and minimize C.

As shown in the above figure, our goal is to find the local minimum. One way to solve this problem is to use calculus. We can solve the extreme point of c by calculating the derivative. But for neural networks, we often face non-normal weights and deviations, that is to say, the dimension of V is not only two dimensions, but may be hundreds of millions of dimensions. It is almost impossible to derive the derivative of a high-dimensional function C(v).

In this case, someone proposed an interesting algorithm. Imagine a ball rolling down the valley from the top of the mountain, ours? Often experience tells us that this ball will eventually roll to? Bottom. Ignore the relevant physical theorems for the time being. Eye observation is to stimulate our imagination? It doesn't bind our minds. So instead of getting stuck in physics? What a mess of details, why don't we just ask: if we play? God, the laws of physics that can be constructed, can you? Match how the ball rolls, so what kind of kinematics law will we adopt to keep the ball rolling? What about the bottom?

In order to describe this problem more accurately, let's think about it? Next, when we are with? Move the spheres separately? Very? How much is that? And then what? When will the ball be sent out? What does calculus tell us that C will change as follows:

It can also be expressed as a vector.

Now our question is to keep looking for one less than 0? C, making C+? C is getting smaller and smaller

Suppose we choose:

This? η is a very? Positive number (called learning rate), so

Because ∨? C∨2≥0, which guarantee? C ≤ 0, that is, if we follow the above? Change the rules of procedure to v, and then to C.

what's up Direct reduction? , will not increase.

So we can constantly change V to keep the value of C down and make the ball roll to the lowest point.

Summary? Under gradient descent algorithm? Did it work? The formula is to repeatedly calculate the gradient? C, and then along the opposite side? Move to, roll down. We can imagine this:

In order for gradient descent to work correctly? We need to choose a suitable learning rate η to ensure that C decreases continuously until we find the minimum value.

Knowing the gradient descent method of binary function c, we can easily extend it to many dimensions. Let's assume that c is? A multivariate function with m variables. ? C will become:

Among them,? C is

? V is:

The update rule is:

When returning to the neural network, the update rules of W and B are:

Speaking of nerves? How to do the network? Gradient descent algorithm to learn their weights and deviations. But, what about this? You left it? Question: We didn't discuss how to calculate the gradient of the cost function. A very important algorithm is needed here: back propagation algorithm.

The inspiration of back propagation algorithm is the chain rule in mathematics.

Four equations:

Output layer error equation:

Current layer error equation:

The relationship between error equation and deviation;

Relationship between error equation and weight

Algorithm description:

Look at this algorithm, and you will know why it is called back propagation. Let's start from the beginning? This layer begins to calculate the error vector increment backwards. This seems a bit strange. Why go from the back? Here we go. But if we seriously consider the proof of backward propagation, this backward motion is actually a cost function, right? The result of the function output by the network. To understand the cost, and then what? Do we need to repeat the law of layer weight and offset change? Chain rule, in turn, get the required expression.

Reference link:/

Previous article:Yina Town Travel Guide
Next article:How to fish in July? When is the prime time for fishing in summer?