I am currently trying to teach me something about neural networks. So I bought myself this book called Applied Artificial Intelligence written by Wolfgang Beer and I am now stuck at understanding a part of his code. Actually I understand the code I just do not understand one mathematical step behind it…
The part looks like this:
for i in range(iterations):
guessed = sig(inputs*weights)
error = output - guessed
adjustment = error*sig_d(outpus)
#Why is there no learningrate?
#Why is the adjustment relative to the error
#muliplied by the derivative of your main function?
weights += adjustment
I tried to look up how the gradient descent method works, but I never got the part with ajusting the weights. How does the math behind it work and why do you use the derivative for it?
Alo when I started to look in the internet for other solutions I always saw them using a learning rate. I understand the consept of it but why is this method not used in this book? It would realy help me if someone could awnser me these questions…
And thanks for all these rapid responses in the past.
2
Answers
Why is there no learningrate?
Why is the adjustment relative to the error
muliplied by the derivative of your main function?
To train a regression model we start with arbitrary weights and adjust weights so that the error will be minimum. If we plot the error as a function of weights we will get a plot like above figure where error J(θ0,θ1) is a function of weights θ0,θ1. We will be succeeded when our error will be very bottom of the graph when its value is the minimum. The red arrows show the minimum points in the graph. To reach to the minimum point we take derivative (the tangential line to a function) of our error function. The slope of the tangent is the derivative at that point and it will give us a direction to move towards. We make steps down the cost function in the direction with the steepest descent. The size of each step is determined by the parameter α, which is called the learning rate.
The gradient descent algorithm is:
In the above figure we plot error J(θ1) is a function of weight θ1. We start with an arbitrary value of θ1 and take derivative(slope of the tangent) of error J(θ1) to adjust weight θ1 so we can reach the bottom where error is minimum. If slope is positive we have to go left or decrease weight θ1. And if slope is negative we have to go right or increase θ1. We have to repeat this procedure until convergence or reaching minimum point.
If learning rate α is too small gradient descent converges too slow. And if α is too large gradient descent overshoots and fails to converge.
All the figures have been taken from Andrew Ng’s machine learning course on coursera.org
https://www.coursera.org/learn/machine-learning/home/welcome