Improve convergence of GD with momentum

Question

Askic V 2024-5-20

1
链接

此问题的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2120531-improve-convergence-of-gd-with-momentum

回答： recent works 2024-5-20

Hello,

I have made a simple implementation of the GD algorithm with momentum and it seems to me that convergence is very slow, it takes about 15k iterations to reach predefined tolerance.

I feel like there should be an improvement to this code, but I don't see where.

Please, suggest me a better implementation:

clear all
clc
% Initial guess
x0 = [1.3, 0.7, 0.8, 1.9, 1.2]';
% Parameters
alpha = 0.001;
beta = 0.4;
max_iter = 10000;
tol = 1e-8;
x = x0;  % find the values of x that minimize␣
i = 0;
step = zeros(size(x));
mse = rosenbrock(x);
fprintf('Initial MSE = %14.10f x = %s\n', mse, mat2str(x'));  % print initial values
while mse > tol
    grad = rosenbrock_gradient(x);
    gnorm = norm(grad);  % gradient norm
    step = -(1-beta)*alpha*grad + beta*step;
    x = x + step;
    mse = rosenbrock(x);  % update the mean squared error
    i = i + 1;
end
fprintf('iterations = %6d\n', i);
fprintf('Final MSE = %14.10f x = %s\n', mse, mat2str(x'));
fprintf('gradient = %s\n', mat2str(grad'));
fprintf('gradient norm = %f\n', gnorm);
% Define the Rosenbrock function
function [mse] = rosenbrock(x)
    mse = sum(100.0 * (x(2:end) - x(1:end-1).^2.0).^2.0 + (1 - x(1:end-1)).^2.0);
end
% Define the gradient of the Rosenbrock function
function [grad] = rosenbrock_gradient(x)
    n = length(x);
    grad = zeros(n, 1);
    grad(1) = -400 * x(1) * (x(2) - x(1)^2) - 2 * (1 - x(1));
    grad(2:n-1) = 200 * (x(2:n-1) - x(1:n-2).^2) - 400 * x(2:n-1) .* (x(3:n) - x(2:n-1).^2) - 2 * (1 - x(2:n-1));
    grad(n) = 200 * (x(n) - x(n-1)^2);
end

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

请先登录，再回答此问题。

Answer 1

recent works 2024-5-20

3
链接

此回答的直接链接

https://ww2.mathworks.cn/matlabcentral/answers/2120531-improve-convergence-of-gd-with-momentum#answer_1460321

To improve the convergence speed of your Gradient Descent (GD) algorithm with momentum, there are several strategies you can consider. Here are some suggestions and modifications to your code:

Adaptive Learning Rate:Implementing an adaptive learning rate, such as using Adam optimizer, can significantly improve convergence speed. Adam adjusts the learning rate based on the first and second moments of the gradients.
Learning Rate Tuning:Sometimes, adjusting the learning rate (𝛼α) and momentum (𝛽β) parameters can lead to faster convergence. You might need to experiment with different values.
Early Stopping:Implementing early stopping criteria can prevent unnecessary iterations once the algorithm has converged sufficiently.
Function Tolerance Check:Check the function value tolerance as well as the gradient norm to ensure you have a robust stopping criterion.

clear all

clc

% Initial guess

x0 = [1.3, 0.7, 0.8, 1.9, 1.2]';

% Parameters for Adam

alpha = 0.001; % initial learning rate

beta1 = 0.9; % decay rate for the first moment estimate

beta2 = 0.999; % decay rate for the second moment estimate

epsilon = 1e-8; % small number to prevent division by zero

max_iter = 10000;

tol = 1e-8;

% Initialize Adam parameters

m = zeros(size(x0)); % first moment vector

v = zeros(size(x0)); % second moment vector

t = 0; % timestep

x = x0; % find the values of x that minimize

i = 0;

mse = rosenbrock(x);

fprintf('Initial MSE = %14.10f x = %s\n', mse, mat2str(x')); % print initial values

while mse > tol && i < max_iter

grad = rosenbrock_gradient(x);

gnorm = norm(grad); % gradient norm

t = t + 1;

m = beta1 * m + (1 - beta1) * grad; % update biased first moment estimate

v = beta2 * v + (1 - beta2) * (grad .^ 2); % update biased second moment estimate

m_hat = m / (1 - beta1^t); % compute bias-corrected first moment estimate

v_hat = v / (1 - beta2^t); % compute bias-corrected second moment estimate

step = -alpha * m_hat ./ (sqrt(v_hat) + epsilon);

x = x + step;

mse = rosenbrock(x); % update the mean squared error

i = i + 1;

end

fprintf('iterations = %6d\n', i);

fprintf('Final MSE = %14.10f x = %s\n', mse, mat2str(x'));

fprintf('gradient = %s\n', mat2str(grad'));

fprintf('gradient norm = %f\n', gnorm);

% Define the Rosenbrock function

function [mse] = rosenbrock(x)

mse = sum(100.0 * (x(2:end) - x(1:end-1).^2.0).^2.0 + (1 - x(1:end-1)).^2.0);

end

% Define the gradient of the Rosenbrock function

function [grad] = rosenbrock_gradient(x)

n = length(x);

grad = zeros(n, 1);

grad(1) = -400 * x(1) * (x(2) - x(1)^2) - 2 * (1 - x(1));

grad(2:n-1) = 200 * (x(2:n-1) - x(1:n-2).^2) - 400 * x(2:n-1) .* (x(3:n) - x(2:n-1).^2) - 2 * (1 - x(2:n-1));

grad(n) = 200 * (x(n) - x(n-1)^2);

end

Adam Optimizer: This method combines the advantages of two other extensions of stochastic gradient descent: AdaGrad and RMSProp. It computes adaptive learning rates for each parameter.

Learning Rates and Moments: The parameters 𝛽1 and 𝛽2 are set to common defaults, but they can be adjusted based on your specific problem.

Stopping Criteria: The loop stops either when the tolerance level is reached or after the maximum number of iterations.By using Adam, you can achieve faster convergence due to its adaptive nature, which helps navigate the optimization landscape more effectively.

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

请先登录，再进行评论。

Improve convergence of GD with momentum

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

更多回答（0 个）

另请参阅

类别

标签

Community Treasure Hunt

Improve convergence of GD with momentum

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

采纳的回答

0 个评论 显示 -2更早的评论隐藏 -2更早的评论

更多回答（0 个）

另请参阅

类别

标签

Community Treasure Hunt

0 个评论
显示 -2更早的评论隐藏 -2更早的评论

0 个评论
显示 -2更早的评论隐藏 -2更早的评论