optimization problem using Q learning

Question

0 个投票

I have problem with optimization using Q-learning.

The concept of my optimization is charging ESS during the cheapest time and discharge during the most expensive time to reduce total cost of electrocity.

Reward for each step is how ESS reduced the cost of electricity. There are some constraints that SOC of ESS can not be lower.

Everything is good but action is how much ESS discharge or charge and state SOC reflect that action.

If ESS discharge, SOC get lower, ESS charge, SOC get higher. But this state doesn't reflect action properly, can anyone help this problem?

clc; clear;

%% ===== 환경 파라미터 =====

T = 48;

eff_cha = 0.95;

eff_dch = 0.95;

SOC_min = 0.1;

SOC_max = 1;

SOC0 = 0.5;

P_ess_max = 1500; % ESS 최대 충/방전 전력 (kW)

ESS_cap = 3000; % ESS 용량 (kWh)

actions = linspace(-0.5,0.5,41); % 비율 [-0.5~0.5]

numActions = length(actions);

%% ===== 학습 파라미터 =====

alpha = 0.1;

gamma = 0.99; % 여기선 MC 방식이라 직접 사용 안 함

epsilon = 0.5;

epsilon_min = 0.05;

epsilon_decay = 0.995;

numEpisodes = 60000;

%% ===== 상태 공간 (이산화) =====

numSOCs = 101;

numPrices=3

numPrices = 3

Q = zeros(numSOCs, numPrices, T, numActions);

%% ===== 가격/부하 데이터 =====

price_real = 140.5*ones(1,24);

price_real(1:7) = 87.3;

price_real(22:24) = 87.3;

price_real(8:10) = 109.8;

price_real(12) = 109.8;

price_real(18:21) = 109.8;

price_real = [price_real, price_real]; % 48시간

load_real = table2array(readtable('48_consumption_6.1.xlsx'));

pv_real = table2array(readtable("PV_gen.xlsx"));

load_real = load_real - pv_real; % 순부하(kW)

%% ===== SOC 임계값 =====

p_crt_val = 0.03;

for p =1:24

p_crt(p) = p_crt_val*(24-p);

end

p_crt = [p_crt, p_crt] + 0.03*randn(1,48);

%% ===== 상태 이산화용 정규화 =====

price_norm = price_real / max(price_real);

discretizeState = @(x) min(max(floor(x * numPrices) + 1, 1), numPrices);

%% ===== Monte Carlo 학습 루프 =====

saving_history = NaN(1,numEpisodes); % 완주한 에피소드 절감액만 기록

completion_rate = zeros(1,numEpisodes); % 완주율 기록

for ep = 1:numEpisodes

SOC = SOC0; % 초기 SOC 비율

episode_memory = []; % [SOC_idx, price_idx, time, action_idx]

grid_before_ep = zeros(1,T);

grid_after_ep = zeros(1,T);

done_flag = true; % 완주 여부

for t = 1:T

% 상태

s_idx = [discretizeState(SOC), discretizeState(price_norm(t)), t];

% ε-greedy 액션

if rand < epsilon

a_idx = randi(numActions);

else

[~, a_idx] = max(Q(s_idx(1), s_idx(2), s_idx(3), :));

end

a_kW = actions(a_idx) * P_ess_max;

% SOC 업데이트

if a_kW >= 0

SOC_next = SOC + (a_kW / ESS_cap) * eff_cha;

else

SOC_next = SOC + (a_kW / ESS_cap) / eff_dch;

end

% 하드 제약 위반 시 중단

if SOC_next > SOC_max

SOC_next = SOC_max; a_kW = 0;

elseif SOC_next < SOC_min

SOC_next = SOC_min; a_kW = 0;

elseif SOC_next < p_crt(t)

SOC_next = p_crt(t); a_kW = 0;

end

% 전력 기록

grid_before_ep(t) = load_real(t);

grid_after_ep(t) = load_real(t) + a_kW;

% 상태/행동 기록

episode_memory(end+1,:) = [s_idx, a_idx];

SOC = SOC_next;

end

% 완주한 경우만 Q 업데이트 & 기록

if done_flag && length(episode_memory) == T

cost_before_ep = sum(grid_before_ep .* price_real);

cost_after_ep = sum(grid_after_ep .* price_real);

saving_ep = cost_before_ep - cost_after_ep;

saving_history(ep) = saving_ep;

for step = 1:size(episode_memory,1)

s_idx = episode_memory(step,1:3);

a_idx = episode_memory(step,4);

Q(s_idx(1), s_idx(2), s_idx(3), a_idx) = ...

Q(s_idx(1), s_idx(2), s_idx(3), a_idx) + ...

alpha * (saving_ep - Q(s_idx(1), s_idx(2), s_idx(3), a_idx));

end

% ε 감소

if epsilon > epsilon_min

epsilon = epsilon * epsilon_decay;

end

% 완주율 기록

completion_rate(ep) = sum(~isnan(saving_history)) / ep;

if mod(ep,10000) == 0

fprintf("Episode %d: 완주=%d, 절감액=%.2f원, ε=%.3f\n", ...

ep, done_flag, saving_history(ep), epsilon);

end

Episode 10000: 완주=1, 절감액=589706.25원, ε=0.050 Episode 20000: 완주=1, 절감액=811117.50원, ε=0.050 Episode 30000: 완주=1, 절감액=668235.00원, ε=0.050 Episode 40000: 완주=1, 절감액=623130.00원, ε=0.050 Episode 50000: 완주=1, 절감액=598522.50원, ε=0.050 Episode 60000: 완주=1, 절감액=674752.50원, ε=0.050

%% ===== 학습 성과 시각화 =====

%% ===== 학습된 정책 시뮬레이션 =====

SOC = SOC0;

SOC_traj = zeros(1,T);

act_traj = zeros(1,T);

grid_power_before = zeros(1,T);

grid_power_after = zeros(1,T);

for t = 1:T

grid_power_before(t) = load_real(t);

s_idx = [discretizeState(SOC), discretizeState(price_norm(t)), t];

[~, a_idx] = max(Q(s_idx(1), s_idx(2), s_idx(3), :));

a_kW = actions(a_idx) * P_ess_max;

if a_kW >= 0

SOC_next = SOC + (a_kW / ESS_cap) * eff_cha;

else

SOC_next = SOC + (a_kW / ESS_cap) / eff_dch;

end

if SOC_next > SOC_max

SOC_next = SOC_max; a_kW = 0;

elseif SOC_next < SOC_min

SOC_next = SOC_min; a_kW = 0;

elseif SOC_next < p_crt(t)

SOC_next = p_crt(t); a_kW = 0;

end

grid_power_after(t) = load_real(t) + a_kW;

SOC_traj(t) = SOC_next;

act_traj(t) = a_kW;

SOC = SOC_next;

end

%% ===== 최종 비용 계산 =====

cost_before = sum(grid_power_before .* price_real);

cost_after = sum(grid_power_after .* price_real);

saving = cost_before - cost_after;

fprintf('최종 ESS 미사용 전 전기비용: %.3f 원\n', cost_before);

최종 ESS 미사용 전 전기비용: 6278802.431 원

fprintf('최종 ESS 사용 후 전기비용: %.3f 원\n', cost_after);

최종 ESS 사용 후 전기비용: 5594228.681 원

fprintf('최종 절감 금액: %.3f 원 (절감률 %.2f%%)\n', saving, saving/cost_before*100);

최종 절감 금액: 684573.750 원 (절감률 10.90%)

%% ===== 시뮬레이션 결과 시각화 =====

figure;

plot(saving_history); title('Learning Curve'); xlabel('Episode'); ylabel('Total Reward'); yticks(-4e5:1e5:9e5); grid on;

figure;

plot(100*SOC_traj,'LineWidth',1); hold on; plot(100*p_crt, 'r','LineWidth',1); title('SOC Trajectory'); ylabel('SOC(%)');ylim([-5 105]);legend('SOC','Critical Load'); grid on;

figure;

stairs(act_traj, '-x'); title('Action Trajectory (kW)'); grid on;

figure;

stairs(price_real); title('Price'); xlabel('Time'); ylabel('Price');

1 个评论
显示 -1更早的评论隐藏 -1更早的评论

Goutam 2025-9-3

Hi 찬목 박,

I think your Q-learning agent might be having trouble because it can't clearly link its actions to the outcomes they produce. With only three price levels, it might be too hard for the agent to tell the difference between low, medium, and high electricity costs especially when those differences can be pretty significant.

Also, since you're using Monte Carlo learning, the agent only gets feedback after the full 48-hour cycle is over. So it’s like getting a final score “you saved 600k won” without knowing which specific decisions (like charging at hour 5 or discharging at hour 20) actually made that happen. That kind of delayed and vague feedback makes it tough for the agent to learn which actions were truly valuable.

One idea that might help: try increasing 'numPrices' from 3 to 10. It’s a small tweak, but it could give the agent a much clearer picture of the price landscape and help it make more informed decisions without needing to redesign the whole system.

请先登录，再进行评论。

请先登录，再回答此问题。

Follow Question