generate a reward function from control specifications to train a reinforcement learning agent

since r2021b

syntax

generaterewardfunction(mpcobj)

generaterewardfunction(blks)

generaterewardfunction(___,'functionname',myfcnname)

description

generaterewardfunction(mpcobj) generates a matlab^® reward function based on the cost and constraints defined in the linear or nonlinear mpc object mpcobj. the generated reward function is displayed in a new editor window and you can use it as a starting point for reward design. you can tune the weights, use a different penalty function, and then use the resulting reward function within an environment to train an agent.

this syntax requires model predictive control toolbox™ software.

example

generaterewardfunction(blks) generates a matlab reward function based on performance constraints defined in the model verification blocks specified in the array of block paths blks.

this syntax requires simulink^® design optimization™ software.

generaterewardfunction(___,'functionname',myfcnname) specifies the name of the generated reward function, and saves it into a file with the same name. it also overwrites any preexisting file with the same name in the current directory. provide this name after either of the previous input arguments.

examples

generate a reward function from mpc object

this example uses:

this example shows how to generate a reinforcement learning reward function from an mpc object.

define plant and create mpc controller

create a random plant using the rss function and set the feedthrough matrix to zero.

plant = rss(4,3,2);
plant.d = 0;

specify which of the plant signals are manipulated variables, measured disturbances, measured outputs and unmeasured outputs.

plant = setmpcsignals(plant,"mv",1,"md",2,"mo",[1 2],"uo",3);

create an mpc controller with a sample time of 0.1 and prediction and control horizons of 10 and 3 steps, respectively.

mpcobj = mpc(plant,0.1,10,3);

-->"weights.manipulatedvariables" is empty. assuming default 0.00000.
-->"weights.manipulatedvariablesrate" is empty. assuming default 0.10000.
-->"weights.outputvariables" is empty. assuming default 1.00000.
   for output(s) y1 and zero weight for output(s) y2 y3

set limits and a scale factor for the manipulated variable.

mpcobj.manipulatedvariables.min = -2;
mpcobj.manipulatedvariables.max = 2;
mpcobj.manipulatedvariables.scalefactor = 4;

set weights for the quadratic cost function.

mpcobj.weights.outputvariables = [10 1 0.1];
mpcobj.weights.manipulatedvariablesrate = 0.2;

generate the reward function

generate the reward function code from specifications in the mpc object using generaterewardfunction. the code is displayed in the matlab editor.

generaterewardfunction(mpcobj)

for this example, the is saved in the matlab function file mympcrewardfcn.m. display the generated reward function.

type mympcrewardfcn.m

function reward = mympcrewardfcn(y,refy,mv,refmv,lastmv)
% mympcrewardfcn generates rewards from mpc specifications.
%
% description of input arguments:
%
% y : output variable from plant at step k 1
% refy : reference output variable at step k 1
% mv : manipulated variable at step k
% refmv : reference manipulated variable at step k
% lastmv : manipulated variable at step k-1
%
% limitations (mpc and nlmpc):
%     - reward computed based on first step in prediction horizon.
%       therefore, signal previewing and control horizon settings are ignored.
%     - online cost and constraint update is not supported.
%     - custom cost and constraint specifications are not considered.
%     - time varying cost weights and constraints are not supported.
%     - mixed constraint specifications are not considered (for the mpc case).
% reinforcement learning toolbox
% 27-may-2021 14:47:29
%#codegen
%% specifications from mpc object
% standard linear bounds as specified in 'states', 'outputvariables', and
% 'manipulatedvariables' properties
ymin = [-inf -inf -inf];
ymax = [inf inf inf];
mvmin = -2;
mvmax = 2;
mvratemin = -inf;
mvratemax = inf;
% scale factors as specified in 'states', 'outputvariables', and
% 'manipulatedvariables' properties
sy  = [1 1 1];
smv = 4;
% standard cost weights as specified in 'weights' property
qy      = [10 1 0.1];
qmv     = 0;
qmvrate = 0.2;
%% compute cost
dy      = (refy(:)-y(:)) ./ sy';
dmv     = (refmv(:)-mv(:)) ./ smv';
dmvrate = (mv(:)-lastmv(:)) ./ smv';
jy      = dy'      * diag(qy.^2)      * dy;
jmv     = dmv'     * diag(qmv.^2)     * dmv;
jmvrate = dmvrate' * diag(qmvrate.^2) * dmvrate;
cost    = jy   jmv   jmvrate;
%% penalty function weight (specify nonnegative)
wy = [1 1 1];
wmv = 10;
wmvrate = 10;
%% compute penalty
% penalty is computed for violation of linear bound constraints.
%
% to compute exterior bound penalty, use the exteriorpenalty function and
% specify the penalty method as 'step' or 'quadratic'.
%
% alternaltively, use the hyperbolicpenalty or barrierpenalty function for
% computing hyperbolic and barrier penalties.
%
% for more information, see help for these functions.
%
% set pmv value to 0 if the rl agent action specification has
% appropriate 'lowerlimit' and 'upperlimit' values.
py      = wy      * exteriorpenalty(y,ymin,ymax,'step');
pmv     = wmv     * exteriorpenalty(mv,mvmin,mvmax,'step');
pmvrate = wmvrate * exteriorpenalty(mv-lastmv,mvratemin,mvratemax,'step');
penalty = py   pmv   pmvrate;
%% compute reward
reward = -(cost   penalty);
end

the calculated reward depends only on the current values of the plant input and output signals and their reference values, and it is composed of two parts.

the first is a negative cost that depends on the squared difference between desired and current plant inputs and outputs. this part uses the cost function weights specified in the mpc object. the second part is a penalty that acts as a negative reward whenever the current plant signals violate the constraints.

the generated reward function is a starting point for reward design. you can tune the weights or use a different penalty function to define a more appropriate reward for your reinforcement learning agent.

generate reward function from verification block

this example uses:

this example shows how to generate a reinforcement learning reward function from a simulink design optimization model verification block.

for this example, open the simulink model levelcheckblock.slx, which contains a check step response characteristics block named level check.

open_system("levelcheckblock")

figure check step response characteristics [1] - level check contains an axes object and other objects of type uiflowcontainer, uimenu, uitoolbar. the axes object contains 9 objects of type patch, line.

generate the reward function code from specifications in the level check block, using generaterewardfunction. the code is displayed in the matlab editor.

generaterewardfunction("levelcheckblock/level check")

for this example, the code is saved in the matlab function file myblockrewardfcn.m.

display the generated reward function.

type myblockrewardfcn.m

function reward = myblockrewardfcn(x,t)
% myblockrewardfcn generates rewards from simulink block specifications.
%
% x : input of levelcheckblock/level check
% t : simulation time (s)
% reinforcement learning toolbox
% 27-may-2021 16:45:27
%#codegen
%% specifications from levelcheckblock/level check
block1_initialvalue = 1;
block1_finalvalue = 2;
block1_steptime = 0;
block1_steprange = block1_finalvalue - block1_initialvalue;
block1_minrise = block1_initialvalue   block1_steprange * 80/100;
block1_maxsettling = block1_initialvalue   block1_steprange * (1 2/100);
block1_minsettling = block1_initialvalue   block1_steprange * (1-2/100);
block1_maxovershoot = block1_initialvalue   block1_steprange * (1 10/100);
block1_minundershoot = block1_initialvalue - block1_steprange * 5/100;
if t >= block1_steptime
    if block1_initialvalue <= block1_finalvalue
        block1_upperboundtimes = [0,5; 5,max(5 1,t 1)];
        block1_upperboundamplitudes = [block1_maxovershoot
                                       block1_maxovershoot; 
                                       block1_maxsettling
                                       block1_maxsettling];
        block1_lowerboundtimes = [0,2; 2,5; 5,max(5 1,t 1)];
        block1_lowerboundamplitudes = [block1_minundershoot
                                       block1_minundershoot; 
                                       block1_minrise
                                       block1_minrise; 
                                       block1_minsettling
                                       block1_minsettling];
    else
        block1_upperboundtimes = [0,2; 2,5; 5,max(5 1,t 1)];
        block1_upperboundamplitudes = [block1_minundershoot
                                       block1_minundershoot; 
                                       block1_minrise,block1_minrise; 
                                       block1_minsettling
                                       block1_minsettling];
        block1_lowerboundtimes = [0,5; 5,max(5 1,t 1)];
        block1_lowerboundamplitudes = [block1_maxovershoot
                                       block1_maxovershoot; 
                                       block1_maxsettling
                                       block1_maxsettling];
    end
    block1_xmax = zeros(1,size(block1_upperboundtimes,1));
    for idx = 1:numel(block1_xmax)
        tseg = block1_upperboundtimes(idx,:);
        xseg = block1_upperboundamplitudes(idx,:);
        block1_xmax(idx) = interp1(tseg,xseg,t,'linear',nan);
    end
    if all(isnan(block1_xmax))
        block1_xmax = inf;
    else
        block1_xmax = max(block1_xmax,[],'omitnan');
    end
    block1_xmin = zeros(1,size(block1_lowerboundtimes,1));
    for idx = 1:numel(block1_xmin)
        tseg = block1_lowerboundtimes(idx,:);
        xseg = block1_lowerboundamplitudes(idx,:);
        block1_xmin(idx) = interp1(tseg,xseg,t,'linear',nan);
    end
    if all(isnan(block1_xmin))
        block1_xmin = -inf;
    else
        block1_xmin = max(block1_xmin,[],'omitnan');
    end
else
    block1_xmin = -inf;
    block1_xmax = inf;
end

%% penalty function weight (specify nonnegative)
weight = 1;
%% compute penalty
% penalty is computed for violation of linear bound constraints.
%
% to compute exterior bound penalty, use the exteriorpenalty function and
% specify the penalty method as 'step' or 'quadratic'.
%
% alternaltively, use the hyperbolicpenalty or barrierpenalty function for
% computing hyperbolic and barrier penalties.
%
% for more information, see help for these functions.
penalty = sum(exteriorpenalty(x,block1_xmin,block1_xmax,'step'));
%% compute reward
reward = -weight * penalty;
end

the generated reward function takes as input arguments the current value of the verification block input signals and the simulation time. a negative reward is calculated using a weighted penalty that acts whenever the current block input signals violate the linear bound constraints defined in the verification block.

input arguments

`mpcobj` — linear or nonlinear mpc object
`mpc` object | `nlmpc` object

linear or nonlinear mpc object, specified as an mpc (model predictive control toolbox) object or an nlmpc (model predictive control toolbox) object, respectively.

note that:

the generated function calculates rewards using signal values at the current time only. predicted future values, signal previewing, and control horizon settings are not used in the reward calculation.
using time-varying cost weights and constraints, or updating them online, is not supported.
only the standard quadratic cost function, as described in (model predictive control toolbox), is supported. therefore, for mpc objects, using mixed constraint specifications is not supported. similarly, for nlmpc objects, custom cost and constraint specifications are not supported.

example: mpc(tf([1 1],[1 2 0]),0.1)

`blks` — path to model verification blocks
`char` vector | `cell` array | `string` array

path to model verification blocks, specified as character vector, cell array or string array. the supported simulink design optimization model verification blocks are the following ones.

(simulink design optimization)
(simulink design optimization)
check step response characteristics (simulink design optimization)
any block belonging to the simulink (simulink) library

example: "mysimulinkmodel02/check against reference"

`myfcnname` — function name
`string` | `char` vector

function name, specified as a string object or character vector.

example: "reward03epf_step"

tips

by default, the exterior bound penalty function exteriorpenalty is used to calculate the penalty. alternatively, to calculate hyperbolic and barrier penalties, you can use the hyperbolicpenalty or barrierpenalty functions.

version history

introduced in r2021b

generate a reward function from control specifications to train a reinforcement learning agent -pg电子麻将胡了

syntax

description

examples

generate a reward function from mpc object

generate reward function from verification block

input arguments

`mpcobj` — linear or nonlinear mpc object
`mpc` object | `nlmpc` object

`blks` — path to model verification blocks
`char` vector | `cell` array | `string` array

`myfcnname` — function name
`string` | `char` vector

tips

version history

see also

functions

objects

topics

generate a reward function from control specifications to train a reinforcement learning agent -pg电子麻将胡了

syntax

description

examples

generate a reward function from mpc object

generate reward function from verification block

input arguments

mpcobj — linear or nonlinear mpc object mpc object | nlmpc object

blks — path to model verification blocks char vector | cell array | string array

myfcnname — function name string | char vector

tips

version history

see also

functions

objects

topics

wechat

`mpcobj` — linear or nonlinear mpc object
`mpc` object | `nlmpc` object

`blks` — path to model verification blocks
`char` vector | `cell` array | `string` array

`myfcnname` — function name
`string` | `char` vector