Engineering, 07.03.2020 02:46 lukeperry

Show how am MDP with a reward function R(s, a, s’) can be transformed into a different MDP with reward function R(s, a), such that optimal policies in the new MDP correspond exactly to optimal policies in the original MDP

Answers: 2

Show answers

Answers

Answer from: vondah4014

U(s) = maxa[R0

(s, a) + γ

pre T

(s, a, pre)(maxb[R0

(pre, b) + γ

0 T

(pre, b, s0

) ∗ U(s

))]]

U(s) = maxa[

0 T(s, a, s0

)(R(s, a, s0

) + γU(s

)]

U(s) = R0

(s) + γ

2 maxa[

post T

(s, a, post)(R0

(post) + γ

2 maxb[

0 T

(post, b, s0

)U(s

))]]

U(s) = maxa[R(s, a) + γ

0 T(s, a, s0

)U(s

)]

Explanation:

MDPs

MDPs can formulated with a reward function R(s), R(s, a) that depends on the action taken or R(s, a, s’) that

depends on the action taken and outcome state.

To Show how am MDP with a reward function R(s, a, s’) can be transformed into a different MDP with reward

function R(s, a), such that optimal policies in the new MDP correspond exactly to optimal policies in the

original MDP.

One solution is to define a ’pre-state’ pre(s, a, s’) for every s, a, s’ such that executing a in s leads not to s’

but to pre(s, a, s’). From the pre-state there is only one action b that always leads to s’. Let the new MDP

have transition T’, reward R’, and discount γ

(s, a, pre(s, a, s0

)) = T(s, a, s0

)

(pre(s, a, s0

), b, s0

) = 1

(s, a) = 0

(pre(s, a, s0

), b) = γ

− 1

2 R(s, a, s0

)

0 = γ

Then, using pre as shorthand for pre(s, a, s’):

U(s) = maxa[R0

(s, a) + γ

pre T

(s, a, pre)(maxb[R0

(pre, b) + γ

0 T

(pre, b, s0

) ∗ U(s

))]]

U(s) = maxa[

0 T(s, a, s0

)(R(s, a, s0

) + γU(s

)]

Now do the same to convert MDPs with R(s, a) into MDPs with R(s).

Similar to part (c), create a state post(s, a) for every s, a such that

(s, a, post(s, a, s0

)) = 1

(post(s, a, s0

), b, s0

) = T(s, a, s0

)

(s) = 0

(post(s, a, s0

)) = γ

− 1

2 R(s, a)

0 = γ

Then, using post as shorthand for post(s, a, s’):

U(s) = R0

(s) + γ

2 maxa[

post T

(s, a, post)(R0

(post) + γ

2 maxb[

0 T

(post, b, s0

)U(s

))]]

U(s) = maxa[R(s, a) + γ

0 T(s, a, s0

)U(s

)]

Answer from: Quest

answer:

explanation:

assault rifle is a rapid-fire, magazine-fed rifle designed for military use. it is a shoulder-fired weapon that allows the shooter to select between semi-automatic (requiring you pull the trigger for each shot), fully automatic (hold the trigger and the gun continuously fires) or three-shot-burst modes.

Answer from: Quest

It is letter a have a great day

Another question on Engineering

Engineering, 04.07.2019 18:10

Coiled springs ought to be very strong and stiff. si3n4 is a strong, stiff material. would you select this material for a spring? explain.

Answers: 2

Answer

Engineering, 04.07.2019 18:10

Different types of steels contain different elements that alter the characteristics of the steel. for each of the following elements, explain what the element does when alloyed with steel.

Answers: 2

Answer

Engineering, 04.07.2019 18:10

Burgers vector is generally parallel to the dislocation line. a)-true b)-false

Answers: 2

Answer

Engineering, 04.07.2019 18:10

Carbon dioxide gas expands isotherm a turbine from 1 mpa, 500 k at 200 kpa. assuming the ideal gas model and neglecting the kinetic and potential energies, determine the change in entropy, heat transfer and work for each kilogram of co2.

Answers: 2

Answer

You know the right answer?

Show how am MDP with a reward function R(s, a, s’) can be transformed into a different MDP with rewa...

Questions