Why the result is not better than MPC?
See original GitHub issueHi Hongzi,
I tried to reproduce the result of Pensieve. After several attempts, I failed to get an ideal result (better performance than MPC). The following is the way I used. The code was downloaded from GitHub, and the trace files were got from Dropbox:
- Put training data (train_sim_traces) in sim/cooked_traces and testing data (test_sim_traces) in sim/cooked_test_traces;
- Run
python multi_agent.py
to train the model; - Copy the generated model files to test/model, and modify the model name in test/rl_no_training.py;
- Run
python rl_no_training.py
in test/ folder to test the model, trace files in test_sim_traces are also used; - Run
python plot_results.py
to compare the results with DP method & MPC method.
I put two figures of total_reward and CDF here. We can see the performance of Pensieve is not better than MPC.
Here is a figure of tensorboard. The training step is about 160,000.
I found the result is not very stable after long time training (more than 10,000). Thus the trained models bring different performance when testing. For example, the model of 164500 steps got a reward of 35.2, while the model of 164600 steps got a reward of 33.7.
Did I do something wrong, so that I couldnāt get the same result as you described in the paper? The pretrain_linear_reward model performs good. How do you get it? Can you give me a hand to solve these questions, any answer is highly appreciated.
Thanks!
Issue Analytics
- State:
- Created 6 years ago
- Comments:7 (3 by maintainers)
Top GitHub Comments
Sure. Iāll try to use
placeholder
and post my result if it works. The following is the current result.ENTROPY_WEIGHT = 5
, 1~20000 epochsENTROPY_WEIGHT = 1
, 20001~40000 epochsENTROPY_WEIGHT = 0.5
, 40001~80000 epochsENTROPY_WEIGHT = 0.3
, 80001~100000 epochsENTROPY_WEIGHT = 0.1
, 100001~120000 epochsIām glad you got the good performance š
As for automatically decaying the exploration factor, notice that
ENTROPY_WEIGHT
sets a constant in tensorflow computation graph (e.g., https://github.com/hongzimao/pensieve/blob/master/sim/a3c.py#L47-L52). To make it tunable during execution, you need to specify a tensorflowplaceholder
and set its value each time.I think any reasonable decay function should work (e.g., linear, step function, etc.). If you manage to get that work, could you post your result (maybe open another issue)? Although we have our internal implementation (we didnāt post it because (1) itās fairly easy to implement and (2) more importantly we intentionally want others to observe this effect), we would appreciate a lot if someone can reproduce and improve our implementation. Thanks!