Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TicTacToe Not Learning to Win?

See original GitHub issue

I’ve been playing with the TicTacToe example. The example trains rapidly on my CPU (great A3C implementation @peastman!), but the rewards indicate that the learned model isn’t doing great. In the TicTacToe environment, +10 is a win, +5 a draw, .1 for unfinished games, -3 for losses and illegal moves. The average rewards I’m seeing (evaluated on 5K games, at intervals of 50K rollouts) are 3-5, indicating, that the system is getting around a draw on average:

[{50000: 3.5432999999999999}, {100000: 5.9225000000000003}, {150000: 3.1412}, {200000: 3.2425999999999999}, {250000: 3.3361999999999998}, {300000: 3.4727000000000001}, {350000: 5.0313999999999997}, {400000: 3.5625}, {450000: 2.9565999999999999}, {500000: 3.7029999999999998}, {550000: 2.6173000000000002}, {600000: 3.5272999999999999}, {650000: 4.0795000000000003}, {700000: 3.6585999999999999}, {750000: 3.1958000000000002}, {800000: 5.8152999999999997}, {850000: 3.9902000000000002}, {900000: 4.1486999999999998}, {950000: 3.3544}, {1000000: 3.3946999999999998}]

Naively, I would expect that we should be able to train RL to win nearly always on TicTacToe (average reward close to 10). What are we possibly missing?

Issue Analytics

State:
Created 6 years ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

peastmancommented, Aug 7, 2017

I think the main problem is just that your network is much bigger and more complicate than it needs to be. I simplified it down to this:

    d1 = Flatten(in_layers=state)
    d2 = Dense(
        in_layers=[d1],
        activation_fn=tf.nn.relu,
        out_channels=64)
    probs = Dense(in_layers=[d2], activation_fn=tf.nn.softmax, out_channels=9)
    value = Dense(in_layers=[d2], activation_fn=None, out_channels=1)
    return {'action_prob': probs, 'value': value}

I also reduced value_weight back to 1.0 (the default value). With those changes, I get much better results:

[{100000: 8.2760999999999996}, {200000: 6.9362000000000004}, {300000: 7.5300000000000002}, {400000: 8.1309000000000005}, {500000: 7.4715999999999996}]

0reactions

lilleswingcommented, Aug 7, 2017

might be a good time to try to get https://github.com/deepchem/deepchem/issues/720 to work (at least in contrib)