question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TicTacToe Not Learning to Win?

See original GitHub issue

I’ve been playing with the TicTacToe example. The example trains rapidly on my CPU (great A3C implementation @peastman!), but the rewards indicate that the learned model isn’t doing great. In the TicTacToe environment, +10 is a win, +5 a draw, .1 for unfinished games, -3 for losses and illegal moves. The average rewards I’m seeing (evaluated on 5K games, at intervals of 50K rollouts) are 3-5, indicating, that the system is getting around a draw on average:

[{50000: 3.5432999999999999}, {100000: 5.9225000000000003}, {150000: 3.1412}, {200000: 3.2425999999999999}, {250000: 3.3361999999999998}, {300000: 3.4727000000000001}, {350000: 5.0313999999999997}, {400000: 3.5625}, {450000: 2.9565999999999999}, {500000: 3.7029999999999998}, {550000: 2.6173000000000002}, {600000: 3.5272999999999999}, {650000: 4.0795000000000003}, {700000: 3.6585999999999999}, {750000: 3.1958000000000002}, {800000: 5.8152999999999997}, {850000: 3.9902000000000002}, {900000: 4.1486999999999998}, {950000: 3.3544}, {1000000: 3.3946999999999998}]

Naively, I would expect that we should be able to train RL to win nearly always on TicTacToe (average reward close to 10). What are we possibly missing?

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
peastmancommented, Aug 7, 2017

I think the main problem is just that your network is much bigger and more complicate than it needs to be. I simplified it down to this:

    d1 = Flatten(in_layers=state)
    d2 = Dense(
        in_layers=[d1],
        activation_fn=tf.nn.relu,
        out_channels=64)
    probs = Dense(in_layers=[d2], activation_fn=tf.nn.softmax, out_channels=9)
    value = Dense(in_layers=[d2], activation_fn=None, out_channels=1)
    return {'action_prob': probs, 'value': value}

I also reduced value_weight back to 1.0 (the default value). With those changes, I get much better results:

[{100000: 8.2760999999999996}, {200000: 6.9362000000000004}, {300000: 7.5300000000000002}, {400000: 8.1309000000000005}, {500000: 7.4715999999999996}]
0reactions
lilleswingcommented, Aug 7, 2017

might be a good time to try to get https://github.com/deepchem/deepchem/issues/720 to work (at least in contrib)

Read more comments on GitHub >

github_iconTop Results From Across the Web

Tic tac toe game not learning to win - python - Stack Overflow
In this tic tac toe game using RL I penalize the agent if it does not win. The RL agent is player X....
Read more >
3 Ways to Win at Tic Tac Toe - wikiHow
1. Play your first X in a corner. Most experienced tic tac toe players put the first "X" in a corner when they...
Read more >
Tic Tac Toe - Never Lose (Usually Win) - YouTube
Your browser can't play this video. Learn more. Switch camera.
Read more >
How To Win At Tic Tac Toe Almost Every Time - YouTube
Your browser can't play this video. Learn more. Switch camera.
Read more >
Is there a way to never lose at Tic-Tac-Toe? - Quora
Yes. You start first with any one corner box. If the opponent choose any box other than center, then opponent cannot stop you...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found