Agent57 vs Meena – Part I: The Death of Gaming Milestones?


A few weeks ago DeepMind released a preprint on arXiv describing a new deep RL agent that has successfully surpassed the performance of a single human professional game tester on a sample of Atari games which included all of the games that had previously posed the greatest challenges to other deep RL agents. Since the preprint release this project has received a lot of attention, including from AI Impacts who have initially concluded that the achievement of this milestone represents the resolution of a forecast from their 2016 survey significantly earlier than experts had predicted. However, based on Dafoe’s desiderata for AI forecasting targets, this target was significantly flawed.

I will not delve into the forecasts flaws here because what is more important is that the achievement itself appears to be no more than incremental progress. Last decade milestones on games made for impressive headlines and brought a lot of attention to DeepMind. However, the attention Agent57 received appears to have been substantially less than DeepMind’s major milestones on games last decade. Perhaps the media is rightfully weary of DeepMind’s handwavy progress on the latest milestones. For example, last year’s major milestone on Starcraft 2 involved outperforming 99.85% of amateur human players, which, to be certain, is impressive. However, this is less impressive than the superhuman performance of AlphaGo in 2015 or the follow-up in 2016 of a vastly superior system which did not even require any human training data. Perhaps the most significant evidence is the fact that the 8 page paper with 20 pages of appendices appeared on arXiv and not in Nature, as did the three examples from last decade cited here. In the past Atari performance has been suitable for Nature, but not anymore.

To be clear, I am by no means trying to diminish the value of this work. The fact that without any domain knowledge a single algorithm can perform better than an above average yet less than expert human on games like Pitfall and Montezuma’s Revenge itself is impressive. Doing so whilst achieving the performance demonstrated on so many other Atari games is even more impressive. However, 16 months after Go Explore demonstrated the first significant progress on these two games, and given the pace of AI progress that we are now accustomed to, this could even be thought to have been foreseeable. The problem with AI forecasting is that there are so many moving targets to forecast, and it is difficult for those who work on this full-time to determine where resources should be best allocated. In the case of games, ironically the answer may simply be that their forecasts are only worth the time for publicity purposes.

Leave a Reply

Your email address will not be published.