Agent57 vs Meena – Part II: An Era of True Progress in AI?

True Progress in AI

When conducting interviews with dozens of the world’s AI experts over the past few years I frequently asked a question about what they thought would be good indicators of true progress in AI. Oftentimes they would reply that the latest milestones which had received substantial media coverage were not signs of true progress in AI, and offered different suggestions for what would constitute true progress in AI.

The previous post explored the recent progress demonstrated by DeepMind’s Agent57 on Atari, which could be seen as an indicator that gaming milestones no longer carry the weight they used to. Perhaps they were a poor signal of progress all along, or perhaps the real signal was just foreshadowing an era of true progress. AI has struggled to overcome the productivity paradox, and it remains to be seen whether the economic value of AI recently heralded by leading firms like McKinsey and PwC will in fact come to bear. However, it begs asking the question of, if not games, then what are indicators of true progress in AI? Furthermore, what new applications of AI can actually bring the economic impacts that these firms have suggested?

Recent Progress in NLP

The past two years have seen tremendous progress in natural language processing (NLP). This began with progress on contextual word embeddings, first with ELMO. Then OpenAI released GPT, the first major transformer-based language model. This was soon followed by BERT, which now is commonly treated as the industry standard. Yet progress did not stop.

The general language understanding benchmark (GLUE), which included a number of different benchmarks that were aggregated in an effort to assess natural language understanding (NLU), was first released in late spring 2018. By roughly the same time in 2019 a transformer-based language model had been used to exceed human-level performance on it. However, this had already been anticipated, and in late spring 2019 a new, stickier aggregate benchmark called SuperGLUE was released in an effort to establish a benchmark that would not be so easily surpassed at human-level. However, this did not come to pass as its reign was also short lived with T5 closing 95% of the gap by October. Since T5 has increased its score on SuperGLUE to 89.3/89.8, or 97.5% improvement over the BERT baseline.

So, is this progress in NLP indicative of true progress in AI? Perhaps, and perhaps not. What does seem clear about this progress in NLP, and more specifically in NLU, is that, much like milestones in games, the traditional means of benchmarking performance are obsolete. However, in contrast to deep RL and games, some new and innovative techniques have been developed for assessing progress in NLU.

One such approach just released this month addresses the “fundamental gap between how humans understand and use language – in open-ended, real-world situations – and today’s benchmarks for language understanding.” This group of leading researchers from the University of Washington and the Allen Institute for AI propose a TuringAdvice challenge which involves assessing language understanding by eliciting advice from systems on complex real world situations. To make this concrete, they released RedditAdvice, an open dataset for evaluating natural language advice in real world scenarios. T5 is only able to give advice equally or more valuable than humans 9% of the time, which makes clear that there is indeed a long, long way to go to realize true NLU in machine learning.

Meena, Google’s “Human-like” Chatbot

Do the poor scores from T5 on RedditAdvice mean that the current systems have not made true progress toward NLU? Machine learning researchers are all aware of an informal notion dubbed “Goodhart’s law” which is understood to suggest that when a measure becomes a target, it ceases to be a good measure. This may have been the case for GLUE and SuperGLUE – that when systems begin trying to optimize for the specific benchmarks the benchmarks become targets rather than measures representative of NLU. The same may be true for the TuringAdvice challenge and the RedditAdvice dataset; only time will tell. Regardless, it should be clear by now that this tendency poses numerous challenges for assessing AI progress.

So, what about Google’s human-like Meena chatbot? Human-like is a strong description for Google to use in reference to their system. To be certain, one hype surrounding chatbots has not always ended well for tech giants. Why should we expect different from Meena? Well, for one Meena uses a transformer-based language model, which involves the same fundamental deep learning architecture as the other Google language models (BERT and t5) which have drive NLU progress in the past 2 years. Moreover, Meena also utilizes a novel technique for evaluation that required crowd-sourcing humans to evaluate the quality of the systems responses, in a fashion that resembles that first proposed by Turing in 1950.

While far from passing a Turing test, the performance from Meena is respectable. Perhaps the conversational topics are not as complex as those included in the RedditAdvice dataset, but this doesn’t mean that true progress has not been made. In fact, a large amount of communication that is practical in organizations concerns very narrow topics which do not require the broad knowledge necessary for general conversational agents or for giving real world advice. In call centers, operators typically respond to a very narrow set of concerns from customers, and call center routing has already been automated without even using AI. Because of the potential of models like T5 for transfer learning, and fine-tuning to increase performance on narrow tasks using proprietary datasets, it is likely that the progress demonstrated by Meena, while short of the robustness necessary for real world applications, is well suited for common, rote natural language tasks encountered by organizations.

Agent57 vs Meena – Part I: The Death of Gaming Milestones?


A few weeks ago DeepMind released a preprint on arXiv describing a new deep RL agent that has successfully surpassed the performance of a single human professional game tester on a sample of Atari games which included all of the games that had previously posed the greatest challenges to other deep RL agents. Since the preprint release this project has received a lot of attention, including from AI Impacts who have initially concluded that the achievement of this milestone represents the resolution of a forecast from their 2016 survey significantly earlier than experts had predicted. However, based on Dafoe’s desiderata for AI forecasting targets, this target was significantly flawed.

I will not delve into the forecasts flaws here because what is more important is that the achievement itself appears to be no more than incremental progress. Last decade milestones on games made for impressive headlines and brought a lot of attention to DeepMind. However, the attention Agent57 received appears to have been substantially less than DeepMind’s major milestones on games last decade. Perhaps the media is rightfully weary of DeepMind’s handwavy progress on the latest milestones. For example, last year’s major milestone on Starcraft 2 involved outperforming 99.85% of amateur human players, which, to be certain, is impressive. However, this is less impressive than the superhuman performance of AlphaGo in 2015 or the follow-up in 2016 of a vastly superior system which did not even require any human training data. Perhaps the most significant evidence is the fact that the 8 page paper with 20 pages of appendices appeared on arXiv and not in Nature, as did the three examples from last decade cited here. In the past Atari performance has been suitable for Nature, but not anymore.

To be clear, I am by no means trying to diminish the value of this work. The fact that without any domain knowledge a single algorithm can perform better than an above average yet less than expert human on games like Pitfall and Montezuma’s Revenge itself is impressive. Doing so whilst achieving the performance demonstrated on so many other Atari games is even more impressive. However, 16 months after Go Explore demonstrated the first significant progress on these two games, and given the pace of AI progress that we are now accustomed to, this could even be thought to have been foreseeable. The problem with AI forecasting is that there are so many moving targets to forecast, and it is difficult for those who work on this full-time to determine where resources should be best allocated. In the case of games, ironically the answer may simply be that their forecasts are only worth the time for publicity purposes.

DeepMind’s Value to Google

While DeepMind is certainly not declaring a positive net income, it remains unclear how Google values DeepMind’s contributions internally. Considering that the vast majority of Google’s revenue comes from products hosted in data centers, it seems that the fiscal benefits of DeepMind’s work to reduce Google’s data centers’ energy use by as much as 40% may not be adequately represented in DeepMind’s financial statements.

After all, a data center is practically just a collection of thousands of electric heaters. These heaters also require dedicated cooling units to ensure they do not overheat. In 2016, Gartner estimated that Google had over 2.5 million servers. Assuming that each server is a 300W heater operating at a mean 2/3 capacity, and that it takes an equivalent amount of power to cool, we can easily estimate that Google uses 1 gigawatt on average. At roughly $0.12 per KWh, this suggests that powering their data centers costs $120k per hour. This may be a gross underestimate of current costs for numerous reasons that I won’t go into, but for a concrete example consider that in 2016 for a single data center in The Netherlands Google purchased the entire 62 megawatt output of a nearby wind farm (Google has since quadrupled their investment into this one of their roughly fifteen data centers). Regardless, at $120k per hour this would place Google’s annual data center energy costs at $1.05 billion. Given this estimate, based on 2016 numbers, DeepMind’s contributions to reduce Google’s energy costs could be as much as $400 million. 

Given these numbers and the figures from the previous blog post, it appears that Google’s ROI from its acquisition of DeepMind isn’t reflected clearly in the UK financial documents for the latter.

2018 Trends in DeepMind Operating Costs (updated)

DeepMind’s 2018 financial data was made available this week. If you haven’t seen the previous DeepMind operating costs post, check it out for some context. The operating costs did not increase consistently with the previous extrapolation, however there was still a 70% increase in operating costs. I’ve updated the extrapolation and plot, and it has not changed greatly from the previous year’s model. It is depicted in Figure 1 below.

Figure 1: Trends in DeepMind operating costs updated with 2018 data.

The new data also indicates that DeepMind has accumulated a large amount of debt, most of which is owed to its parent company Alphabet – I’m looking into this further. However, DeepMind’s contract revenue also continues to increase. In fact, prior to 2016, DeepMind had no revenue. In the three years since, turnover has increased to £102.8M. The data for this is shown in Figure 2 below. If revenue continues to increase at this rate, and assuming negligible interest on the debt from Alphabet, DeepMind would be able to support the current growth if they implemented a 20 month hiring and budget freeze starting on January 1st 2019. This would suggest that they could begin growing at the rate depicted here roughly one year from now, in September of 2020.

Figure 2: Trends in DeepMind operating costs compared to trends in DeepMind revenue growth.

Financial matters aside, another consideration is that realistically there doesn’t seem to be enough talent to continue to fuel the expansion at the current rate. Perhaps, if a hiring freeze effectively adjusted the operating costs growth curve to fall in line with the revenue trendline, it would be more plausible that the talent of the caliber necessary for a large, well-funded brain trust could be adequately trained and experienced in order to sustain the growth.

The upshot from this seems to be that DeepMind’s financial data still appears to be a strong indicator worth monitoring for projecting progress toward transformative AI.

(Note:  Further analysis is still needed, e.g. adding the projected dates for reaching levels of funding for large scale government funded projects given DeepMind’s income. Also, the current Jupyter notebook can be found here. Also see the follow-up post on DeepMind’s value to Google.)

Scenario Network Mapping Development Workshop

As part of my dissertation research I’ve been developing a workshopping technique for mapping the paths to AGI based on scenario network mapping. Scenario network mapping is a scenario planning technique that is intended to overcome challenges in complex strategic planning situations by incorporating large numbers of scenarios that each form a particular pathway of plausible possible futures. While suggested for mapping technology development, it had not previously been used for this task. Thus, I conducted a workshop to develop the method for the purposes of mapping the paths to AGI while at AI Safety Camp 3 in Spain last month.

The process is intended to be conducted through four half-day workshops held on different days, up to a week apart. For the purpose of this development workshop we simply conducted two two hour workshops a couple of days apart. Despite the severe shortening of the process, the workshop was still a success. While we did not have time to complete the mapping, we were able to demonstrate a proof of concept and learn many valuable lessons for regarding workshop design and facilitation that will be valuable in future versions of the workshop. A paper proposing the technique and outlining the workshop process has been accepted for presentation at the 12th annual AGI conference in August.

Scenario network mapping is a visual and tactile technique that relies on a number of office supply products to create a colorful map of scenarios, or, in this case, paths to AGI. The process involves colored ribbon, different sizes of blank paper, different sizes and colors of sticky notes, masking tape, colored thin-tipped markers, etc. It is intended to be conducted in a single room with a large amount of wall space that is also big enough for splitting into breakout groups. The demonstrative workshop only used five people instead of the suggested 15-20, but more participants will be used in the next iteration.

The development workshop was very successful in demonstrating the viability of the adapted scenario network mapping technique. Many valuable lessons were also learned that will be applied to future workshops. Four out of five of the participants found the experience valuable and insightful. I’m looking forward to continuing the development and getting closer to complete maps of the plausible paths to AGI!


AI Forecasting Dinner

I have been working with Katja Grace (AI Impacts, FHI, MIRI) to organize a dinner for people interested in AI forecasting. This dinner will be held in San Francisco during EA Global on June 22nd at 6:30pm. We have rented a private venue and will be having pizza. The venue is a ~12 minute walk from the Westfield Center in San Francisco.

This will be an excellent opportunity for people interested in and working on AI forecasting to meet and chat. I’m looking forward to meeting new people and discussing current work with familiar faces. See you Saturday!

Trends in DeepMind Operating Costs

(Note: an updated version of this post has been created incorporating the 2018 data from DeepMind.)

Google DeepMind is frequently considered to be the world’s leading AI research lab. Alphabet, DeepMind’s parent company, has been investing heavily in DeepMind over recent years. When DeepMind’s operating costs are extrapolated the trendline indicates that the cumulative investment into DeepMind could approach the levels of historic large-scale national and multinational efforts in the coming years. A plot depicting this extrapolation is shown below.

A plot of DeepMind operating costs extrapolated to 2026. Google’s operating income is also extrapolated here to demonstrate the feasibility of such a sustained level of investment.

The construction of the Large Hadron Collider and the discovery of the Higgs Boson, The Manhattan Project, and The Apollo Program represent some of the crowning achievements of science and engineering over the past century. Inflation adjusted estimates for their costs are $13.3B, $23B and $112B in 2018 dollars, respectively. If Alphabet continues to increase investment into DeepMind at the current rate, then the cumulative investment would reach the level of the Apollo program in September 2024.

Of particular interest is that it is economically feasible for this trend to continue long enough to reach the levels of investment of these historic scientific feats (Alphabet will generate enough income over the next six years to sustain the required investment). Given that Alphabet does not pay a dividend and that it ended 2018 with more than $109B in cash, this is not only feasible but also plausible. Even if the rate of investment slows, DeepMind could still be reaching these unprecedented levels of investment into scientific research sometime over the next decade.

NOTE: As a subsidiary of Alphabet DeepMind operating costs would not typically be reported publicly. However, DeepMind is still incorporated in the UK and thus have to file an annual financial report with the government. So, DeepMind’s annual operating costs for the preceding year are made public sometime in October of the following year and made available online.

NOTE: The Jupyter notebook for generating this figure can be found here.

2018 Forecasting Study Results

The initial results from my 2018 forecasting study have been made available. These include results from the survey conducted at ICML, IJCAI and HLAI. The figure shows the aggregate forecasts and 95% confidence bands for each of the transformative AI scenarios.

Interested readers can find the full manuscript here: Forecasting Transformative AI: An Expert Survey.

To date 25 interviews have been conducted with leading AI researchers including Jürgen Schmidhuber, Rich Sutton, Satinder Singh, Peter Stone, Thomas Dieterrich and others who choose to remain anonymous. Analysis of the 2018 interviews is forthcoming. Stay tuned for more details.

Affordable GPUs for Deep Learning Training

THE BOTTOM LINE: For personal use do not use AWS, Google Cloud or Azure. It is best to either build your own machine or to rent GPU instances using the 1080 Ti or 2080 Ti from alternate cloud providers.

I’ve written extensively on this blog about building cheap deep learning machines. Initially, last year, I built a $600, basic deep learning machine for practice. Then, earlier this year, I built a small cluster with 8 Nvidia GTX 1080 Tis in order to finish a study (see previous post) which has since been published by the journal JAMIA on 3d deep learning for cancer detection. My extensive documentation of these builds was intended to offer a roadmap for other students trying to do deep learning research on a budget. More specifically, the intention was to present the argument for building a deep learning machine over using AWS. Jeff Chen has recently done an even better job of presenting this case.

However, the purpose of this post is to present an alternative for cheap deep learning that does not require AWS or building your own machine. The reason for the difference in cost between AWS and building your own machine is simple: AWS uses more expensive GPUs. Technically, the GPUs used by AWS are meant for scientific computing while the cards used to build your own machine are meant for consumers. Practically, the scientific cards don’t have a video port while the consumer cards have multiple video ports.

Nvidia’s scientific cards are sold as part of the Tesla product line. The current generation is the V100 and the previous generation is the P100. The equivalent consumer cards are the GTX 2080 Ti and the GTX 1080 Ti, respectively. The performance of these cards is nearly the same, plus or minus 10%. However, the price of the scientific cards is about 10 times higher. By building our own machine we are able to get a steep discount. Now there are some benefits of the scientific cards, primarily in scalable parallelization. These benefits are useful on very large models and for very large batch sizes. If your research involves these elements, then you’re probably best off using AWS or a similar cloud provider. If not, then you’re probably best off building your own machine, right? No, not necessarily.

Recently I’ve become aware of at least one online cloud provider that offers instances of a variety of Nvidia’s consumer GPUs: Vectordash. You can see on that the 1080 Ti is only $0.64 /hr. Leader GPU offers weekly rates for 2 cards that approach $0.74 /hr. For my research project I built my cluster for $11,000* and I ran it for 6 weeks. GPU instance pricing has changed since the blog post about building that cluster, so below I compare the system to equivalent alternatives using current pricing.

Had I used Google Cloud, which offers the cheapest P100 instances at present, I would have spent approximately $11,773.44. Because I built my own cluster I saved some cash and I still had the hardware when the project was finished. However, had I used Leader GPU, I would have spent roughly $3600.

After the research project completed, I kept the cluster in case I had to run anymore cases. Then I kept it for some small jobs until just recently, when I sold it on eBay recouping $5,000 after fees. Thus, the research project ultimately cost me $6,000. Had I used Leader GPU I could have saved $2,400. Without the option to rent 1080 Ti instances, it was definitely cheaper to build my own cluster. However, renting 1080 Ti instances would have saved me a chunk of cash. If I had to do it again**, I definitely would have rented 1080 Ti instances.

*The cluster cost $11,000 to build in March, but due to diminished demand for GPUs it would cost roughly $1,600 less to build now.

**Even considering today’s cheaper GPUs, I would still save $800 and a substantial amount of my time necessary for building, configuring and maintaining the cluster.

1080Ti vs P100

When designing a small deep learning cluster for the university last year I ran into trouble trying to determine whether the P100 or 1080Ti was more powerful (and if so, how much more powerful). Ultimately, I was unable to come to a conclusion, so, I decided to get both so I could find out for myself. This post describes my experience using these cards on a recent project of mine and is a follow-up to the previous post.

Recently I had to revise and resubmit a manuscript I had written for a medical informatics journal on using a 2 stage deep learning system for detecting lung cancer from CT scans. I was using a 3D U-Net and a 3D resnet, so, this required a lot of compute. I was using 2 mini clusters, one at the university and one I had built at home (see previous post). In total there were 15 1080Tis and a single P100. It wound up taking me 6 weeks to train and optimize the 180 models for the 10 separate 9-fold cross-validations I had to conduct.


I compared performance between the two cards for both the 3D U-Net and 3D resnet. Over the course of running the cases for the revision the 1080Tis consistently outperformed the P100 by roughly 10%. This was not surprising to me and confirms another benchmarking that was published online since I had designed the cluster. I didn’t really analyze these results, but I will try to add plots for comparison in the future.


I was unable to compare the performance of the P100 and the 1080Tis for half-precision (float16) operations. Based on Nvidia’s literature, I do suspect that the P100 would outperform the 1080Tis by about 30%. This is substantial, but far from what one would expect given the differences in price.

Because I only had 1 P100 I was also unable to compare the two GPUs on parallel performance. The 1080Tis are designed only to be used for task parallelism by using each GPU (i.e. training separate models on each). This was effective for my project because I had to train 180 separate models, but is limiting for other tasks involving larger networks (note, I was using a batch size of 2 using single GPUs). The P100s are designed for data parallelism (insomuch as they have much more bandwidth with Nvidia’s NVLink interconnect) as well as task parallelism, so, this should be easier. Of course, we can’t know without testing this, and the results could be surprising as they were in the head-to-head comparison. This is suggested for future work.

Assuming that the P100 is superior for data parallelism as well as for half-precision operations, one may ask whether the P100 is worth the cost. The answer, of course, depends on the type of research you conduct. For most people, I think the answer would be no. For researchers only interested in half-precision operations I still don’t think the assumed advantage of the P100 would be worth the price tag of 6x the 1080Ti. The only situations I envision the P100 being worth the extra cost is if you have datasets that are large enough to merit multiple GPU use, i.e. non-public datasets.


My conclusion is that, without a doubt, 1080Tis perform similarly to P100s for most tasks.  Many factors contribute to performance that I didn’t account for, so I’m not confident to say more than that. I was also unable to determine whether this would hold up for data parallelism or for half-precision floats. Regardless, the performance would still be in the same ballpark and my recommendations still the same: unless you’re working with very large datasets typically not publicly available, there’s no need for the features offered by the P100 or any Tesla series card.

Looking to the future, I am curious to see if Nvidia continues to sell consumer grade GPUs that are as fast or faster than the research grade cards. The GTX 1100 series cards are expected toward the end of the summer, and I look forward to testing their performance against the V100.