From Mapping Scientific Software to Automating Science

Last summer, while working on a grant titled Enhancing Scientific Research Productivity with Foundation Models (funded by the Alfred P. Sloan Foundation) I attempted to map the space of scientific software. Ultimately, after spending a summer with a research assistant, working with my co-investigator on the grant and his graduate assistant, and after talking with numerous others about the topic, it became clear that this was too ambitious a task.

For one, scientific software is so broad that there are really two different classes of scientific software we can think of: specialized scientific software and generalizable scientific software. The former is what we most often associate with scientific software, or, that seemed to have been the case until recently. The latter has traditionally been less commonly associated with scientific software, although it may in fact play a much larger role than is commonly attributed to it.

Think of spreadsheets as a tool for scientific research. Without the use of spreadsheets in the 80s and 90s, a large amount of scientific research may have been slowed. These decision support and data analysis tools provide us an ability to analyze numeric data of any type, and have been referred to as ‘the killer app’ by some given the breadth and significance of their role in advancing progress in so many different disciplines during the 80s and 90s.

However, as noted, ultimately the attempts to map the space of scientific software proved futile because the scope and ambitions of the objective were too great. Yet, this work did lead to some useful insights that are documented here.


We are now seeing the emergence of different types of generalizable scientific software based on a variety of applications of large language model and foundation model technologies. Examples include applications like Elicit that utilize GPT-4 to accelerate literature review, or, ChatGPT itself could be thought to be a very powerful tool/piece of generalizable scientific software. 

Really, there are few good examples of generalizable scientific software. The electronic spreadsheet, mentioned earlier, can certainly be thought of as one. I know that I still frequently use electronic spreadsheets for quickly viewing new data for exploratory analysis. Electronic spreadsheets are also still very useful for quickly generating visualizations. However, examples of other easy-to-use general purpose scientific software tools do not easily come to mind.

Possibly one could think of Matlab or Python—high-level or scripting programming languages with broad suites of packages or libraries that can be applied for a wide range of scientific applications—as generalizable scientific software. Of course, the packages and libraries that are developed for Python and Matlab users could also be thought of as specialized scientific software, but collectively, they enable much more powerful analysis than the electronic spreadsheet. Similarly, Jupyter notebooks or the data science platform Anaconda might be thought of as generalizable scientific software. 

If it is not apparent to readers, the lines between generalizable and specialized scientific software—with a few exceptions—are very blurred. This is why it was unreasonable to try to create a map of scientific software.


What if we were to start thinking about AI scientific software in terms of the dichotomy of specialized and generalizable scientific software? It might be easy to identify examples of specialized AI scientific software like AlphaFold or BioBERT as well as examples of generalizable AI scientific software like Elicit, ChatGPT, or even GPT-4 (i.e., directly, via API access). There are many more examples of each, but we can focus on a subclass of generalizable AI software, that of agentic wrappers for foundation models.

In case it’s not apparent what I mean by agentic wrappers for foundation models, I am talking about software like BabyAGI or AutoGPT. Tools like this are not only generalizable in being able to enhance performance of other generalizable AI software, but they are also able to combine specialized scientific software with generalizable scientific software. 

At a very simplified and fundamental level, science is a process involving hypotheses and experiments. If generalizable AI scientific software like GPT-4 is able to accurately generate hypotheses, and specialized scientific software like AlphaFold is able to conduct experiments or to program specialized scientific software able to conduct experiments, then generalizable AI scientific software like BabyAGI or AutoGPT could be used to combine the hypothesis generation with the conducting of experiments. 

While AI alone might not be able to generate new major scientific discoveries in the near future, it is plausible to expect that AI will be able to help humans to automate much of the hypothesis and experimentation process. And agentic wrappers for foundation models are not absolutely necessary. Humans can fill the role of these wrappers, so foundation models alone, with an interactive user interface like ChatGPT, are able to act as productivity tools for scientists. 


Based on the reasoning laid out above, I expect to see extensive use of GPT-4 and even more powerful foundation models to come—either through API access or through the interactive user interfaces—to become increasingly more common for use in enhancing scientific productivity. The degree to which productivity might be enhanced will vary by discipline, sub-discipline, problem class, etc. These powerful new tools might just be useful as tools, or, they might be able to entirely automate the scientific process. 


This research was funded by the Alfred P. Sloan Foundation as part of the Better Software for Science Program.

Agent57 vs Meena – Part II: An Era of True Progress in AI?

True Progress in AI

When conducting interviews with dozens of the world’s AI experts over the past few years I frequently asked a question about what they thought would be good indicators of true progress in AI. Oftentimes they would reply that the latest milestones which had received substantial media coverage were not signs of true progress in AI, and offered different suggestions for what would constitute true progress in AI.

The previous post explored the recent progress demonstrated by DeepMind’s Agent57 on Atari, which could be seen as an indicator that gaming milestones no longer carry the weight they used to. Perhaps they were a poor signal of progress all along, or perhaps the real signal was just foreshadowing an era of true progress. AI has struggled to overcome the productivity paradox, and it remains to be seen whether the economic value of AI recently heralded by leading firms like McKinsey and PwC will in fact come to bear. However, it begs asking the question of, if not games, then what are indicators of true progress in AI? Furthermore, what new applications of AI can actually bring the economic impacts that these firms have suggested?

Recent Progress in NLP

The past two years have seen tremendous progress in natural language processing (NLP). This began with progress on contextual word embeddings, first with ELMO. Then OpenAI released GPT, the first major transformer-based language model. This was soon followed by BERT, which now is commonly treated as the industry standard. Yet progress did not stop.

The general language understanding benchmark (GLUE), which included a number of different benchmarks that were aggregated in an effort to assess natural language understanding (NLU), was first released in late spring 2018. By roughly the same time in 2019 a transformer-based language model had been used to exceed human-level performance on it. However, this had already been anticipated, and in late spring 2019 a new, stickier aggregate benchmark called SuperGLUE was released in an effort to establish a benchmark that would not be so easily surpassed at human-level. However, this did not come to pass as its reign was also short lived with T5 closing 95% of the gap by October. Since T5 has increased its score on SuperGLUE to 89.3/89.8, or 97.5% improvement over the BERT baseline.

So, is this progress in NLP indicative of true progress in AI? Perhaps, and perhaps not. What does seem clear about this progress in NLP, and more specifically in NLU, is that, much like milestones in games, the traditional means of benchmarking performance are obsolete. However, in contrast to deep RL and games, some new and innovative techniques have been developed for assessing progress in NLU.

One such approach just released this month addresses the “fundamental gap between how humans understand and use language – in open-ended, real-world situations – and today’s benchmarks for language understanding.” This group of leading researchers from the University of Washington and the Allen Institute for AI propose a TuringAdvice challenge which involves assessing language understanding by eliciting advice from systems on complex real world situations. To make this concrete, they released RedditAdvice, an open dataset for evaluating natural language advice in real world scenarios. T5 is only able to give advice equally or more valuable than humans 9% of the time, which makes clear that there is indeed a long, long way to go to realize true NLU in machine learning.

Meena, Google’s “Human-like” Chatbot

Do the poor scores from T5 on RedditAdvice mean that the current systems have not made true progress toward NLU? Machine learning researchers are all aware of an informal notion dubbed “Goodhart’s law” which is understood to suggest that when a measure becomes a target, it ceases to be a good measure. This may have been the case for GLUE and SuperGLUE – that when systems begin trying to optimize for the specific benchmarks the benchmarks become targets rather than measures representative of NLU. The same may be true for the TuringAdvice challenge and the RedditAdvice dataset; only time will tell. Regardless, it should be clear by now that this tendency poses numerous challenges for assessing AI progress.

So, what about Google’s human-like Meena chatbot? Human-like is a strong description for Google to use in reference to their system. To be certain, one hype surrounding chatbots has not always ended well for tech giants. Why should we expect different from Meena? Well, for one Meena uses a transformer-based language model, which involves the same fundamental deep learning architecture as the other Google language models (BERT and t5) which have drive NLU progress in the past 2 years. Moreover, Meena also utilizes a novel technique for evaluation that required crowd-sourcing humans to evaluate the quality of the systems responses, in a fashion that resembles that first proposed by Turing in 1950.

While far from passing a Turing test, the performance from Meena is respectable. Perhaps the conversational topics are not as complex as those included in the RedditAdvice dataset, but this doesn’t mean that true progress has not been made. In fact, a large amount of communication that is practical in organizations concerns very narrow topics which do not require the broad knowledge necessary for general conversational agents or for giving real world advice. In call centers, operators typically respond to a very narrow set of concerns from customers, and call center routing has already been automated without even using AI. Because of the potential of models like T5 for transfer learning, and fine-tuning to increase performance on narrow tasks using proprietary datasets, it is likely that the progress demonstrated by Meena, while short of the robustness necessary for real world applications, is well suited for common, rote natural language tasks encountered by organizations.

Agent57 vs Meena – Part I: The Death of Gaming Milestones?


A few weeks ago DeepMind released a preprint on arXiv describing a new deep RL agent that has successfully surpassed the performance of a single human professional game tester on a sample of Atari games which included all of the games that had previously posed the greatest challenges to other deep RL agents. Since the preprint release this project has received a lot of attention, including from AI Impacts who have initially concluded that the achievement of this milestone represents the resolution of a forecast from their 2016 survey significantly earlier than experts had predicted. However, based on Dafoe’s desiderata for AI forecasting targets, this target was significantly flawed.

I will not delve into the forecasts flaws here because what is more important is that the achievement itself appears to be no more than incremental progress. Last decade milestones on games made for impressive headlines and brought a lot of attention to DeepMind. However, the attention Agent57 received appears to have been substantially less than DeepMind’s major milestones on games last decade. Perhaps the media is rightfully weary of DeepMind’s handwavy progress on the latest milestones. For example, last year’s major milestone on Starcraft 2 involved outperforming 99.85% of amateur human players, which, to be certain, is impressive. However, this is less impressive than the superhuman performance of AlphaGo in 2015 or the follow-up in 2016 of a vastly superior system which did not even require any human training data. Perhaps the most significant evidence is the fact that the 8 page paper with 20 pages of appendices appeared on arXiv and not in Nature, as did the three examples from last decade cited here. In the past Atari performance has been suitable for Nature, but not anymore.

To be clear, I am by no means trying to diminish the value of this work. The fact that without any domain knowledge a single algorithm can perform better than an above average yet less than expert human on games like Pitfall and Montezuma’s Revenge itself is impressive. Doing so whilst achieving the performance demonstrated on so many other Atari games is even more impressive. However, 16 months after Go Explore demonstrated the first significant progress on these two games, and given the pace of AI progress that we are now accustomed to, this could even be thought to have been foreseeable. The problem with AI forecasting is that there are so many moving targets to forecast, and it is difficult for those who work on this full-time to determine where resources should be best allocated. In the case of games, ironically the answer may simply be that their forecasts are only worth the time for publicity purposes.

DeepMind’s Value to Google

While DeepMind is certainly not declaring a positive net income, it remains unclear how Google values DeepMind’s contributions internally. Considering that the vast majority of Google’s revenue comes from products hosted in data centers, it seems that the fiscal benefits of DeepMind’s work to reduce Google’s data centers’ energy use by as much as 40% may not be adequately represented in DeepMind’s financial statements.

After all, a data center is practically just a collection of thousands of electric heaters. These heaters also require dedicated cooling units to ensure they do not overheat. In 2016, Gartner estimated that Google had over 2.5 million servers. Assuming that each server is a 300W heater operating at a mean 2/3 capacity, and that it takes an equivalent amount of power to cool, we can easily estimate that Google uses 1 gigawatt on average. At roughly $0.12 per KWh, this suggests that powering their data centers costs $120k per hour. This may be a gross underestimate of current costs for numerous reasons that I won’t go into, but for a concrete example consider that in 2016 for a single data center in The Netherlands Google purchased the entire 62 megawatt output of a nearby wind farm (Google has since quadrupled their investment into this one of their roughly fifteen data centers). Regardless, at $120k per hour this would place Google’s annual data center energy costs at $1.05 billion. Given this estimate, based on 2016 numbers, DeepMind’s contributions to reduce Google’s energy costs could be as much as $400 million. 

Given these numbers and the figures from the previous blog post, it appears that Google’s ROI from its acquisition of DeepMind isn’t reflected clearly in the UK financial documents for the latter.

2018 Trends in DeepMind Operating Costs (updated)

DeepMind’s 2018 financial data was made available this week. If you haven’t seen the previous DeepMind operating costs post, check it out for some context. The operating costs did not increase consistently with the previous extrapolation, however there was still a 70% increase in operating costs. I’ve updated the extrapolation and plot, and it has not changed greatly from the previous year’s model. It is depicted in Figure 1 below.

Figure 1: Trends in DeepMind operating costs updated with 2018 data.

The new data also indicates that DeepMind has accumulated a large amount of debt, most of which is owed to its parent company Alphabet – I’m looking into this further. However, DeepMind’s contract revenue also continues to increase. In fact, prior to 2016, DeepMind had no revenue. In the three years since, turnover has increased to £102.8M. The data for this is shown in Figure 2 below. If revenue continues to increase at this rate, and assuming negligible interest on the debt from Alphabet, DeepMind would be able to support the current growth if they implemented a 20 month hiring and budget freeze starting on January 1st 2019. This would suggest that they could begin growing at the rate depicted here roughly one year from now, in September of 2020.

Figure 2: Trends in DeepMind operating costs compared to trends in DeepMind revenue growth.

Financial matters aside, another consideration is that realistically there doesn’t seem to be enough talent to continue to fuel the expansion at the current rate. Perhaps, if a hiring freeze effectively adjusted the operating costs growth curve to fall in line with the revenue trendline, it would be more plausible that the talent of the caliber necessary for a large, well-funded brain trust could be adequately trained and experienced in order to sustain the growth.

The upshot from this seems to be that DeepMind’s financial data still appears to be a strong indicator worth monitoring for projecting progress toward transformative AI.

(Note:  Further analysis is still needed, e.g. adding the projected dates for reaching levels of funding for large scale government funded projects given DeepMind’s income. Also, the current Jupyter notebook can be found here. Also see the follow-up post on DeepMind’s value to Google.)

Scenario Network Mapping Development Workshop

As part of my dissertation research I’ve been developing a workshopping technique for mapping the paths to AGI based on scenario network mapping. Scenario network mapping is a scenario planning technique that is intended to overcome challenges in complex strategic planning situations by incorporating large numbers of scenarios that each form a particular pathway of plausible possible futures. While suggested for mapping technology development, it had not previously been used for this task. Thus, I conducted a workshop to develop the method for the purposes of mapping the paths to AGI while at AI Safety Camp 3 in Spain last month.

The process is intended to be conducted through four half-day workshops held on different days, up to a week apart. For the purpose of this development workshop we simply conducted two two hour workshops a couple of days apart. Despite the severe shortening of the process, the workshop was still a success. While we did not have time to complete the mapping, we were able to demonstrate a proof of concept and learn many valuable lessons for regarding workshop design and facilitation that will be valuable in future versions of the workshop. A paper proposing the technique and outlining the workshop process has been accepted for presentation at the 12th annual AGI conference in August.

Scenario network mapping is a visual and tactile technique that relies on a number of office supply products to create a colorful map of scenarios, or, in this case, paths to AGI. The process involves colored ribbon, different sizes of blank paper, different sizes and colors of sticky notes, masking tape, colored thin-tipped markers, etc. It is intended to be conducted in a single room with a large amount of wall space that is also big enough for splitting into breakout groups. The demonstrative workshop only used five people instead of the suggested 15-20, but more participants will be used in the next iteration.

The development workshop was very successful in demonstrating the viability of the adapted scenario network mapping technique. Many valuable lessons were also learned that will be applied to future workshops. Four out of five of the participants found the experience valuable and insightful. I’m looking forward to continuing the development and getting closer to complete maps of the plausible paths to AGI!


AI Forecasting Dinner

I have been working with Katja Grace (AI Impacts, FHI, MIRI) to organize a dinner for people interested in AI forecasting. This dinner will be held in San Francisco during EA Global on June 22nd at 6:30pm. We have rented a private venue and will be having pizza. The venue is a ~12 minute walk from the Westfield Center in San Francisco.

This will be an excellent opportunity for people interested in and working on AI forecasting to meet and chat. I’m looking forward to meeting new people and discussing current work with familiar faces. See you Saturday!

Trends in DeepMind Operating Costs

(Note: an updated version of this post has been created incorporating the 2018 data from DeepMind.)

Google DeepMind is frequently considered to be the world’s leading AI research lab. Alphabet, DeepMind’s parent company, has been investing heavily in DeepMind over recent years. When DeepMind’s operating costs are extrapolated the trendline indicates that the cumulative investment into DeepMind could approach the levels of historic large-scale national and multinational efforts in the coming years. A plot depicting this extrapolation is shown below.

A plot of DeepMind operating costs extrapolated to 2026. Google’s operating income is also extrapolated here to demonstrate the feasibility of such a sustained level of investment.

The construction of the Large Hadron Collider and the discovery of the Higgs Boson, The Manhattan Project, and The Apollo Program represent some of the crowning achievements of science and engineering over the past century. Inflation adjusted estimates for their costs are $13.3B, $23B and $112B in 2018 dollars, respectively. If Alphabet continues to increase investment into DeepMind at the current rate, then the cumulative investment would reach the level of the Apollo program in September 2024.

Of particular interest is that it is economically feasible for this trend to continue long enough to reach the levels of investment of these historic scientific feats (Alphabet will generate enough income over the next six years to sustain the required investment). Given that Alphabet does not pay a dividend and that it ended 2018 with more than $109B in cash, this is not only feasible but also plausible. Even if the rate of investment slows, DeepMind could still be reaching these unprecedented levels of investment into scientific research sometime over the next decade.

NOTE: As a subsidiary of Alphabet DeepMind operating costs would not typically be reported publicly. However, DeepMind is still incorporated in the UK and thus have to file an annual financial report with the government. So, DeepMind’s annual operating costs for the preceding year are made public sometime in October of the following year and made available online.

NOTE: The Jupyter notebook for generating this figure can be found here.

2018 Forecasting Study Results

The initial results from my 2018 forecasting study have been made available. These include results from the survey conducted at ICML, IJCAI and HLAI. The figure shows the aggregate forecasts and 95% confidence bands for each of the transformative AI scenarios.

Interested readers can find the full manuscript here: Forecasting Transformative AI: An Expert Survey.

To date 25 interviews have been conducted with leading AI researchers including Jürgen Schmidhuber, Rich Sutton, Satinder Singh, Peter Stone, Thomas Dieterrich and others who choose to remain anonymous. Analysis of the 2018 interviews is forthcoming. Stay tuned for more details.

Affordable GPUs for Deep Learning Training

THE BOTTOM LINE: For personal use do not use AWS, Google Cloud or Azure. It is best to either build your own machine or to rent GPU instances using the 1080 Ti or 2080 Ti from alternate cloud providers.

I’ve written extensively on this blog about building cheap deep learning machines. Initially, last year, I built a $600, basic deep learning machine for practice. Then, earlier this year, I built a small cluster with 8 Nvidia GTX 1080 Tis in order to finish a study (see previous post) which has since been published by the journal JAMIA on 3d deep learning for cancer detection. My extensive documentation of these builds was intended to offer a roadmap for other students trying to do deep learning research on a budget. More specifically, the intention was to present the argument for building a deep learning machine over using AWS. Jeff Chen has recently done an even better job of presenting this case.

However, the purpose of this post is to present an alternative for cheap deep learning that does not require AWS or building your own machine. The reason for the difference in cost between AWS and building your own machine is simple: AWS uses more expensive GPUs. Technically, the GPUs used by AWS are meant for scientific computing while the cards used to build your own machine are meant for consumers. Practically, the scientific cards don’t have a video port while the consumer cards have multiple video ports.

Nvidia’s scientific cards are sold as part of the Tesla product line. The current generation is the V100 and the previous generation is the P100. The equivalent consumer cards are the GTX 2080 Ti and the GTX 1080 Ti, respectively. The performance of these cards is nearly the same, plus or minus 10%. However, the price of the scientific cards is about 10 times higher. By building our own machine we are able to get a steep discount. Now there are some benefits of the scientific cards, primarily in scalable parallelization. These benefits are useful on very large models and for very large batch sizes. If your research involves these elements, then you’re probably best off using AWS or a similar cloud provider. If not, then you’re probably best off building your own machine, right? No, not necessarily.

Recently I’ve become aware of at least one online cloud provider that offers instances of a variety of Nvidia’s consumer GPUs: Vectordash. You can see on that the 1080 Ti is only $0.64 /hr. Leader GPU offers weekly rates for 2 cards that approach $0.74 /hr. For my research project I built my cluster for $11,000* and I ran it for 6 weeks. GPU instance pricing has changed since the blog post about building that cluster, so below I compare the system to equivalent alternatives using current pricing.

Had I used Google Cloud, which offers the cheapest P100 instances at present, I would have spent approximately $11,773.44. Because I built my own cluster I saved some cash and I still had the hardware when the project was finished. However, had I used Leader GPU, I would have spent roughly $3600.

After the research project completed, I kept the cluster in case I had to run anymore cases. Then I kept it for some small jobs until just recently, when I sold it on eBay recouping $5,000 after fees. Thus, the research project ultimately cost me $6,000. Had I used Leader GPU I could have saved $2,400. Without the option to rent 1080 Ti instances, it was definitely cheaper to build my own cluster. However, renting 1080 Ti instances would have saved me a chunk of cash. If I had to do it again**, I definitely would have rented 1080 Ti instances.

*The cluster cost $11,000 to build in March, but due to diminished demand for GPUs it would cost roughly $1,600 less to build now.

**Even considering today’s cheaper GPUs, I would still save $800 and a substantial amount of my time necessary for building, configuring and maintaining the cluster.