True Progress in AI
When conducting interviews with dozens of the world’s AI experts over the past few years I frequently asked a question about what they thought would be good indicators of true progress in AI. Oftentimes they would reply that the latest milestones which had received substantial media coverage were not signs of true progress in AI, and offered different suggestions for what would constitute true progress in AI.
The previous post explored the recent progress demonstrated by DeepMind’s Agent57 on Atari, which could be seen as an indicator that gaming milestones no longer carry the weight they used to. Perhaps they were a poor signal of progress all along, or perhaps the real signal was just foreshadowing an era of true progress. AI has struggled to overcome the productivity paradox, and it remains to be seen whether the economic value of AI recently heralded by leading firms like McKinsey and PwC will in fact come to bear. However, it begs asking the question of, if not games, then what are indicators of true progress in AI? Furthermore, what new applications of AI can actually bring the economic impacts that these firms have suggested?
Recent Progress in NLP
The past two years have seen tremendous progress in natural language processing (NLP). This began with progress on contextual word embeddings, first with ELMO. Then OpenAI released GPT, the first major transformer-based language model. This was soon followed by BERT, which now is commonly treated as the industry standard. Yet progress did not stop.
The general language understanding benchmark (GLUE), which included a number of different benchmarks that were aggregated in an effort to assess natural language understanding (NLU), was first released in late spring 2018. By roughly the same time in 2019 a transformer-based language model had been used to exceed human-level performance on it. However, this had already been anticipated, and in late spring 2019 a new, stickier aggregate benchmark called SuperGLUE was released in an effort to establish a benchmark that would not be so easily surpassed at human-level. However, this did not come to pass as its reign was also short lived with T5 closing 95% of the gap by October. Since T5 has increased its score on SuperGLUE to 89.3/89.8, or 97.5% improvement over the BERT baseline.
So, is this progress in NLP indicative of true progress in AI? Perhaps, and perhaps not. What does seem clear about this progress in NLP, and more specifically in NLU, is that, much like milestones in games, the traditional means of benchmarking performance are obsolete. However, in contrast to deep RL and games, some new and innovative techniques have been developed for assessing progress in NLU.
One such approach just released this month addresses the “fundamental gap between how humans understand and use language – in open-ended, real-world situations – and today’s benchmarks for language understanding.” This group of leading researchers from the University of Washington and the Allen Institute for AI propose a TuringAdvice challenge which involves assessing language understanding by eliciting advice from systems on complex real world situations. To make this concrete, they released RedditAdvice, an open dataset for evaluating natural language advice in real world scenarios. T5 is only able to give advice equally or more valuable than humans 9% of the time, which makes clear that there is indeed a long, long way to go to realize true NLU in machine learning.
Meena, Google’s “Human-like” Chatbot
Do the poor scores from T5 on RedditAdvice mean that the current systems have not made true progress toward NLU? Machine learning researchers are all aware of an informal notion dubbed “Goodhart’s law” which is understood to suggest that when a measure becomes a target, it ceases to be a good measure. This may have been the case for GLUE and SuperGLUE – that when systems begin trying to optimize for the specific benchmarks the benchmarks become targets rather than measures representative of NLU. The same may be true for the TuringAdvice challenge and the RedditAdvice dataset; only time will tell. Regardless, it should be clear by now that this tendency poses numerous challenges for assessing AI progress.
So, what about Google’s human-like Meena chatbot? Human-like is a strong description for Google to use in reference to their system. To be certain, one hype surrounding chatbots has not always ended well for tech giants. Why should we expect different from Meena? Well, for one Meena uses a transformer-based language model, which involves the same fundamental deep learning architecture as the other Google language models (BERT and t5) which have drive NLU progress in the past 2 years. Moreover, Meena also utilizes a novel technique for evaluation that required crowd-sourcing humans to evaluate the quality of the systems responses, in a fashion that resembles that first proposed by Turing in 1950.
While far from passing a Turing test, the performance from Meena is respectable. Perhaps the conversational topics are not as complex as those included in the RedditAdvice dataset, but this doesn’t mean that true progress has not been made. In fact, a large amount of communication that is practical in organizations concerns very narrow topics which do not require the broad knowledge necessary for general conversational agents or for giving real world advice. In call centers, operators typically respond to a very narrow set of concerns from customers, and call center routing has already been automated without even using AI. Because of the potential of models like T5 for transfer learning, and fine-tuning to increase performance on narrow tasks using proprietary datasets, it is likely that the progress demonstrated by Meena, while short of the robustness necessary for real world applications, is well suited for common, rote natural language tasks encountered by organizations.