Skip to main content Skip to secondary navigation

SQ2. What are the most important advances in AI?

Main content start

In the last five years, the field of AI has made major progress in almost all its standard sub-areas, including vision, speech recognition and generation, natural language processing (understanding and generation), image and video generation, multi-agent systems, planning, decision-making, and integration of vision and motor control for robotics. In addition, breakthrough applications emerged in a variety of domains including games, medical diagnosis, logistics systems, autonomous driving, language translation, and interactive personal assistance. The sections that follow provide examples of many salient developments.

Underlying Technologies

In the last five years, the field of AI has made major progress in almost all its standard sub-areas, including vision, speech recognition and generation, natural language processing (understanding and generation), image and video generation, multi-agent systems, planning, decision-making, and integration of vision and motor control for robotics. In addition, breakthrough applications emerged in a variety of domains including games, medical diagnosis, logistics systems, autonomous driving, language translation, and interactive personal assistance. The sections that follow provide examples of many salient developments.

People are using AI more today to dictate to their phone, get recommendations for shopping, news, or entertainment, enhance their backgrounds on conference calls, and so much more. The core technology behind most of the most visible advances is machine learning, especially deep learning (including generative adversarial networks or GANs) and reinforcement learning powered by large-scale data and computing resources. GANs are a major breakthrough, endowing deep networks with the ability to produce artificial content such as fake images that pass for the real thing. GANs consist of two interlocked components—a generator, responsible for creating realistic content, and a discriminator, tasked with distinguishing the output of the generator from naturally occurring content. The two learn from each other, becoming better and better at their respective tasks over time. One of the practical applications can be seen in GAN-based medical-image augmentation, in which artificial images are produced automatically to expand the data set used to train networks for producing diagnoses1. Recognition of the remarkable power of deep learning has been steadily growing over the last decade. Recent studies have begun to uncover why and under what conditions deep learning works well2. In the past ten years, machine-learning technologies have moved from the academic realm into the real world in a multitude of ways that are both promising and concerning.

Language Processing

Language processing technology made a major leap in the last five years, leading to the development of network architectures with enhanced capability to learn from complex and context-sensitive data. These advances have been supported by ever-increasing data resources and computing power.

Of particular note are neural network language models, including ELMo, GPT, mT5, and BERT.3 These models learn about how words are used in context—including elements of grammar, meaning, and basic facts about the world—from sifting through the patterns in naturally occurring text. They consist of billions of tunable parameters and are engineered to be able to process unprecedented quantities of data (over one trillion words for GPT-3, for example). By stringing together likely sequences of words, several of these models can generate passages of text that are often indistinguishable from human-generated text, including news stories, poems, fiction and even computer code. Performance on question-answering benchmarks (large quizzes with questions like “Where was Beyoncé born?”) have reached superhuman levels,4 although the models that achieve this level of proficiency exploit spurious correlations in the benchmarks and exhibit a level of competence on naturally occurring questions that is still well below that of human beings.

These models’ facility with language is already supporting applications such as machine translation, text classification, speech recognition, writing aids, and chatbots. Future applications could include improving human-AI interactions across diverse languages and situations. Current challenges include how to obtain quality data for languages used by smaller populations, and how to detect and remove biases in their behavior. In addition, it is worth noting that the models themselves do not exhibit deep understanding of the texts that they process, fundamentally limiting their utility in many sensitive applications. Part of the art of using these models, to date, is in finding scenarios where their incomplete mastery still provides value.

Related to language processing is the tremendous growth in conversational interfaces over the past five years. The near ubiquity of voice-control systems like Google Assistant, Siri, and Alexa is a consequence of both improvements on the voice-recognition side, powered by the AI advances discussed above, and also improvements in how information is organized and integrated for voice-based delivery. Google Duplex, a conversational interface that can call businesses to make restaurant reservations and appointments, was rolled out in 2018 and received mixed initial reviews due to its impressive engineering but off-putting system design.5

Computer Vision and Image Processing

Image-processing technology is now widespread, finding uses ranging from video-conference backgrounds to the photo-realistic images known as deepfakes. Many image-processing approaches use deep learning for recognition, classification, conversion, and other tasks. Training time for image processing has been substantially reduced. Programs running on ImageNet, a massive standardized collection of over 14 million photographs used to train and test visual identification programs, complete their work 100 times faster than just three years ago.6

Real-time object-detection systems such as YOLO (You Only Look Once) that notice important objects when they appear in an image are widely used for video surveillance of crowds and are important for mobile robots including self-driving cars. Face-recognition technology has also improved significantly over the last five years, and now some smartphones and even office buildings rely on it to control access. In China, facial-recognition technology is used widely in society, from security to payment, although there are very recent moves to pull back on the broad deployment of this technology.7 Of course, while facial-recognition technology can be a powerful tool to improve efficiency and safety, it raises issues around bias and privacy. Some companies have suspended providing face-recognition services. And, in fact, the creator of YOLO has said that he no longer works on the technology because “the military applications and privacy concerns became impossible to ignore.”8

It is now possible to generate photorealistic images and even videos using GANs. Sophisticated image-processing systems enhanced by deep learning let users seamlessly replace existing images with new ones, such as inserting someone into a video of an event they did not attend. While such modifications could be carried out by skilled artists decades ago, AI automation has substantially lowered the barriers. These so-called deepfakes are being used in illicit activity such as “revenge porn,” in which an attacker creates artificial sexual content featuring a specific victim, and identity theft, in which a profile of a non-existent person is generated and used to gain access to services, and have spurred research into improving automatic detection of deepfake images.

Caption: The GAN technology for generating images and the transformer technology for producing text can be integrated in various ways. These images were produced by OpenAI’s “DALL-E” given the prompt: “a stained glass window with an image of a blue strawberry.” A similar query to a web-based image search produces blue strawberries, blue stained-glass windows, or stained-glass windows with red strawberries, suggesting that the system is not merely retrieving relevant images but producing novel combinations o
Caption: The GAN technology for generating images and the transformer technology for producing text can be integrated in various ways. These images were produced by OpenAI’s “DALL-E” given the prompt: “a stained glass window with an image of a blue strawberry.” A similar query to a web-based image search produces blue strawberries, blue stained-glass windows, or stained-glass windows with red strawberries, suggesting that the system is not merely retrieving relevant images but producing novel combinations of visual features. From:


Developing algorithms for games and simulations in adversarial situations has long been a fertile training ground and a showcase for the advancement of AI techniques. DeepMind’s application of deep networks to Atari video games and the game of Go around 2015 helped bring deep learning to wide public prominence, and the last five years have seen significant additional progress. AI agents have now out-maneuvered their human counterparts in combat and multiplayer situations including the games StarCraft II,9 Quake III,10 and Alpha Dogfight11—a US Defense Department-sponsored jet-fighter simulation—as well as classical games like poker.12

The DeepMind team that developed AlphaGo went on to create AlphaGoZero,13 which discarded the use of direct human guidance in the form of a large collection of data from past Go matches. Instead, it developed moves and tactics on its own, starting from scratch. This idea was further augmented with AlphaZero,14 a single network architecture that could learn to play expert-level Go, Shogi, or Chess.


The last five years have seen consistent progress in intelligent robotics driven by machine learning, powerful computing and communication capabilities, and increased availability of sophisticated sensor systems. Although these systems are not fully able to take advantage of all the advances in AI, primarily due to the physical constraints of the environments, highly agile and dynamic robotics systems are now available for home and industrial use. In industrial robotics, with the implementation of deep-learning-based vision systems, manipulator-type robots—those that grab things, as opposed to those that roll across the floor—can pick up randomly placed overlapping objects at speeds that are practical for real-world applications.

Bipedal and four-legged robots continue to advance in agility. Atlas, a state-of-the-art humanoid robot built by Boston Dynamics, demonstrated the ability to jump, run, backflip, and maneuver uneven terrain—feats that were impossible for robots just a few years ago. Spot, a quadruped robot also from Boston Dynamics,15 also maneuvers through difficult environments and is being used on construction sites for delivery and monitoring of lightweight materials and tools. It is worth noting, however, that these systems are built using a combination of learning techniques honed in the last several years, classical control theory akin to that used in autopilots, and painstaking engineering and design. Cassie, a biped robot developed by Agility Robotics and Oregon State University, uses deep reinforcement learning for its walking and running behaviors.16 Whereas deployment of AI in user-facing vision and language technologies is now commonplace, the majority of types of robotics systems remain lab-bound.

During 2020, robotics development was driven in part by the need to support social distancing during the COVID-19 pandemic. A group of restaurants opened in China staffed by a team of 20 robots to help cook and serve food. Some early delivery robots were deployed on controlled campuses18 to carry books and food. A diverse collection of companies worldwide are actively pursuing business opportunities in autonomous delivery systems for the last mile. While these types of robots are being increasingly used in the real world, they are by no means mainstream yet and are still prone to mistakes, especially when deployed in unmapped or novel environments. In Japan, a new legal framework is being discussed to ensure that autonomous robotics systems are able to be safely deployed on public roads at limited speeds.19

The combination of deep learning with agile robotics is opening up new opportunities in industrial robotics as well. Leveraging improvements in vision, robotic grippers are beginning to be able to select and pick randomly placed objects and use them to construct stacks. Being able to pick up and put down diverse objects is a key competence in a variety of potential applications, from tidying up homes to preparing packages for shipping.


Autonomous vehicles or self-driving cars have been one of the hottest areas in deployed robotics, as they impact the entire automobile industry as well as city planning. The design of self-driving cars requires integration of a range of technologies including sensor fusion, AI planning and decision-making, vehicle dynamics prediction, on-the-fly rerouting, inter-vehicle communication, and more. Driver assist systems are increasingly widespread in production vehicles.20 These systems use sensors and AI-based analysis to carry out tasks such as adaptive cruise control to safely adjust speed, and lane-keeping assistance to keep vehicles centered on the road.

The optimistic predictions from five years ago of rapid progress in fully autonomous driving have failed to materialize. The reasons may be complicated,21 but the need for exceptional levels of safety in complex physical environments makes the problem more challenging, and more expensive, to solve than had been anticipated. Nevertheless, autonomous vehicles are now operating in certain locales such as Phoenix, Arizona, where driving and weather conditions are particularly benign, and outside Beijing, where 5G connectivity allows remote drivers to take over if needed.22


AI is increasingly being used in biomedical applications, particularly in diagnosis, drug discovery, and basic life science research.

Recent years have seen AI-based imaging technologies move from an academic pursuit to commercial projects.23 Tools now exist for identifying a variety of eye and skin disorders,24 detecting cancers,25 and supporting measurements needed for clinical diagnosis.26 Some of these systems rival the diagnostic abilities of expert pathologists and radiologists, and can help alleviate tedious tasks (for example, counting the number of cells dividing in cancer tissue). In other domains, however, the use of automated systems raises significant ethical concerns.27

AI-based risk scoring in healthcare is also becoming more common. Predictors of health deterioration are now integrated into major health record platforms (for example, EPIC Deterioration Index), and individual health centers are increasingly integrating AI-based risk predictions into their operations.28 Although some amount of bias is evident in these systems,29 they appear exceptionally promising for overall improvements in healthcare.

Beyond treatment support, AI now augments a number of other health operations and measurements, such as helping predict durations of surgeries to optimize scheduling, and  identifying patients at risk of needing transfer to intensive care.30 There are technologies for digital medical transcription,31 for reading ECG systems, for producing super-resolution images to reduce the amount of time patients are in MRI machines, and for identifying questions for clinicians to ask pediatric patients.32 While current penetration is relatively low, we can expect to see uses of AI expand in this domain in the future; in many cases, these are applications of already-mature technologies in other areas of operations making their way into healthcare.


AI has been increasingly adopted into finance. Deep learning models now partially automate lending decisions for several lenders33 and have transformed payments with credit scoring, for example WeChat Pay.34 These new systems often take advantage of consumer data that are not traditionally used in credit scoring. In some cases, this approach can open up credit to new groups of people; in others, it can be used to force people to adopt specific social behaviors.35

High-frequency trading relies on a combination of models as well as the ability to make fast decisions. In the space of personal finance, so-called robo-advising—automated financial advice—is quickly becoming mainstream for investment and overall financial planning.36  For financial institutions, uses of AI are going beyond detecting fraud and enhancing cybersecurity to automating legal and compliance documentation as well as detecting money laundering.37 Government Pension Investment Fund (GPIF) of Japan, the world’s largest pension fund, introduced a deep-learning-based system to monitor investment styles of contracting fund managers and identify risk from unexpected change in market situations known as regime switch.38 Such applications enable financial institutions to recognize otherwise invisible risks, contributing to more robust and stable asset-management practices. 

Recommender Systems

With the explosion of information available to us, recommender systems that automatically prioritize what we see when we are online have become absolutely essential. Such systems have always drawn heavily on AI, and now they have a dramatic influence on people’s consumption of products, services, and content—from news, to music, to videos, and more. Apart from a general trend toward more online activity and commerce, the AI technologies powering recommender systems have changed considerably in the past five years. One shift is the near-universal incorporation of deep neural networks to better predict user responses to recommendations.39 There has also been increased usage of sophisticated machine-learning techniques for analyzing the content of recommended items, rather than using only meta-data and user click or consumption behavior. That is, AI systems are making more of an effort to understand why a specific item might be a good recommendation for a particular person or query. Examples include Spotify’s use of audio analysis of music40 or the application of large language models such as BERT to improve recommendations of news or social media posts.41 Another trend is modeling and prediction of multiple distinct user behaviors, instead of making recommendations for only one activity at a time; functionality facilitated by the use of so-called multi-task models.42 Of course, applying recommendation to multiple tasks simultaneously raises the challenging question of how best to make tradeoffs among these different objectives.

The use of ever-more-sophisticated machine-learned models for recommending products, services, and (especially) content has raised significant concerns about the issues of fairness, diversity, polarization, and the emergence of filter bubbles, where the recommender system suggests, for example, news stories that other people like you are reading instead of what is truly most important. While these problems require more than just technical solutions, increasing attention is paid to technologies that can at least partly address such issues. Promising directions include research on the tradeoffs between popularity and diversity of content consumption,43 and fairness of recommendations among different users and other stakeholders (such as the content providers or creators).44

[1] Antreas Antoniou, Amos Storkey, and Harrison Edwards, "Data Augmentation Generative Adversarial Networks," March 2018

[2] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song, "A Convergence Theory for Deep Learning via Over-Parameterization," June 2019; Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals, "Understanding deep learning requires rethinking generalization," February 2017

[3] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer, "Deep contextualized word representations," Deep contextualized word representations, " March 2018; Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever, "Improving Language Understanding by Generative Pre-Training"; Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," May 2019

[4] Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy, "SpanBERT: Improving Pre-training by Representing and Predicting Spans," Transactions of the Association for Computational Linguistics.January 2020 


[6] Daniel Zhang, Saurabh Mishra, Erik Brynjolfsson, John Etchemendy, Deep Ganguli, Barbara Grosz, Terah Lyons, James Manyika, Juan Carlos Niebles, Michael Sellitto, Yoav Shoham, Jack Clark, and Raymond Perrault, “The AI Index 2021 Annual Report,” AI Index Steering Committee, Human-Centered AI Institute, Stanford University, Stanford, CA, March 2021 p. 49



[9] Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H. Choi, Richard Powell, Timo Ewalds, Petko Georgiev, Junhyuk Oh, Dan Horgan, Manuel Kroiss, Ivo Danihelka, Aja Huang, Laurent Sifre, Trevor Cai, John P. Agapiou, Max Jaderberg, Alexander S. Vezhnevets, Rémi Leblond, Tobias Pohlen, Valentin Dalibard, David Budden, Yury Sulsky, James Molloy, Tom L. Paine, Caglar Gulcehre, Ziyu Wang, Tobias Pfaff, Yuhuai Wu, Roman Ring, Dani Yogatama, Dario Wünsch, Katrina McKinney, Oliver Smith, Tom Schaul, Timothy Lillicrap, Koray Kavukcuoglu, Demis Hassabis, Chris Apps, and David Silver, "Grandmaster level in StarCraft II using multi-agent reinforcement learning," Nature Volume 575, October 2019

[10]  Max Jaderberg, Wojciech M. Czarnecki, Iain Dunning, Luke Marris, Guy Lever, Antonio Garcia Castañeda, Charles Beattie, Neil C. Rabinowitz, Ari S. Morcos, Avraham Ruderman, Nicolas Sonnerat, Tim Green, Louise Deason, Joel Z. Leibo, David Silver, Demis Hassabis, Koray Kavukcuoglu, and Thore Graepel, "Human-level performance in 3D multiplayer games with population-based reinforcement learning," Science, May 2019 



[13] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis, "Mastering the game of Go without human knowledge," Nature, Volume 550, October 2017 

[14] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis, "A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play," Science, December 2018 





[19] As of May 2021, a draft framework stipulates that autonomous robots under the speed of 15km/h will be legalized. Robotics companies are arguing for 20km/h.




[23] For example, Path.AI, Paige.AI, Arterys.

[24] For example, IDx-DR.

[25] For example, BioMind, PolypDx.

[26] For example, CheXNet.

[27] Michael Anis Mihdi Afnan, Cynthia Rudin, Vincent Conitzer, Julian Savulescu, Abhishek Mishra, Yanhe Liu, and Masoud Afnan, "Ethical Implementation of Artificial Intelligence to Select Embryos in In Vitro Fertilization," April 2021

[28] For example, infection risk predictors at Vector, Ontario Tech University, McMaster Children’s Hospital, and Southlake Regional Health Centre.


[30] For example, at Vector and St. Michael’s Hospital and also using other forms of risk (for example, AlgoAnalyzer, TruScore, OptimaAI).

[31] For example, Nuance Dragon, 3M M*Modal, Kara, NoteSwift.

[32] For example, Child Health Improvement.







[39] For example, see early research on ``neural collaborative filtering’’, Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua, "Neural Collaborative Filtering," August 2017; or the use of DNNs for YouTube recommendations


[41] See for their use in the 2020 Recommender Systems (RecSys) challenge.

[42] Zhe Zhao, Lichan Hong, Li Wei, Jilin Chen, Aniruddh Nath, Shawn Andrews, Aditee Kumthekar, Maheswaran Sathiamoorthy, Xinyang Yi, and Ed Chi, "Recommending what video to watch next: a multitask ranking system," Proceedings of the 13th ACM Conference on Recommender Systems (RecSys '19)


[44] .


Cite This Report

Michael L. Littman, Ifeoma Ajunwa, Guy Berger, Craig Boutilier, Morgan Currie, Finale Doshi-Velez, Gillian Hadfield, Michael C. Horowitz, Charles Isbell, Hiroaki Kitano, Karen Levy, Terah Lyons, Melanie Mitchell, Julie Shah, Steven Sloman, Shannon Vallor, and Toby Walsh. "Gathering Strength, Gathering Storms: The One Hundred Year Study on Artificial Intelligence (AI100) 2021 Study Panel Report." Stanford University, Stanford, CA, September 2021. Doc: Accessed: September 16, 2021.

Report Authors

AI100 Standing Committee and Study Panel 


© 2021 by Stanford University. Gathering Strength, Gathering Storms: The One Hundred Year Study on Artificial Intelligence (AI100) 2021 Study Panel Report is made available under a Creative Commons Attribution-NoDerivatives 4.0 License (International):