Skip to main content Skip to secondary navigation

SQ3. What are the most inspiring open grand challenge problems?

Main content start

The concept of a “grand challenge” has played a significant role in AI research at least since 1988, when Turing Award winner Raj Reddy, an AI pioneer, gave a speech titled “Foundations and Grand Challenges of Artificial Intelligence.”1 In the address, Reddy outlined the major achievements of the field and posed a set of challenges as a way of articulating the motivations behind research in the field. Some of Reddy’s challenges have been fully or significantly solved: a “self-organizing system” that can read a textbook and answer questions;2 a world-champion chess machine;3 an accident-avoiding car;4 a translating telephone.5 Others have remained open: mathematical discovery,6 a self-replicating system that enables a small set of machine tools to produce other tools using locally available raw materials.

Some of today’s grand challenges in AI are carefully defined goals with a clear marker of success, similar to Reddy’s chess challenge. The 2016 AI100 report was released just after one such grand challenge was achieved, with DeepMind’s AlphaGo beating a world champion in Go. There are also a number of open grand challenges with less specific criteria for completion, but which inspire AI researchers to achieve needed breakthroughs—such as AlphaFold’s 2020 success at predicting protein structures. 

Reddy’s grand challenges were framed in terms of concrete tasks to be completed—drive a car, win a game of chess. Similar challenges—such as improving accuracy rates on established datasets like ImageNet7—continue to drive creativity and progress in AI research. One of the leading machine-learning conferences, Neural Information Processing Systems, began a “competition track” for setting and solving such challenges in 2017.8

But, as the field of AI has matured, so has the idea of a grand challenge. Perhaps the most inspiring challenge is to build machines that can cooperate and collaborate seamlessly with humans and can make decisions that are aligned with fluid and complex human values and preferences. This challenge cannot be solved without collaboration between computer scientists with social scientists and humanists. But these are domains in which research challenges are not as crisply defined as a measurable task or benchmark. And indeed, a lesson learned from social science and humanities-inspired research over the past five years is that AI research that is overly tuned to concrete benchmarks and tasks—such as accuracy rates on established datasets—can take us further away from the goal of cooperative and well-aligned AI that serves humans’ needs, goals, and values.9 The problems of racial, gender, and other biases in machine-learning models,10 for example, can be at least partly attributed to a blind spot created by research that aimed only to improve accuracy on available datasets and did not investigate the representativeness or quality of the data, or the ways in which different errors have different human values and consequences. Mislabeling a car as an airplane is one thing; mislabeling a person as a gorilla is another.11 So we include in our concept of a grand challenge the open research questions that, like the earlier grand challenges, should inspire a new generation of interdisciplinary AI researchers. 

Turing Test

Alan Turing formulated his original challenge in 1950 in terms of the ability of an interrogator to distinguish between a woman and a machine attempting to mimic a woman, through written question-and-answer exchange.12 A machine passes the Turing test if it is able to do as good a job as a man at imitating a woman. Today, the challenge is understood to be more demanding (and less sexist): engaging in fluent human text-based conversation requiring a depth of syntactic, cultural, and contextual knowledge so the machine would be mistaken as a human being. Attempts have been made over the years to improve on the basic design.13 Barbara Grosz, an AI pioneer on the topic of natural communication between people and machines, proposed a modern version of the Turing Test in 2012:

A computer (agent) team member [that can] behave, over the long term and in uncertain, dynamic environments, in such a way that people on the team will not notice that it is not human.14

Has the Turing challenge been solved? Although language models such as GPT-3 are capable of producing significant amounts of text that are difficult to distinguish from human-generated text,15 these models are still unreliable and often err by producing language that defies human rules, conventions, and common sense, especially in long passages or an interactive setting. Indeed, major risks still surround these errors. Language-generating chatbots easily violate human norms about acceptable language, producing hateful, racist, or sexist statements in contexts where a socially competent human clearly would not.16

Today’s version of the Turing challenge should also take into consideration the very real harms that come from building machines that trick humans into believing they are interacting with other humans. The initial roll out of Google Duplex generated significant public outcry because the system uses an extremely natural voice and injects umms and ahs when booking an appointment; it looked as though it was trying to fool people. (The current version discloses its computer identity.)17 With the enormous advances in the capacity for machine learning to produce images, video, audio, and text that are indistinguishable from human-generated versions have come significant challenges to the quality and stability of human relationships and systems. AI-mediated content on social media platforms, for example, has contributed in the last few years to political unrest and violence.18

A contemporary version of the Turing challenge might therefore be the creation of a machine that can engage in fluent communication with a human without being mistaken for a human, especially because people adapt so readily to human-like conversational interaction.19 Grosz’s version of the test recognizes the importance of this concern: It “does not ask that the computer system act like a person or be mistaken for one. Instead it asks that the computer’s nonhumanness not hit one in the face, that it is not noticeable, and that the computer act intelligently enough that it does not baffle its teammates.” This approach would be consistent with principles that are emerging in regulation, such as the European Union’s 2021 proposal for legislation requiring that providers of AI systems design and develop mechanisms to inform people that they are interacting with AI technology.20


RoboCup is an established grand challenge in AI and robotics with the goal of developing a fully autonomous robot team capable of beating the FIFA World Cup champion soccer (football) team by 2050. Researchers from over 35 countries are involved in this initiative, with a series of international and regional competitions, symposia, summer schools, and other activities. While RoboCup’s main goal is to develop a super-human team of robots, an alternative goal is to form a human-robot hybrid championship team. This alternative goal stresses human-robot collaboration, fostering symbiotic human-robot relationships.

Since 2007, RoboCup has moved toward trials of robots playing soccer on an outdoor field, and has matched a winning robot team against human players indoors. While the level of play remains far from real-world soccer, these steps constitute major progress toward more realistic play. RoboCup has also introduced and fostered novel competitions for intelligent robotics including home-based, industrial, and disaster-response robotics.

RoboCup challenge
Ball control, passing strategy, and shooting accuracy have continued to improve over the quarter century the RoboCup competition has been held. While still dominated by human players, even in their researcher clothes, the best robot teams can occasionally score in the yearly human-robot match. Peter Stone, the AI-100 Standing Committee chair, is shown here taking a shot in the RoboCup 2019 match in Sydney, managed by ICMS Australasia. From:…

International Math Olympiad

The International Math Olympiad (IMO) is an international mathematics competition for high-school students. Related to Reddy’s challenge in mathematical discovery is to build an AI system that can win a gold medal in the IMO.The committee sponsoring this challenge has set precise parameters for success: The AI must be capable of producing, with the same time limit as a human contestant, solutions to the problems in the annual IMO that can be checked by an automated theorem prover in 10 minutes (the time it usually takes a human judge to evaluate a human’s solution) and achieving a score that would have earned a gold medal in a given year.21

The AI Scientist

The AI Scientist challenge22 envisions the development, by 2050, of AI systems that can engage in autonomous scientific research. They would be “capable of making major discoveries some of which are at the level worthy of the Nobel Prize or other relevant recognition” and “make strategic choices about research goals, design protocols and experiments to collect data, notice and characterize a significant discovery, communicate in the form of publications and other scientific means to explain the innovation and methods behind the discovery and articulate the significance of the discovery [and] its impact.” A workshop organized by the Alan Turing Institute in February 2020 proposed the creation of a global initiative to develop such an AI Scientist.23

Broader Challenges

We now turn to open grand challenges that do not have the structure of a formal competition or crisp benchmark. These research challenges are among the most inspiring.


Modern machine-learning models are trained on increasingly massive datasets (over one trillion words for GPT-3, for example) and optimized to accomplish specific tasks or maximize specified reward functions. While these methods enable surprisingly powerful systems, and performance appears to follow a power law—showing increases that continue to grow with increasing data sets or model size24—many believe that major advances in AI will require developing the capacity for generalizing or transferring learning from a training task to a novel one. Although modern machine-learning techniques are making headway on this problem, a robust capacity for generalizability and transfer learning will likely require the integration of symbolic and probabilistic reasoning—combining a primary focus on logic with a more statistical point of view. Some characterize the skill of extrapolating from few examples as a form of common sense, meaning that it requires broad knowledge about the world and the ability to adapt that knowledge in novel circumstances. Increasing generality is likely to require that machines learn, as humans do, from small samples and by analogy. Generalizability is also a key component of robustness, allowing an AI system to respond and adapt to shifts in the frequency with which it sees different examples, something that continues to interfere with modern machine-learning-based systems.


An important source of generality in natural intelligence is knowledge of cause and effect. Current machine-learning techniques are capable of discovering hidden patterns in data, and these discoveries allow the systems to solve ever-increasing varieties of problems. Neural network language models, for example, built on the capacity to predict words in sequence, display tremendous capacity to correct grammar, answer natural language questions, write computer code, translate languages, and summarize complex or extended specialized texts. Today’s machine-learning models, however, have only limited capacity to discover causal knowledge of the world, as Turing award winner Judea Pearl has emphasized.25 They have very limited ability to predict how novel interventions might change the world they are interacting with, or how an environment might have evolved differently under different conditions. They do not know what is possible in the world. To create systems significantly more powerful than those in use today, we will need to teach them to understand causal relationships. It remains an open question whether we’ll be able to build systems with good causal models of sufficiently complex systems from text alone, in the absence of interaction.


Nils J. Nilsson,26 one of AI’s pioneers and an author of an early textbook in the field, defined intelligence as the capacity to function appropriately and with foresight in an environment. When that environment includes humans, appropriate behavior is determined by complex and dynamic normative schemes. Norms govern almost everything we do; whenever we make a decision, we are aware of whether others would consider it “acceptable” or “not acceptable.”27 And humans have complex processes for choosing norms with their own dynamics and characteristics.28 Normatively competent AI systems will need to understand and adapt to dynamic and complex regimes of normativity.

Aligning with human normative systems is a massive challenge in part because what is “good” and what is “bad” varies tremendously across human cultures, settings, and time. Even apparently universal norms such as “do not kill” are highly variable and nuanced: Some modern societies say it is okay for the state to kill someone who has killed another or revealed state secrets; historically, many societies approved of killing a woman who has had pre-marital or extra-marital sex or whose family has not paid dowry, and some groups continue to sanction such killing today. And most killing does not occur in deliberate, intentional contexts. Highways and automobiles are designed to trade off speed and traffic flow with a known risk that a non-zero number of people will be killed by the design. AI researchers can choose not to participate in the building of systems that violate the researcher’s own values, by refusing to work on AI that supports state surveillance or military applications, say. But a lesson from the social sciences and humanities is that it is naive to think that there is a definable and core set of universal values that can directly be built into AI systems. Moreover, a core value that is widely shared is the concept of group self-determination and national sovereignty. AI systems built for Western values, with Western tradeoffs, violate other values.

Even within a given shared normative framework, the challenges are daunting. As an example, there has been an explosion of interest in the last five years in the problem of developing algorithms that are unbiased and fair.29 Given the marked cultural differences in what is even considered “fair,” doing this will require going beyond the imposition of statistical constraints on outputs of AI systems. Like a competent human, advanced AI systems will need to be able to both read and interact with cultural and social norms, and sometimes highly local practices, rules, and laws, and to adapt as these features of the normative environment change. At the same time, AI systems will need to have features that allow them to be integrated into the institutions through which humans implement normative systems. For an AI system to be accountable, for example, it will require that accounts of how and why it acted as it did are reviewable by independent third parties tasked with ensuring that the account is consistent with applicable rules. 

Humans who are alleged to have engaged in unlawful conduct are held accountable by independent adjudicators applying consistent rules and procedures. For AI to be ethical, fair and value-aligned, it needs to have good normative models and to be capable of integrating its behavior into human normative institutions and processes. Although significant progress is being made on making AI more explainable30—and avoiding opaque models in high-stakes settings when possible31—systems of accountability require more than causal accounts of how a decision was reached; they require normative accounts of how and why the decision is consistent with human values. Explanation is an interaction between a machine and human; justification is an interaction between a machine and an entire normative community and its institutions.

[1] Raj Reddy, “Foundations and Grand Challenges of Artificial Intelligence,” 1988 AAAI Presidential Address,

 [2] Arguably, the challenge is partially solved. For example, GPT-3 can answer many questions reasonably well, and some not at all, based on having trained on many textbooks and other online materials. See Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt, “Measuring Massive Multitask Language Understanding,”; and Noam Kolt, “Predicting Consumer Contracts,” Berkeley Technology Law Journal, Vol. 37.

[3] Murray Campbell, A. Joseph Hoane Jr., Feng-hsiung Hsu, “Deep Blue,” Artificial Intelligence, Volume 134, Issues 1-2, Pages 57-83. 

[4] Alex G. Cunningham, Enric Galceran, Dhanvin Mehta, Gonzalo Ferrer, Ryan M. Eustice and Edwin Olson, “MPDM: Multi-policy decision-making from autonomous driving to social robot navigation,”


[6] There are automated proof checkers and some brute-force theorem provers, but generating a novel interesting mathematical conjecture and proving it in a way humans understand is still an open challenge.

[7] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, Li Fei-Fei, “ImageNet: A large-scale hierarchical image database,” IEEE,


[9] A core result in economic theory is that, when success has measurable and unmeasurable components, incentives tuned to measurable components can degrade performance overall by distorting efforts away from the unmeasurable. See Bengt Holmstrom and Paul Milgrom, “Multitask Principal-Agent Analyses: Incentive Contracts, Asset Ownership, and Job Design,” 7 J. L. Econ. & Org. 24 (1991), 

[10] Joy Buolamwini and Timnit Gebru, “Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification,” Proceedings of Machine Learning Research 81:1–15, 2018,


[12] A.M. Turing, “Computing Machinery and Intelligence,” Mind, Volume LIX, Issue 236, October 1950, Pages 433–460

[13] Hector J. Levesque, Ernest Davis, and Leora Morgenstern, “The Winograd Schema Challenge,” Proceedings of the Thirteenth International Conference on Principles of Knowledge Representation and Reasoning,; and Paul R. Cohen, “If Not Turing’s Test, Then What?” AI Magazine Volume 26 Number 4,

[14] Barbara Grosz, “What Question Would Turing Pose Today?” AI Magazine, Vol. 33 No. 4: Winter 2012 ,

[15] These models now closely approximate human performance on natural language benchmark tasks.







[22] Hiroaki Kitano, Artificial Intelligence to Win the Nobel Prize and Beyond: Creating the Engine for Scientific Discovery, AI Magazine, Vol. 37 No. 1: Spring 2016 ,


[24] Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, Yanqi Zhou, “Deep Learning Scaling is Predictable, Empirically,”; and Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei, “Scaling Laws for Neural Language Models,”

[25] Judea Pearl and Dana Mackenzie, “The Book of Why: The New Science of Cause and Effect,”

[26] Note that Nilsson’s definition was also featured in the first AI100 report.

[27] Geoffrey Brennan, Robert E. Goodin, Nicholas Southwood Explaining Norms (2016); Cristina Bichierri, The Grammar of Society (2005).

[28] Gillian K Hadfield and Barry R Weingast, “Microfoundations of the Rule of Law,” Annual Review of Political Science, Vol. 17:21-42 

[29] For example, the ACM Conference on Fairness, Accountability, and Transparency began in 2018. 

[30] Arun Rai, “Explainable AI: from black box to glass box,” J. of the Acad. Mark. Sci. 48, 137–141 (2020). 

[31] Cynthia Rudin, “Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead,” Nature Machine Intelligence,  volume 1, 206–215 (2019).






Cite This Report

Michael L. Littman, Ifeoma Ajunwa, Guy Berger, Craig Boutilier, Morgan Currie, Finale Doshi-Velez, Gillian Hadfield, Michael C. Horowitz, Charles Isbell, Hiroaki Kitano, Karen Levy, Terah Lyons, Melanie Mitchell, Julie Shah, Steven Sloman, Shannon Vallor, and Toby Walsh. "Gathering Strength, Gathering Storms: The One Hundred Year Study on Artificial Intelligence (AI100) 2021 Study Panel Report." Stanford University, Stanford, CA, September 2021. Doc: Accessed: September 16, 2021.

Report Authors

AI100 Standing Committee and Study Panel 


© 2021 by Stanford University. Gathering Strength, Gathering Storms: The One Hundred Year Study on Artificial Intelligence (AI100) 2021 Study Panel Report is made available under a Creative Commons Attribution-NoDerivatives 4.0 License (International):