Human values and preferences are hard to specify, especially in complex domains. Accordingly, much AGI safety research has focused on approaches to AGI design that refer to human values and preferences indirectly, by learning a model that is grounded in expressions of human values (via stated preferences, observed behaviour, approval, etc.) and/or real-world processes that generate expressions of those values. There are additionally approaches aimed at modelling or imitating other aspects of human cognition or behaviour without an explicit aim of capturing human preferences (but usually in service of ultimately satisfying them). Let us refer to all these models as human models.
In this post, we discuss several reasons to be cautious about AGI designs that use human models. We suggest that the AGI safety research community put more effort into developing approaches that work well in the absence of human models, alongside the approaches that rely on human models. This would be a significant addition to the current safety research landscape, especially if we focus on working out and trying concrete approaches as opposed to developing theory. We also acknowledge various reasons why avoiding human models seems difficult.
Problems with Human Models
To be clear about human models, we draw a rough distinction between our actual preferences (which may not be fully accessible to us) and procedures for evaluating our preferences. The first thing, actual preferences, is what humans actually want upon reflection. Satisfying our actual preferences is a win. The second thing, procedures for evaluating preferences, refers to various proxies for our actual preferences such as our approval, or what looks good to us (with necessarily limited information or time for thinking). Human models are in the second category; consider, as an example, a highly accurate ML model of human yes/no approval on the set of descriptions of outcomes. Our first concern, described below, is about overfitting to human approval and thereby breaking its connection to our actual preferences. (This is a case of Goodhart’s law.)
Less Independent Audits
Imagine we have built an AGI system and we want to use it to design the mass transit system for a new city. The safety problems associated with such a project are well recognised; suppose we are not completely sure we have solved them, but are confident enough to try anyway. We run the system in a sandbox on some fake city input data and examine its outputs. Then we run it on some more outlandish fake city data to assess robustness to distributional shift. The AGI’s outputs look like reasonable transit system designs and considerations, and include arguments, metrics, and other supporting evidence that they are good. Should we be satisfied and ready to run the system on the real city’s data, and to implement the resulting proposed design?
We suggest that an important factor in the answer to this question is whether the AGI system was built using human modelling or not. If it produced a solution to the transit design problem (that humans approve of) without human modelling, then we would more readily trust its outputs. If it produced a solution we approve of with human modelling, then although we expect the outputs to be in many ways about good transit system design (our actual preferences) and in many ways suited to being approved by humans, to the extent that these two targets come apart we must worry about having overfit to the human model at the expense of the good design. (Why not the other way around? Because our assessment of the sandboxed results uses human judgement, not an independent metric for satisfaction of our actual preferences.)
Humans have a preference for not being wrong about the quality of a design, let alone being fooled about it. How much do we want to rely on having correctly captured these preferences in our system? If the system is modelling humans, we strongly rely on the system learning and satisfying these preferences, or else we expect to be fooled to the extent that a good-looking but actually bad transit system design is easier to compose than an actually-good design. On the other hand, if the system is not modelling humans, then the fact that its output looks like a good design is better evidence that it is in fact a good design. Intuitively, if we consider sampling possible outputs and condition on the output looking good (via knowledge of humans), the probability of it being good (via knowledge of the domain) is higher when the system’s knowledge is more about what is good than what looks good.
Here is a handle for this problem: a desire for an independent audit of the system’s outputs. When a system uses human modelling, the mutual information between its outputs and the auditing process (human judgement) is higher. Thus, using human models reduces our ability to do independent audits.
Avoiding human models does not avoid this problem altogether. There is still an “outer-loop optimisation” version of the problem. If the system produces a weird or flawed design in sandbox, and we identify this during an audit, we will probably reject the solution and attempt to debug the system that produced it. This introduces a bias on the overall process (involving multiple versions of the system over phases of auditing and debugging) towards outputs that fool our auditing procedure.
However, outer-loop optimisation pressures are weaker, and therefore less worrying, than in-loop optimisation pressures. We would argue that the problem is much worse, i.e., the bias towards fooling is stronger, when one uses human modelling. This is because the relevant optimisation is in-loop instead and is encountered more often.
As one more analogy to illustrate this point, consider a classic Goodhart’s law example of teaching to the test. If you study the material, then take a test, your test score reveals your knowledge of the material fairly well. If you instead study past tests, your test score reveals your ability to pass tests, which may be correlated with your knowledge of the material but is increasingly less likely to be so correlated as your score goes up. Here human modelling is analogous to past tests and actual preferences are analogous to the material. Taking the test is analogous to an audit, which we want to be independent from the study regimen.
Risk from Bugs
We might implement our first AGI system incorrectly in a mundane sense. Specifically, even if we fully develop a theory of safe or aligned AGI, we might fail to implement that theory due to bugs or problems with our implementation techniques. In this case, we would be relatively better off if the mutual information between the AGI’s knowledge and human preferences is low. We expect the system’s behaviour to be dependent on its knowledge in some way, and we expect implementation errors to shift the nature of that dependence away from our intentions and expectations. Incorrect behaviour that depends on human preferences seems more dangerous than incorrect behaviour that does not.
Consider the space of AGI system implementations, under a metric like similarity to an intended design (equivalently: severity of deviation from the design due to bugs). We want all the points near the first AGI system we build to be safe, because we may end up with a slightly different design than intended for reasons such as being confused about what we are doing or making implementation errors.
There are at least three ways in which the risk from bugs can manifest.
Incorrectly Encoded Values: Supposing we intend the first use of AGI to be solving some bounded and well-specified task, but we misunderstand or badly implement it so much that what we end up with is actually unboundedly optimising some objective function. Then it seems better if that objective is something abstract like puzzle solving rather than something more directly connected to human preferences: consider, as a toy example, if the sign (positive/negative) around the objective were wrong.
Manipulation: The earlier arguments for independent audits do not just apply to the specific tasks we would plan to audit, but also to any activities an AGI system might carry out that humans might disapprove of. Examples include finding ways to hack into our supposedly secure systems, hiding its intentions and activity from us, or outright manipulating us. These tasks are much easier with access to a good psychological model of humans, which can be used to infer what mistakes we might make, or what loopholes we might overlook, or how we might respond to different behaviour from the system.
Human modelling is very close to human manipulation in design space. A system with accurate models of humans is close to a system which successfully uses those models to manipulate humans.
Threats: Another risk from bugs comes not from the AGI system caring incorrectly about our values, but from having inadequate security. If our values are accurately encoded in an AGI system that cares about satisfying them, they become a target for threats from other actors who can gain from manipulating the first system. More examples and perspectives on this problem have been described here.
The increased risk from bugs of human modelling can be summarised as follows: whatever the risk that AGI systems produce catastrophic outcomes due to bugs, the very worst outcomes seem more likely if the system was trained using human modelling because these worst outcomes depend on the information in human models.
Less independent audits and the risk from bugs can both be mitigated by preserving independence of the system from human model information, so the system cannot overfit to that information or use it perversely. The remaining two problems we consider, mind crime and unexpected agents, depend more heavily on the claim that modelling human preferences increases the chances of simulating something human-like.
Many computations may produce entities that are morally relevant because, for example, they constitute sentient beings that experience pain or pleasure. Bostrom calls improper treatment of such entities “mind crime”. Modelling humans in some form seems more likely to result in such a computation than not modelling them, since humans are morally relevant and the system’s models of humans may end up sharing whatever properties make humans morally relevant.
Similar to the mind crime point above, we expect AGI designs that use human modelling to be more at risk of producing subsystems that are agent-like, because humans are agent-like. For example, we note that trying to predict the output of consequentialist reasoners can reduce to an optimisation problem over a space of things that contains consequentialist reasoners. A system engineered to predict human preferences well seems strictly more likely to run into problems associated with misaligned sub-agents. (Nevertheless, we think the amount by which it is more likely is small.)
Safe AGI Without Human Models is Neglected
Given the independent auditing concern, plus the additional points mentioned above, we would like to see more work done on practical approaches to developing safe AGI systems that do not depend on human modelling. At present, this is a neglected area in the AGI safety research landscape. Specifically, work of the form “Here’s a proposed approach, here are the next steps to try it out or investigate further”, which we might term engineering-focused research, is almost entirely done in a human-modelling context. Where we do see some safety work that eschews human modelling, it tends to be theory-focused research, for example, MIRI’s work on agent foundations. This does not fill the gap of engineering-focused work on safety without human models.
To flesh out the claim of a gap, consider the usual formulations of each of the following efforts within safety research: iterated distillation and amplification, debate, recursive reward modelling, cooperative inverse reinforcement learning, and value learning. In each case, there is human modelling built into the basic setup for the approach. However, we note that the technical results in these areas may in some cases be transportable to a setup without human modelling, if the source of human feedback (etc.) is replaced with a purely algorithmic, independent system.
Some existing work that does not rely on human modelling includes the formulation of safely interruptible agents, the formulation of impact measures (or side effects), approaches involving building AI systems with clear formal specifications (e.g., some versions of tool AIs), some versions of oracle AIs, and boxing/containment. Although they do not rely on human modelling, some of these approaches nevertheless make most sense in a context where human modelling is happening: for example, impact measures seem to make most sense for agents that will be operating directly in the real world, and such agents are likely to require human modelling. Nevertheless, we would like to see more work of all these kinds, as well as new techniques for building safe AGI that does not rely on human modelling.
Difficulties in Avoiding Human Models
A plausible reason why we do not yet see much research on how to build safe AGI without human modelling is that it is difficult. In this section, we describe some distinct ways in which it is difficult.
It is not obvious how to put a system that does not do human modelling to good use. At least, it is not as obvious as for the systems that do human modelling, since they draw directly on sources (e.g., human preferences) of information about useful behaviour. In other words, it is unclear how to solve the specification problem—how to correctly specify desired (and only desired) behaviour in complex domains—without human modelling. The “against human modelling” stance calls for a solution to the specification problem wherein useful tasks are transformed into well-specified, human-independent tasks either solely by humans or by systems that do not model humans.
To illustrate, suppose we have solved some well-specified, complex but human-independent task like theorem proving or atomically precise manufacturing. Then how do we leverage this solution to produce a good (or better) future? Empowering everyone, or even a few people, with access to a superintelligent system that does not directly encode their values in some way does not obviously produce a future where those values are realised. (This seems related to Wei Dai’s human-safety problem.)
Implicit Human Models
Even seemingly “independent” tasks leak at least a little information about their origins in human motivations. Consider again the mass transit system design problem. Since the problem itself concerns the design of a system for use by humans, it seems difficult to avoid modelling humans at all in specifying the task. More subtly, even highly abstract or generic tasks like puzzle solving contain information about the sources/designers of the puzzles, especially if they are tuned for encoding more obviously human-centred problems. (Work by Shah et al. looks at using the information about human preferences that is latent in the world.)
Specification Competitiveness / Do What I Mean
Explicit specification of a task in the form of, say, an optimisation objective (of which a reinforcement learning problem would be a specific case) is known to be fragile: there are usually things we care about that get left out of explicit specifications. This is one of the motivations for seeking more and more high level and indirect specifications, leaving more of the work of figuring out what exactly is to be done to the machine. However, it is currently hard to see how to automate the process of turning tasks (vaguely defined) into correct specifications without modelling humans.
Performance Competitiveness of Human Models
It could be that modelling humans is the best way to achieve good performance on various tasks we want to apply AGI systems to for reasons that are not simply to do with understanding the problem specification well. For example, there may be aspects of human cognition that we want to more or less replicate in an AGI system, for competitiveness at automating those cognitive functions, and those aspects may carry a lot of information about human preferences with them in a hard to separate way.
What to Do Without Human Models?
We have seen arguments for and against aspiring to solve AGI safety using human modelling. Looking back on these arguments, we note that to the extent that human modelling is a good idea, it is important to do it very well; to the extent that it is a bad idea, it is best to not do it at all. Thus, whether or not to do human modelling at all is a configuration bit that should probably be set early when conceiving of an approach to building safe AGI.
It should be noted that the arguments above are not intended to be decisive, and there may be countervailing considerations which mean we should promote the use of human models despite the risks outlined in this post. However, to the extent that AGI systems with human models are more dangerous than those without, there are two broad lines of intervention we might attempt. Firstly, it may be worthwhile to try to decrease the probability that advanced AI develops human models “by default”, by promoting some lines of research over others. For example, an AI trained in a procedurally-generated virtual environment seems significantly less likely to develop human models than an AI trained on human-generated text and video data.
Secondly, we can focus on safety research that does not require human models, so that if we eventually build AGI systems that are highly capable without using human models, we can make them safer without needing to teach them to model humans. Examples of such research, some of which we mentioned earlier, include developing human-independent methods to measure negative side effects, to prevent specification gaming, to build secure approaches to containment, and to extend the usefulness of task-focused systems.
Acknowledgements: thanks to Daniel Kokotajlo, Rob Bensinger, Richard Ngo, Jan Leike, and Tim Genewein for helpful comments on drafts of this post.