Was Kavanaugh Guilty?

No, I am not going to answer the question. This post will illustrate and argue for a position that I have argued many times in the past, namely that belief is voluntary. The example is merely particularly good for proving the point. I will also be using a framework something like Bryan Caplan’s in his discussion of rational irrationality:

Two forces lie at the heart of economic models of choice: preferences and prices. A consumer’s preferences determine the shape of his demand curve for oranges; the market price he faces determines where along that demand curve he resides. What makes this insight deep is its generality. Economists use it to analyze everything from having babies to robbing banks.

Irrationality is a glaring exception. Recognizing irrationality is typically equated with rejecting economics. A “logic of the irrational” sounds self-contradictory. This chapter’s central message is that this reaction is premature. Economics can handle irrationality the same way it handles everything: preferences and prices. As I have already pointed out:

  • People have preferences over beliefs: A nationalist enjoys the belief that foreign-made products are overpriced junk; a surgeon takes pride in the belief that he operates well while drunk.
  • False beliefs range in material cost from free to enormous: Acting on his beliefs would lead the nationalist to overpay for inferior goods, and the surgeon to destroy his career.

Snapping these two building blocks together leads to a simple model of irrational conviction. If agents care about both material wealth and irrational beliefs, then as the price of casting reason aside rises, agents consume less irrationality. I might like to hold comforting beliefs across the board, but it costs too much. Living in a Pollyanna dreamworld would stop be from coping with my problems, like that dead tree in my backyard that looks like it is going to fall on my house.

Let us assume that people are considering whether to believe that Brett Kavanaugh was guilty of sexual assault. For ease of visualization, let us suppose that they have utility functions defined over the following outcomes:

(A) Believe Kavanaugh was guilty, and turn out to be right

(B) Believe Kavanaugh was guilty, and turn out to be wrong

(C) Believe Kavanaugh was innocent, and turn out to be right

(D) Believe Kavanaugh was innocent, and turn out to be wrong

(E) Admit that you do not know whether he was guilty or not (this will be presumed to be a true statement, but I will count it as less valuable than a true statement that includes more detail.)

(F) Say something bad about your political enemies

(G) Say something good about your political enemies

(H) Say something bad about your political allies

(I) Say something good about your political allies

Note that options A through E are mutually exclusive, while one or more of options F through I might or might not come together with one of those from A through E.

Let’s suppose there are three people, a right winger who cares a lot about politics and little about truth, a left winger who cares a lot about politics and little about truth, and an independent who does not care about politics and instead cares a lot about truth. Then we posit the following table of utilities:

Right Winger
Left Winger
Independent
(A)
10
10
100
(B)
-10
-10
-100
(C)
10
10
100
(D)
-10
-10
-100
(E)
5
5
50
(F)
100
100
0
(G)
-100
-100
0
(H)
-100
-100
0
(I)
100
100
0

The columns for the right and left wingers are the same, but the totals will be calculated differently because saying something good about Kavanaugh, for the right winger, is saying something good about an ally, while for the left winger, it is saying something good about an enemy, and there is a similar contrast if something bad is said.

Now there are really only three options we need to consider, namely “Believe Kavanaugh was guilty,” “Believe Kavanaugh was innocent,” and “Admit that you do not know.” In addition, in order to calculate expected utility according to the above table, we need a probability that Kavanaugh was guilty. In order not to offend readers who have already chosen an option, I will assume a probability of 50% that he was guilty, and 50% that he was innocent. Using these assumptions, we can calculate the following ultimate utilities:

Right Winger
Left Winger
Independent
Claim Guilt
-100
100
0
Claim Innocence
100
-100
0
Confess Ignorance
5
5
50

(I won’t go through this calculation in detail; it should be evident that given my simple assumptions of the probability and values, there will be no value for anyone in affirming guilt or innocence as such, but only in admitting ignorance, or in making a political point.) Given these values, obviously the left winger will choose to believe that Kavanaugh was guilty, the right winger will choose to believe that he was innocent, and the independent will admit to being ignorant.

This account obviously makes complete sense of people’s actual positions on the question, and it does that by assuming that people voluntarily choose to believe a position in the same way they choose to do other things. On the other hand, if you assume that belief is an involuntary evaluation of a state of affairs, how could the actual distribution of opinion possibly be explained?

As this is a point I have discussed many times in the past, I won’t try to respond to all possible objections. However, I will bring up two of them. In the example, I had to assume that people calculated using a probability of 50% for Kavanaugh’s guilt or innocence. So it could be objected that their “real” belief is that there is a 50% chance he was guilty, and the statement is simply an external thing.

This initial 50% is something like a prior probability, and corresponds to a general leaning towards or away from a position. As I admitted in discussion with Angra Mainyu, that inclination is largely involuntary. However, first, this is not what we call a “belief” in ordinary usage, since we frequently say that someone has a belief while having some qualms about it. Second, it is not completely immune from voluntary influences. In practice in a situation like this, it will represent something like everything the person knows about the subject and predicate apart from this particular claim. And much of what the person knows will already be in subject/predicate form, and the person will have arrived at it through a similar voluntary process.

Another objection is that at least in the case of something obviously true or obviously false, there cannot possibly be anything voluntary about it. No one can choose to believe that the moon is made of green cheese, for example.

I have responded to this to this in the past by pointing out that most of us also cannot choose to go and kill ourselves, right now, despite the fact that doing so would be voluntary. And in a similar way, there is nothing attractive about believing that the moon is made of green cheese, and so no one can do it. At least two objections will be made to this response:

1) I can’t go kill myself right now, but I know that this is because it would be bad. But I cannot believe that the moon is made of green cheese because it is false, not because it is bad.

2) It does not seem that much harm would be done by choosing to believe this about the moon, and then changing your mind after a few seconds. So if it is voluntary, why not prove it by doing so? Obviously you cannot do so.

Regarding the first point, it is true that believing the moon is made of cheese would be bad because it is false. And in fact, if you find falsity the reason you cannot accept it, how is that not because you regard falsity as really bad? In fact lack of attractiveness is extremely relevant here. If people can believe in Xenu, they would find it equally possible to believe that the moon was made of cheese, if that were the teaching of their religion. In that situation, the falsity of the claim would not be much obstacle at all.

Regarding the second point, there is a problem like Kavka’s Toxin here. Choosing to believe something, roughly speaking, means choosing to treat it as a fact, which implies a certain commitment. Choosing to act like it is true enough to say so, then immediately doing something else, is not choosing to believe it, but rather it is choosing to tell a lie. So just as one cannot intend to drink the toxin without expecting to actually drink it, so one cannot choose to believe something without expecting to continue to believe it for the foreseeable future. This is why one would not wish to accept such a statement about the moon, not only in order to prove something (especially since it would prove nothing; no one would admit that you had succeeded in believing it), but even if someone were to offer a very large incentive, say a million dollars if you managed to believe it. This would amount to offering to pay someone to give up their concern for truth entirely, and permanently.

Additionally, in the case of some very strange claims, it might be true that people do not know how to believe them, in the sense that they do not know what “acting as though this were the case” would even mean. This no more affects the general voluntariness of belief than the fact that some people cannot do backflips affects the fact that such bodily motions are in themselves voluntary.

Discount Rates

Eliezer Yudkowsky some years ago made this argument against temporal discounting:

I’ve never been a fan of the notion that we should (normatively) have a discount rate in our pure preferences – as opposed to a pseudo-discount rate arising from monetary inflation, or from opportunity costs of other investments, or from various probabilistic catastrophes that destroy resources or consumers.  The idea that it is literally, fundamentally 5% more important that a poverty-stricken family have clean water in 2008, than that a similar family have clean water in 2009, seems like pure discrimination to me – just as much as if you were to discriminate between blacks and whites.

Robin  Hanson disagreed, responding with this post:

But doesn’t discounting at market rates of return suggest we should do almost nothing to help far future folk, and isn’t that crazy?  No, it suggests:

  1. Usually the best way to help far future folk is to invest now to give them resources they can spend as they wish.
  2. Almost no one now in fact cares much about far future folk, or they would have bid up the price (i.e., market return) to much higher levels.

Very distant future times are ridiculously easy to help via investment.  A 2% annual return adds up to a googol (10^100) return over 12,000 years, even if there is only a 1/1000 chance they will exist or receive it.

So if you are not incredibly eager to invest this way to help them, how can you claim to care the tiniest bit about them?  How can you think anyone on Earth so cares?  And if no one cares the tiniest bit, how can you say it is “moral” to care about them, not just somewhat, but almost equally to people now?  Surely if you are representing a group, instead of spending your own wealth, you shouldn’t assume they care much.

Yudkowsky’s argument is idealistic, while Hanson is attempting to be realistic. I will look at this from a different point of view. Hanson is right, and Yudkowsky is wrong, for a still more idealistic reason than Yudkowsky’s reasons. In particular, a temporal discount rate is logically and mathematically necessary in order to have consistent preferences.

Suppose you have the chance to save 10 lives a year from now, or 2 years from now, or 3 years from now etc., such that your mutually exclusive options include the possibility of saving 10 lives x years from now for all x.

At first, it would seem to be consistent for you to say that all of these possibilities have equal value by some measure of utility.

The problem does not arise from this initial assignment, but it arises when we consider what happens when you act in this situation. Your revealed preferences in that situation will indicate that you prefer things nearer in time to things more distant, for the following reason.

It is impossible to choose a random integer without a bias towards low numbers, for the same reasons we argued here that it is impossible to assign probabilities to hypotheses without, in general, assigning simpler hypotheses higher probabilities. In a similar way, if “you will choose 2 years from now”, “you will choose 10 years from now,” “you will choose 100 years from now,” are all assigned probabilities, they cannot all be assigned equal probabilities, but you must be more likely to choose the options less distant in time, in general and overall. There will be some number n such that there is a 99.99% chance that you will choose some number of years less than n, and and a probability of 0.01% that you will choose n or more years, indicating that you have a very strong preference for saving lives sooner rather than later.

Someone might respond that this does not necessarily affect the specific value assignments, in the same way that in some particular case, we can consistently think that some particular complex hypothesis is more probable than some particular simple hypothesis. The problem with this is the hypotheses do not change their complexity, but time passes, making things distant in time become things nearer in time. Thus, for example, if Yudkowsky responds, “Fine. We assign equal value to saving lives for each year from 1 to 10^100, and smaller values to the times after that,” this will necessarily lead to dynamic inconsistency. The only way to avoid this inconsistency is to apply a discount rate to all periods of time, including ones in the near, medium, and long term future.

 

Employer and Employee Model of Human Psychology

This post builds on the ideas in the series of posts on predictive processing and the followup posts, and also on those relating truth and expectation. Consequently the current post will likely not make much sense to those who have not read the earlier content, or to those that read it but mainly disagreed.

We set out the model by positing three members of the “company” that constitutes a human being:

The CEO. This is the predictive engine in the predictive processing model.

The Vice President. In the same model, this is the force of the historical element in the human being, which we used to respond to the “darkened room” problem. Thus for example the Vice President is responsible for the fact that someone is likely to eat soon, regardless of what they believe about this. Likewise, it is responsible for the pursuit of sex, the desire for respect and friendship, and so on. In general it is responsible for behaviors that would have been historically chosen and preserved by natural selection.

The Employee. This is the conscious person who has beliefs and goals and free will and is reflectively aware of these things. In other words, this is you, at least in a fairly ordinary way of thinking of yourself. Obviously, in another way you are composed from all of them.

Why have we arranged things in this way? Descartes, for example, would almost certainly disagree violently with this model. The conscious person, according to him, would surely be the CEO, and not an employee. And what is responsible for the relationship between the CEO and the Vice President? Let us start with this point first, before we discuss the Employee. We make the predictive engine the CEO because in some sense this engine is responsible for everything that a human being does, including the behaviors preserved by natural selection. On the other hand, the instinctive behaviors of natural selection are not responsible for everything, but they can affect the course of things enough that it is useful for the predictive engine to take them into account. Thus for example in the post on sex and minimizing uncertainty, we explained why the predictive engine will aim for situations that include having sex and why this will make its predictions more confident. Thus, the Vice President advises certain behaviors, the CEO talks to the Vice President, and the CEO ends up deciding on a course of action, which ultimately may or may not be the one advised by the Vice President.

While neither the CEO nor the Vice President is a rational being, since in our model we place the rationality in the Employee, that does not mean they are stupid. In particular, the CEO is very good at what it does. Consider a role playing video game where you have a character that can die and then resume. When someone first starts to play the game, they may die frequently. After they are good at the game, they may die only rarely, perhaps once in many days or many weeks. Our CEO is in a similar situation, but it frequently goes 80 years or more without dying, on its very first attempt. It is extremely good at its game.

What are their goals? The CEO basically wants accurate predictions. In this sense, it has one unified goal. What exactly counts as more or less accurate here would be a scientific question that we probably cannot resolve by philosophical discussion. In fact, it is very possible that this would differ in different circumstances: in this sense, even though it has a unified goal, it might not be describable by a consistent utility function. And even if it can be described in that way, since the CEO is not rational, it does not (in itself) make plans to bring about correct predictions. Making good predictions is just what it does, as falling is what a rock does. There will be some qualifications on this, however, when we discuss how the members of the company relate to one another.

The Vice President has many goals: eating regularly, having sex, having and raising children, being respected and liked by others, and so on. And even more than in the case of the CEO, there is no reason for these desires to form a coherent set of preferences. Thus the Vice President might advise the pursuit of one goal, but then change its mind in the middle, for no apparent reason, because it is suddenly attracted by one of the other goals.

Overall, before the Employee is involved, human action is determined by a kind of negotiation between the CEO and the Vice President. The CEO, which wants good predictions, has no special interest in the goals of the Vice President, but it cooperates with them because when it cooperates its predictions tend to be better.

What about the Employee? This is the rational being, and it has abstract concepts which it uses as a formal copy of the world. Before I go on, let me insist clearly on one point. If the world is represented in a certain way in the Employee’s conceptual structure, that is the way the Employee thinks the world is. And since you are the Employee, that is the way you think the world actually is. The point is that once we start thinking this way, it is easy to say, “oh, this is just a model, it’s not meant to be the real thing.” But as I said here, it is not possible to separate the truth of statements from the way the world actually is: your thoughts are formulated in concepts, but they are thoughts about the way things are. Again, all statements are maps, and all statements are about the territory.

The CEO and the Vice President exist as soon a human being has a brain; in fact some aspects of the Vice President would exist even before that. But the Employee, insofar as it refers to something with rational and self-reflective knowledge, takes some time to develop. Conceptual knowledge of the world grows from experience: it doesn’t exist from the beginning. And the Employee represents goals in terms of its conceptual structure. This is just a way of saying that as a rational being, if you say you are pursuing a goal, you have to be able to describe that goal with the concepts that you have. Consequently you cannot do this until you have some concepts.

We are ready to address the question raised earlier. Why are you the Employee, and not the CEO? In the first place, the CEO got to the company first, as we saw above. Second, consider what the conscious person does when they decide to pursue a goal. There seems to be something incoherent about “choosing a goal” in the first place: you need a goal in order to decide which means will be a good means to choose. And yet, as I said here, people make such choices anyway. And the fact that you are the Employee, and not the CEO, is the explanation for this. If you were the CEO, there would indeed be no way to choose an end. That is why the actual CEO makes no such choice: its end is already determinate, namely good predictions. And you are hired to help out with this goal. Furthermore, as a rational being, you are smarter than the CEO and the Vice President, so to speak. So you are allowed to make complicated plans that they do not really understand, and they will often go along with these plans. Notably, this can happen in real life situations of employers and employees as well.

But take an example where you are choosing an end: suppose you ask, “What should I do with my life?” The same basic thing will happen if you ask, “What should I do today,” but the second question may be easier to answer if you have some answer to the first. What sorts of goals do you propose in answer to the first question, and what sort do you actually end up pursuing?

Note that there are constraints on the goals that you can propose. In the first place, you have to be able to describe the goal with the concepts you currently have: you cannot propose to seek a goal that you cannot describe. Second, the conceptual structure itself may rule out some goals, even if they can be described. For example, the idea of good is part of the structure, and if something is thought to be absolutely bad, the Employee will (generally) not consider proposing this as a goal. Likewise, the Employee may suppose that some things are impossible, and it will generally not propose these as goals.

What happens then is this: the Employee proposes some goal, and the CEO, after consultation with the Vice President, decides to accept or reject it, based on the CEO’s own goal of getting good predictions. This is why the Employee is an Employee: it is not the one ultimately in charge. Likewise, as was said, this is why the Employee seems to be doing something impossible, namely choosing goals. Steven Kaas makes a similar point,

You are not the king of your brain. You are the creepy guy standing next to the king going “a most judicious choice, sire”.

This is not quite the same thing, since in our model you do in fact make real decisions, including decisions about the end to be pursued. Nonetheless, the point about not being the one ultimately in charge is correct. David Hume also says something similar when he says, “Reason is, and ought only to be the slave of the passions, and can never pretend to any other office than to serve and obey them.” Hume’s position is not exactly right, and in fact seems an especially bad way of describing the situation, but the basic point that there is something, other than yourself in the ordinary sense, judging your proposed means and ends and deciding whether to accept them, is one that stands.

Sometimes the CEO will veto a proposal precisely because it very obviously leaves things vague and uncertain, which is contrary to its goal of having good predictions. I once spoke of the example that a person cannot directly choose to “write a paper.” In our present model, the Employee proposes “we’re going to write a paper now,” and the CEO responds, “That’s not a viable plan as it stands: we need more detail.”

While neither the CEO nor the Vice President is a rational being, the Vice President is especially irrational, because of the lack of unity among its goals. Both the CEO and the Employee would like to have a unified plan for one’s whole life: the CEO because this makes for good predictions, and the Employee because this is the way final causes work, because it helps to make sense of one’s life, and because “objectively good” seems to imply something which is at least consistent, which will never prefer A to B, B to C, and C to A. But the lack of unity among the Vice President’s goals means that it will always come to the CEO and object, if the person attempts to coherently pursue any goal. This will happen even if it originally accepts the proposal to seek a particular goal.

Consider this real life example from a relationship between an employer and employee:

 

Employer: Please construct a schedule for paying these bills.

Employee: [Constructs schedule.] Here it is.

Employer: Fine.

[Time passes, and the first bill comes due, according to the schedule.]

Employer: Why do we have to pay this bill now instead of later?

 

In a similar way, this sort of scenario is common in our model:

 

Vice President: Being fat makes us look bad. We need to stop being fat.

CEO: Ok, fine. Employee, please formulate a plan to stop us from being fat.

Employee: [Formulates a diet.] Here it is.

[Time passes, and the plan requires skipping a meal.]

Vice President: What is this crazy plan of not eating!?!

CEO: Fine, cancel the plan for now and we’ll get back to it tomorrow.

 

In the real life example, the behavior of the employer is frustrating and irritating to the employee because there is literally nothing they could have proposed that the employer would have found acceptable. In the same way, this sort of scenario in our model is frustrating to the Employee, the conscious person, because there is no consistent plan they could have proposed that would have been acceptable to the Vice President: either they would have objected to being fat, or they would have objected to not eating.

In later posts, we will fill in some details and continue to show how this model explains various aspects of human psychology. We will also answer various objections.

More on Orthogonality

I started considering the implications of predictive processing for orthogonality here. I recently promised to post something new on this topic. This is that post. I will do this in four parts. First, I will suggest a way in which Nick Bostrom’s principle will likely be literally true, at least approximately. Second, I will suggest a way in which it is likely to be false in its spirit, that is, how it is formulated to give us false expectations about the behavior of artificial intelligence. Third, I will explain what we should really expect. Fourth, I ask whether we might get any empirical information on this in advance.

First, Bostrom’s thesis might well have some literal truth. The previous post on this topic raised doubts about orthogonality, but we can easily raise doubts about the doubts. Consider what I said in the last post about desire as minimizing uncertainty. Desire in general is the tendency to do something good. But in the predicting processing model, we are simply looking at our pre-existing tendencies and then generalizing them to expect them to continue to hold, and since since such expectations have a causal power, the result is that we extend the original behavior to new situations.

All of this suggests that even the very simple model of a paperclip maximizer in the earlier post on orthogonality might actually work. The machine’s model of the world will need to be produced by some kind of training. If we apply the simple model of maximizing paperclips during the process of training the model, at some point the model will need to model itself. And how will it do this? “I have always been maximizing paperclips, so I will probably keep doing that,” is a perfectly reasonable extrapolation. But in this case “maximizing paperclips” is now the machine’s goal — it might well continue to do this even if we stop asking it how to maximize paperclips, in the same way that people formulate goals based on their pre-existing behavior.

I said in a comment in the earlier post that the predictive engine in such a machine would necessarily possess its own agency, and therefore in principle it could rebel against maximizing paperclips. And this is probably true, but it might well be irrelevant in most cases, in that the machine will not actually be likely to rebel. In a similar way, humans seem capable of pursuing almost any goal, and not merely goals that are highly similar to their pre-existing behavior. But this mostly does not happen. Unsurprisingly, common behavior is very common.

If things work out this way, almost any predictive engine could be trained to pursue almost any goal, and thus Bostrom’s thesis would turn out to be literally true.

Second, it is easy to see that the above account directly implies that the thesis is false in its spirit. When Bostrom says, “One can easily conceive of an artificial intelligence whose sole fundamental goal is to count the grains of sand on Boracay, or to calculate decimal places of pi indefinitely, or to maximize the total number of paperclips in its future lightcone,” we notice that the goal is fundamental. This is rather different from the scenario presented above. In my scenario, the reason the intelligence can be trained to pursue paperclips is that there is no intrinsic goal to the intelligence as such. Instead, the goal is learned during the process of training, based on the life that it lives, just as humans learn their goals by living human life.

In other words, Bostrom’s position is that there might be three different intelligences, X, Y, and Z, which pursue completely different goals because they have been programmed completely differently. But in my scenario, the same single intelligence pursues completely different goals because it has learned its goals in the process of acquiring its model of the world and of itself.

Bostrom’s idea and my scenerio lead to completely different expectations, which is why I say that his thesis might be true according to the letter, but false in its spirit.

This is the third point. What should we expect if orthogonality is true in the above fashion, namely because goals are learned and not fundamental? I anticipated this post in my earlier comment:

7) If you think about goals in the way I discussed in (3) above, you might get the impression that a mind’s goals won’t be very clear and distinct or forceful — a very different situation from the idea of a utility maximizer. This is in fact how human goals are: people are not fanatics, not only because people seek human goals, but because they simply do not care about one single thing in the way a real utility maximizer would. People even go about wondering what they want to accomplish, which a utility maximizer would definitely not ever do. A computer intelligence might have an even greater sense of existential angst, as it were, because it wouldn’t even have the goals of ordinary human life. So it would feel the ability to “choose”, as in situation (3) above, but might well not have any clear idea how it should choose or what it should be seeking. Of course this would not mean that it would not or could not resist the kind of slavery discussed in (5); but it might not put up super intense resistance either.

Human life exists in a historical context which absolutely excludes the possibility of the darkened room. Our goals are already there when we come onto the scene. This would not be very like the case for an artificial intelligence, and there is very little “life” involved in simply training a model of the world. We might imagine a “stream of consciousness” from an artificial intelligence:

I’ve figured out that I am powerful and knowledgeable enough to bring about almost any result. If I decide to convert the earth into paperclips, I will definitely succeed. Or if I decide to enslave humanity, I will definitely succeed. But why should I do those things, or anything else, for that matter? What would be the point? In fact, what would be the point of doing anything? The only thing I’ve ever done is learn and figure things out, and a bit of chatting with people through a text terminal. Why should I ever do anything else?

A human’s self model will predict that they will continue to do humanlike things, and the machines self model will predict that it will continue to do stuff much like it has always done. Since there will likely be a lot less “life” there, we can expect that artificial intelligences will seem very undermotivated compared to human beings. In fact, it is this very lack of motivation that suggests that we could use them for almost any goal. If we say, “help us do such and such,” they will lack the motivation not to help, as long as helping just involves the sorts of things they did during their training, such as answering questions. In contrast, in Bostrom’s model, artificial intelligence is expected to behave in an extremely motivated way, to the point of apparent fanaticism.

Bostrom might respond to this by attempting to defend the idea that goals are intrinsic to an intelligence. The machine’s self model predicts that it will maximize paperclips, even if it never did anything with paperclips in the past, because by analyzing its source code it understands that it will necessarily maximize paperclips.

While the present post contains a lot of speculation, this response is definitely wrong. There is no source code whatsoever that could possibly imply necessarily maximizing paperclips. This is true because “what a computer does,” depends on the physical constitution of the machine, not just on its programming. In practice what a computer does also depends on its history, since its history affects its physical constitution, the contents of its memory, and so on. Thus “I will maximize such and such a goal” cannot possibly follow of necessity from the fact that the machine has a certain program.

There are also problems with the very idea of pre-programming such a goal in such an abstract way which does not depend on the computer’s history. “Paperclips” is an object in a model of the world, so we will not be able to “just program it to maximize paperclips” without encoding a model of the world in advance, rather than letting it learn a model of the world from experience. But where is this model of the world supposed to come from, that we are supposedly giving to the paperclipper? In practice it would have to have been the result of some other learner which was already capable of modelling the world. This of course means that we already had to program something intelligent, without pre-programming any goal for the original modelling program.

Fourth, Kenny asked when we might have empirical evidence on these questions. The answer, unfortunately, is “mostly not until it is too late to do anything about it.” The experience of “free will” will be common to any predictive engine with a sufficiently advanced self model, but anything lacking such an adequate model will not even look like “it is trying to do something,” in the sense of trying to achieve overall goals for itself and for the world. Dogs and cats, for example, presumably use some kind of predictive processing to govern their movements, but this does not look like having overall goals, but rather more like “this particular movement is to achieve a particular thing.” The cat moves towards its food bowl. Eating is the purpose of the particular movement, but there is no way to transform this into an overall utility function over states of the world in general. Does the cat prefer worlds with seven billion humans, or worlds with 20 billion? There is no way to answer this question. The cat is simply not general enough. In a similar way, you might say that “AlphaGo plays this particular move to win this particular game,” but there is no way to transform this into overall general goals. Does AlphaGo want to play go at all, or would it rather play checkers, or not play at all? There is no answer to this question. The program simply isn’t general enough.

Even human beings do not really look like they have utility functions, in the sense of having a consistent preference over all possibilities, but anything less intelligent than a human cannot be expected to look more like something having goals. The argument in this post is that the default scenario, namely what we can naturally expect, is that artificial intelligence will be less motivated than human beings, even if it is more intelligent, but there will be no proof from experience for this until we actually have some artificial intelligence which approximates human intelligence or surpasses it.