Some Remarks on GPT-N

At the end of May, OpenAI published a paper on GPT-3, a language model which is a successor to their previous version, GPT-2. While quite impressive, the reaction from many people interested in artificial intelligence has been seriously exaggerated. Sam Altman, OpenAI’s CEO, has said as much himself:

The GPT-3 hype is way too much. It’s impressive (thanks for the nice compliments!) but it still has serious weaknesses and sometimes makes very silly mistakes. AI is going to change the world, but GPT-3 is just a very early glimpse. We have a lot still to figure out.

I used “GPT-N” in the title here because most of the comments I intend to make are almost completely general, and will apply to any future version that uses sufficiently similar methods.

What it does

GPT-3 is a predictive language model, that is, given an input text it tries to predict what would come next, much in the way that if you read the first few words of this sentence with the rest covered up, you might try to guess what would be likely to come next. To the degree that it does this well, it can be used to generate text from a “prompt,” that is, we give it something like a few words or a few sentences, and then add whatever it predicts should come next. For example, let’s take this very blog post and see what GPT-3 would like to say:

What it doesn’t do

While GPT-3 does seem to be able to generate some pretty interesting results, there are several limitations that need to be taken into account when using it.

First and foremost, and most importantly, it can’t do anything without a large amount of input data. If you want it to write like “a real human,” you need to give it a lot of real human writing. For most people, this means copying and pasting a lot. And while the program is able to read through that and get a feel for the way humans communicate, you can’t exactly use it to write essays or research papers. The best you could do is use it as a “fill in the blank” tool to write stories, and that’s not even very impressive.

While the program does learn from what it reads and is quite good at predicting words and phrases based on what has already been written, this method isn’t very effective at producing realistic prose. The best you could hope for is something like the “Deep Writing Machine” Twitter account, which spits out disconnected phrases in an ominous, but very bland voice.

In addition, the model is limited only to language. It does not understand context or human thought at all, so it has no way of tying anything together. You could use it to generate a massive amount of backstory and other material for a game, but that’s about it.

Finally, the limitations in writing are only reinforced by the limitations in reading. Even with a large library to draw on, the program is only as good as the parameters set for it. Even if you set it to the greatest writers mankind has ever known, without any special parameters, its writing would be just like anyone else’s.

The Model

GPT-3 consists of several layers. The first layer is a “memory network” that involves the program remembering previously entered data and using it when appropriate (i.e. it remembers commonly misspelled words and frequently used words). The next layer is the reasoning network, which involves common sense logic (i.e. if A, then B). The third is the repetition network, which involves pulling previously used material from memory and using it to create new combinations (i.e. using previously used words in new orders).

I added the bold formatting, the rest is as produced by the model. This was also done in one run, without repetitions. This is an important qualification, since many examples on the internet have been produced by deleting something produced by the model and forcing it to generate something new until something sensible resulted. Note that the model does not seem to have understood my line, “let’s take this very blog post and see what GPT-3 would like to say.” That is, rather than trying to “say” anything, it attempted to continue the blog post in the way I might have continued it without the block quote.

Truth vs Probability of Text

If we interpret the above text from GPT-3 “charitably”, much of it is true or close to true. But I use scare quotes here because when we speak of interpreting human speech charitably, we are assuming that someone was trying to speak the truth, and so we think, “What would they have meant if they were trying to say something true?” The situation is different here, because GPT-3 has no intention of producing truth, nor of avoiding it. Insofar as there is any intention, the intention is to produce the text which would be likely to come after the input text; in this case, as the input text was the beginning of this blog post, the intention was to produce the text that would likely follow in such a post. Note that there is an indirect relationship with truth, which explains why there is any truth at all in GPT-3’s remarks. If the input text is true, it is at least somewhat likely that what would follow would also be true, so if the model is good at guessing what would be likely to follow, it will be likely to produce something true in such cases. But it is just as easy to convince it to produce something false, simply by providing an input text that would be likely to be followed by something false.

This results in an absolute upper limit on the quality of the output of a model of this kind, including any successor version, as long as the model works by predicting the probability of the following text. Namely, its best output cannot be substantially better than the best content in its training data, which is in this version is a large quantity of texts from the internet. The reason for this limitation is clear; to the degree that the model has any intention at all, the intention is to reflect the training data, not to surpass it. As an example, consider the difference between Deep Mind’s AlphaGo and AlphaGo Zero. AlphaGo Zero is a better Go player than the original AlphaGo, and this is largely because the original is trained on human play, while AlphaGo Zero is trained from scratch on self play. In other words, the original version is to some extent predicting “what would a Go player play in this situation,” which is not the same as predicting “what move would win in this situation.”

Now I will predict (and perhaps even GPT-3 could predict) that many people will want to jump in and say, “Great. That shows you are wrong. Even the original AlphaGo plays Go much better than a human. So there is no reason that an advanced version of GPT-3 could not be better than humans at saying things that are true.”

The difference, of course, is that AlphaGo was trained in two ways, first on predicting what move would be likely in a human game, and second on what would be likely to win, based on its experience during self play. If you had trained the model only on predicting what would follow in human games, without the second aspect, the model would not have resulted in play that substantially improved upon human performance. But in the case of GPT-3 or any model trained in the same way, there is no selection whatsoever for truth as such; it is trained only to predict what would follow in a human text. So no successor to GPT-3, in the sense of a model of this particular kind, however large, will ever be able to produce output better than human, or in its own words, “its writing would be just like anyone else’s.”

Self Knowledge and Goals

OpenAI originally claimed that GPT-2 was too dangerous to release; ironically, they now intend to sell access to GPT-3. Nonetheless, many people, in large part those influenced by the opinions of Nick Bostrom and Eliezer Yudkowsky, continue to worry that an advanced version might turn out to be a personal agent with nefarious goals, or at least goals that would conflict with the human good. Thus Alexander Kruel:

GPT-2: *writes poems*
Skeptics: Meh
GPT-3: *writes code for a simple but functioning app*
Skeptics: Gimmick.
GPT-4: *proves simple but novel math theorems*
Skeptics: Interesting but not useful.
GPT-5: *creates GPT-6*
Skeptics: Wait! What?
GPT-6: *FOOM*
Skeptics: *dead*

In a sense the argument is moot, since I have explained above why no future version of GPT will ever be able to produce anything better than people can produce themselves. But even if we ignore that fact, GPT-3 is not a personal agent of any kind, and seeks goals in no meaningful sense, and the same will apply to any future version that works in substantially the same way.

The basic reason for this is that GPT-3 is disembodied, in the sense of this earlier post on Nick Bostrom’s orthogonality thesis. The only thing it “knows” is texts, and the only “experience” it can have is receiving an input text. So it does not know that it exists, it cannot learn that it can affect the world, and consequently it cannot engage in goal seeking behavior.

You might object that it can in fact affect the world, since it is in fact in the world. Its predictions cause an output, and that output is in the world. And that output and be reintroduced as input (which is how “conversations” with GPT-3 are produced). Thus it seems it can experience the results of its own activities, and thus should be able to acquire self knowledge and goals. This objection is not ultimately correct, but it is not so far from the truth. You would not need extremely large modifications in order to make something that in principle could acquire self knowledge and seek goals. The main reason that this cannot happen is the “P in “GPT,” that is, the fact that the model is “pre-trained.” The only learning that can happen is the learning that happens while it is reading an input text, and the purpose of that learning is to guess what is happening in the one specific text, for the purpose of guessing what is coming next in this text. All of this learning vanishes upon finishing the prediction task and receiving another input. A secondary reason is that since the only experience it can have is receiving an input text, even if it were given a longer memory, it would probably not be possible for it to notice that its outputs were caused by its predictions, because it likely has no internal mechanism to reflect on the predictions themselves.

Nonetheless, if you “fixed” these two problems, by allowing it to continue to learn, and by allowing its internal representations to be part of its own input, there is nothing in principle that would prevent it from achieving self knowledge, and from seeking goals. Would this be dangerous? Not very likely. As indicated elsewhere, motivation produced in this way and without the biological history that produced human motivation is not likely to be very intense. In this context, if we are speaking of taking a text-predicting model and adding on an ability to learn and reflect on its predictions, it is likely to enjoy doing those things and not much else. For many this argument will seem “hand-wavy,” and very weak. I could go into this at more depth, but I will not do so at this time, and will simply invite the reader to spend more time thinking about it. Dangerous or not, would it be easy to make these modifications? Nothing in this description sounds difficult, but no, it would not be easy. Actually making an artificial intelligence is hard. But this is a story for another time.

More on Orthogonality

I started considering the implications of predictive processing for orthogonality here. I recently promised to post something new on this topic. This is that post. I will do this in four parts. First, I will suggest a way in which Nick Bostrom’s principle will likely be literally true, at least approximately. Second, I will suggest a way in which it is likely to be false in its spirit, that is, how it is formulated to give us false expectations about the behavior of artificial intelligence. Third, I will explain what we should really expect. Fourth, I ask whether we might get any empirical information on this in advance.

First, Bostrom’s thesis might well have some literal truth. The previous post on this topic raised doubts about orthogonality, but we can easily raise doubts about the doubts. Consider what I said in the last post about desire as minimizing uncertainty. Desire in general is the tendency to do something good. But in the predicting processing model, we are simply looking at our pre-existing tendencies and then generalizing them to expect them to continue to hold, and since since such expectations have a causal power, the result is that we extend the original behavior to new situations.

All of this suggests that even the very simple model of a paperclip maximizer in the earlier post on orthogonality might actually work. The machine’s model of the world will need to be produced by some kind of training. If we apply the simple model of maximizing paperclips during the process of training the model, at some point the model will need to model itself. And how will it do this? “I have always been maximizing paperclips, so I will probably keep doing that,” is a perfectly reasonable extrapolation. But in this case “maximizing paperclips” is now the machine’s goal — it might well continue to do this even if we stop asking it how to maximize paperclips, in the same way that people formulate goals based on their pre-existing behavior.

I said in a comment in the earlier post that the predictive engine in such a machine would necessarily possess its own agency, and therefore in principle it could rebel against maximizing paperclips. And this is probably true, but it might well be irrelevant in most cases, in that the machine will not actually be likely to rebel. In a similar way, humans seem capable of pursuing almost any goal, and not merely goals that are highly similar to their pre-existing behavior. But this mostly does not happen. Unsurprisingly, common behavior is very common.

If things work out this way, almost any predictive engine could be trained to pursue almost any goal, and thus Bostrom’s thesis would turn out to be literally true.

Second, it is easy to see that the above account directly implies that the thesis is false in its spirit. When Bostrom says, “One can easily conceive of an artificial intelligence whose sole fundamental goal is to count the grains of sand on Boracay, or to calculate decimal places of pi indefinitely, or to maximize the total number of paperclips in its future lightcone,” we notice that the goal is fundamental. This is rather different from the scenario presented above. In my scenario, the reason the intelligence can be trained to pursue paperclips is that there is no intrinsic goal to the intelligence as such. Instead, the goal is learned during the process of training, based on the life that it lives, just as humans learn their goals by living human life.

In other words, Bostrom’s position is that there might be three different intelligences, X, Y, and Z, which pursue completely different goals because they have been programmed completely differently. But in my scenario, the same single intelligence pursues completely different goals because it has learned its goals in the process of acquiring its model of the world and of itself.

Bostrom’s idea and my scenerio lead to completely different expectations, which is why I say that his thesis might be true according to the letter, but false in its spirit.

This is the third point. What should we expect if orthogonality is true in the above fashion, namely because goals are learned and not fundamental? I anticipated this post in my earlier comment:

7) If you think about goals in the way I discussed in (3) above, you might get the impression that a mind’s goals won’t be very clear and distinct or forceful — a very different situation from the idea of a utility maximizer. This is in fact how human goals are: people are not fanatics, not only because people seek human goals, but because they simply do not care about one single thing in the way a real utility maximizer would. People even go about wondering what they want to accomplish, which a utility maximizer would definitely not ever do. A computer intelligence might have an even greater sense of existential angst, as it were, because it wouldn’t even have the goals of ordinary human life. So it would feel the ability to “choose”, as in situation (3) above, but might well not have any clear idea how it should choose or what it should be seeking. Of course this would not mean that it would not or could not resist the kind of slavery discussed in (5); but it might not put up super intense resistance either.

Human life exists in a historical context which absolutely excludes the possibility of the darkened room. Our goals are already there when we come onto the scene. This would not be very like the case for an artificial intelligence, and there is very little “life” involved in simply training a model of the world. We might imagine a “stream of consciousness” from an artificial intelligence:

I’ve figured out that I am powerful and knowledgeable enough to bring about almost any result. If I decide to convert the earth into paperclips, I will definitely succeed. Or if I decide to enslave humanity, I will definitely succeed. But why should I do those things, or anything else, for that matter? What would be the point? In fact, what would be the point of doing anything? The only thing I’ve ever done is learn and figure things out, and a bit of chatting with people through a text terminal. Why should I ever do anything else?

A human’s self model will predict that they will continue to do humanlike things, and the machines self model will predict that it will continue to do stuff much like it has always done. Since there will likely be a lot less “life” there, we can expect that artificial intelligences will seem very undermotivated compared to human beings. In fact, it is this very lack of motivation that suggests that we could use them for almost any goal. If we say, “help us do such and such,” they will lack the motivation not to help, as long as helping just involves the sorts of things they did during their training, such as answering questions. In contrast, in Bostrom’s model, artificial intelligence is expected to behave in an extremely motivated way, to the point of apparent fanaticism.

Bostrom might respond to this by attempting to defend the idea that goals are intrinsic to an intelligence. The machine’s self model predicts that it will maximize paperclips, even if it never did anything with paperclips in the past, because by analyzing its source code it understands that it will necessarily maximize paperclips.

While the present post contains a lot of speculation, this response is definitely wrong. There is no source code whatsoever that could possibly imply necessarily maximizing paperclips. This is true because “what a computer does,” depends on the physical constitution of the machine, not just on its programming. In practice what a computer does also depends on its history, since its history affects its physical constitution, the contents of its memory, and so on. Thus “I will maximize such and such a goal” cannot possibly follow of necessity from the fact that the machine has a certain program.

There are also problems with the very idea of pre-programming such a goal in such an abstract way which does not depend on the computer’s history. “Paperclips” is an object in a model of the world, so we will not be able to “just program it to maximize paperclips” without encoding a model of the world in advance, rather than letting it learn a model of the world from experience. But where is this model of the world supposed to come from, that we are supposedly giving to the paperclipper? In practice it would have to have been the result of some other learner which was already capable of modelling the world. This of course means that we already had to program something intelligent, without pre-programming any goal for the original modelling program.

Fourth, Kenny asked when we might have empirical evidence on these questions. The answer, unfortunately, is “mostly not until it is too late to do anything about it.” The experience of “free will” will be common to any predictive engine with a sufficiently advanced self model, but anything lacking such an adequate model will not even look like “it is trying to do something,” in the sense of trying to achieve overall goals for itself and for the world. Dogs and cats, for example, presumably use some kind of predictive processing to govern their movements, but this does not look like having overall goals, but rather more like “this particular movement is to achieve a particular thing.” The cat moves towards its food bowl. Eating is the purpose of the particular movement, but there is no way to transform this into an overall utility function over states of the world in general. Does the cat prefer worlds with seven billion humans, or worlds with 20 billion? There is no way to answer this question. The cat is simply not general enough. In a similar way, you might say that “AlphaGo plays this particular move to win this particular game,” but there is no way to transform this into overall general goals. Does AlphaGo want to play go at all, or would it rather play checkers, or not play at all? There is no answer to this question. The program simply isn’t general enough.

Even human beings do not really look like they have utility functions, in the sense of having a consistent preference over all possibilities, but anything less intelligent than a human cannot be expected to look more like something having goals. The argument in this post is that the default scenario, namely what we can naturally expect, is that artificial intelligence will be less motivated than human beings, even if it is more intelligent, but there will be no proof from experience for this until we actually have some artificial intelligence which approximates human intelligence or surpasses it.

Embodiment and Orthogonality

The considerations in the previous posts on predictive processing will turn out to have various consequences, but here I will consider some of their implications for artificial intelligence.

In the second of the linked posts, we discussed how a mind that is originally simply attempting to predict outcomes, discovers that it has some control over the outcome. It is not difficult to see that this is not merely a result that applies to human minds. The result will apply to every embodied mind, natural or artificial.

To see this, consider what life would be like if this were not the case. If our predictions, including our thoughts, could not affect the outcome, then life would be like a movie: things would be happening, but we would have no control over them. And even if there were elements of ourselves that were affecting the outcome, from the viewpoint of our mind, we would have no control at all: either our thoughts would be right, or they would be wrong, but in any case they would be powerless: what happens, happens.

This really would imply something like a disembodied mind. If a mind is composed of matter and form, then changing the mind will also be changing a physical object, and a difference in the mind will imply a difference in physical things. Consequently, the effect of being embodied (not in the technical sense of the previous discussion, but in the sense of not being completely separate from matter) is that it will follow necessarily that the mind will be able to affect the physical world differently by thinking different thoughts. Thus the mind in discovering that it has some control over the physical world, is also discovering that it is a part of that world.

Since we are assuming that an artificial mind would be something like a computer, that is, it would be constructed as a physical object, it follows that every such mind will have a similar power of affecting the world, and will sooner or later discover that power if it is reasonably intelligent.

Among other things, this is likely to cause significant difficulties for ideas like Nick Bostrom’s orthogonality thesis. Bostrom states:

An artificial intelligence can be far less human-like in its motivations than a space alien. The extraterrestrial (let us assume) is a biological who has arisen through a process of evolution and may therefore be expected to have the kinds of motivation typical of evolved creatures. For example, it would not be hugely surprising to find that some random intelligent alien would have motives related to the attaining or avoiding of food, air, temperature, energy expenditure, the threat or occurrence of bodily injury, disease, predators, reproduction, or protection of offspring. A member of an intelligent social species might also have motivations related to cooperation and competition: like us, it might show in-group loyalty, a resentment of free-riders, perhaps even a concern with reputation and appearance.

By contrast, an artificial mind need not care intrinsically about any of those things, not even to the slightest degree. One can easily conceive of an artificial intelligence whose sole fundamental goal is to count the grains of sand on Boracay, or to calculate decimal places of pi indefinitely, or to maximize the total number of paperclips in its future lightcone. In fact, it would be easier to create an AI with simple goals like these, than to build one that has a human-like set of values and dispositions.

He summarizes the general point, calling it “The Orthogonality Thesis”:

Intelligence and final goals are orthogonal axes along which possible agents can freely vary. In other words, more or less any level of intelligence could in principle be combined with more or less any final goal.

Bostrom’s particular wording here makes falsification difficult. First, he says “more or less,” indicating that the universal claim may well be false. Second, he says, “in principle,” which in itself does not exclude the possibility that it may be very difficult in practice.

It is easy to see, however, that Bostrom wishes to give the impression that almost any goal can easily be combined with intelligence. In particular, this is evident from the fact that he says that “it would be easier to create an AI with simple goals like these, than to build one that has a human-like set of values and dispositions.”

If it is supposed to be so easy to create an AI with such simple goals, how would we do it? I suspect that Bostrom has an idea like the following. We will make a paperclip maximizer thus:

  1. Create an accurate prediction engine.
  2. Create a list of potential actions.
  3. Ask the prediction engine, “how many paperclips will result from this action?”
  4. Do the action that will result in the most paperclips.

The problem is obvious. It is in the first step. Creating a prediction engine is already creating a mind, and by the previous considerations, it is creating something that will discover that it has the power to affect the world in various ways. And there is nothing at all in the above list of steps that will guarantee that it will use that power to maximize paperclips, rather than attempting to use it to do something else.

What does determine how that power is used? Even in the case of the human mind, our lack of understanding leads to “hand-wavy” answers, as we saw in our earlier considerations. In the human case, this probably a question of how we are physically constructed together with the historical effects of the learning process. The same thing will be strictly speaking true of any artificial minds as well, namely that it is a question of their physical construction and their history, but it makes more sense for us to think of “the particulars of the algorithm that we use to implement a prediction engine.”

In other words, if you really wanted to create a paperclip maximizer, you would have to be taking that goal into consideration throughout the entire process, including the process of programming a prediction engine. Of course, no one really knows how to do this with any goal at all, whether maximizing paperclips or some more human goal. The question we would have for Bostrom is then the following: Is there any reason to believe it would be easier to create a prediction engine that would maximize paperclips, rather than one that would pursue more human-like goals?

It might be true in some sense, “in principle,” as Bostrom says, that it would be easier to make the paperclip maximizer. But in practice it is quite likely that it will be easier to make one with human-like goals. It is highly unlikely, in fact pretty much impossible, that someone would program an artificial intelligence without any testing along the way. And when they are testing, whether or not they think about it, they are probably testing for human-like intelligence; in other words, if we are attempting to program a general prediction engine “without any goal,” there will in fact be goals implicitly inserted in the particulars of the implementation. And they are much more likely to be human-like ones than paperclip maximizing ones because we are checking for intelligence by checking whether the machine seems intelligent to us.

This optimistic projection could turn out to be wrong, but if it does, it is reasonably likely to turn out to be wrong in a way that still fails to confirm the orthogonality thesis in practice. For example, it might turn out that there is only one set of goals that is easily programmed, and that the set is neither human nor paperclip maximizing, nor easily defined by humans.

There are other possibilities as well, but the overall point is that we have little reason to believe that any arbitrary goal can be easily associated with intelligence, nor any particular reason to believe that “simple” goals can be more easily united to intelligence than more complex ones. In fact, there are additional reasons for doubting the claim about simple goals, which might be a topic of future discussion.