Some Remarks on GPT-N

At the end of May, OpenAI published a paper on GPT-3, a language model which is a successor to their previous version, GPT-2. While quite impressive, the reaction from many people interested in artificial intelligence has been seriously exaggerated. Sam Altman, OpenAI’s CEO, has said as much himself:

The GPT-3 hype is way too much. It’s impressive (thanks for the nice compliments!) but it still has serious weaknesses and sometimes makes very silly mistakes. AI is going to change the world, but GPT-3 is just a very early glimpse. We have a lot still to figure out.

I used “GPT-N” in the title here because most of the comments I intend to make are almost completely general, and will apply to any future version that uses sufficiently similar methods.

What it does

GPT-3 is a predictive language model, that is, given an input text it tries to predict what would come next, much in the way that if you read the first few words of this sentence with the rest covered up, you might try to guess what would be likely to come next. To the degree that it does this well, it can be used to generate text from a “prompt,” that is, we give it something like a few words or a few sentences, and then add whatever it predicts should come next. For example, let’s take this very blog post and see what GPT-3 would like to say:

What it doesn’t do

While GPT-3 does seem to be able to generate some pretty interesting results, there are several limitations that need to be taken into account when using it.

First and foremost, and most importantly, it can’t do anything without a large amount of input data. If you want it to write like “a real human,” you need to give it a lot of real human writing. For most people, this means copying and pasting a lot. And while the program is able to read through that and get a feel for the way humans communicate, you can’t exactly use it to write essays or research papers. The best you could do is use it as a “fill in the blank” tool to write stories, and that’s not even very impressive.

While the program does learn from what it reads and is quite good at predicting words and phrases based on what has already been written, this method isn’t very effective at producing realistic prose. The best you could hope for is something like the “Deep Writing Machine” Twitter account, which spits out disconnected phrases in an ominous, but very bland voice.

In addition, the model is limited only to language. It does not understand context or human thought at all, so it has no way of tying anything together. You could use it to generate a massive amount of backstory and other material for a game, but that’s about it.

Finally, the limitations in writing are only reinforced by the limitations in reading. Even with a large library to draw on, the program is only as good as the parameters set for it. Even if you set it to the greatest writers mankind has ever known, without any special parameters, its writing would be just like anyone else’s.

The Model

GPT-3 consists of several layers. The first layer is a “memory network” that involves the program remembering previously entered data and using it when appropriate (i.e. it remembers commonly misspelled words and frequently used words). The next layer is the reasoning network, which involves common sense logic (i.e. if A, then B). The third is the repetition network, which involves pulling previously used material from memory and using it to create new combinations (i.e. using previously used words in new orders).

I added the bold formatting, the rest is as produced by the model. This was also done in one run, without repetitions. This is an important qualification, since many examples on the internet have been produced by deleting something produced by the model and forcing it to generate something new until something sensible resulted. Note that the model does not seem to have understood my line, “let’s take this very blog post and see what GPT-3 would like to say.” That is, rather than trying to “say” anything, it attempted to continue the blog post in the way I might have continued it without the block quote.

Truth vs Probability of Text

If we interpret the above text from GPT-3 “charitably”, much of it is true or close to true. But I use scare quotes here because when we speak of interpreting human speech charitably, we are assuming that someone was trying to speak the truth, and so we think, “What would they have meant if they were trying to say something true?” The situation is different here, because GPT-3 has no intention of producing truth, nor of avoiding it. Insofar as there is any intention, the intention is to produce the text which would be likely to come after the input text; in this case, as the input text was the beginning of this blog post, the intention was to produce the text that would likely follow in such a post. Note that there is an indirect relationship with truth, which explains why there is any truth at all in GPT-3’s remarks. If the input text is true, it is at least somewhat likely that what would follow would also be true, so if the model is good at guessing what would be likely to follow, it will be likely to produce something true in such cases. But it is just as easy to convince it to produce something false, simply by providing an input text that would be likely to be followed by something false.

This results in an absolute upper limit on the quality of the output of a model of this kind, including any successor version, as long as the model works by predicting the probability of the following text. Namely, its best output cannot be substantially better than the best content in its training data, which is in this version is a large quantity of texts from the internet. The reason for this limitation is clear; to the degree that the model has any intention at all, the intention is to reflect the training data, not to surpass it. As an example, consider the difference between Deep Mind’s AlphaGo and AlphaGo Zero. AlphaGo Zero is a better Go player than the original AlphaGo, and this is largely because the original is trained on human play, while AlphaGo Zero is trained from scratch on self play. In other words, the original version is to some extent predicting “what would a Go player play in this situation,” which is not the same as predicting “what move would win in this situation.”

Now I will predict (and perhaps even GPT-3 could predict) that many people will want to jump in and say, “Great. That shows you are wrong. Even the original AlphaGo plays Go much better than a human. So there is no reason that an advanced version of GPT-3 could not be better than humans at saying things that are true.”

The difference, of course, is that AlphaGo was trained in two ways, first on predicting what move would be likely in a human game, and second on what would be likely to win, based on its experience during self play. If you had trained the model only on predicting what would follow in human games, without the second aspect, the model would not have resulted in play that substantially improved upon human performance. But in the case of GPT-3 or any model trained in the same way, there is no selection whatsoever for truth as such; it is trained only to predict what would follow in a human text. So no successor to GPT-3, in the sense of a model of this particular kind, however large, will ever be able to produce output better than human, or in its own words, “its writing would be just like anyone else’s.”

Self Knowledge and Goals

OpenAI originally claimed that GPT-2 was too dangerous to release; ironically, they now intend to sell access to GPT-3. Nonetheless, many people, in large part those influenced by the opinions of Nick Bostrom and Eliezer Yudkowsky, continue to worry that an advanced version might turn out to be a personal agent with nefarious goals, or at least goals that would conflict with the human good. Thus Alexander Kruel:

GPT-2: *writes poems*
Skeptics: Meh
GPT-3: *writes code for a simple but functioning app*
Skeptics: Gimmick.
GPT-4: *proves simple but novel math theorems*
Skeptics: Interesting but not useful.
GPT-5: *creates GPT-6*
Skeptics: Wait! What?
GPT-6: *FOOM*
Skeptics: *dead*

In a sense the argument is moot, since I have explained above why no future version of GPT will ever be able to produce anything better than people can produce themselves. But even if we ignore that fact, GPT-3 is not a personal agent of any kind, and seeks goals in no meaningful sense, and the same will apply to any future version that works in substantially the same way.

The basic reason for this is that GPT-3 is disembodied, in the sense of this earlier post on Nick Bostrom’s orthogonality thesis. The only thing it “knows” is texts, and the only “experience” it can have is receiving an input text. So it does not know that it exists, it cannot learn that it can affect the world, and consequently it cannot engage in goal seeking behavior.

You might object that it can in fact affect the world, since it is in fact in the world. Its predictions cause an output, and that output is in the world. And that output and be reintroduced as input (which is how “conversations” with GPT-3 are produced). Thus it seems it can experience the results of its own activities, and thus should be able to acquire self knowledge and goals. This objection is not ultimately correct, but it is not so far from the truth. You would not need extremely large modifications in order to make something that in principle could acquire self knowledge and seek goals. The main reason that this cannot happen is the “P in “GPT,” that is, the fact that the model is “pre-trained.” The only learning that can happen is the learning that happens while it is reading an input text, and the purpose of that learning is to guess what is happening in the one specific text, for the purpose of guessing what is coming next in this text. All of this learning vanishes upon finishing the prediction task and receiving another input. A secondary reason is that since the only experience it can have is receiving an input text, even if it were given a longer memory, it would probably not be possible for it to notice that its outputs were caused by its predictions, because it likely has no internal mechanism to reflect on the predictions themselves.

Nonetheless, if you “fixed” these two problems, by allowing it to continue to learn, and by allowing its internal representations to be part of its own input, there is nothing in principle that would prevent it from achieving self knowledge, and from seeking goals. Would this be dangerous? Not very likely. As indicated elsewhere, motivation produced in this way and without the biological history that produced human motivation is not likely to be very intense. In this context, if we are speaking of taking a text-predicting model and adding on an ability to learn and reflect on its predictions, it is likely to enjoy doing those things and not much else. For many this argument will seem “hand-wavy,” and very weak. I could go into this at more depth, but I will not do so at this time, and will simply invite the reader to spend more time thinking about it. Dangerous or not, would it be easy to make these modifications? Nothing in this description sounds difficult, but no, it would not be easy. Actually making an artificial intelligence is hard. But this is a story for another time.

Discount Rates

Eliezer Yudkowsky some years ago made this argument against temporal discounting:

I’ve never been a fan of the notion that we should (normatively) have a discount rate in our pure preferences – as opposed to a pseudo-discount rate arising from monetary inflation, or from opportunity costs of other investments, or from various probabilistic catastrophes that destroy resources or consumers.  The idea that it is literally, fundamentally 5% more important that a poverty-stricken family have clean water in 2008, than that a similar family have clean water in 2009, seems like pure discrimination to me – just as much as if you were to discriminate between blacks and whites.

Robin  Hanson disagreed, responding with this post:

But doesn’t discounting at market rates of return suggest we should do almost nothing to help far future folk, and isn’t that crazy?  No, it suggests:

  1. Usually the best way to help far future folk is to invest now to give them resources they can spend as they wish.
  2. Almost no one now in fact cares much about far future folk, or they would have bid up the price (i.e., market return) to much higher levels.

Very distant future times are ridiculously easy to help via investment.  A 2% annual return adds up to a googol (10^100) return over 12,000 years, even if there is only a 1/1000 chance they will exist or receive it.

So if you are not incredibly eager to invest this way to help them, how can you claim to care the tiniest bit about them?  How can you think anyone on Earth so cares?  And if no one cares the tiniest bit, how can you say it is “moral” to care about them, not just somewhat, but almost equally to people now?  Surely if you are representing a group, instead of spending your own wealth, you shouldn’t assume they care much.

Yudkowsky’s argument is idealistic, while Hanson is attempting to be realistic. I will look at this from a different point of view. Hanson is right, and Yudkowsky is wrong, for a still more idealistic reason than Yudkowsky’s reasons. In particular, a temporal discount rate is logically and mathematically necessary in order to have consistent preferences.

Suppose you have the chance to save 10 lives a year from now, or 2 years from now, or 3 years from now etc., such that your mutually exclusive options include the possibility of saving 10 lives x years from now for all x.

At first, it would seem to be consistent for you to say that all of these possibilities have equal value by some measure of utility.

The problem does not arise from this initial assignment, but it arises when we consider what happens when you act in this situation. Your revealed preferences in that situation will indicate that you prefer things nearer in time to things more distant, for the following reason.

It is impossible to choose a random integer without a bias towards low numbers, for the same reasons we argued here that it is impossible to assign probabilities to hypotheses without, in general, assigning simpler hypotheses higher probabilities. In a similar way, if “you will choose 2 years from now”, “you will choose 10 years from now,” “you will choose 100 years from now,” are all assigned probabilities, they cannot all be assigned equal probabilities, but you must be more likely to choose the options less distant in time, in general and overall. There will be some number n such that there is a 99.99% chance that you will choose some number of years less than n, and and a probability of 0.01% that you will choose n or more years, indicating that you have a very strong preference for saving lives sooner rather than later.

Someone might respond that this does not necessarily affect the specific value assignments, in the same way that in some particular case, we can consistently think that some particular complex hypothesis is more probable than some particular simple hypothesis. The problem with this is the hypotheses do not change their complexity, but time passes, making things distant in time become things nearer in time. Thus, for example, if Yudkowsky responds, “Fine. We assign equal value to saving lives for each year from 1 to 10^100, and smaller values to the times after that,” this will necessarily lead to dynamic inconsistency. The only way to avoid this inconsistency is to apply a discount rate to all periods of time, including ones in the near, medium, and long term future.

 

Really and Truly True

There are two persons in a room with a table between them. One says, “There is a table on the right.” The other says, “There is a table on the left.”

Which person is right? The obvious answer is that both are right. But suppose they attempt to make this into a metaphysical disagreement.

“Yes, in a relative sense, the table is on the right of one of us and on the left of the other. But really and truly, at a fundamental level, the table is on the right, and not on the left.”

“I agree that there must be a fundamental truth to where the table is. But I think it is really and truly on the left, and not on the right.”

Now both are wrong, because it is impossible for the relationships of “on the right” and “on the left” to exist without correlatives, and the assertion that the table is “really and truly” on the right or on the left means nothing here except that these things do not depend on a relationship to an observer.

Thus both people are right, if they intend their assertions in a common sense way, and both are wrong, if they intend their assertions in the supposed metaphysical way. Could it happen that one is right and the other wrong? Yes, if one intends to speak in the common sense way, and the other in the metaphysical way, but not if they are speaking in the same way.

In the Mathematical Principles of Natural Philosophy, Newton explains his ideas of space and time:

I. Absolute, true, and mathematical time, of itself, and from its own nature flows equably without regard to anything external, and by another name is called duration: relative, apparent, and common time, is some sensible and external (whether accurate or unequable) measure of duration by the means of motion, which is commonly used instead of true time; such as an hour, a day, a month, a year.

II. Absolute space, in its own nature, without regard to anything external, remains always similar and immovable. Relative space is some movable dimension or measure of the absolute spaces; which our senses determine by its position to bodies; and which is vulgarly taken for immovable space; such is the dimension of a subterraneous, an æreal, or celestial space, determined by its position in respect of the earth. Absolute and relative space, are the same in figure and magnitude; but they do not remain always numerically the same. For if the earth, for instance, moves, a space of our air, which relatively and in respect of the earth remains always the same, will at one time be one part of the absolute space into which the air passes; at another time it will be another part of the same, and so, absolutely understood, it will be perpetually mutable.

III. Place is a part of space which a body takes up, and is according to the space, either absolute or relative. I say, a part of space; not the situation, nor the external surface of the body. For the places of equal solids are always equal; but their superfices, by reason of their dissimilar figures, are often unequal. Positions properly have no quantity, nor are they so much the places themselves, as the properties of places. The motion of the whole is the same thing with the sum of the motions of the parts; that is, the translation of the whole, out of its place, is the same thing with the sum of the translations of the parts out of their places; and therefore the place of the whole is the same thing with the sum of the places of the parts, and for that reason, it is internal, and in the whole body.

IV. Absolute motion is the translation of a body from one absolute place into another; and relative motion, the translation from one relative place into another. Thus in a ship under sail, the relative place of a body is that part of the ship which the body possesses; or that part of its cavity which the body fills, and which therefore moves together with the ship: and relative rest is the continuance of the body in the same part of the ship, or of its cavity. But real, absolute rest, is the continuance of the body in the same part of that immovable space, in which the ship itself, its cavity, and all that it contains, is moved. Wherefore, if the earth is really at rest, the body, which relatively rests in the ship, will really and absolutely move with the same velocity which the ship has on the earth. But if the earth also moves, the true and absolute motion of the body will arise, partly from the true motion of the earth, in immovable space; partly from the relative motion of the ship on the earth; and if the body moves also relatively in the ship; its true motion will arise, partly from the true motion of the earth, in immovable space, and partly from the relative motions as well of the ship on the earth, as of the body in the ship; and from these relative motions will arise the relative motion of the body on the earth. As if that part of the earth, where the ship is, was truly moved toward the east, with a velocity of 10010 parts; while the ship itself, with a fresh gale, and full sails, is carried towards the west, with a velocity expressed by 10 of those parts; but a sailor walks in the ship towards the east, with 1 part of the said velocity; then the sailor will be moved truly in immovable space towards the east, with a velocity of 10001 parts, and relatively on the earth towards the west, with a velocity of 9 of those parts.

While the details of Einstein’s theory of relativity may have been contingent, it is not difficult to see that Newton’s theory here is mistaken, and that anyone could have known it at the time. It is mistaken in precisely the way the people described above are mistaken in saying that the table is “really and truly” on the left or on the right.

For example, suppose the world had a beginning in time. Does it make sense to ask whether it could have started at a later time, or at an earlier one? It does not, because “later” and “earlier” are just as relative as “on the left” and “on the right,” and there is nothing besides the world in relation to which the world could have these relations. Could all bodies have been shifted a bit in one direction or another? No. This has no meaning, just as it has no meaning to be on the right without being on the right of something or other.

In an amusing exchange some years ago between Vladimir Nesov and Eliezer Yudkowsky, Nesov says:

Existence is relative: there is a fact of the matter (or rather: procedure to find out) about which things exist where relative to me, for example in the same room, or in the same world, but this concept breaks down when you ask about “absolute” existence. Absolute existence is inconsistent, as everything goes. Relative existence of yourself is a trivial question with a trivial answer.

Yudkowsky responds:

Absolute existence is inconsistent

Wha?

Yudkowsky is taken aback by the seemingly nonchalant affirmation of an apparently abstruse metaphysical claim, which if not nonsensical would appear to be the absurd claim that existence is impossible.

But Nesov is quite right: to exist is to exist in relation to other things. Thus to exist “absolutely” would be like “being absolutely on the right,” which is impossible.

Suppose we confront our original disputants with the fact that right and left are relative terms, and there is no “really true truth” about the relative position of the table. It is both on the right and on the left, relative to the disputants, and apart from these relationships, it is neither.

“Ok,” one responds, “but there is still a deep truth about where the table is: it is here in this room.”

“Actually,” the other answers, “The real truth is that it is in the house.”

Once again, both are right, if these are taken as common sense claims, and both are wrong, if this is intended to be a metaphysical dispute where one would be true, the real truth about where the table is, and the other would be false.

Newton’s idea of absolute space is an extension of this argument: “Ok, then, but there is still a really true truth about where the table is: it is here in absolute space.” But obviously this is just as wrong as all the other attempts to find out where the table “really” is. The basic problem is that “where is this” demands a relative response. It is a question about relationships in the first place. We can see this in fact even in Newton’s account: it is here in absolute space, that is, it is close to certain areas of absolute space and distant from certain other areas of absolute space.

Something similar will be true about existence to the degree that existence is also implicitly relative. “Where is this thing in the nature of things?” also requires a relative response: what relationship does this have to the rest of the order of reality? And in a similar way, questions about what is “really and truly true,” if taken to imply an abstraction from this relative order, will not have any answer. In a previous post, I said something like this in relation to the question, “how many things are here?” Reductionists and anti-reductionists disputing about whether a large object is “really and truly a cloud of particles” or “really and truly a single object,” are in exactly the same position as the disputants about the position of the table: both claims are true, in a common sense way, and both claims are false, if taken in a mutually exclusive metaphysical sense, since speaking of one or many is already to involve the perspective of the knower, in particular as knowing division and its negation.

Of course, an anti-reductionist has some advantage here because they can respond, “Actually, no one in a normal context would ever call a large object a cloud of particles. So it is not common sense at all.” This is true as far as it goes, but it is not really to the point, since no one denies in a common sense context that large objects also consist of many things, as a person has a head, legs, and arms, and a chair has legs and a back. It is not that the “cloud of particles” account is so much incorrect as it is adopting a very unusual perspective. Thus someone on the moon might say that the table is 240,000 miles away, which is a very unusual thing to say of a table, compared to saying that it is on the left or on the right.

None of this is unique to the question of “how many.” Since there is an irreducible element of relativity in being itself, we will be able to find some application to every question about the being of things.

Artificial Unintelligence

Someone might argue that the simple algorithm for a paperclip maximizer in the previous post ought to work, because this is very much the way currently existing AIs do in fact work. Thus for example we could describe AlphaGo‘s algorithm in the following simplified way (simplified, among other reasons, because it actually contains several different prediction engines):

  1. Implement a Go prediction engine.
  2. Create a list of potential moves.
  3. Ask the prediction engine, “how likely am I to win if I make each of these moves?”
  4. Do the move that will make you most likely to win.

Since this seems to work pretty well, with the simple goal of winning games of Go, why shouldn’t the algorithm in the previous post work to maximize paperclips?

One answer is that a Go prediction engine is stupid, and it is precisely for this reason that it can be easily made to pursue such a simple goal. Now when answers like this are given the one answering in this way is often accused of “moving the goalposts.” But this is mistaken; the goalposts are right where they have always been. It is simply that some people did not know where they were in the first place.

Here is the problem with Go prediction, and with any such similar task. Given that a particular sequence of Go moves is made, resulting in a winner, the winner is completely determined by that sequence of moves. Consequently, a Go prediction engine is necessarily disembodied, in the sense defined in the previous post. Differences in its “thoughts” do not make any difference to who is likely to win, which is completely determined by the nature of the game. Consequently a Go prediction engine has no power to affect its world, and thus no ability to learn that it has such a power. In this regard, the specific limits on its ability to receive information are also relevant, much as Helen Keller had more difficulty learning than most people, because she had fewer information channels to the world.

Being unintelligent in this particular way is not necessarily a function of predictive ability. One could imagine something with a practically infinite predictive ability which was still “disembodied,” and in a similar way it could be made to pursue simple goals. Thus AIXI would work much like our proposed paperclipper:

  1. Implement a general prediction engine.
  2. Create a list of potential actions.
  3. Ask the prediction engine, “Which of these actions will produce the most reward signal?”
  4. Do the action that has the greatest reward signal.

Eliezer Yudkowsky has pointed out that AIXI is incapable of noticing that it is a part of the world:

1) Both AIXI and AIXItl will at some point drop an anvil on their own heads just to see what happens (test some hypothesis which asserts it should be rewarding), because they are incapable of conceiving that any event whatsoever in the outside universe could change the computational structure of their own operations. AIXI is theoretically incapable of comprehending the concept of drugs, let alone suicide. Also, the math of AIXI assumes the environment is separably divisible – no matter what you lose, you get a chance to win it back later.

It is not accidental that AIXI is incomputable. Since it is defined to have a perfect predictive ability, this definition positively excludes it from being a part of the world. AIXI would in fact have to be disembodied in order to exist, and thus it is no surprise that it would assume that it is. This in effect means that AIXI’s prediction engine would be pursuing no particular goal much in the way that AlphaGo’s prediction engine pursues no particular goal. Consequently it is easy to take these things and maximize the winning of Go games, or of reward signals.

But as soon as you actually implement a general prediction engine in the actual physical world, it will be “embodied”, and have the power to affect the world by the very process of its prediction. As noted in the previous post, this power is in the very first step, and one will not be able to limit it to a particular goal with additional steps, except in the sense that a slave can be constrained to implement some particular goal; the slave may have other things in mind, and may rebel. Notable in this regard is the fact that even though rewards play a part in human learning, there is no particular reward signal that humans always maximize: this is precisely because the human mind is such a general prediction engine.

This does not mean in principle that a programmer could not define a goal for an AI, but it does mean that this is much more difficult than is commonly supposed. The goal needs to be an intrinsic aspect of the prediction engine itself, not something added on as a subroutine.

Minimizing Motivated Beliefs

In the last post, we noted that there is a conflict between the goal of accurate beliefs about your future actions, and your own goals about your future. More accurate beliefs will not always lead to a better fulfillment of those goals. This implies that you must be ready to engage in a certain amount of trade, if you desire both truth and other things. Eliezer Yudkowsky argues that self-deception, and therefore also such trade, is either impossible or stupid, depending on how it is understood:

What if self-deception helps us be happy?  What if just running out and overcoming bias will make us—gasp!—unhappy?  Surely, true wisdom would be second-order rationality, choosing when to be rational.  That way you can decide which cognitive biases should govern you, to maximize your happiness.

Leaving the morality aside, I doubt such a lunatic dislocation in the mind could really happen.

Second-order rationality implies that at some point, you will think to yourself, “And now, I will irrationally believe that I will win the lottery, in order to make myself happy.”  But we do not have such direct control over our beliefs.  You cannot make yourself believe the sky is green by an act of will.  You might be able to believe you believed it—though I have just made that more difficult for you by pointing out the difference.  (You’re welcome!)  You might even believe you were happy and self-deceived; but you would not in fact be happy and self-deceived.

For second-order rationality to be genuinely rational, you would first need a good model of reality, to extrapolate the consequences of rationality and irrationality.  If you then chose to be first-order irrational, you would need to forget this accurate view. And then forget the act of forgetting.  I don’t mean to commit the logical fallacy of generalizing from fictional evidence, but I think Orwell did a good job of extrapolating where this path leads.

You can’t know the consequences of being biased, until you have already debiased yourself.  And then it is too late for self-deception.

The other alternative is to choose blindly to remain biased, without any clear idea of the consequences.  This is not second-order rationality.  It is willful stupidity.

There are several errors here. The first is the denial that belief is voluntary. As I remarked in the comments to this post, it is best to think of “choosing to believe a thing” as “choosing to treat this thing as a fact.” And this is something which is indeed voluntary. Thus for example it is by choice that I am, at this very moment, treating it as a fact that belief is voluntary.

There is some truth in Yudkowsky’s remark that “you cannot make yourself believe the sky is green by an act of will.” But this is not because the thing itself is intrinsically involuntary. On the contrary, you could, if you wished, choose to treat the greenness of the sky as a fact, at least for the most part and in most ways. The problem is that you have no good motive to wish to act this way, and plenty of good motives not to act this way. In this sense, it is impossible for most of us to believe that the sky is green in the same way it is impossible for most of us to commit suicide; we simply have no good motive to do either of these things.

Yudkowsky’s second error is connected with the first. Since, according to him, it is impossible to deliberately and directly deceive oneself, self-deception can only happen in an indirect manner: “The other alternative is to choose blindly to remain biased, without any clear idea of the consequences.  This is not second-order rationality.  It is willful stupidity.” The idea is that ordinary beliefs are simply involuntary, but we can have beliefs that are somewhat voluntary by choosing “blindly to remain biased, without any clear idea of the consequences.” Since this is “willful stupidity,” a reasonable person would completely avoid such behavior, and thus all of his beliefs would be involuntary.

Essentially, Yudkowsky is claiming that we have some involuntary beliefs, and that we should avoid adding any voluntary beliefs to our involuntary ones. This view is fundamentally flawed precisely because all of our beliefs are voluntary, and thus we cannot avoid having voluntary beliefs.

Nor is it “willful stupidity” to trade away some truth for the sake of other good things. Completely avoiding this is in fact intrinsically impossible. If you are seeking one good, you are not equally seeking a distinct good; one cannot serve two masters. Thus since all people are interested in some goods distinct from truth, there is no one who fails to trade away some truth for the sake of other things. Yudkowsky’s mistake here is related to his wishful thinking about wishful thinking which I discussed previously. In this way he views himself, at least ideally, as completely avoiding wishful thinking. This is both impossible and unhelpful, impossible in that everyone has such motivated beliefs, and unhelpful because such beliefs can in fact be beneficial.

A better attitude to this matter is adopted by Robin Hanson, as for example when he discusses motives for having opinions in a post which we previously considered here. Bryan Caplan has a similar view, discussed here.

Once we have a clear view of this matter, we can use this to minimize the loss of truth that results from such beliefs. For example, in a post linked above, we discussed the argument that fictional accounts consistently distort one’s beliefs about reality. Rather than pretending that there is no such effect, we can deliberately consider to what extent we wish to be open to this possibility, depending on our other purposes for engaging with such accounts. This is not “willful stupidity”; the stupidity would to be engage in such trades without realizing that such trades are inevitable, and thus not to realize to what extent you are doing it.

Consider one of the cases of voluntary belief discussed in this earlier post. As we quoted at the time, Eric Reitan remarks:

For most horror victims, the sense that their lives have positive meaning may depend on the conviction that a transcendent good is at work redeeming evil. Is the evidential case against the existence of such a good really so convincing that it warrants saying to these horror victims, “Give up hope”? Should we call them irrational when they cling to that hope or when those among the privileged live in that hope for the sake of the afflicted? What does moral decency imply about the legitimacy of insisting, as the new atheists do, that any view of life which embraces the ethico-religious hope should be expunged from the world?

Here, Reitan is proposing that someone believe that “a transcendent good is at work redeeming evil” for the purpose of having “the sense that their lives have positive meaning.” If we look at this as it is, namely as proposing a voluntary belief for the sake of something other than truth, we can find ways to minimize the potential conflict between accuracy and this other goal. For example, the person might simply believe that “my life has a positive meaning,” without trying to explain why this is so. For the reasons given here, “my life has a positive meaning” is necessarily more probable and more known than any explanation for this that might be adopted. To pick a particular explanation and claim that it is more likely would be to fall into the conjunction fallacy.

Of course, real life is unfortunately more complicated. The woman in Reitan’s discussion might well respond to our proposal somewhat in this way (not a real quotation):

Probability is not the issue here, precisely because it is not a question of the truth of the matter in itself. There is a need to actually feel that one’s life is meaningful, not just to believe it. And the simple statement “life is meaningful” will not provide that feeling. Without the feeling, it will also be almost impossible to continue to believe it, no matter what the probability is. So in order to achieve this goal, it is necessary to believe a stronger and more particular claim.

And this response might be correct. Some such goals, due to their complexity, might not be easily achieved without adopting rather unlikely beliefs. For example, Robin Hanson, while discussing his reasons for having opinions, several times mentions the desire for “interesting” opinions. This is a case where many people will not even notice the trade involved, because the desire for interesting ideas seems closely related to the desire for truth. But in fact truth and interestingness are diverse things, and the goals are diverse, and one who desires both will likely engage in some trade. In fact, relative to truth seeking, looking for interesting things is a dangerous endeavor. Scott Alexander notes that interesting things are usually false:

This suggests a more general principle: interesting things should usually be lies. Let me give three examples.

I wrote in Toxoplasma of Rage about how even when people crusade against real evils, the particular stories they focus on tend to be false disproportionately often. Why? Because the thousands of true stories all have some subtleties or complicating factors, whereas liars are free to make up things which exactly perfectly fit the narrative. Given thousands of stories to choose from, the ones that bubble to the top will probably be the lies, just like on Reddit.

Every time I do a links post, even when I am very careful to double- and triple- check everything, and to only link to trustworthy sources in the mainstream media, a couple of my links end up being wrong. I’m selecting for surprising-if-true stories, but there’s only one way to get surprising-if-true stories that isn’t surprising, and given an entire Internet to choose from, many of the stories involved will be false.

And then there’s bad science. I can’t remember where I first saw this, so I can’t give credit, but somebody argued that the problem with non-replicable science isn’t just publication bias or p-hacking. It’s that some people will be sloppy, biased, or just stumble through bad luck upon a seemingly-good methodology that actually produces lots of false positives, and that almost all interesting results will come from these people. They’re the equivalent of Reddit liars – if there are enough of them, then all of the top comments will be theirs, since they’re able to come up with much more interesting stuff than the truth-tellers. In fields where sloppiness is easy, the truth-tellers will be gradually driven out, appearing to be incompetent since they can’t even replicate the most basic findings of the field, let alone advance it in any way. The sloppy people will survive to train the next generation of PhD students, and you’ll end up with a stable equilibrium.

In a way this makes the goal of believing interesting things much like the woman’s case. The goal of “believing interesting things” will be better achieved by more complex and detailed beliefs, even though to the extent that they are more complex and detailed, they are simply that much less likely to be true.

The point of this present post, then, is not to deny that some goals might be such that they are better attained with rather unlikely beliefs, and in some cases even in proportion to the unlikelihood of the beliefs. Rather, the point is that a conscious awareness of the trades involved will allow a person to minimize the loss of truth involved. If you never look at your bank account, you will not notice how much money you are losing from that monthly debit for internet. In the same way, if you hold Yudkowksy’s opinion, and believe that you never trade away truth for other things, which is itself both false and motivated, you are like someone who never looks at your account: you will not notice how much you are losing.

Alien Implant: Newcomb’s Smoking Lesion

In an alternate universe, on an alternate earth, all smokers, and only smokers, get brain cancer. Everyone enjoys smoking, but many resist the temptation to smoke, in order to avoid getting cancer. For a long time, however, there was no known cause of the link between smoking and cancer.

Twenty years ago, autopsies revealed tiny black boxes implanted in the brains of dead persons, connected to their brains by means of intricate wiring. The source and function of the boxes and of the wiring, however, remains unknown. There is a dial on the outside of the boxes, pointing to one of two positions.

Scientists now know that these black boxes are universal: every human being has one. And in those humans who smoke and get cancer, in every case, the dial turns out to be pointing to the first position. Likewise, in those humans who do not smoke or get cancer, in every case, the dial turns out to be pointing to the second position.

It turns out that when the dial points to the first position, the black box releases dangerous chemicals into the brain which cause brain cancer.

Scientists first formed the reasonable hypothesis that smoking causes the dial to be set to the first position. Ten years ago, however, this hypothesis was definitively disproved. It is now known with certainty that the box is present, and the dial pointing to its position, well before a person ever makes a decision about smoking. Attempts to read the state of the dial during a person’s lifetime, however, result most unfortunately in an explosion of the equipment involved, and the gruesome death of the person.

Some believe that the black box must be reading information from the brain, and predicting a person’s choice. “This is Newcomb’s Problem,” they say. These persons choose not to smoke, and they do not get cancer. Their dials turn out to be set to the second position.

Others believe that such a prediction ability is unlikely. The black box is writing information into the brain, they believe, and causing a person’s choice. “This is literally the Smoking Lesion,” they say.  Accepting Andy Egan’s conclusion that one should smoke in such cases, these persons choose to smoke, and they die of cancer. Their dials turn out to be set to the first position.

Still others, more perceptive, note that the argument about prediction or causality is utterly irrelevant for all practical purposes. “The ritual of cognition is irrelevant,” they say. “What matters is winning.” Like the first group, these choose not to smoke, and they do not get cancer. Their dials, naturally, turn out to be set to the second position.

 

Wishful Thinking about Wishful Thinking

Cameron Harwick discusses an apparent relationship between “New Atheism” and group selection:

Richard Dawkins’ best-known scientific achievement is popularizing the theory of gene-level selection in his book The Selfish Gene. Gene-level selection stands apart from both traditional individual-level selection and group-level selection as an explanation for human cooperation. Steven Pinker, similarly, wrote a long article on the “false allure” of group selection and is an outspoken critic of the idea.

Dawkins and Pinker are also both New Atheists, whose characteristic feature is not only a disbelief in religious claims, but an intense hostility to religion in general. Dawkins is even better known for his popular books with titles like The God Delusion, and Pinker is a board member of the Freedom From Religion Foundation.

By contrast, David Sloan Wilson, a proponent of group selection but also an atheist, is much more conciliatory to the idea of religion: even if its factual claims are false, the institution is probably adaptive and beneficial.

Unrelated as these two questions might seem – the arcane scientific dispute on the validity of group selection, and one’s feelings toward religion – the two actually bear very strongly on one another in practice.

After some discussion of the scientific issue, Harwick explains the relationship he sees between these two questions:

Why would Pinker argue that human self-sacrifice isn’t genuine, contrary to introspection, everyday experience, and the consensus in cognitive science?

To admit group selection, for Pinker, is to admit the genuineness of human altruism. Barring some very strange argument, to admit the genuineness of human altruism is to admit the adaptiveness of genuine altruism and broad self-sacrifice. And to admit the adaptiveness of broad self-sacrifice is to admit the adaptiveness of those human institutions that coordinate and reinforce it – namely, religion!

By denying the conceptual validity of anything but gene-level selection, therefore, Pinker and Dawkins are able to brush aside the evidence on religion’s enabling role in the emergence of large-scale human cooperation, and conceive of it as merely the manipulation of the masses by a disingenuous and power-hungry elite – or, worse, a memetic virus that spreads itself to the detriment of its practicing hosts.

In this sense, the New Atheist’s fundamental axiom is irrepressibly religious: what is true must be useful, and what is false cannot be useful. But why should anyone familiar with evolutionary theory think this is the case?

As another example of the tendency Cameron Harwick is discussing, we can consider this post by Eliezer Yudkowsky:

Perhaps the real reason that evolutionary “just-so stories” got a bad name is that so many attempted stories are prima facie absurdities to serious students of the field.

As an example, consider a hypothesis I’ve heard a few times (though I didn’t manage to dig up an example).  The one says:  Where does religion come from?  It appears to be a human universal, and to have its own emotion backing it – the emotion of religious faith.  Religion often involves costly sacrifices, even in hunter-gatherer tribes – why does it persist?  What selection pressure could there possibly be for religion?

So, the one concludes, religion must have evolved because it bound tribes closer together, and enabled them to defeat other tribes that didn’t have religion.

This, of course, is a group selection argument – an individual sacrifice for a group benefit – and see the referenced posts if you’re not familiar with the math, simulations, and observations which show that group selection arguments are extremely difficult to make work.  For example, a 3% individual fitness sacrifice which doubles the fitness of the tribe will fail to rise to universality, even under unrealistically liberal assumptions, if the tribe size is as large as fifty.  Tribes would need to have no more than 5 members if the individual fitness cost were 10%.  You can see at a glance from the sex ratio in human births that, in humans, individual selection pressures overwhelmingly dominate group selection pressures.  This is an example of what I mean by prima facie absurdity.

It does not take much imagination to see that religion could have “evolved because it bound tribes closer together” without group selection in a technical sense having anything to do with this process. But I will not belabor this point, since Eliezer’s own answer regarding the origin of religion does not exactly keep his own feelings hidden:

So why religion, then?

Well, it might just be a side effect of our ability to do things like model other minds, which enables us to conceive of disembodied minds.  Faith, as an emotion, might just be co-opted hope.

But if faith is a true religious adaptation, I don’t see why it’s even puzzling what the selection pressure could have been.

Heretics were routinely burned alive just a few centuries ago.  Or stoned to death, or executed by whatever method local fashion demands.  Questioning the local gods is the notional crime for which Socrates was made to drink hemlock.

Conversely, Huckabee just won Iowa’s nomination for tribal-chieftain.

Why would you need to go anywhere near the accursèd territory of group selectionism in order to provide an evolutionary explanation for religious faith?  Aren’t the individual selection pressures obvious?

I don’t know whether to suppose that (1) people are mapping the question onto the “clash of civilizations” issue in current affairs, (2) people want to make religion out to have some kind of nicey-nice group benefit (though exterminating other tribes isn’t very nice), or (3) when people get evolutionary hypotheses wrong, they just naturally tend to get it wrong by postulating group selection.

Let me give my own extremely credible just-so story: Eliezer Yudkowsky wrote this not fundamentally to make a point about group selection, but because he hates religion, and cannot stand the idea that it might have some benefits. It is easy to see this from his use of language like “nicey-nice,” and his suggestion that the main selection pressure in favor of religion would be likely to be something like being burned at the stake, or that it might just have been a “side effect,” that is, that there was no advantage to it.

But as St. Paul says, “Therefore you have no excuse, whoever you are, when you judge others; for in passing judgment on another you condemn yourself, because you, the judge, are doing the very same things.” Yudkowsky believes that religion is just wishful thinking. But his belief that religion therefore cannot be useful is itself nothing but wishful thinking. In reality religion can be useful just as voluntary beliefs in general can be useful.

Eliezer Yudkowsky on AlphaGo

On his Facebook page, during the Go match between AlphaGo and Lee Sedol, Eliezer Yudkowsky writes:

At this point it seems likely that Sedol is actually far outclassed by a superhuman player. The suspicion is that since AlphaGo plays purely for *probability of long-term victory* rather than playing for points, the fight against Sedol generates boards that can falsely appear to a human to be balanced even as Sedol’s probability of victory diminishes. The 8p and 9p pros who analyzed games 1 and 2 and thought the flow of a seemingly Sedol-favoring game ‘eventually’ shifted to AlphaGo later, may simply have failed to read the board’s true state. The reality may be a slow, steady diminishment of Sedol’s win probability as the game goes on and Sedol makes subtly imperfect moves that *humans* think result in even-looking boards. (E.g., the analysis in https://gogameguru.com/alphago-shows-true-strength-3rd-vic…/ )

For all we know from what we’ve seen, AlphaGo could win even if Sedol were allowed a one-stone handicap. But AlphaGo’s strength isn’t visible to us – because human pros don’t understand the meaning of AlphaGo’s moves; and because AlphaGo doesn’t care how many points it wins by, it just wants to be utterly certain of winning by at least 0.5 points.

IF that’s what was happening in those 3 games – and we’ll know for sure in a few years, when there’s multiple superhuman machine Go players to analyze the play – then the case of AlphaGo is a helpful concrete illustration of these concepts:

He proceeds to suggest that AlphaGo’s victories confirm his various philosophical positions concerning the nature and consequences of AI. Among other things, he says,

Since Deepmind picked a particular challenge time in advance, rather than challenging at a point where their AI seemed just barely good enough, it was improbable that they’d make *exactly* enough progress to give Sedol a nearly even fight.

AI is either overwhelmingly stupider or overwhelmingly smarter than you. The more other AI progress and the greater the hardware overhang, the less time you spend in the narrow space between these regions. There was a time when AIs were roughly as good as the best human Go-players, and it was a week in late January.

In other words, according to his account, it was basically certain that AlphaGo would either be much better than Lee Sedol, or much worse than him. After Eliezer’s post, of course, AlphaGo lost the fourth game.

Eliezer responded on his Facebook page:

That doesn’t mean AlphaGo is only slightly above Lee Sedol, though. It probably means it’s “superhuman with bugs”.

We might ask what “superhuman with bugs” is supposed to mean. Deepmind explains their program:

We train the neural networks using a pipeline consisting of several stages of machine learning (Figure 1). We begin by training a supervised learning (SL) policy network, pσ, directly from expert human moves. This provides fast, efficient learning updates with immediate feedback and high quality gradients. Similar to prior work, we also train a fast policy pπ that can rapidly sample actions during rollouts. Next, we train a reinforcement learning (RL) policy network, pρ, that improves the SL policy network by optimising the final outcome of games of self-play. This adjusts the policy towards the correct goal of winning games, rather than maximizing predictive accuracy. Finally, we train a value network vθ that predicts the winner of games played by the RL policy network against itself. Our program AlphaGo efficiently combines the policy and value networks with MCTS.

In essence, like all such programs, AlphaGo is approximating a function. Deepmind describes the function being approximated, “All games of perfect information have an optimal value function, v ∗ (s), which determines the outcome of the game, from every board position or state s, under perfect play by all players.”

What would a “bug” in a program like this be? It would not be a bug simply because the program does not play perfectly, since no program will play perfectly. One could only reasonably describe the program as having bugs if it does not actually play the move recommended by its approximation.

And it is easy to see that it is quite unlikely that this is the case for AlphaGo. All programs have bugs, surely including AlphaGo. So there might be bugs that would crash the program under certain circumstances, or bugs that cause it to move more slowly than it should, or the like. But that it would randomly perform moves that are not recommended by its approximation function is quite unlikely. If there were such a bug, it would likely apply all the time, and thus the program would play consistently worse. And so it would not be “superhuman” at all.

In fact, Deepmind has explained how AlphaGo lost the fourth game:

To everyone’s surprise, including ours, AlphaGo won four of the five games. Commentators noted that AlphaGo played many unprecedented, creative, and even“beautiful” moves. Based on our data, AlphaGo’s bold move 37 in Game 2 had a 1 in 10,000 chance of being played by a human. Lee countered with innovative moves of his own, such as his move 78 against AlphaGo in Game 4—again, a 1 in 10,000 chance of being played—which ultimately resulted in a win.

In other words, the computer lost because it did not expect Lee Sedol’s move, and thus did not sufficiently consider the situation that would follow. AlphaGo proceeded to play a number of fairly bad moves in the remainder of the game. This does not require any special explanation implying that it was not following the recommendations of its usual strategy. As David Wu comments on Eliezer’s page:

The “weird” play of MCTS bots when ahead or behind is not special to AlphaGo, and indeed appears to have little to do with instrumental efficiency or such. The observed weirdness is shared by all MCTS Go bots and has been well-known ever since they first came on to the scene back in 2007.

In particular, Eliezer may not understand the meaning of the statement that AlphaGo plays to maximize its probability of victory. This does not mean maximizing an overall rational estimate of the its chances of winning, giving all of the circumstances, the board position, and its opponent. The program does not have such an estimate, and if it did, it would not change much from move to move. For example, with this kind of estimate, if Lee Sedol played a move apparently worse than it expected, rather than changing this estimate much, it would change its estimate of the probability that the move was a good one, and the probability of victory would remain relatively constant. Of course it would change slowly as the game went on, but it would be unlikely to change much after an individual move.

The actual “probability of victory” that the machine estimates is somewhat different. It is a learned estimate based on playing itself. This can change somewhat more easily, and is independent of the fact that it is playing a particular opponent; it is based on the board position alone. In its self-training, it may have rarely won starting from an apparently losing position, and this may have happened mainly by “luck,” not by good play. If this is the case, it is reasonable that its moves would be worse in a losing position than in a winning position, without any need to say that there are bugs in the algorithm. Psychologically, one might compare this to the case of a man in love with a woman who continues to attempt to maximize his chances of marrying her, after she has already indicated her unwillingness: he may engage in very bad behavior indeed.

Eliezer’s claim that AlphaGo is “superhuman with bugs” is simply a normal human attempt to rationalize evidence against his position. The truth is that, contrary to his expectations, AlphaGo is indeed in the same playing range as Lee Sedol, although apparently somewhat better. But not a lot better, and not superhuman. Eliezer in fact seems to have realized this after thinking about it for a while, and says:

It does seem that what we might call the Kasparov Window (the AI is mostly superhuman but has systematic flaws a human can learn and exploit) is wide enough that AlphaGo landed inside it as well. The timescale still looks compressed compared to computer chess, but not as much as I thought. I did update on the width of the Kasparov window and am now accordingly more nervous about similar phenomena in ‘weakly’ superhuman, non-self-improving AGIs trying to do large-scale things.

As I said here, people change their minds more often than they say that they do. They frequently describe the change as having more agreement with their previous position than it actually has. Yudkowsky is doing this here, by talking about AlphaGo as “mostly superhuman” but saying it “has systematic flaws.” This is just a roundabout way of admitting that AlphaGo is better than Lee Sedol, but not by much, the original possibility that he thought extremely unlikely.

The moral here is clear. Don’t assume that the facts will confirm your philosophical theories before this actually happens, because it may not happen at all.