In the second instalment of this new video series, Stephen José Hanson talks to Richard Sutton.
Recently, Richard Sutton, along with co-authors David Silver, Satinder Singh and Doina Precup, published a paper entitled Reward is enough, in which they hypothesise that the maximisation of total reward may be enough to understand intelligence and its associated abilities. This discussion attempts to further clarify this position and answer the more general question, “what is AI?”
HANSON: Rich, it’s great to see you again. I thought we’d start out with me trying to summarise the paper you guys wrote. It seems like a white paper, a kind of position paper, to set the stage for what I assume is a strategy in DeepMind and what is a strategy, frankly from knowing you for all these years, that you’ve had. I mean, this is not new stuff. You’ve been saying this certainly since the 70s. The 70s may be a little hazy for both of us, but certainly since the 80s you and others have been pointing out a couple of things: reinforcement, if you allow me to call it reinforcement learning, now this covers a couple of things, just to give some context – I have a background in animal behaviour and reinforcement learning is something I’m very familiar with – I’ve written a lot of papers on that as well, but for rats and pigeons, not robots.
SUTTON: That’s something very special about you, or that you and I share a little bit, we both have a background in animal learning theory.
HANSON: That’s right, I’d forgotten about that. Tell me more.
SUTTON: That is the origin of my participation in the field of reinforcement learning. It’s obvious that animals learn by reward and penalty and it’s just amazing that that wasn’t really in AI and the various engineering approaches to learning. It wasn’t present, at one time. Now it’s a substantial thing, but there was a while when it wasn’t a thing. AI guys don’t know enough about psychology.
HANSON: They don’t know enough about animal behaviour and behavior theory overall, but we’ll get to that in a second. Although in response to your comment, would you agree that it we look back at control theory there’s certainly a latent kind of reward, or shall we call it outcome or consequence for some action or some set of actions that lead to cooling a building or heating some part of a space that is homeostasis – now it’s hard to say that the furnace is getting rewarded, but nonetheless, it is getting a control signal saying “get it warmer, get it colder”, it’s following rules to make things happen and it’s maximising, if it did get rewards, it would be maximising rewards, right? I mean, if humans had to say “I like the heat in here, it’s good, good job furnace”, then wouldn’t the furnace basically be maximising reward in that sense? Or is that too simple?
SUTTON: Well, you could say that homeostasis and reward maximising are similar in that they’re both things about data, and they’re things about interacting with the world and having an effect on the world. But you could also say that they’re like opposites because one is trying to drive errors to zero and is told what to do, and the other has to maximise something and it is not told what to do, it has to figure out what to do.
HANSON: Right, right. So, if the furnace didn’t come with some built-in policies then it would have to learn them based on something the human was doing in the room so that it could get some information about that state so that it could somehow improve the state of the human’s, kind of, trembling or sweating.
SUTTON: Yeah, but going back to your initial statement, surely in engineering there’s been something like behaving to maximise reward, and there is a tiny corner of engineering – operations research – that does dynamic programming, and that is not about learning actually, but it’s about reward. It’s about computing optimum policies to maximise reward.
HANSON: OK, well, so tell me where I’m going wrong here in terms of summarising this paper, which I found very illuminating. In contrast, just to set the stage in the other direction, the huge change in AI in the last five to ten years is deep learning, and deep learning is mostly (I’m going to be fairly general here) about classification learning. Mostly about supervised categorised learning and the fact that there has been a phenomenal improvement in benchmarks, sometimes by 150% if you look at the given context. So, that’s caught everyone’s attention, maybe except for Elon Musk who still can’t get a car to drive autonomously without running people over. But, besides him, certainly Google and other groups are… Amazon is trying to figure out how to do autonomous driving at, I think it’s called level 4. I’m not sure what their categories are but they have some designation or robotic efficacy or something. Based on Asimov – “do no harm, don’t kill humans”, things like that. But your claim is stronger. There’s this thing called artificial general intelligence and it reminded me of the work that was done back in the 20s and 30s with psychologists, particularly L. L. Thurstone and some others who were saying there has to be something called general intelligence. Now this of course led to IQ tests, it led to all kinds of strange vices because IQ tests are never properly normed and they’re a moving target because the item analysis itself is dynamic and you can’t really say much… frankly the statistics they rely on are mostly horse shit, but that’s a technical term! But, nonetheless, when you’ve got correlations of 0.3 and you’re trying to push that forward into a community and test their intelligence, my big worry is that someone in AI is going to discover IQ tests and start trying to use those on poor robots. But there’s something else going on here, because Thurstone and others had this idea that – and in fact in terms of factor analysis – that there was some kind of general factor that was more common across all sorts of tasks and abilities that humans did. It’s not just that this guy was a genius, it’s just that he had a bunch of common strategies and talents and abilities that somehow covered …a scope of the task space that was much larger. If you agree that that’s kind of the definition here of artificial general intelligence, I’m not pushing it, I’m just saying that’s one possibility, then you’re saying that if we just look at an agent attempting to maximise reward under some general context, it will become generally intelligent. Is that fair?
SUTTON: Yeah, it’s not unfair.
HANSON: OK, so if we go with that…
SUTTON: You asked the question “what is AI?” and that’s what you’re getting at. I mean, I guess that explanation you gave is kind of complicated. Really intelligence is a simpler thing, it’s just the computational part of the ability to achieve goals.
HANSON: OK, and you think that would be sufficient, therefore…
SUTTON: So you have to have a goal, and a modern way to formulate the goal is in terms of reward. So then, basically any way if you have a general computational ability to maximise reward, then you’re intelligent, almost by definition.
HANSON: Somehow, coming from animal behavior – I was raised on BF Skinner and I read “Behaviour of organisms” from front to back several times, the 1938 book, and it strikes me in reading that, that there’s something very similar in what you guys are saying that feels like behaviourism. That is, I think you do comment on this in the paper, and you talk about internal representations, but there’s something deeper here. If I said my cat is maximising rewards all the time, as far as I can tell, and it learns all sorts of strange behaviours in order to get food, nonetheless, is that sufficient to say that he has a general intelligence?
SUTTON: Intelligence is the computational part of the ability to achieve goals, and abilities can have various degrees, so it’s not like you have this ability or you don’t have this ability. You have that ability to various degrees, it’s not binary.
HANSON: So what controls the degrees?
SUTTON: Cats, I think, are really intelligent. Let’s just make that clear. (Cat in the background meows!)
HANSON: They’ve had a chequered past, I’ll give you that. The Egyptians seemed to be confusing them for some kind of Gods at some point. But, we’ll leave that aside.
SUTTON: You mentioned Skinner and behaviourism. I think we should deal with that. So I think there’s a sense in which the idea of reward seems kind of reductionist and almost demeaning – it sort of suggests that our lives are just about getting pleasure and pain as scalar signals. And, I think that’s a real thing, we shouldn’t run away from that, we should pay attention to it. Reductionism is not necessarily bad, we like the way physics can reduce things to atoms, and it can be powerful. But it also seems kind of annoying that we reduce things to lesser things, we reduce objects, people to just atoms. That one we’re used to, we think “I’m a bunch of atoms arranged in a particular way, and the way they’re arranged is really special, so I’m happy with it”.
HANSON: I spend everyday scanning people in MRI systems and my view of them is that they’re just bags of water which I can throw some protons around in and so on. So I get the reductionist tendency here.
SUTTON: Maybe we are just reward maximisers – really complicated reward maximisers interacting with a really complicated world, and that’s OK. That reduction does not demean us.
HANSON: Except that the critique on Skinner of course was that if you try to apply this to language and developmental acquisition of language you run into kind of a conundrum identifying what the reward is. Often the examples are mothers trying to correct their children when they say “we go to the store”, they say “no, we went to the store”, and the children say “yes, we go to”. So the problem is, which seems to be some kind of ballistic… this is where cognitive psychology comes in, in the 50s and 60s, particularly there’s a great book which is worth bringing up in this context which was something George Miller, Eugene Galanter and Karl Pribram wrote, called “Plans and the structure of behaviour”, which prior to Ulric Neisser was a complete rejection of behaviourism. Miller was just adamant that this behaviouristic, reductionist view of things was just not capturing what cognitive function was in a human. Now, in saying that, and you did, rewards can be very subtle, and of course Skinner spent a lot of time finding out that your environment in some sense arranges rewards or maybe you do, in the sense that you’re your own kind of reward manager. And that somehow fits back into this internal structure that isn’t really being captured by the reductionism I don’t think, that seems to be the big critique of this kind of approach is that if you look at the internal states of the brain – I don’t believe there’s a mind, I don’t generally believe what philosophers say, basically, there’s a brain and in the brain there are emergent things, and things happen and that’s great, but it’s mechanical, it’s causal, but it’s very highly representational because it has some relation to the world and the context around it that isn’t, it doesn’t seem as low dimensional as a reward maximisation function. But, maybe I’m missing something.
SUTTON: Well, that’s a lot of things. So, there’s this whole controversy about behaviourism that ran its course many decades ago and that we all have in our memory. There was a big fight and behaviourism lost. And behaviourism is bad, but we learned. Even old guys like us, it happened before we were academics, and so we’re just left with this lingering memory. And yet, when we grew up, we went to school, this stuff about animal learning, this is cool, this is fundamental principles of learning, these guys with the rats and stuff figured out some important things. And so, we don’t understand why it’s bad but we understand behaviourism is bad because it’s been passed down to us. How do we square the circle? I don’t know. I haven’t really studied carefully the controversy, but it obviously doesn’t really apply to all of the things that people are doing in reinforcement learning – we’re not trying to reduce the mind to stimuli and responses, there are things going on inside the mind, that’s what we study, we study the algorithms by which things are constructed, we study the insides of the brain, of the mind, we are not behaviourists in terms of restricting language and concepts to things external. This is my weak understanding of the controversy. The controversy was you’re not allowed to talk about what’s inside, now it obviously doesn’t apply to modern AI. AI is all about what’s inside, what’s in the mind.
HANSON: Well, except that if we look at some of the successes of DeepMind with regard to game playing and benchmarks, and protein folding itself is quite a shocker, although it’s not really solved, but it’s on the road, my goodness, and that’s something where the internal structure matters but then again this is probably something more about biological knowledge and the way in which the algorithm interacted with that, which is why it worked. The same thing with the game players, I mean Gerald Tesauro was an amazing backgammon player. So when he started to build TD-Gammon, it was a shocker to him when it started to beat him. Because it just played itself – I guess this was an early kind of GAN system, basically bootstrapping itself with a copy.
HANSON: Yeah, Self-play. But, in those situations it’s very hard to point out, other than behavioural observations. In the GO situation, I guess there were certain openings or certain “board shapes” of the pieces that would appear that had never appeared before, and people go “this is shocking, this is a level of creativity”, but it’s an inference, it’s not about the internals of that machine. No-one really took the machine apart and tried to figure it out. I’m happy with a bunch of mathematical equations, even if I don’t understand them, at least I feel that somebody understood what was going on. And on the other hand that’s not very satisfying to our friends out there trying to push explainable AI. I have to point out, sarcastically, that most explanations really aren’t actually very satisfactory. I’m not sure where that tangent is going to take them. But it does highlight the point we are getting at here – what are we actually maximising in the sense of the internal representation or the structure, what is changing about the GO player that makes him so much than that poor Korean champion that lost to the AlphaGO, what’s inside of the black box?
SUTTON: That’s actually pretty clear. It’s changing its value function, its intuitive sense of when it’s winning and losing. And it’s also changing its intuitive sense of what move to make. So, in reinforcement learning we call one its value function, its intuitive sense of how well it’s doing, and the policy is its intuitive sense of what actions to take. It’s learning those two things, it’s learning intuition about the best move and about how well it’s doing.
HANSON: Forgive me if I’d be too much of a reductionist to say “is that in the striatum or in the medial temporal lobe”?
SUTTON: It’s great how we have the brain and we can reverse engineer it maybe, and we have theorists, AI people, and they’re all working on the same thing in different ways.
HANSON: Right, but it does, I like your explanation of the intuitive sense of how well I’m doing, and indeed I do have an intuitive sense when I play a game of chess, you know, am I beating Rich Sutton at this game, and I don’t know and that’s confusing and eventually I get his queen so I must be doing better, and then he checkmates me – uh oh! So, obviously these intuitions can be wrong and they can lead you astray…
SUTTON: And we can learn them…
HANSON: Yes, and you get better at that, and it’s the diversity I guess of the game play and the different trajectories in the game play that you can see trade-offs. In other words there’s some meta picture of the game play that gets represented. And, let’s say, those are like hyper parameters that are levers way up here and we can learn that hierarchy of meta-levers and kinda pull one up here and all the rest of them flow down and get credit assignment out of that. So now, we have this very large causal structure that we’re manipulating with a few parameters, a little epsilon here, a little alpha there, just enough to reap the rewards of whatever happened. As that becomes thinner and thinner, that is, the parameter control at the top is very, very small set of dimensions, you begin to lose, therefore, the causality between the little changes you made up here and the huge fact that you just have a general way of winning chess all the time. And so if that gets disconnected on you, you now need to, I guess, trickle back down into the complexity and sort out where that went wrong, and go back to those meta parameters. So, I’m just giving you a hypothetical way of filling out that intuition as this kind of high dimensional, hierarchical thing that is reduced down to a few dimensions that I can talk to another human about. “How did you do that”, “Well, I lead with my queen”, or something. Or, “when I start making pasta, I first find a very good basil”. Obviously the pesto you are making is more complicated than that and you’ve reduced it down to a few things that aren’t going to be very useful to the next person making it unless they have a good model. But, it sounds good – that’s why I think explanations are basically bankrupt because you basically have to have an interface with somebody else who already pretty much knows what they are doing and then I can impart a lot of stuff extremely fast – that seems kinda like the trick of this business.
SUTTON: Well, I think you and I just totally agree here. But we are both a little unusual. I mean the main fact we know, and we’ve both studied psychology a little bit, is that whenever psychologists look at a phenomenon they find that most of our intuitions about that phenomenon are wrong. We’re not doing what we think. Particularly about explanation, we are generally just making up the explanations. That is exactly what’s going on. The explanations that people give are generally not the truth.
HANSON: Exactly, they’re low dimensional representations of the thing that actually happened.
SUTTON: Low dimensional made up stories about what’s going on.
HANSON: Absolutely. Narratives of your life that are explanations, but they’re satisfying to the extent that we’re not explaining to five year olds who will not understand what you’re saying.
SUTTON: If you like illusions they are satisfying but they are not real.
HANSON: They are not real, that’s right. But, when you find two people agreeing on a lot of stuff there is probably a shared sense of kind of experience or knowledge basis that does create a kind of continuity in the communication. But I still think the business of AI and artificial general intelligence is a long way off, and I’m not sure that reward is enough. One thing that I thought I was going to read in here, and there was a little bit of it, was sort of the models, the internal models. Now, Philip Johnson-Laird years ago had a thing called “mental models” and there were a bunch of other folks that also subscribed to this. It turns out, the problem was it was never implemented in any computational, or even mathematical way, that you could see “what is a mental model?” And it became a vague, kind of useless term and various sides in the field attacked Phil viciously for years. And yet, and this goes back to Kenneth Craik…
SUTTON: Kenneth Criak, 1943. The Nature of Explanation. I’m totally on board with models.
HANSON: Don’t you think that should be in this paper – how you make models.
SUTTON: No it can’t be, because the paper is about the problem. It’s not about the solutions methods. Reward is enough for the problem, in my eyes, not for the solution.
HANSON: I see, I see. Right, but in this case, especially with RL, we have something with, if not the model, then the way the solution occurs for the organism. It’s got a set of options – it’s got kind of an equivalence class of things it can do. So, even in a rat situation where it’s pushing on a lever, well OK I’ve got to push this lever – should I push it with my mouth, should I push it with my right paw, my left paw, should I lay down and kick it? Maybe I should just bounce into it every once in a while and food appears. So, there’s a sense that equivalence class of behaviour doesn’t matter and the rat’s model of this, unless you make the contingency so severe – you must push it with your right front paw, and these three little fingers you have and nothing else…
SUTTON: We want to have abstraction on the output side.
HANSON: That’s right, but that abstraction leads to variance and it leads to lack of solutions if the output side hasn’t got a tight coupling with the outcome. So, this is where learning educational systems breakdown, for example, because children aren’t getting rewarded for things that they actually should be knowing, they’re getting rewarded for lots of other things that may be incidental or distracting. Without understanding that model side I just don’t see how reward could be sufficient as AGI, you need…
SUTTON: The theory is that reward is enough of the specification of the problem, in order to drive the creation of the model. You need a model in order to get a reward.
HANSON: That begs the question, what’s the modelling about then. I guess ostensibly we’d say, well it’s a neural network. In the sense that we’re creating a function approximator of some sort, that can build an evaluation for us.
SUTTON: The model takes a state and a way of behaving, and tells you what the result state will be.
HANSON: Yeah, but, you know, good old AI liked us to write down some propositional logic to do this.
SUTTON: And to do it symbolically rather than numerically with neural networks.
HANSON: So, are we at DeepMind rejecting the symbolic side of this dramatically?
SUTTON: Urm, there are aspects of DeepMind that are more symbolic than I would normally do, but I think we’ve accepted the modern view that the symbolic thing has been a bit of a dry hole. We don’t want to just go back to the old ways.
HANSON: Well, we don’t want to be too controversial, but let’s go back in time a tiny bit. There was a person I knew, and I suspect you knew too, who got himself caught up defining general AI back in the 70s and it turned out to be an enormous waste of billions of dollars.
SUTTON: Doug Lenat.
HANSON: Ha, you are good! He has several quotes, I have one right here. In 1989, he said “AGI will be found and what we need to do is find the 80 lines of code, because that’s all it’s going to be”. 80 lines of code. Then he said “cyc” – which was a knowledge system that was going to encompass all the world’s knowledge through high school students typing into machines, and doing analogies of some sort between hospitals and schools and politics – would become generally intelligent in 2001.
HANSON: There’s some people predicting 2040 now.
SUTTON: Well, he’s not alone. Marvin Minsky used to say, we just get 10,000 facts, then we get intelligence, then we get 100,000, then a million, the number just keeps getting bigger and bigger. He was not alone. This was good old-fashioned AI and it was the thing. It’s easy for us to make fun of it now.
HANSON: I still think there are people saying things like this now.
SUTTON: There are still people saying things like that. We have to have sympathy with them. Computers were small, they had about a k of memory. When the big mac came out, one of the early macintoshes, the big mac was 512k. That was the secondary memory.
HANSON: I remember that, I could do anything with that. That’s an amazing amount of memory.
SUTTON: So, they were doing a technology that was appropriate for what they had available then.
HANSON: Yeah, but the hyperbole was… The question is what was in Lenat’s mind when he said things like this?
SUTTON: Hyperbole was needed in order to get grants.
HANSON: Well that’s true, but cynical. I was down at MCC during his time period as one of the representatives of Bell Labs and four of us got to go, if and only if we paid Admiral Bobby Inman a million dollars. There were 50 corporations that went down there. And then they had 10-15 million dollars from the US government. A monumental failure with huge amounts of cash going down the drain and nobody really remembers this as far as I can tell. But it does strike an important point here because the kind of stuff that you and Andy (Barto) were talking about back in the 70s is a constant, you have never changed your tune. I can’t find you ever saying, there’s just 10 million rules we need to write down. And I looked, trust me, I can’t find you saying that – you’ve got a clean slate there. It’s pretty amazing.
SUTTON: I’ve always been a connectionist. I still think of myself as a connectionist. I can’t say neural networks, because these things are not neural networks, those are networks that already exist, they are in our brains. I can never get over that terminology.
HANSON: It’s tough. I still think of myself as a connectionist. You’ve been to NeurIPS – what’s remarkable there is that none of those kids know anything about connectionism. Because I said I was part of this connectionist group, and they go “what?”, “OK, neural networks”. But I’m always shocked when I go to that conference, and I’m amazed at what’s going on on a weekly basis. Now, there was an interesting event recently, a new book on Deep Learning by two new folks in the field. I contacted these two characters who were physicists at some point but started studying learning at facebook. They wrote a 500-page book called “the principles of deep learning”. It’s an arXiv and it’s coming out in 2022 in book form. It’s literally 500 pages of text with probably 300 pages of math. I can follow some of it. I talked to them for quite a while trying to follow what they were doing here. They did everything asymptotically but what they did do, page 331 was they wrote down the total gradient for a deep learning model of some unspecified depth, and it goes on for a half page or so, and then they basically look at it asymptotically to infinity and make some interesting claims about it. I mean they do some astonishing things. I don’t know if they’re true but they are pretty amazing. It struck me, OK, so they’re actually trying to construct the internal inventory of what’s inside of deep learning and this is also being bandied about in Princeton (IAS) a couple of years ago, by an Italian mathematician, who brought Yann and a few others in to explain deep learning, and they were going to do all of the theory and the math and figure out mathematically what it was actually doing. I’ve gone to a couple of these workshops and in the second or third one it felt like I was in 1988 again. These guys were explaining gradient descent and how complex it is, and that’s the last I saw of that. I don’t think it mattered too much to this book but it might be important. I’m not going to make any claims about it but it’s shockingly well done. So, what implication it has, it has something to do probably with training initial conditions, but again these seem like details to me relative to AGI which of course is the larger issue here, and I don’t think if you did the complete mathematics or geophysics of an earthquake that would really give the same impression of what an earthquake is like. There’s a mismatch between the kind of, let’s call it, rational explanation and the phenomenology of being in an earthquake. And this gets back to these different levels of analysis in terms of intuition and what I’m trying to suggest is something about the modelling that is missing here. Neural nets aren’t enough.
SUTTON: Let me try something on you. As you say, deep learning renewed the interest in neural networks, and it is a great thing and has been very successful. It’s been an enormous explosion of work. If I was to criticise it, it would be that it’s an explosion but it’s also very narrow. But, I think the most interesting thing to say, and you are one of the few people that may directly appreciate this, is that it’s like David Marr. David Marr and his levels of explanation. He says the levels are sort of computation theory level, where you have the principles – you talk about what is being computed and why that’s the right thing – and he separates that from the representation and algorithms, and that’s separated from the physical representation. So it’s like the brain is the physical representation, the algorithms would be like neural networks, that’s the representation. Then there’s computational theory which is actually more conceptual and it’s what you’re computing and why. That’s what I’ve always felt has been missing, it’s been missing in neuroscience, it’s been strangely neglected in AI. The big controversies in AI are – should we use neural networks or should we use symbolic rules. That’s at the intermediate level, that’s not how we are going to do what we’re doing – “what the hell are we doing?” I like to think that reinforcement learning is about “what the hell are we doing?” We’re trying to maximize reward, we’re trying to look for a value function, we’re trying to learn a policy, we’re trying to learn a model. Now that way of talking does all of this computational theory. What you’re computing and why, but not how.
HANSON: That’s nice. So, it’s more of a systems level analysis of what you’re creating, what you’re trying to create, whereas whatever deep learning is doing (and we still don’t know) it does it really well and the ability to classify, create concepts, is at the basis of these kinds of models that we need to have. We know that whatever semantics is, whatever the language structure is that we draw, this is related back to models that we can draw up and reconfigure and mutate in ways that can make our abilities much larger than they actually are. We make sure that we can control the world around us, as opposed to us controlling the world directly. This gets back to the concern I have about RL as a thing, because obviously if people do things like, they’re a writer, they sing songs, they like to dance, you can say there’s intrinsic rewards for this but they have to be learned somehow. Maybe this goes back to social evolution or something. The reason the telephone is so popular is that, prior to WWII people were saying “there might be a town where there’ll be one phone and you can drive from here and use that phone”, no-one could imagine a time in which everyone had a phone in their hand all the time. The rewards here are not just subtle, they seem non-existent.
SUTTON: I totally disagree. I think we should be past this. We’ve seen AlphaZero and AlphaGo – those start with just checkmate, just winning the game and they evolve enormously subtle and complex notions of things like control the king side, the safety of things, and they can do enormously subtle things, they have sub goals. They have a value function, sub-goals, you can get enormous complexity out of a simple goal. You can take AlphaZero and AlphaGo – get enormous complexity out of a simple thing. Or you can take evolution, which is all about reproduction and leads to all the complexity of all the different organisms in the world. You can have a simple goal and yet achieve enormous complexity. We should be used to that.
HANSON: I agree with you on that, but there’s a sense in which the goal and the outcome of achieving that goal has to be implemented, let’s say for the sake of argument in biology somehow, so there’s a sense in which dopamine occurs when I get a grant – part of my brain is just filled with dopamine spurts all over the place, and that’s producing a perseveration on my part to keep writing grants even though the hit rate is so low you still perseverate in these things. So, there’s something about the evolutionary persistence before a species becomes extinct, and the same thing about AlphaGo – what did AlphaGo feel when it won the game, what is it that implemented the reward? If you’re just saying it’s an algorithm that’s fine but it seems like you’re leaving out a whole dimension of the evolutionary mechanism that allows us to exist.
SUTTON: Well, we know about temporal difference learning, we know about value functions, and value functions mean you learn the subtle thing – am I doing well? So, like you’re playing chess and there are no rewards. The rewards are all zero until the end of the game when you checkmate or lose or draw, but all during the game you have a subtle sense of whether you’re doing better or worse and that’s from the value function, there’s no reward. The value function has learned from the reward, the value function is a prediction of the final reward, and each one of these increases and decreases in value, these feel good, these feel bad. So, when you got your grant proposal you realised your life is going to be a lot easier now than if you hadn’t got that. So, even though there wasn’t a reward at that time you had a realisation that you were going to get a different reward in the future. And so, you felt good and, as you say, dopamine filled your brain, because dopamine is not a reward, dopamine is the TD-error, it’s the change in your value function at a moment.
HANSON: You should’ve been a psychiatrist, that feels very good as an explanation. But in the sense that you actually have a dopamine release you feel wonderful and at the same time its delusional, because it’s this kind of biochemistry that’s producing some, maybe, secondary reward as opposed to primary reward, if we think about animal behaviour for a minute in terms of food and sex.
SUTTON: It’s exactly the same as secondary reinforcement, it’s the TD-error. Now, we understand it as the TD-error. And it should be there. You have just realised that you’re going to get more reward than before and that is a really great thing.
HANSON: So a robot in the future arguably should have this huge sense of well-being as the value function gets higher.
SUTTON: Yeah, yeah.
HANSON: OK, so this wouldn’t be a stupid thing to try to do, although it might seem adjunctive in that the robot will still do things to maximise reward even though it’s not feeling any better about itself. It’s saying “I have to do this, my overlord masters are saying I have to do this. I’m just a horrible robot, I have nothing better to do so I’ll maximise this reward”. So, I’m getting at this kind of disconnect between emotional states, which in some sense are counterfactuals right – we have these emotions, what if this happens, what if that happens – and in some sense in a reward goal system there’s a bunch of counterfactuals being played out – what could happen here, I could do this, I could do this. And then if I did, I’d better say this could happen. Then my value function would go “wow, that’s where we’re going, we’re going to win the lottery this time”. And then there’s some risk system going, “you know, you never win the lottery, don’t spend so much money”. So there’s a whole bunch of competing goals out there and they all have to somehow come through a final common path and all implement it in different ways I think.
SUTTON: Cool, so we have these all different… So, I believe in reward. I believe in what you call the reward hypothesis; it’s that all goals and purposes can be thought of as maximising a single scalar signal that we receive from the world. I’ve become comfortable with that. It took me many years. But you’re still a bit anxious about it, I can tell that.
HANSON: Less than you imagine.
SUTTON: This is where you get lax about it, you say I want to get money, and I want to have a big car, house, personal relationships – I have multiple goals. But really, these are just part of getting reward. And then there’s the world, the world intervenes. We realise “oh, if I get the fancy car I won’t be able to afford the nice place to live”. And I have to trade them off. The world intervenes and forces me to make choices. That I see as, do I want this goal, or that, really there’s just one goal but the world does intervene and mean you can only achieve some things, you can’t do everything. You may appear to have choices between your goals but it’s really between strategies to get the one and ultimate. Sometimes you could say “I don’t know how to trade off a good home life and a good work life” and it seems incomparable and I want some answer, but we are really good at doing that kind of trade off, people do tradeoffs, “should I have the chocolate cake or …”
HANSON: “… or should I go back on my running wheel”. Yes.
SUTTON: That’s right, we make those choices. We’re good at trading off goals and deciding which one we really want.
HANSON: Well, I’m looking forward to this DeepMind robot someday, but before we get there I have one little anecdote for you, which may put some of this in a different perspective, let’s see. Going back to animal behaviour where we started, have you ever heard of the free food effect?
HANSON: So, back in the 70s people would put animals on fixed ratio schedules and they’d sit and they’d press the bar three times and get some food, four times and get some food, and so on. Well, there was a kind of insight when people started watching inside the Skinner box – animals in between times when they were pressing lever and times when they were eating out of the hopper, they would wander round the back and maybe they’d pick up some wood chips and start chewing on them, then they’d come back and start pushing the lever again. So, some clever researcher said “hmm, maybe I’ll just put a bowl of food in the back and see what they do with that”. Since, frankly what they’re doing is getting rewarded for pushing this lever to get food, what if I just break this economy and put food in the back – what would they do? Well, they don’t eat the free food, they leave it alone. They enjoy the task so much and the reward that they’re getting. In other words there’s something virtually important about the work, which was really astounding. Now the animal, when the lever was locked, would go back to the free food, eat some of it, but then immediately they go back and see if the lever is unlocked. This was dubbed the free food effect in that it wasn’t much of an effect at all. Animals are not socialists apparently, they like the meritocracy of getting food. So, the only thing I’m saying there is that rewards are also high dimensional things and I think it’s going to play into this in a way that allows us to try to factor these robots in different ways. I’m not sure how to do that, but I do think that rewards are … it just doesn’t go back to my dark chocolate mints that I like. There’s something that is cognitively important about this but I’ll let you have the last word.
SUTTON: Yeah, so rewards… we don’t know what our rewards are, and rewards probably involve things like intrinsic motivation. What does this fancy word mean: intrinsic motivation? It means we do value understanding the world separately from obtaining what we might imagine are the immediate benefits such as food and drink. But just coming to understand the world is important to us. OK, so does that mean that reward has to be multi-dimensional? Maybe, but maybe not. I just love the simplicity of there being a single number and I’m going to hold onto that as like my null hypothesis for as long as I can.
HANSON: You don’t like tensors hey?
SUTTON: Conceptually, it’s much more unclear. The final point I think is that we are not just… What is a person, what is intelligence? The interesting way to say it is that we are beings that try to understand our world and by understanding the world we mean predicting and controlling our data stream. And, if that’s the goal, to predict and control in all its complexity and subtlety, that’s what it means to understand the world. That’s a never-ending activity and that is separate from getting something. It’s like getting and understanding. Getting knowledge – I think a lot of our lives are concerned with getting knowledge particularly our lives as academics is concerned with that, as well as getting a nice job and a place to live and the respect of our friends, hopefully.
HANSON: I can’t agree more that this has been one of the most sympatico talks I’ve had where I find myself feeling as though you’re taking the words right out of my mouth and I’m taking them out of yours. So, I think there is something about just the shared kind of background in starting with animal behaviour and moving through this kind of evolutionary path, which I think most computer scientists don’t get and I think that can be a problem in terms of what David Rumelhart used to call proof by lack of imagination.
Richard Sutton is a distinguished research scientist at DeepMind and a Professor of computing science at the University of Alberta.
Stephen José Hanson is Full Professor of Psychology at Rutgers University and Director of Rutgers Brain Imaging Center (RUBIC) and an executive member of the Rutgers Cognitive Science Center.