In this episode, Stephen Hanson talks to Yoshua Bengio about deep learning 2.0, consciousness, neural networks, representations, explanations, causality, and more.
Hanson: OK Yoshua, thanks for joining me on the AIhub video blog, which I’m starting to call it now. I’d like to start here with a little story that I think is relevant to you, and it’s probably because you did a postdoc with Michael Jordan in cognitive science at MIT back in the 90s, and I had spent a lot of time there, and during this time period there were early, you know, four or five years before this there was a thing called NETtalk which you may recall. The student Charlie Rosenberg was actually at Princeton and I was one of his advisors. So, he came back from a summer vacation with Terry Sejnowski, and George Miller was his senior advisor and so George and I sat down on a Sunday afternoon when he came back from Terry’s and he played NETtalk for us. And I went “that’s what happened with my DECTALK you took…”. He says “no no, no this is a decoding from the one hidden layer network-backpropagation”. I said “wow”. And George, very intuitively, sat back and he said “you know, this is like a trap door isn’t it, things go in and you don’t know what’s happening and you can’t get them back out”. Well,that caught my attention. And sort of precient about what is happening now in DL representations.
Another related story related to this was when I was a program chair of NeurIPS. Yann LeCunn, who you also worked with, showed up to the meeting and he was program chair for what accounted for almost 50% of the submissions which was at that time called algorithms and architectures.
Bengio: Right, right.
Hanson: You may remember that – you were probably reviewer or area chair at some time as well.
Bengio: I was a general and program chair later.
Hanson: Right, right I knew that… But now algorithms and architectures eventually disappeared. But here’s a wonderful thing that Yann did. He showed up, and this was at Yale ’cause John Moody was the general chair, and he shows up and everyone is remarking he’s a little late, and he says I can’t accept any papers here, and this was like 50% of the papers. He rejected every paper. And he just didn’t like algorithms and architectures. He thought that it was neither theoretical nor was it applied, it was just like people making crap up that and he didn’t like it. God bless him. But eventually he came around and we were able to accept about half of the things he rejected through other people arguing with him. But, what was interesting about that is just part of that you know NeurIPS was about neural computation or neuroscience and cognitive science. In fact, those were the two largest areas in the very beginning of the conference. By the fourth or fifth year algorithms and architectures had taken over and some applications – speech recognition of course is an obvious case that was incredibly popular and there were some very … Alex Weibel was doing just amazing things at the time.
Bengio: I was working on speech recognition during my PhD actually.
Hanson: I imagined that might be the case. It turns out to be obviously extremely interesting now as well. But the question that comes up, or the thing that I muse about when I think about your research, and I think you’re particularly unique in this regard— in the way that you have essentially absorbed a lot of different kinds of scientific fields and then tried to merge them together in some coherent whole. And part of this is actually your embrace of cognitive science or cognitive psychology and what I know and love as memory systems. So one of the things we study in our lab is in fact implicit explicit memory and we’ve been doing this since you know the late 70s and my wife (Catherine Hanson) who is a collaborator. She did most of her thesis work on implicit/explicit memories and we have subsequently published a lot of brain imaging stuff in this area. One of the real confusions about this is that you know explicit memory is this declarative thing which probably is more linguistic based, has verbal structure to it. The implicit thing is you know like when you make risotto, my favorite example. You know, you put the risotto down, some olive oil, some garlic, and then somehow it gets made, and you go “what! How did that happen”, and you don’t really go through a recipe, if you know what you’re doing. If it actually works out you don’t have a recipe, you just do it. And this is part of the problem, right, because a lot of what we’re doing right now is implicit problem solving, it’s not explicit. But it does beg the question, what’s the neuroscience or cognitive science connection between explicit and implicit systems? I believe this connects back to what you’re calling deep learning 2.0. Please, correct me if I’m wrong.
Bengio: That’s exactly spot on.
Hanson: Perfect. OK well so ok can you say more?
Bengio: We haven’t incorporated these notions from cognitive science and neuroscience in machine learning, and yet they may actually explain a large part of the gap that we observe between the abilities of humans and machines, in particular when it comes with something that looks unrelated and that is out of distribution generalization. So, one of the big question marks in machine learning in the last few years is: we want to be able to generalize, not just on the training distribution like a test set that comes from the same distribution of the training set, but you want to generalize to different distributions that somehow have to do with the same concepts, but they come up in a different way.
Hanson: You think that’s a human problem …I mean, is that something that us humans deal with as a natural consequence of the distributions in the world?
Bengio: Yeah. I mean, we have to. If we weren’t able to do that evolution would have deleted those genes. So, you know we have evolved a way to deal with those changes in the world. So let me give you an example and you’ll see the connection to higher level cognition and explicit memory and so on. Let’s say you’ve been driving cars in North America only, all of your life. Then you go to London, you rent a car, traffic law is almost the same except for this little detail: you have to drive on the left.
Hanson: You’re on the wrong side of the car too!
Bengio: Exactly. So now what’s going on? It’s very interesting. What’s going on is you can’t just drive by your habits, otherwise you’re gonna make an accident.
Hanson: Right.
Bengio: And you know this. You don’t even need to think about it, but what is actually happening is you keep paying attention to what’s going on on the road in a different way than normal driving. In particular, you keep in mind “oh, I have to drive on the left hand side, everything is reversed”, and that allows you to survive. So this ability to work with explicit memory like this “rule that has changed”, to think about what you’re doing before acting on impulse, and maybe revise your impulse because you know, “I was going to do the usual thing and then I would have had an accident”. So that’s an ability that allows us to transfer what we’ve learned about driving and handle change about the things that, here, verbally have been pointed out, are different. So the distribution has changed.
Hanson: Well…
Bengio: So humans are really good at that, and we don’t have anything like this in machine learning these days, yet.
Hanson: Let’s back up. Humans I don’t think are very good at this.
Bengio. Well, they’re better than machines.
Hanson: Well, they’re better than machines, in that machines don’t do it at all.
Bengio: Exactly.
Hanson: But, here’s the thing. This touches on all kinds of what I would call sort of third rails in machine learning and AI, and in cognitive science, and it has to do with consciousness for one thing.
Bengio: Exactly.
Hanson: It has to do with conscious awareness – the C word. And this brings up the idea of what is an explanation? And I don’t wanna get too philosophical. I mean I’ve got a lot of philosophy and linguistics friends where we do argue about explanations….
Bengio: I’ve been thinking about “what’s an explanation” for a couple of years now, and I think it’s really important that machine learning tackles that. Because it has to do with abstraction.
Hanson: Right.
Bengio: So, just as a context, the deep motivation we had for deep learning in the early 2000s, at least the one that drove me, is the notion that we could learn these deep nets where at the top, at the top layers, we would have really abstract representations that capture the kind of abstract concepts that we manipulate to understand the world, including the verbalizable stuff, the conscious stuff. So, we haven’t achieved that, and if we want to achieve that, and I think we need to, in order to approach human intelligence, we need to tackle this question of how do we come up with these abstract explanations. And we have a lot of clues because of course you know that’s the part that’s the tip of the iceberg that we could see through our inner observations of what’s going on in our mind and of course cognitive science is trying to do this more formally but by asking people about what they’re thinking and observing brain images. And so we have a lot of information, we even have neuroscience information about what’s going on in the brain when you’re getting conscious of something or paying attention to something or not. That’s a huge amount of information that we’re not exploiting in machine learning.
Hanson: Well, OK, so let me disagree with you, but in a very pleasant way. So I’m sitting in RUBIC, the Rutgers brain imaging center. I have two scanners. We’ve scanned about 25,000 brains over probably thousands of studies. I’m very interested in brain networks and brain connectivity and we’ve done a lot of graph theory and have tried to characterize really smaller systems like cases with 1000 ROIs, or let’s call them nodes. And then you can instantiate the network through structural equation modeling or something similar, you then have something you can actually do some dynamics with. You could run inputs through and watch the network change. We’ve done stuff like this and these turn out to be biomarkers by the way for things like schizophrenia or Alzheimer’s or other things that you can see over time in an archive, kind of, the brain slowly falling off a cliff. But there’s something that strikes me about the things you said in terms of the cognitive science of it–. It never seemed to be the case and in fact a lot of the neuroscience and brain imaging data shows this, essentially that the implicit and explicit systems are not in close contact in a way that let’s say from a computer science point-of-view you could sort of copy information back and forth ….
Bengio: No, but I don’t think that’s how it is. I don’t think it’s like we have a part of our brain that does system one, implicit stuff and then the part of our brain that does system two, explicit stuff.
Hanson: Right.
Bengio: The much more plausible theory in my view is related to the global workspace theory: we have this communication bottleneck between different parts of the brain. The stuff that gets communicated across the brain, that’s what we have in current working memory, and you know it gets updated every half a second or something. So in a way it’s all a neural net and it’s all system one. But, there’s this bottleneck there, and this is stuff that goes through that bottleneck, the information that gets exchanged in some way through a global coordination which is probably a dynamical system, so the different parts of the brain are kind of coordinating on some interpretation or something.
Hanson: I like this picture. I like this picture a lot.
Bengio: Whatever the form, that’s the part that we call system two or explicit. Everything else is happening below the hood.
Hanson: Right. Well except the part that… well, I agree with all that and that’s very nice. I mean one of the things that I’ve sort of learned over the last almost 20 years of doing brain imaging is that we are talking about what the brain does. It has a thing called resting state. I don’t know if you ever heard of this, but it’s basically kind of a background tonic state that has some sort of low frequency structure that then communicates. It’s almost like a priming system, it’s waiting for you to decide I’m gonna pick up this cup and drink some coffee. Well all of a sudden the resting state recedes back so this dynamical system moves back and says “go ahead”, and parts of it are picked off to actually implement the picking up the cup. And as soon as I’m done with the cup and I fall back into some kind of alpha kind of band then I’m kind of tired. But the thing that then is a problem here is the explanation part of this. I think humans are really bad at explanation. I mean one of the reasons professors spend so much time trying to figure out how to explain things is because it’s hard to explain things to people who don’t know what you’re talking about. And explanations when someone says you gotta realize explanations you know not in the academic world but you know in the real world, people say “well how come that light came on?” “Well, someone must have switched something.” So not a good explanation but it’s satisfying. So there’s a sense there’s a kind of a Herbert Simon sense of satisficing (non-optimal–but ok) here in the sense that we like explanations, but generally they’re not very useful, and they don’t do anything. They mainly make you feel better about a situation you don’t understand. You know, why is there light, why are there shiny things that are floating through the sky out there that look like the Aurora borealis what is that? And you say “well that’s like you know space clouds”. OK so not a good explanation but I’m just saying that’s what parents tell their children…
Bengio: I think you have to be careful about the definition of explanation. The way that you seem to refer to it is like in classical AI, where we have the explanation. But if you think in probabilistic terms, if you think about the Bayesian posterior over explanations, then first of all there’s like an uncountable number of possible explanations for the stuff you’re seeing. It’s just combinatorial. And of course, not a full explanation, just a piece of it, comes to our mind. “Somebody maybe turned off the light switch” (but really there is much more going on in the scene). So we have this very partial view and it’s only one of many possible explanations. But we have this amazing machinery in our brain that imagines stuff. Some piece of explanation is gonna come up to our consciousness. And, you’re right, you could think of it like our brain is making a hypothesis. It’s not sure and sometimes we’re too sure of our thoughts but really we need to think of what’s going on here is like hypothesis generation which helps because it helps us to connect the dots. I mean sometimes it’s useless right, but sometimes it’s super important. That’s really one of the strengths of humans as scientists. We come up with these explanations and sometimes they’re wrong. We can do experiments and we see that it doesn’t work or somebody tells us there’s a hole in your reasoning, whatever. OK, because they’re not perfect. Our thoughts are not perfect explanations, they are probabilistic,…
Hanson: It’s also arguably incomplete. I mean that’s part of the problem.
Bengio: It has to be incomplete.
Hanson: And explanations, you know in the AI sense of having the explanation, that’s not what’s happening with humans.
Bengio: Exactly, I agree.
Hanson: Humans are actually just get by most of the time, and they get by on this kind of business of well “hey how did that happen, blah, blah, blah” and you know, well that’s not an explanation a lot of people babbling together. OK so that does leave your sort of theoretical pinnings a little tricky here and and it brings up another C word by the way, causality, because the one thing that we know biological systems love to do, we love to do this, is we want things to be causal. “Why did that happen, who did that, what’s going on?”
Bengio: It’s an obsession.
Hanson: It’s an obsession, and it’s very… you know, my cat actually believes in causality up to a point, and then he gives up. He says “that’s it, where’s the food. I’m done doing your work problem, I don’t want it. I don’t wanna solve problems, just feed me”. So there’s a sense that we’re trading off some I mean, I’m half serious about this, something that has to do with evolutionary viability or reproductive success and causality. The fact that we have to get through the world without being run over by cars and attacked by IRS agents.
Bengio: So wait, I disagree a bit here.
Hanson: Good.
Bengio: I agree we have a very strong inductive bias to look for explanations that are causal. And very often they’re wrong. But sometimes they’re right. But it’s not just about being wrong or right. It helps us organize our world model. So my friend Yann Le Cun thinks that the biggest thing we miss in machine learning is a good way to construct a world model. And I think that this causality business is essential and it’s connected to the out-of-distribution problem that I was talking about at the beginning.
Hanson: Right.
Bengio: So, when I go from North America to London the laws of physics didn’t change. The way people behave for the most part didn’t change. You know there’s this little part of the whole system that changed and that’s just this particular traffic law. So in order to be able to generalize out-of-distribution like this we need to be able to break down knowledge somehow in our brain into these, what causality researchers call causal mechanisms. Knowledge about the world in the causal picture is broken down into these fundamental mechanisms about how cause and effect are related to each other. Of course you could have multiple causes and so on. But our human causal models are very sparse. We can’t even imagine something where you have 1000 causes that explain an effect. It doesn’t fit in our conscious mind. We are constrained to have these very sparse causal mental explanations, sparse causal graphs. And this is an inductive bias, it’s a prior which is often not appropriate. ‘Cause the world is more complicated than that. But, here’s the trick. There are aspects of the world for which it works. And there are aspects of the world for which it doesn’t. So let me try to be more clear. Aspects of the world that come to our consciousness, that we can verbalize over, by construction, by the constraints of how our brain is designed, are limited to only talking about these very sparse dependencies between entities that have typically some causal relationship between each other. So we could talk about “oh the cat was chasing the squirrel, and the squirrel was afraid and climbed the tree”. OK so that’s the sort of thing our system two model of the world is able to handle. But of course in that scene, in that little video, there’s a lot more that’s going on about how they run and the physics of what’s going on and so on. And that’s hard to verbalize because it doesn’t fit that sparsity assumption, but we still understand it. It’s just happening at this implicit, or intuitive level, and our brain is able to simultaneously embrace these two sources of information for modeling the world, and take the right decisions, especially when our life is at stake. So you can make assumptions about the world, but you can mitigate the fact that they’re often not correct by having this system one vs system two division and say “well, there are things where my assumptions don’t apply, I use system 1 machinery, and I only talk about the abstract stuff that I can verbalize where these sparsity and causality assumptions make sense. The rest is just gonna be under the radar of consciousness”.
Hanson: Well, that does sound good. But I think there’s still this issue of explanation that in the version you’re telling seems to be…
Bengio: A partial explanation…
Hanson: Yeah, they seem to be like a support vector in a way that provides just enough local support to get some data explained. And the problem of course is that when you’re explaining things or explain things to another human, typically, unless you’re obsessed with your cat and you’re trying to explain to it all the time–not a good thing..… if you explain to another human then really there’s a sense in which you have to do some kind of mental calculus about what they might know, sometimes talked about as theory of mind like…
Bengio: Yeah we’re not always very good at that.
Hanson: What is it that Yoshua knows, that I know, and that, therefore, he and I can talk. Whereas, if I’m talking to a student who barely knows calculus I’m gonna have a really hard problem describing the predator/prey (Lotka-Volterra) dynamical system to them without starting with, you know foxes and rabbits, and them sort of oscillating and eating each other, or why they are either starving or being eaten. So there’s a hook here, and I think you were talking about kind of a minimal amount of explanatory variance or explanatory structure that needs to be there. But then, why do we need any larger explanation, other than, let’s say, my specific experience with a car in London, or my specific experience making risotto. I don’t need to explain. I can’t explain how to hit a tennis ball, or to make risotto to someone who’s never done it. They’re just gonna have to do it a couple 100 times.
Bengio: Absolutely. Because the communication channel between humans is very limited. Just like our workspace bottleneck. Maybe it’s not a coincidence that language and the workspace have the same kind of limitation. So there are these aspects of the world that can fit into that bottleneck. And they often have a discrete nature, and it goes to large extent into language. I think there’s also benefit to this bottleneck and discreteness… it’s not just a tool for communication, and clearly it is a tool for that, but it’s also for organizing your own understanding of the world. So here’s the thing that bothers me about the current state of the art of deep learning, how these humongous neural nets that are like one big homogeneous soup of everything connected to everything.
Hanson: The blob.
Bengio: Yes, it’s the blob. OK I’m exaggerating because now state of the art has a lot of attention mechanisms, which by the way my group introduced in 2014/15, and that they’re extremely useful. But still…
Hanson: We should come back to attention in a minute, because I think that’s…
Bengio: It’s related to attention because attention helps to break down knowledge into these pieces that you can focus on, that are gonna be easily re-composable, including pieces that are not verbalizable. You need a sort of guidance for what are the right pieces, for factorizing knowledge. This consciousness stuff helps us to do that. You don’t even need to have a full understanding of how you walk, but you do have a word for it, right. So breaking down knowledge into the right pieces is important for out-of-distribution generalization because it allows us to on-the-fly with attention pick the right pieces of knowledge, maybe form a sentence. A sentence, of course, is the tip of the iceberg of what you really know, it just helps to organize your knowledge into the right pieces and select the right ones on-the-fly.
Hanson: This does bring up the linguistic aspect of all this too, because yeah there is a philosopher here at Rutgers who had a hate-love relationship with Jerry Fodor and he would often go to a talk and there’d be a pizza and then he’d sit and yell at me about connectionism. And he says “you know there’s only one damn good connectionist out there and that’s that Geoff Hinton guy”. And I said “uh huh”. He says “well, maybe you”. But he had this theory about the language of thought and on the face of it it’s relevant in that he’s saying that really our consciousness is really about the development of language and the way language is organized in the brain, and the way that it ties back to our explicit implicit interactions. So we think of the explicit implicit system as a set of interactions. The question is what’s being passed back and forth> See, I don’t think that you’re arguing for hybrid AI here. It’s not like we’re gonna basically build some kind of theorem prover and then stick deep learning on the side of it and things will workout. No they probably won’t work out very well if you do that. But clearly you can see behavior that does seem organized in logical ways and has structure to it and obviously gets at this implicit explicit interaction. But there’s no, I don’t see how in your theory that actually can emerge. I mean, in other words if you’re not willing to design this in, as an engineering kind of tactic, how does it emerge from something that’s just exposed to distributions and data and has an architecture? Unless you think the architecture itself is part of the programming language?
Bengio: Yeah, I mean, I think the traditional view in artificial neural nets connected to theories of the brain is, you have inductive biases that say evolution has put into us, and they come in different places, and they come in the architecture, but they also come into training objective, like the learning rules. So the sort of stuff I’m developing right now in my group has both of these. In other words, our training procedure is different from the standard end-to-end backdrop on some objective function. There is an objective, it’s just intractable.
Hanson: There is what?
Bengio: There is an objective, but it’s intractable.
Hanson: OK, I see.
Bengio: Normally, in typical supervised or supervised learning, you can write an objective function and you just take the derivative with respect to parameters, that’s the standard way. But, actually we have examples where you can’t do that. In RL, of course you can write it down but it’s intractable. Like, you have some reward. If you look at things like Boltzmann machines, well, there is an objective function but you can’t optimize directly… so this sort of stuff I’m talking about is more of that kind where you have an objective but you have to do some sort of sampling, maybe like the stuff that comes to your mind that looks a bit random, in order to be able to get a training signal that you can use to train everything.
Hanson: OK, well.
Bengio: Again, I agree that it’s not a clean separation between the explicit and the implicit. Such a clean separation or hybrid is not gonna work for a number of reasons. One reason is, one of the failures of classical AI is search. Search gets really hard in higher dimensions. And neural nets are really good at guessing good solutions, so they can do that. That’s what AlphaGo really does. So if you’ve got a neural net that tells you, well you don’t need to actually try a zillion possible games, future trajectories.
Hanson: Try this one, it has a high probability. Right, and I think what you said before in this kind of Bayesian probabilistic approach, is that there’s things that are likely and the world will present itself in this sort of modal way. It’s not like a flat space where you have to search everything down to find your keys. They’re going to be exactly like in three places. And so those those kind of priors, now that’s I think something that you said earlier that this implicit/explicit interaction is getting at, which I think is crucial to what it is you’re trying to do, at least from what I can tell. And it’s sort of, where do these little tiny explanations come from. Now, there is a lot of older research on memory and I’m sure you’re probably aware of it. One very famous Canadian psychologist, Endel Tulving, who I had known for many, many years, and I think he’s retired now, but he had come up with, in fact, these different kinds of memory systems. And he and I were on an Advisory Board for like 10 years and he would just come to me and say “hey here’s the hippocampus Steve, if you look, this is where episodic memories are coming, and …”, so he had this very tight view of the way the memory systems are organized. And the one thing that people don’t talk about in AI deep learning is his notion of episodic memory. And I think that’s kind of crucial here because one of the things you find people doing is, they don’t remember everything. In fact one of the best things about memory systems in humans is we forget things. And that’s not actually a bug, that’s a feature. We want to basically… let’s go to the inverse – suppose we remembered everything, every episodic lunch you ever had, all the details. Even though you always have lunch with the same three graduate students every day, you don’t remember every one of those lunches, specifically. You don’t remember their clothing, you don’t remember what you wore, you don’t remember where you had lunch, or maybe you have lunch at the same place. So what happens is there’s an abstraction, and sometimes this is kind of a cortical abstraction where you basically see the memory dissolving, and then it goes into some cortical region. OK it starts out somewhere near the hippocampus, but it’s got to be filtered and it gets kind of crystallized, it gets compressed into a zip file something in the cortex…there’s a little crystal, and then whenever your student says “do you remember the lunch, we were talking about the thing and the thing and then the person dropped something over there”, and all of a sudden that little crystal will pop back down and unfold itself. And that turns out to be more likely where implicit explicit interactions are occurring, is at that episodic level which is usually not represented in this story in general. And frankly all we have from a memory point of view in terms of our own individuality and personality are these episodic things. Sometimes we generalize them. They become metamorphic.
Bengio: I think thoughts can fall into this category. It doesn’t have to be…
Hanson: Yeah absolutely, absolutely. Sitting around thinking, and this in fact is also being played out because you are unfolding something that is episodic and it probably is grounded to use our friend Steve Harnad’s grounding idea, I never quite understood exactly… he explained this once to me on an airplane and I said: “you’re not gonna talk about this at the meeting (Los Alamos) this symbol stuff.…”. He says “yes”. And in the end It worked out beautifully as far as I can tell, but I still don’t quite understand how it happens, but I think it’s important that there’s some, and I wouldn’t call it symbolic, I’d call it episodic sort of crystallization. There’s something about the episodic stuff that turns out to be important and it gets abstracted, and that’s the place where we can compute if you will. That’s where we’re computing stuff.
Bengio: Yeah so, memory I think is a really important part of the system that machine learning hasn’t paid much attention to, I agree. There’s a little bit of it, for example, in RL most people use what’s called replay buffers where some snippets of tasks experienced can be replayed to train the policy or the value function. But it’s not enough… so, for example, one thing that’s missing in the replay buffers but that you see in memory-augmented neural nets, connected to the examples you gave, is that you have a very selective choice of what we bring back from memory. It has to be relevant in some sense. It’s like, one way I think about this is like every memory is competing against all the other memories…
Hanson: That’s … exactly, I think that’s right. I think that’s right.
Bengio: And if it’s too weak, it doesn’t have like a big voice, because maybe it was not a very emotional memory or something, then it never wins the competition. Even if you look for it. But if somebody spells out exactly the right words about this event then suddenly it has enough support and it emerges to your consciousness.
Hanson: That’s right. Well there’s a good example of that from a disease argument. If you look at mental illness, particularly schizophrenia, one of the things that happens and it’s just sort of prior to the sort of the first episode which probably is in teenager years, you see in fact, behaviourally, and psychiatrists use this as an example all the time, is that there’s kind of a deep well and there’s an organization around a very specific episodic thing. It could be a music event, it could be a girlfriend, it could be… you don’t know and what’ll happen is whenever that person is stressed they’ll always go back to that well.
Bengio: I see.
Hanson: And then they persevere around that well constantly, and the kinds…
Bengio: The more you recall something, the more it gets easier to recall it…
Hanson: That’s right, the more you increase the probability of returning to this well so this is what good antipsychotic drugs do is they basically remove that big local minima that you’re falling into. Maybe by changing something about the dimensionality of the synaptic space, I don’t know.
Bengio: But it could be, I mean the fact that it’s altering a disease also suggests that there’s something else broken that leads to this thing – it’s not just a local minimum. So one idea I have is that one of the factors in deciding what is going to come to mind is how much information it brings about the things you’re currently seeing or the things you’ve seen in the past.
Hanson: That’s interesting, that’s interesting
Bengio: So, if it doesn’t bring any new information, for example, if you think about things that happened, they are already explained. Then the propensity of picking those memories or those thoughts is just very, very small. We’re more likely to think about something that we would not have expected, but suddenly becomes probable given the new information we’re seeing around us. That’s mutual information between the outside world, the things we’re seeing, and our mental constructions. So the things that are on top of that information, like random, probably wouldn’t be selected for coming into the global workspace.
Hanson: Yeah, those are some very interesting ideas. I like that. Let’s go back to architectures for a minute. Let’s get off of psychology and talk about layers.
Bengio: The things we’ve been talking about are this sort of public discussions cognitive scientists may have and go “I think it’s like this, or I think it’s like that”. But what I’m trying to do is different. I’m not a cognitive scientist, I’m a machine learning researcher. So I’m trying to design architectures and training frameworks where these sorts of things we’re talking about will emerge as a consequence of training objectives and architecture.
Hanson: Yeah, yeah, I agree. That was always my belief in the beginning and I had done various kinds of studies. One of the things I did was with a very early autoencoder that learned on the Brown corpus and it I called it Parsnip and it was a single layer hidden thing, and we had sensors and the sensors they would learn… but it had a whole bunch of interesting grammar behavior and the problem was that I thought “gee, if I had more layers here, maybe it would do something else.” So I added more layers, and of course, all my derivatives went to Ohio and that was the end of that, I never learned. But I mean that’s sort of I guess what makes the GPT-3 the transformers interesting, in that they are learning something more than just a phrase structure similarity blob. They’re clearly doing something more complicated than that. And part of this may have a lot to do with what we were just talking about in terms of episodic memory. I think that a lot of what is being learned in GPT-Xs is this episodic memory and then what to score as important or not important, possibly based on familiarity and probability. And it’s been…
Bengio: Yeah, but I don’t think GPT has anything like episodic memory in the sense of specific memories or specific aspects …I mean you have attention on the input, that’s the self attention on the words, but there’s no self attention on either old stuff, old memories, or pieces of knowledge. There’s no explicit attention on these things. So I’m not convinced that GPT has these abilities. It’s very impressive but I think it’s missing that explicit attention-driven reasoning layer that needs to be added on top and it’s also missing the grounding, by the way. It’s pretty obvious to me that if you train a machine learning system that only sees texts it’s gonna miss a whole lot of what the world is about. All the implicit stuff is gonna be missing. And just having the words for things doesn’t mean you understand them.
Hanson: Right, and actually that’s an interesting speculation lets say when the GPT kind of framework is inserted in a robot this starts to become… I think the experimental work there, it starts to get more interesting. But I’m not sure I agree, I think that if GPT-3 is exposed to a lot of Wikipedia, and Wikipedia has connections about romance languages and different kinds of poetry and so on and so forth, those elements will start to reinforce themselves as a semantic network.
Bengio: What about the knowledge of the world that is condemned to stay implicit, that is not verbalizable?
Hanson: Yes, yes, that’s interesting.
Bengio: That’s never gonna be in GPT-X.
Hanson: Well I mean you could certainly train it to have implicit knowledge.
Bengio: It’s not exposed to it. Unless you stick images into it.
Hanson: No no, I think the most likely application here is some kind of home nursing assistant for me in a few years, which will come in and it’ll figure out well, Steve needs some soup. So there’ll be a whole bunch of implicit cues that the entity, the agent will have to learn just from me being Steve. The explicit rules it might have gotten from its owners they’re gonna be useless. They may set up boundaries: don’t kill Steve, don’t tip Steve over.
Bengio: That’s exactly my point.
Hanson: OK, so we agree. OK, fine. But I think it does bring up the issue of this combination of robotics and what is happening with GPT systems, because I think that’s another phase transition in my mind. There’s something going to happen there that’s going to be really interesting. I don’t know what, but I bet we’ll see a headline.
Bengio: It’s an interesting question of why is it that we haven’t figured out robotics.
Hanson: Let me go to one last topic here before we stop. Oh gosh this has been so much fun we just talked almost the time away. Here’s the thing – so I wanna go back to things you’ve been thinking about for a long time, and it’s the layers. It’s the depth of these things and what layers are doing. I mean it seems to me that, if I time traveled back and forth from the 80s to here I say there’s dropout, there’s noise, there’s some vanishing derivative mitigation, but it’s the same thing, it’s back prop and it’s being run just in a big system and it and with some tweaks it works. So what’s really different here are the layers. The layers themselves, the thousands of layers, the millions and millions of connections – this turns out to be crucial and yet we don’t have much of a theory about this.
Bengio: Actually I did write some theory papers about this in the 2000s…
Hanson: I saw those, yes.
Bengio: …about why depth could be useful to represent more abstract things. The idea is, although you can represent any function with a single hidden layer, the set of functions say for the same number of parameters, there are functions that you can represent with a deep network that you just can’t with a shallow network. We need a deeper one to do it efficiently.
Hanson: Right, that’s a beautiful result.
Bengio: That’s something we could show theoretically, and what it means is that, maybe, the sort of abstract high level, it doesn’t have to be at the level of conscious processing but things that are higher up in the food chain here, might be functions of the input that would be very difficult to learn with a shallow network. I mean, you could if you had a big enough one, and it might not generalize as well. So there’s a question of generalization. Another way to put it is, if you have the right architecture, generalization is gonna be easier. And our plain one-layer nets have very few constraints. When you say, “it has to have this deep thing”, actually you’re putting constraints. People don’t understand, they think “oh it’s deeper, it has more capacity”.
Hanson: It has more parameters.
Bengio: But when you make it deeper for the same number of weights, you’re saying it’s only some kinds of functions that I’m gonna really like. Functions that can be represented through the composition of many steps. And similarly, by the way, recurrent nets, if you have a longer sequence then you can represent more complex things. So it’s all about imposing architectural inductive biases that can, if they capture something important, yield better generalization.
Hanson: Right, so the whole idea of sort of bias variance tradeoff and the standard statistical view here is just not very useful and it’s definitely the case though, and you may have mentioned this in one of your papers, but more recently people have been talking about implicit regularization. And the regularization that’s occurring seems to be directly related to the number of layers. And there is some work, it’s more empirical, it’s not so theoretical by a statistician at Berkeley, I forget his name right now, but I’ll remember later (Mahoney). What he did was kind of interesting. He took deep learning networks, like a kind of species that he found in the wild, and he took about 300 of them, and then he took the weights at each layer and he created a correlation matrix for each layer of weights, and then he pulled out the eigenvalue from that correlation matrix and created an eigenvalue distribution. And he then did this at different steps of learning – early in learning, middle learning, and so on, through all the dense nets. He found that there seemed to be like five stages. That is, there seems to be a kind of a random noise stage where the eigenvalues are just in a huge bulk, and then a stage where they began to kind of there were covariances that just popped out, or eigenvalues that popped out. And it’s like early, it’s almost like the deep learning is saying “these are interesting things I’m finding, let’s let them go do something for us”. So it’s almost like kind of precursors to feature detection stuff. I mean they’re just starting to … assuming these are all classifiers. And then there’s a lot of bleed out in the third stage, and then there’s a flip where the distribution then becomes almost hyperbolic. It’s basically flattening out, and all of the information is being pushed out in that tail. So there’s a lot of regularization theory, and I published some very early stuff here in the 80s on regularizers and noise. I’ve been fascinated by noise and regularizers for a long time and I think that has a lot to do with what this fellow is showing. But then all of a sudden you start seeing a rank collapse. So the matrix just “pum”, and guess what – it learned it! So there’s kind of this kind of slow curation process and then something happens at “bam” it shoots off.
Bengio: Yeah, there is of course a connection to all the recent work, both empirical and theoretical, to try to understand why it is that these huge networks do not overfit so much. My view on this is simply implicit regularization as you were calling it. The learning dynamics prevents it from overfitting, it stops exploiting the capacity before… as soon as it nails the training data, the weights don’t grow anymore, they don’t need to.
Hanson: But don’t you think this has something to do with what we were talking about earlier in terms of this implicit explicit trade-off, even though I agree it’s not explicit.
Bengio: I think it’s a different meaning of explicit and implicit.
Hanson: Well, except that there’s an abstraction that may be going on. I’m just thinking that at some point, as the layers are evolving, it’s as you say there’s so many constraints, it’s not going to basically start fitting every data point it could find. It’s basically fitting interesting data points and then it’s creating these abstract structures. And the abstract structures get to hang around a lot longer. I mean one of the …as in the single layer hidden unit networks, I mean one problem was this complexity business about loading data. So you have enough capacity with good representation, but then you couldn’t learn anything. The learning was NP-complete. I think Judd showed that and others…. But this is the same kind of thing. All of a sudden it’s easy to learn. But there is some trade off with regard to the representations that are being formed, and I think this is still a very open question.
Bengio: There’s so much we need to learn.
Hanson: There’s so much we need to learn. And on that note, I’m going to stop, unless you want to say something else… This has just been… I feel like even though we haven’t met much, it seems like we’ve been talking a long time, it’s just very pleasant to talk to someone who’s got similar thoughts occasionally. Maybe that’s what explanation actually is in this context. But, in any case, Yoshua, thanks again for coming and I hope we can talk again sometime and look forward to running into you in person instead of on a small TV screen or in a mask!
Bengio: Thanks for the discussion, it was very enjoyable.
Hanson: Alright well you take care, bye.