In this episode, Stephen José Hanson talks to Geoff Hinton about neural networks, backpropagation, overparameterization, digit recognition, brain imaging, syntax and semantics, Winograd sentences, and more.
Hanson: OK Geoff, thanks for joining me in this chat. This is for AIhub, and I’ve recorded three or four different conversations and it sort of started out thinking about– what is AI, but it really started out with an old friend of mine (we overlapped in Graduate School), Michael Jordan, who had written several articles (one in Medium) and I wrote a reply, which got some attention, mainly from Mike. He and I had this discussion and I disagreed so much with him I wanted to just see what was going on. Even if you haven’t been paying attention you’ll notice that something is happening. He was basically saying that the deep-learning phenomenon that’s happening right now is – I almost think of it as like The Beatles, when Beatlemania started, we’re in deep-learning mania. But, there’s a lot of good things happening too, and as I pointed out to him, protein folding.. He said “I agree, but of course, they didn’t solve the problem!”. I said “you’re creating these diminishing comments to create an atmosphere of ‘this is going to fail, the AI winter is going to come’. It’s like some kind of self-fulfilling prophecy on your part. Why are you doing this? Don’t you realize you’re like the only person who doesn’t get this”.
Hinton: He’s not the only person. There’s Gary Marcus…
[This section has been edited. For a more in-depth discussion please see this Connectionists discussion].
Hanson: Anyway, I knew Michael back in grad school and at that point he was always focused on the margins of things – I mean, important things. There’s a sense in which he really is rejecting the whole DL thing strongly, and he’s an interesting character in this. Now, you on the other hand have had, at least what I’ve heard you say in other contexts, that deep learning concerns you. I think that Yoshua Bengio has had a lot of concerns as well about DL. So we’re doing classification and it works well, but how does it compare to human thought and reasoning, and all the wonderful things humans do? Now, being a psychologist, and having a couple of brain imaging scanners, and having scanned ten thousand brains, I’m much less impressed by what humans can do. As you see, it appears to revert to a couple of networks in the brain that are interacting in interesting ways (*we don’t know what this means yet*). And there’s a lot of…, neural imaging is going this way towards network science, there’s no doubt there’s some collision coming, between artificial intelligence and the network sciences coming out of that, and the network sciences coming out of brain imaging, which has just taken it over. So, I was very excited about this when I started thinking about it, so I just want to toss it out to see what you think is happening right now, what’s going on?
Hinton: OK, so I think one thing that’s been fairly clearly established by the research on deep learning, which is that if you do stochastic gradient descent on a lot of parameters, then amazing things will happen. Like you’ll be able to generate whole stories. Or, you’ll be able to integrate symbolic expressions and be able to compete with things like macsyma at integrating symbolic expressions, so that’s quite extraordinary. Or you’ll be able to do machine translation, which should be the preserve of pure symbolic AI. What it doesn’t show, for example, is that the brain is using backpropagation. I spent a long time trying to figure out how the brain could do backprop. I’m still trying to figure it out. I still don’t think it’s impossible. My argument is, you can take a stem cell and you can turn it into an eyeball or a tooth. If you can do that, you can surely turn it into something that does backprop. The question is: is backprop the algorithm that the brain is using? I’m beginning to think it’s probably not.
Hanson: So really, and that’s kind of what struck me about this time period in 2007,8,9, up to 2012. If I was still in the 80s and I time travelled to this period, I’d say “ReLU units and dropout – it just looks like backpropagation to me. In huge networks”. Not much has changed, except in some sense the parameterization, which you just brought up. There’s a huge number of parameters in this thing, which from a statistical point of view, it can’t possibly work. It can’t do anything without strong regularization, and in 1000 layers with billions of, trillions of weights, this is absurd, right? It can’t work. But it does.
Hinton: Well, there’s two issues. Firstly, does it fit the statisticians idea of how you build a model, which is you’d better have less parameters than data.
Hanson: It had better be linear, Gaussian and five parameters and have 5 times the amount of data.. yes?
Hinton: I think that’s a bit unfair. We can go beyond all that. But they certainly believed you ought to have less parameters than bits in the data. And, what we’ve discovered is that great big neural nets work pretty well, surprisingly well. And so what’s really weird is that you can have a neural net with enough parameters so that it can learn to associate outputs with inputs, even if the outputs are made entirely random. That same net, if the outputs are entirely random, so it’s way over parameterized, will go off and find the structure in the data, and generalize really nicely. And that’s very weird.
Hanson: It is, and I don’t see why statisticians aren’t – and I take Mike Jordan as a particularly good statistician – why they’re not worried about what that means. The attitude is to fall back to “it shouldn’t work”. I’ve heard very sophisticated statisticians tell me that that can’t work, and people who helped define certain fields in multivariate statistics say that can’t work. But I point out that it does work, so wouldn’t it be better to try to develop some theory about this, about why it’s working?
Hinton: Indeed. And there is a very simple theory, which is that it’s basically early stopping. For example, if you take MNIST. There are 60,000 training examples. So I tried training a network, using backprop, with a few tricks, that has the same ratio of parameters to training examples as the brain has. So if you take the brain and let’s suppose that each fixation is a training example. In your life you make about 5-10 billion fixations, so let’s say 10 billion fixations. But you have like 100 trillion synapses. So given that ratio, you can burn 10,000 or 100,000 synapses per fixation. So if you do the same with MNIST you end up thinking that you’re going to need something like half a billion parameters for MNIST at that level of profligacy. And if you train a network like that, it works. And the reason it works is because it very quickly gets to making zero errors on the training data.
Hanson: Right, so…
Hinton: Certainly with early stopping you can understand why a big system with a lot of parameters, if it gets the training data right, it’ll stop learning.
Hanson: Right. I’ve actually looked at a lot of learning curves in deep learning systems and I was also really interested in learning curves going back to when I was first doing animal conditioning research (1970s), and learning curves in that case were always negative-exponents, based on Hull’s theories and other early psychological learning theorists and of course consistent with the Perceptron. But there was an interesting variation due to Thurstone, and some of the early mathematical psychologists who had derived hyperbolic curves. Now, these types of learning curves do exactly what you said, they have a small incubation time, and then they shoot up to 100% accuracy as if there’s a kind of of tipping point- as if there is a resolution of the feature structures that appear to crystallize then the system basically goes “I can get this to zero error right now. It’s done”. In an examination of many DLs during learning, it appears they are behaving this way but especially when there’s many, many, many layers. Which led me to hypothesize that it’s not so much various tweaks to encourage learning efficiency and whether back propagation is the brain, but these layers are terribly important in the kind of reconnoitring of the information somehow. The exploration and it’s reduction are critical to… lots of layers. Now I had a student who actually went off to Google, and he’s running some group up there and I talked to him a couple of years ago and I said “what do you decide to do to make it work better?” – I forget their particular task, probably speech recognition, and he says “well, we just add more layers”. Sounded like a bad idea to me. He says “but it works”. It’s the old saying.., you remember Alan Lapides had an old phrase. He said “the fairy dust approach, just sprinkle with hidden units and it will work”. And this is almost like, sprinkle with layers and it will work.
Hinton: But it’s fairly clear, the number of layers, in the brain. So the equivalent of layers in the brain is cortical areas.
Hanson: Right, and you only have six of them.
Hinton: You don’t have this very deep power in the cortical areas. Even van Essen only has it about five to ten areas deep between the input and the hippocampus. The cortex isn’t like that.
Hanson: Right as I said– just 6 in cortex, but the argument might be that if we consider recurrent networks or feedback networks, some cyclic structures, then we could have many, many layers in that virtual structure as we unfold it. It’s just a matter that we’re basically processing in a kind of recursive way until we…
Hinton: Sure, yeah. But then it gets even less brain-like because as soon as you take these recurrent nets and ask “well, how do they train them?”, they train them by backprop through time. And there’s no way in hell the brain is doing backprop through time.
Hanson: Well, I don’t know why you say that. Seems strange. I think that, I mean there are many ways in which one could construct some kind of wet tissue to do something like error propagation if there’s symmetry, you know if there’s like two parallel systems that have some way to communicate. One of them is sending errors and the other is dealing with the information. And they’re connected together in a sympathetic way.
Hinton: No.
Hanson: No?
Hinton: Think of video. Think of these systems that do backprop through time for understanding video, a recurrent net. There’s no way you can be pipelining the visual input and using your hardware in a pipeline so you can process it in real time. And do backprop through time. They’re just completely incompatible.
Hanson: Well, you need some delay. There has to be some way to create some off-phase but symmetric information. With a delay…
Hinton: Basically, you can’t do pipelining if you have backprop through time.
Hanson: Oh, I understand. But if you consider that the visual input from the retina back to V1 is almost 200 ms, I mean we’re living in a very slow world as it is. And so you could have a slower network, say… the cerebellum’s got a lot of room, we could stick it in there and basically simulate the cortex and then “pop” out every once in a while for proper updates, I mean, there’s bigger systems here. And of course, the feed-forward structures that have been the focus for 40-50 years are not in the brain, I grant you that. But I think this kind of argument borders on proof by lack of imagination at this point, like Rumelhart used to say.
Hinton: I agree. So I have my favorite of this type of argument: Which is that Chomsky thought language can’t be learned because he couldn’t see any good way to do the learning.
Hanson: Right, that’s right. And he’s still saying it. I think he was presented recently with some deep learning results and he just shook his head and said “ridiculous, neural networks can’t work” or something, I don’t know. One of the things historically that caught me here too, not that I want to digress in various directions, but I will a little bit, is that it seems to me, unlike the first AI winter, which had to do with the deconstruction of the Perceptron, there’s a sense in which there was more like a pause, almost like an incubation. The whole thing was being marinated. It wasn’t stopped, it was just kind of sitting waiting for something to happen. And these other kinds of innovations and technology and compute cycles and data, all of that then appeared with time… I don’t know, I mean you were there when this happened. I was still wondering what was going on. It just exploded. I mean, is that how it felt? You for one, never gave up on any of these ideas. So when it hit, you had to have actually predicted this somehow.
Hinton: Yeah, I mean it seemed to me that the brain’s got to learn somehow. The brain doesn’t learn by being taught statements in symbolic logic, and figuring out their entailments. That’s not how the brain works. It clearly works by having huge vectors of neural activity that interact with each other. So we’ve just got to figure out how these vectors get constructed and how they interact. That’s got to be what’s going on. And so, from my point of view, there never were alternatives to neural networks. They were just silliness, and they were silliness that depended on people not having any biological insight. They thought that somehow the essence of intelligence was symbolic reasoning, and that symbolic reasoning had to involve using symbols inside your head. And all of that was just crazy. So I had a very nice test for whether people were crazy or not, which is; suppose that you could understand a rat and you could really, really understand a rat. In the way a physicist understands a ball falling off a tower. Would you be more or less than half way to understanding human intelligence.? And it seemed to me just obvious that if you could understand a rat you are most of the way there. On top of that you have language and other stuff. But basically you understood how the thing worked. And I tried this question on Steve Pinker. He was convinced that you were much less than half way there. It’s just a question of hubris. It’s a question of thinking there’s really something special about us. Of course, there are lots of really special things about us but the essence of how brains work and how they manage to do things in the world, that applies to rats as well as to us, and if you can understand a rat, you’ll be most of the way there.
Hanson: Well, this was the view in psychology in the early 19th century–that we’re going to start with simple organisms and somehow this will generalize to human behavior. Now, to some extent, that program failed, it didn’t work. I supposed it’s partly because language had become kind of the hobby horse of certain psychologists, not all of them. But, then the basic processes really from the 50s haven’t changed that much in terms of attention and memory and episodic memory and stuff like that. It’s just there’s more we know more about them – do we know enough to actually simulate it? Maybe. Maybe not, I don’t know. But, we’re on the right track. So I agree with you in that perspective. But then we’ve got this thing that just happened again. There’s so many good retreads these days – back to the 80s. One of them being explainable AI. And I just love the term explainable, because it gets down to what is an explanation here. Do we really want the CIFAR network that just learned about cats and dogs, to say “I know what a cat is”. You might be able to construct explanations from info inside the network, but that seems to be an entirely different function. If I can explain to you how to play tennis, you’re not going to be able to play tennis when I’m done. It’s not going to work.
Hinton: My favourite example is with handwritten digit recognition. I can make a neural network that does very well at handwritten digit recognition, and people would like an explanation of how it works. The way it works is that it’s got a whole bunch of weights in it, and you put these weights through these functions, out comes the answer. That’s how it works. And if you want to know more about how it works, I’d better tell you what the weights are. Or at least I’d better show you, for a very large set of examples, the mapping from input to output. If I show you that, that’s enough, because you know you can distil another network from that, so that contains the information, just the mapping from inputs to outputs. But, that’s not what people want, they want a different kind of explanation. I’ve seen people in the press say, “well, we can explain how neural networks work. They way they recognize objects is they first derive features, and then they combine these features”, and there’s a sense in which that’s true, but if you ask, “how do you recognize a two. Can you explain it?”, most people think they can. That’s what’s interesting. So I ask “how do you recognise that something is two?” People will give me an explanation, then I’ll find a two that completely violates that explanation. I can find a two where the tail of the two is vertical. The fact is, we don’t know how we recognise a two, we can’t explain it. So if you want to be confident in what the systems do, you have to do it in the same way you get confidence in people. If I show you a lot of twos and you recognise them all, then I think you’re a pretty good two recogniser. I’m fairly confident you can read the two on my cheque. But I don’t get confidence that you can recognise a two by looking inside your brain, or by asking you how you do it. Because if I ask you how you do it, you’ll give me a bunch of rubbish.
Hanson: Yeah, and again, in brain science, we know there’s implicit and explicit learning systems in the brain, and they do different things. Now, we also know they communicate, but not that well. So, you can know how to make risotto, or something, but I can’t explain it to you. You’re going to have to watch me do it, and then maybe you’ll get lucky the first time. There’s already that breakdown, so I just find there’s like this false dichotomy that was existing in the 80s and now it’s come back again. Why? Because neural networks are working and people don’t like it. That seems to be the context with that.
Hinton: The big difference between what happened in the last neural network winters and now, is that when the last neural network winters came along, neural networks weren’t running our lives. There were a few. There was Yann’s network that read the amount on cheques, and that was used quite a lot. Neural networks were used in postcodes and things like that. But, actually not that many applications. Now they really work. The reason there’s not going to be another winter is because they really work.
Hanson: Yeah, yeah. So, one of the trends out there, one of the things I also have noticed recently. There were a group of mathematicians at the Institute for Advanced Studies that ran a bunch of workshops. They then brought up people to,… Yann was there and a few others, they brought up people to give talks and they were singularly unimpressive. And I thought, gosh I’m back in the 80s again – just talking about the same thing. One of the keynotes was talking about gradient descent, and I thought, gosh. But then I went over to the poster session where all these kids were, and I went “what are you doing here, this is amazing. How did you do that? What’s this?” And there was this recent book that appeared on DL principles… these two kids from Facebook, ah.. Yaida and Roberts?, and they wrote a 500 page book – I got a copy from the Arxiv – and I started talking to them and I realized… They are physicists by training and they took quantum mechanics kind of math and applied it to deep learning and started to prove lots of theorems. And, there were some astonishing things I read in the book that I thought were interesting about layers, and about how quickly things learn and so on. But, on the other hand they couldn’t answer simple questions like, what’s a learning curve going to look like, is there a generalization? And they couldn’t say anything, they said “well, we’ve done the general case for everything. And that’s equation 247”. I looked at it for a while and gave up. I don’t understand quantum mechanics. So, anyway, in this same thread, I remember, almost 40-50 years ago, I was at Bell Labs, and you came and gave a talk and it was on Boltzmann machines. First modern Neural Network I had ever heard and to me it seemed game changing.
Hinton: I remember that.
Hanson: And, the guy who was the director (Max Mathews), he showed up. It was a full crowd, everyone wanted to know what a Boltzmann machine was. And I remember David Krantz, who I knew pretty well, and he asked you a question: “So, Geoff, this is really interesting, have you proved any theorems about it?” Even at the time I thought, why would you want to prove theorems about it? This thing works and then of course I realized he was asking something a little deeper. He was thinking, is there a kind of foundational basis for this where we can understand what the family of these things are, or how they’re related in a larger sense – maybe that the brain or biology, or something like this. But, you know. In the context, do you think this kind of thing, this Yaida and Roberts book will have…I mean, I can’t tell whether it will have much of an effect on anything
Hinton: I don’t know.
Hanson: OK, fair enough.
Hinton: It would be very nice if we really understood why overparameterized big networks actually extract the regularities that we’re interested in. As opposed to just finding some random way to fit the data. And it’s presumably because fitting the data by extracting all the regularities that are really there is easier than fitting it by just doing random things, by extracting lots of regularities that are just there because of sampling error. But, in the meantime, there’s lots of experimental work to be done, lots of work developing algorithms. So, I had another question I used to ask people. Suppose there were two talks at NeurIPS, which would you prefer to go to? One talk is about a really clever new way of proving a known result. The other talk is about a new learning algorithm that’s different from the existing algorithms with no proofs at all, with no idea how it works, but it seems to work. I’m clearly on one end of the spectrum.
Hanson: Me too. So, there’s a sense that this thing that Michael and I always argue about, going right back to the McDonald – Pew meetings (Hanson was a member of McDonald-Pew Cognitive neuroscience advisory board), and he was part of that Center at MIT at the time. Michael and I would argue, and he sort of had this “neats” versus “scruffy” idea, and the idea that statistical models were really models that were specified and then you exploited that specification with data, and not too much data, just enough data to get the model configured. And if you found that there was covariance amongst the parameters then you should get rid of some of the parameters, and just compact it. I was trained in a very different way, and the idea was that, well, I’m not sure I know what the model is, why would I, I’m modelling pigeon behaviour – I realized even in a simple system you don’t know what the right model is. So, I had a statistics professor tell me “well, in that case you talk about model mis-specification. And the question comes down to the statistician who believes they know the model. How much risk are you willing to suffer for model mis-specification? If your linear model is that wrong and it’s still OK, then somehow you’ve left out so much variance, so much information, that you’ve somehow lost the fundamental aspect of the thing you’re modelling”. And this was the thing that caught my attention about neural networks in that time period, because I was more interested in general mathematical modelling – my God, I said to myself, this is a giant fiesta of model mis-specification! – I love this, this is exactly what I want to do”. I want to be able to have some universal general approximator and then be able to extract the model out of it that got discovered. I don’t know how the motor system works, I want to discover something about something. So, learning to me was just good model mis-specification. And of course, statisticians just hate this stuff in that context right. Cos, they don’t want to suffer the risk of having a model that’s outside the equivalence class of models they think are correct.
Hinton: I think it’s even worse than that. Backpropagation is very good at squeezing a lot of information into not many parameters. We run it in great big networks that have a lot of parameters, but if you run it in a smaller network, it’ll do the best it can, and it’ll get a lot of information into the parameters. And I don’t believe that’s what the brain’s like. So, with the neural nets we run, typically we’re not in the regime where you have very little data and a huge number of parameters. We are often in regimes where we have more parameters than you’d expect you can get away with. But if you look at the brain, and you say it’s got 100 trillion parameters, so maybe 10 trillion of those are used for vision – we’re very visual animals. So, you’ve got these 10 trillion parameters used for vision, and in your entire lifetime you only make 10 billion fixations. So, you’ve got at least a thousand parameters per fixation, and probably many more. That’s a very different regime from what statisticians are thinking about. And the reason we’re in that regime is because supporting a synapse for your whole lifetime takes a lot less energy than making one fixation. So, we’re in the regime where parameters are cheap and doing computation to fit parameters is cheap. Data is expensive, because that involves possibly getting killed.
Hanson: But it’s also cheap in the sense that you point out – fixations produce an enormous amount of data in a one year old.
Hinton: Particularly for unsupervised learning.
Hanson: yeah right, particularly for unsupervised learning. And to the extent that one can build some kind of modal distributional information from that then you can use that to train things.
Hinton: But it’s still the case that we have to understand what learning’s like when you have lots of parameters per training case. And the brain is clearly optimized for sucking as much information as it can out of a training case, not for squeezing as much information as it can into a parameter.
Hanson: Right, that’s an interesting distinction.
Hinton: And that’s a very different perspective from what most statisticians have. It argues against backprop. And it’s one of the reasons I now don’t believe that backprop is how the brain works; here’s what I think is a pretty good reason. Suppose you ask: how many parameters do you need, to make a system that can translate moderately well between 50 different languages? It can translate any language into any other language, without necessarily going via English. And it can do it moderately well. How many parameters do I need for that? Well, it turns out you can do that with just a few billion parameters. And if you ask, well, in terms of brain imaging what’s a few billion parameters. Well, I can tell you if it was a mouse, a mouse has a billion weights, a billion synapses, per cubic millimetre. They have to be optimized for being compact. So, we probably need a few cubic millimetres to have a billion synapses. So, a billion synapses is about one voxel in a brain scan.
Hanson: More like maybe 10 million, 100 million.
Hinton: How big are voxel cells in the brain?
Hanson: Voxels, um, in fMRI, let’s say two millimetre orthonormal.
Hinton: OK, so that’s eight cubic millimetres per voxel cell. So, if it was a mouse cortex, that would be eight billion parameters. We assume for a human cortex that’s well over a billion parameters.
Hanson: OK, right, in terms of connections, OK, I see, I agree, yes.
Hinton: There are enough parameters in one voxel in a brain scan to do this translation between 50 different languages. So, the brain is clearly not that efficient. That is, you would be very surprised if all that knowledge of all those different languages is fitted into one voxel – you’d expect it to take several voxels at least. So, I think backpropagation is much better at squeezing a lot of information into not many synapses than the brain is. Because, that’s not what the brain is optimized for. The brain is optimized for extracting a lot of information from not many fixations, without much experience.
Hanson: Well, the thing I really enjoyed about this second version of Google translate…I think I was at NeurIPS at 2017 or 2019, I forget which one, and I went to a Google Brain poster and they were doing a bag of words between different languages and getting remarkable results. Now, in the 80s I had done a simple autoencoder on the Brown corpus, and was able to pull what I thought was syntactic structure out, because I was told by linguists that you must do syntactic structure first, and then semantic structure. And, so I had a model in the 80s, parsnip, which would pull out some interesting syntactics. And At least it would get philosophers to write about it, that’s the important thing here. But in this case, I talked to a very pleasant, bright gentleman and I said “do you have a linguist working with you, what’s the syntax structure? There’s many ways to code the syntax. Are you doing it in a kind of a corpus way, or…”. And he said “what’s syntax?”
Hinton: I had a very interesting conversation recently with Fernando Pereira who’s in charge of natural language processing at Google. It was about whether we’ll ever go back to the idea of taking a sentence and extracting a logical form from it. And, his answer was “no, we’ll never go back to it”. We know the right symbolic language to use, it’s called English. The thing is we operate on this symbolic language with is a very fancy neural net processor, and so it doesn’t need to be unambiguous. You can use pronouns in it because the neural net processor is fine with that. He thinks the only symbolic language you have is natural language. You don’t have another symbolic language inside your head for creating symbol strings that are unambiguous. People in Google have been working on taking a sentence in a context. So you might have a sentence “and then he gave it back to him”, and the problem is who’s he and who’s him and what’s it? And in the context you know what that is. So, now you can make a neural net that will take a sentence in context and give you a version of the sentence that doesn’t require the context. So given “and then he gave it back to him” in a context it will return “and then Jack gave the soccer ball back to Bill”. And that’s symbolic processing in the sense that you start with some sentence which is a string of symbols and you produce a new sentence that is a new string of symbols. But, the only place the symbols are is the input and output. There’s no symbols elsewhere – it’s all embedding vectors.
Hanson: I almost wish Jerry Fodor could’ve lived long enough to see this, because he would’ve hated this so much (I miss his counter-arguments!). Syntax and semantics were kind of squished together. It’s interesting, through the last decades I spent a lot of time with linguists – one of my best friends is Edwin Williams who’s got a Chomsky number, and he always said something interesting … he’s a syntactician… he’d say this funny thing. I’d say well “how does this connect back to semantics? We’ve got this lexicon, and these things are selected…” And he’d say, “no, no, no, syntax is semantics”. And I said, “what do you mean by that?” And he says, “well, it’s very clear that a lexicon carries this information (“logical form”), I’m just extracting part of it therefore it must carry semantic information (markers)”. And that sort of made sense, but no one else, I had heard, talked like him, so this kind of idea that syntax and semantics are just part of a large parcel and you could grab out the information you needed to map to other information (with the detail that the sentence is the logical form to linguists). So maybe there is a kind of intermediary thing that’s being extracted. Or do you not think the representations are like that?
Hinton: I mean, you’re extracting stuff. You’re extracting these big vectors. But, if you ask in terms of symbol strings. Is there a lingua franca inside the head? We don’t need them – what we need is vectors inside the head. You know that little symbol on some computers that says “intel inside?” We need one on our foreheads that says vectors inside.
Hanson: Interesting… a couple of years ago we did a simple bilingual experiment with Spanish and English speakers and we’d give them an English sentence with a Spanish word inserted, or we’d give them Spanish with an English word inserted, and what we’d find (we’d have them guess what the word meant in different contexts), we would be scanning their brains and we found that English and Spanish, in particular would occupy the same areas of visual word form area (VWFA). So, it’s almost like that area was able to accommodate – as long as they were close enough – and then the lexical access would be relatively automatic. But it gave you this idea that there was this kind of mosaic of things… Of course, if you know Chinese and English, there is no mosaic, They lateralize… It suggests the language structures can have some modular structure, and again within a larger system of networks.
Hinton: Let me go back to the question of whether we really extract syntax. So, I think we really do get the difference between two syntactic forms, if we have a sentence like “next weekend we shall be visiting relatives”. That has two completely different senses that happen to have the same truth conditions. So one is: next weekend, what we will be is visiting relatives. Versus, next weekend what we will be doing is visiting relatives. Those are two completely different senses. I think when you hear the sentence, you can’t think of both senses at the same time. When you first hear the sentence you fix on one of those senses. Just like with one of those Necker cube pictures: you see it one way round. Or you see it the other way round.
Hanson: Sure, that might be driven by context or the audience…
Hinton: The point is that we really do disambiguate it. If that’s what you mean by syntax, that’s fine. They are different syntactic structures. We hit on one or we hit on the other. The question is whether to get the parse tree, you have to have a lingua franca inside. I think not. I actually now believe we have these layers of agreement in the syntactic structure, so we do parse things. But, that doesn’t mean there’s a symbolic structure. Neural nets can parse things.
Hanson: Well, I’m not saying it’s symbolic. If you have a system that can parse pairwise a hundred and nine languages, OK five thousand different parsing outcomes, it does seem likely that if you analyzed that and looked inside the network in various ways, you might be able to extract something that looked like a setting stage, some kind of small infrastructure that allows you to push off very rapidly in all directions. I don’t know what that is. I’m not going to call it a lingua franca. I’m just going to say that there’s some enabler there that has learned to do this.
Hinton: So, earlier this year I put a paper on the web about how neural networks can represent part-whole structures, which is about how you can essentially parse visual scenes without having to have anything explicitly symbolic inside the net. It’s about how you can make a system that runs on big vectors that can do parsing. And I do believe that we really do parse things – I just don’t think it requires a lingua franca.
Hanson: I agree, I’m not trying to reduce it to symbol – non-symbolic representation. I’m saying there’s some kind of glue, or something that develops that just isn’t a map. I’m not denigrating maps. They’re all over the brain and they’re obviously critical. This translation machine is much more interesting than mapping bags of words to other bags of words. I don’t think it has much to do with the statistical aspects of the word distributions themselves. It might have to do with how language evolved and the way we see it throughout the world, and certain languages have more similarity than others. So, I did want to ask you something about these architectures, which worries me, but maybe not you. So, Google announced that there’s this new GPT thing called GLAM and it has a trillion weights in it and many, many layers and it does interesting things like GPT-3 does. Of course, these things are a bit of a Rorschach test – if you ask GPT-3 if it’s going to take over the world it says “no, I won’t do it”. So there’s a sense in which, I see it as a huge phrase structure blob that can find a phrase structure in space and pull out a similar blob, phase structure, and put it out there, and some people will say “that sounds good to me”.
Hinton: I think it’s much more than that.
Hanson: But, I’m saying, it’s untestable in that sense. And the size of these things are out of control..
Hinton: People are untestable too.
Hanson: In the sense that…right, but if we give them a task…
Hinton: The same with these things. You can give them a task and you can see the task that they fail at. You can understand that they really don’t understand things properly in the sense people do, by the fact they can’t deal properly with Winograd sentences. They can do some of them…
Hanson: Wait a minute, what are Winograd sentences (Winograd schema)?
Hinton: OK, if I tell you “the trophy would not fit in the suitcase because it was too big”, then “it” refers to the trophy, but if I say “the trophy would not fit in the suitcase because it was too small”, “it” refers to the suitcase. But if you translate those two sentences into French, you have to know whether the “it” refers to trophy or suitcase because they are different genders in French.
Hanson: Right…
Hinton: But you can’t translate that sentence into French without resolving the ambiguity about what “it” refers to, and that depends on real-world knowledge about how big things don’t fit into small things.
Hanson: I see. OK, the Winograd part of that I wouldn’t… linguists wouldn’t talk that way, or refer to Winograd schemas as language. They would refer to what you say, in that there is some kind of world context in which…
Hinton: The Winograd sentences were proposed by very good symbolic people like Hector Levesque as a real test for these networks.
Hanson: I saw that, and I asked other linguists what Winograd sentences were who said I’ve never heard of that. Just saying, there’s a grain of salt to be taken with that.
Hinton: No, it’s not a grain of salt. It is a very good test. It’s a test of whether you really understand.
Hanson: The test is fine. I’m just arguing about the context of the Winograd schemas. The test is fine, I agree. But it would seem, If you don’t have world knowledge and you don’t have experience with small suitcases and big trophies, you’re gonna have trouble interpreting what’s going on.
Hinton: The point is, once they can do that, then it’s going to be quite hard to say they didn’t understand.
Hanson: But they may need more contextual information along with,…maybe it’s the visual context of seeing suitcases and trophies together and understanding something about the large and small relational structures they’re looking at. And I agree with you, the GPT-3 thing is impressive and amazing, and it’s very much what we were trying to do in the 80s by training on lots of texts and trying to understand if the system would understand language.
Hinton: Let me make one more comment about this. So, I thought, once you could get neural nets to translate from one language to a very different language, people would stop saying they didn’t understand, because how can you translate to a different language if you don’t understand? People kept on saying they didn’t understand. A few people. So, right now you can get neural nets and you give them a sentence that says “a hamster wearing a red hat” and it’ll draw you a picture of a hamster wearing a red hat. Now, it’s quite hard to say they’re just learning to convert phonemes into pixels. This is just converting phonemes to pixels and it’s just a coincidence that it looks like a hamster with a red hat. They don’t really understand. Once you can say “a hamster wearing a red hat” and you can draw a hamster wearing a red hat, that seems to be pretty definitive evidence that it is understood. And I’m sure people like Gary Marcus will say, it doesn’t even understand what a pixel is. But, any normal person, if you could say “hamster with a red hat” and they can draw one, that’s pretty good evidence that they understood what you meant.
Hanson: But, psychologists have spent a long time basically creating task paradigms that exactly do that, that’s the point. Psychologists don’t really understand the kind of taxonomy of all these tasks but they have certain kinds of properties that can inform us of psychological and brain activity that’s relevant to the outcome. So, I agree with you. I don’t see why it’s any different from a neural network being able to draw a hamster with a red hat – it’s the same test I would give to a twelve year old. And they would draw a hamster with a red hat. It may not look like a hamster, maybe it would look like their pet cat. But the hat would be red. So we agree at some level. Let me change the topic a bit; I don’t know how much time you have, but I’m enjoying this so much. One of the other reasons I like doing this is because I get to talk to people I haven’t seen for a long time and actually just to enjoy the conversation without bumping into them in a Starbucks by accident every five years. One of the things I thought was funny that I remembered, because I have this old video and it’s got to be from the 70s, and it’s you being interviewed by John Searle. Do you recall that at all?
Hinton: Oh, I recall that extremely well. It was a very painful episode. I was advised…, before I did it I talked to Dan Dennett and said, “should I do this?” and Dan Dennett said “no”. And so, I got Searle to agree to something, that we wouldn’t talk about the Chinese Room argument, and he agreed to that.
Hanson: Oh, lucky you.
Hinton: So then we started the interview, and the producer of the programme was an Israeli who was a friend of John Searle’s, and John Searle introduced me by saying something like “today we’re going to talk to Geoff Hinton. Geoff Hinton is a connectionist so of course he has no problem with the Chinese Room argument”.
Hanson: [Laughs]
Hinton: And I had to object and say “look” – I didn’t feel like I could say “look, you agreed just before we went on camera not to talk about that” – I just put up an objection. The whole interview consisted of me being badgered by John Searle. They made it an hour programme, and they allowed two hours of filming to make an hour programme. And then about halfway through, the Israeli producer stopped the filming and said “stop the cameras”, and he turned to me and said “you have to be more assertive”. And I thought, “oh my God, I have to be more assertive”.
Hanson: So, the only thing I liked about the whole interview, except you were terribly uncomfortable through most of it and you were sort of sitting and staring off in a direction. Was this a BBC…?
Hinton: No, it was… I’m fairly sure it was ITV. The reason I’m fairly sure it was ITV is because there was a green room before the thing, and in the green they gave me an envelope stuffed with money.
Hanson: I’ve given talks somewhere and walked out the back and they gave me an envelope of money and I went “what’s this for?” I guess they were taking a gate fee. Anyway, one of the things he did, and I won’t try the [adopts a Midwest accent] he’s from the Midwest like me and he’d have this Iowa kind of “what if we replaced every brain cell in your brain with a chip. And as I did that, slowly we lose Geoff Hinton, he just disappeared right”. And he finished off and he just went on and you said something like “no, actually it’s worse than that. It’s the software.” And Searle just looked at you like you’d just slapped him in the head with a dead fish, you know.
Hinton: I wouldn’t have said it was the software. What I would’ve said is that if you replaced every neuron with a piece of silicon that behaved exactly the same way, I’d still be Geoff Hinton.
Hanson: But that’s when he said something about AI, and he was implying that AI was this kind of hardware transformation and you said “no, it’s much worse than that. I believe that software will actually be able to simulate human intelligence someday. Not neural networks, but just the idea that this was not about hardware and so on and so forth. But that’s the thing that also struck me about your persistence and consistency, is that, even though, if I read the hype out there and, I don’t think they realize how brain-oriented you really are in terms of the actual theorizing. Well, we’re at an early stage here of what’s happening. Other than backpropagation, there must be other learning algorithms that are waiting to be picked out of the air by some young postdoc somewhere who’s right now sitting thinking, “0h, I know how this works”.
Hinton: Not necessarily a young postdoc. It might be a very old researcher who thinks they can get lucky one more time. [Both laugh.]
Hanson: Nice of you to point that out. So, I’m gonna thank you again. This was just so much fun. I enjoyed it a lot and it’s good to see you. I hope you are doing well with our days of plague here. I’m going to say – see you around. Take care Geoff. Bye.
Hinton: Thanks for inviting me. Bye.
Geoff Hinton is Emeritus Professor at University of Toronto and also works at Google Brain. In 2017, he co-founded and became the Chief Scientific Advisor of the Vector Institute in Toronto. Hinton received the 2018 Turing Award, together with Yoshua Bengio and Yann LeCun, for their work on deep learning.
Stephen José Hanson is Full Professor of Psychology at Rutgers University and Director of Rutgers Brain Imaging Center (RUBIC) and an executive member of the Rutgers Cognitive Science Center.
Other interviews in this series:
What is AI? Stephen Hanson in conversation with Michael Jordan
What is AI? Stephen Hanson in conversation with Richard Sutton
What is AI? Stephen Hanson in conversation with Terry Sejnowski