In this instalment, Stephen José Hanson talks to Terry Sejnowski about the history of neural networks, neural modelling, biophysics, explainable AI, language modelling, deep learning, protein folding, and much more.
Hanson: Terry, thanks so much for joining this videocast or podvideo, I don’t really know what to call it. When I started trying to conceptualize what I was getting at, I wanted to talk to people who had a clear and obvious perspective on what they thought AI is. And you’re particularly unique, and special in this context, because you have been consistent since… Well, there’s a great book that you have a chapter in and I think Jim Anderson edited in 1981, called “Parallel Models of Associative Memory”.
Sejnowski: It’s interesting you brought that up because I met Geoff Hinton in San Diego in 1979 at a workshop he and Jim organized that resulted in that book. It was my first neural network workshop.
Hanson: But not Steve Grossberg yet?
Sejnowski: No, but Dave Rumelhart, Jay McClelland, Teuvo Kohonen and a handful of others were there. We were all interested the same things. There was no neural network organization or community at that time – We were a bunch of isolated researchers working on our own.
Hanson: And probably not well appreciated, by talking about neural networks, or neural modelling.
Sejnowski: We were the outliers. But we had a great time talking with each other.
Hanson: Going back to the book, you had a chapter called skeleton filters in the brain. I think that was the name of it. Perhaps not the best title in the world, but still… “Skeleton filters” is a little scary, I gotta say. But, it was a really incredibly easy read – I just read it the other day again. And, in it, you’re really going in a subtle way from biophysics, modelling a neuron and referencing everybody, you know Cowen, and everybody who’d developed a differential equation, or anything up to semantics and cognition. But biophysical modeling, this kind of category you might associate with biophysics of neural modelling, in that neurons and circuits matter and that’s what we’re modelling, for that purpose – that’s the purpose of it. For example, I think you mentioned Hartline and Ratliff, and Limulus crab retina. And this provided an enormous amount of data well into the 60s where people were actually modelling and there were predictions and it was very tightly tied to the crab.
Sejnowski: By the way, although it’s called a Horseshoe Crab, and looks like one, Limulus has eight legs, so it’s an arachnid.
Hanson: Ha! Not a fun thing to bring into the lab then!
Sejnowski: The Limulus was an important model system in neuroscience because Hartline and Ratliff working at the Rockefeller University were able to record from single optic fibers and could uncover what the Limulus was seeing. This was amenable to neural modeling and mathematical analysis. Skeleton filters came out of my PhD thesis with John Hopfield. Skeleton is another word for sparsity. The idea is that the nonlinear responses of neurons select a sparse subset that are above threshold and can linearly filter correlations. By choosing different subsets of neurons, the same network can instantiate an infinite number of linear filters. We now know that sparse activity is common at the highest levels of the cortex.
Hanson: Right, right
Sejnowski: By the way, that was a prelude to the same experiment done in mammalian retinas in vivo by Steve Kuffler. Much later, I was his postdoc at Harvard Neurobiology.
Hanson: OK so that brings up like the first category that I think we want to like just put aside here in the shell for a minute this kind of biophysical neural modeling, but then there’s just stuff that you often and other people called brain inspired computing and obviously this picks up on a lot of this stuff at NeurIPS, and it picks up on the idea that there are tasks, there are things, you know object recognition, speech recognition, there’s things we want to model in this behavioral way by the actual reference bank to the neural substrate could be thin it could be more of a caricature of you know we’ve got units and connection and stuff but we’re not really modeling this synapse or any kind of dendritic structures, because – you know you know this I don’t have to tell you – the neuron is a very complicated thing. Forget networks, it’s just a nightmare the model correctly. And none of that actually gets into this second category very much. I mean there’s a lot of various kinds of nerve transmitter, and various kinds of noise and structure that’s there but not really modeled, let’s say routinely by people at NeurIPS. Then there’s computational neuroscience which doesn’t necessarily … which may be some kind of combination of modeling circuits and how it relates to behavior. So, obviously the physical aspect of that is also true but you could just study the circuit just to look at the circuit and not worry about what the function out in the world is so that function starts being a part of computational neuroscience.
And then there’s this thing that I know Demis Hassabis has talked about at DeepMind which is this (of course it’s more of a metaphor, it’s reverse engineering the brain)… the sense in which whatever DeepMind is doing they’re taking the biological insights and applying them somehow. Now, many people that I’ve talked to in this series and had outside just say “no that can’t be true, it’s not possible”. This is essentially, the deep learning event that we’re in is something special, it’s something very important but it’s very hard to relate it back to, say, neuroscience and you and I know neuroscientists – I can probably find them in the hall here in in our neuroscience center – who will tell me “no no no, that has nothing to do with the brain”. You know there were like Di Carlo and Yannis and a couple other people who I think in early days when AlexNet first appeared they tried to apply this to visual kinds of recording with again the mismatch – they got a couple 100 neurons that they’re collecting and then AlexNet has, you know 10,000. So yes, you could do the Rosharch testing, and maybe it’s stronger than that. Clearly there are strong correlations and they did fit a lot with the data to make that work out I think. But that’s sort of where I’m saying now. Do you think any of that is misleading?
Sejnowski: You’ve covered a lot of ground there. I want to just start out by addressing the issue of the relationship between artificial networks and biological real-world networks. I think there’s a real misunderstanding here. There are two different goals. One is to try and understand the brain and how it works, and the other one is building things that work like the brain. The bottom line is that there has been a great explosion in both of these fields and, for the first time, the researchers in these two groups – the AI engineers on the one hand who don’t know biology, and the biologists on the other hand who don’t know AI – are talking to each other. We have something in common to talk about. We can discuss progress of mutual interest. The point is that finally AI has a vocabulary and conceptual framework that makes sense to neuroscientists. We’re talking about units, we’re talking about weights, we’re talking about activities, we’re talking about short-term memory and attention; there is real crosstalk taking place. That convergence never happened with good old fashioned AI, based on symbols, rules and logic. The AI guy would ask “where are the symbols”, and the neuroscientist would say, “Tell me how to find them.” That was the end of the dialog. Back in the 80s, AI researchers told us that we were misguided and used birds as an example: If you want to build an airplane, they said, you won’t learn anything by studying flapping wings and feathers. There is a wonderful biography of the Wright brothers by David McCullough that tells a different story. They were fascinated by birds, which could glide for long distances with minimal energy. The shape of bird’s wings gave them ideas for airfoils, which they tested in a wind tunnel that they built. They were inspired by feathers, which were stiff, lightweight and had a high surface area, to build wings covered by canvas between wooden spars. They looked to nature for inspiration and general principles of aerodynamics and materials. This is what we are trying to do with brains, to extract general principles and instantiate them in man-made materials. The old AI crowd still doesn’t get it. As Max Planck once said, it’s not old ideas that die, it’s the people who believe in them.
Hanson: Well, they’re still doing this right? The explainable AI is simply this retread that comes from the 90s back into the present.
Sejnowski: We should discuss what explainable means. I think traditional AI wanted explanations in English. That may not be the best way to explain something complex. For me, coming from physics, explanations are mathematical. When physics began building mathematical explanations it opened up a new level of understanding, pushing the boundaries of mathematics over centuries. We should expect the same to happen to AI. But to return to the point. What is important is that the dialog is happening, a lot of back and forth, a lot of advances are occurring and that’s really what science is all about. But here’s the other point that I want to make, which is that what we’re trying to extract from the brain on a computational level are general principles. Some of these are well known, it’s just that we haven’t gotten around to actually exploring them in a computational way until recently. Here are some of the general principles. First of all, much of computer science is about computing on a von Neumann architecture. Algorithms are designed to run efficiently on that architecture. And there are many wonderful algorithms. Digital computers are remarkably useful. But to solve difficult problems in vision and language you need to scale up to massive amounts of computation. We need to find algorithms that scale up. And none of the traditional AI algorithms developed on the von Neumann architecture scale up because of combinatoric explosion. Nature has explored the space of massively parallel architectures. We can now explore that space with large-scale digital computing. Neural network algorithms, and many others to come, use parallel digital computers highly efficiently. It’s a natural fit, since supercomputers now have hundreds of thousands of CPUs and GPUs. Each neuron in the brain works asynchronously with all the others, so the computation can be spread out over a lot of processors in a distributed way, but you still have to find some way to integrate all that information. We have much to learn from brains about the global control of distributed computing. The second general principle that was missing from AI, which we now can appreciate was incredibly important, is learning from massive data streams. Humans learn through sensory experience and motor exploration. Adaptation is an essential part of signal processing and control theory. Biology adapts by changing the actual structure of the brain — the hardware adapts in response to experience in the physical world. Look at a baby: It’s unbelievable how much is going on in the brain of a baby adapting to the environment it happens to be born into: The language of the parents, the culture of the society, and the properties of the world. Learning is a very, very fundamental principle.
Hanson: So, let me just interject something here, so some of the critics that you know, like, I don’t know, I can bring up some names later, but Tommy Poggio comes to mind as somebody I think highly of but I think he’s wrong. He says “well, these things are really interesting but they really can’t learn like babies because they’re really learning the on labeled data, they’re learning on millions and millions of examples of labeled in somebody’s got to label these, or they have to be labeled somehow in some automated process. Also the standard thing that really comes from the 1980s is trial running you know. You know, babies start speaking after hearing one word. Of course they don’t, but there’s that sense in which people will say that and I can provide counterexamples to that. But what amazes me is I read something the other day that says the reason deep learning is on the verge of failing is because it can’t do one trial learning.
Sejnowski: It’s interesting. These people are generally not the ones that are trying to fix it. Actually Tommy Poggio is an exception. He has been a leader in analyzing learning algorithms. I think that these are important issues. But, one thing that you have to keep in mind is that we’re at the very beginning, we’re at the Wright brother’s stage, and we just got off the ground.
Hanson: That’s a good point.
Sejnowski: We’re not going to the moon in the next five years. That is asking for too much, like asking a baby to walk. But we are laying the foundation that will eventually take us to the moon and beyond. Nature gave us existence proofs that this is possible. Supervised learning is just one of many types of learning. The emphasis so far has been on supervised learning because it’s fast. Look how many years it takes for a baby to learn about the world largely using unsupervised and self-supervised learning. If we want to develop a general AI we may have to take this route.
Hanson: But there is GPT3 and GPT4 and so on…
Sejnowski: These natural language networks are harbingers of future networks driven by increasing computer power. They have hundreds of billions of weights. I heard that it took $20 million to teach GPT3 how to construct sentences in response to queries. I gave GPT-3 a press announcement from Neurolink about their new brain-computer interface and asked it to write ad copy. I was amazed when it came back with “Don’t wait, upgrade your brain!” I don’t think that even Elon Musk could have come up with that. What these language models have taught us is that few-shot learning becomes possible as you scale up the size of the network. The same massive network can perform many different language tasks. This is a third general principle, the principle of scale, and it is the answer to critics who complain about few-shot learning. It magically becomes possible in very large networks like our brain. Human language is a latecomer in evolution. Deep learning is a model for the neocortex, which evolved 200 million years ago in mammals. The neocortex vastly expanded in primates and especially in humans. The cerebral cortex has an architecture that is scalable. But before mammals arrived, reptiles, dinosaurs, and birds were quite capable of surviving without a neocortex. Learning to survive depended on reinforcement learning, an ancient part of our brain and we now know the algorithm — temporal difference learning — something we share with all species. Reinforcement learning makes it possible to learn sequences of actions in order to gain future rewards. This is how AlphaGo beat the world champion Go player. There are other parts of the brain that are equally important but haven’t yet made it into AI and are incredibly important for autonomy, like the cerebellum that creates a forward model of your actions.
Hanson: So, it would sound like some kind of deep learning robotics here would be sort of crucial as a research direction and it seems to be that you know, if you look at Boston Dynamics, they’re certainly not using that technology per se.
Sejnowski: Boston Dynamics uses traditional control theory with a central controller. Big dog is impressive, but not as impressive as real dogs that use distributed control, which makes it possible to react rapidly in a reflexive way and to flexibly plan movements at progressively longer time scales in higher brain areas. The point I’m trying to make is that the brain is not a one-trick pony. It doesn’t have one learning algorithm. There are dozens of learning algorithms that we haven’t tapped yet, and these are the ones that babies rely on to ground their brain and body in the physical and social world.
Hanson: OK that’s an important point, because that’s something I was gonna ask and sort of go off on a tangent on, was about different kinds of learning algorithms and what’s yet to be discovered, you know from looking at the brain. And obviously, you know, discovering you know some kind of credit assignment using error propagation, you’re right it certainly has something to do with mapping and cortical maps. It’s great but obviously we have a lot more functionality available, so there must be a lot more learning rules. But then this goes back to this architecture business because you and I remember, we go back to 1986 and we know about back propagation and we know that that architecture and the way in which it could learn turned out to be limited. It’s not just adding more layers, it’s obviously adding some tweaks that make the error sustain itself longer, other kinds of interesting computational gradient approaches that are important in all this that are really solid technology and optimization not necessarily understood within neuroscience. So the people in the explosion …I read something, I think it was Josh Tenebaum– “well you know they’re just doing classification”. BUT, Humans basically just do, do classification – that’s basically what we do all the time!
Sejnowski: Classification isn’t our only special power: Deep Learning networks have an even deeper capability: They’re universal function approximators.
Hanson: There’s that. Yes, there’s that.
Sejnowski: We are just beginning to appreciate how important this is. It’s having a huge impact on science, in every area of science. Just look at protein folding.
Hanson: Protein folding is amazing.
Sejnowski: That’s a problem that biologists thought would never be solved because in nature proteins fold through physical interactions and computing molecular dynamics is intractable. How did DeepMind do it? There were enough protein structures solved by x-ray crystallography to allow deep learning to generalize to all the proteins in the human body. By looking at the structure on the output it was able to figure out how it got there from the input string of amino acids, without having to go through every femtosecond of folding. The universal function approximator is infiltrating every area of biology, every area of physics and every area of science, because of the big data that are piling up. To give one more example from biology, the resolution of light microscopes is limited by diffraction of light to around half a micron. But biologists have figured out how to beat the diffraction limit and have achieved super-resolution. So, how does that work? You label a protein with a fluorescent tag and image the photons, a fat Gaussian, but you can estimate the center of the Gaussian with nanometer resolution. Cell biologists can now track single molecules with nanometer resolution.
Hanson: Amazing.
Sejnowski: The downside is that a super-resolution microscope costs a million dollars, and not many labs can afford them. It takes days to collect enough photons, so the throughput is low. But, now you can take an image with a conventional microscope and use it as the input to a deep learning network trained to give a super-resolution output. After training on enough pairs, the network generalizes to other low-resolution images and voila – the cheap light microscope becomes a super-resolution microscope. You get the answer within minutes, so you don’t have to wait all day. This is democratizing super-resolution for cell biologists. Deep learning is a function approximator. It’s can take what looks like a noisy image and extract hidden information that transforms it into a super-resolution image. This is just one out many amazing applications that is revolutionizing science and engineering. In the 1980’s neural networks were dismissed as just doing “statistics.” Well, they vastly underestimated the power of statistics. It’s like saying that all of physics is just arithmetic.
Hanson: Yeah that’s the reason I’m bringing this up. In the context of this conversation you know I think the other thing about the protein folding thing is so amazing and I just watched some of the videos and read the paper too but it’s that there are also some causal principles that they seem to be extracting from this about the biology. But, like a lot of deep learning stuff (what we should call post-analysis) it’s difficult because it is doing something that we may not immediately understand. I mean function approximation is a wonderful thing but it’s universal which means we may not know exactly what it did until later.
Sejnowski: This is where mathematicians are needed. A few years ago the National Academies organized a symposium called the Science of Deep Learning. It wasn’t organized by AI engineers, but by mathematicians who are creating mathematics. Deep learning is paradoxical and an opportunity to create new mathematics. Something is working and it shouldn’t work according to old mathematics.
Hanson: That’s right. Called the Science of Deep Learning
Sejnowski: Deep learning shouldn’t work based on sample complexity in statistics – networks are vastly overparameterized and should overfit the data. We were told back in the 80s that our neural networks had way too many parameters, but that didn’t stop us.
Hanson: And simple regularizers, which I cared about a lot back then, basically worked amazingly well. John Moody had introduced a kind of a hessian analysis of all the weights and he showed that, you know “crap, some of the weights aren’t doing anything”! There’s some kind of natural regularization, and of course there’s been theses written on this and people are still … why in these high dimensional spaces the system just says “I’m just going to use the data I need right now and wait for more data”. I mean this is a very interesting phenomenon about the layers themselves…
Sejnowski: Absolutely, and there’s so much more to be done there. At the symposium, Peter Bartlett had an insight that overparameterization was less of a problem in high-dimensional spaces. Our intuitions and theorems are for low-dimensional spaces. The geometry of spaces with billions of dimensions is completely different from our intuition about three-dimensional space. I wrote a paper on this theme entitled “The Unreasonable Effectiveness of Deep Learning in Artificial Intelligence” in a special issue of PNAS.
Hanson: I read that. I loved that paper. This is a very critical point, as I’m sometimes confused for a statistician. Statisticians will say if you come to them with your problem and say “well just get rid of all those other variables you need to get this as low dimensional as possible so it’s a linear damn it!” Of course, most statistics which you know where the central limit there will actually apply properly you know a small number of random variables that you’re summing over and it will be Gaussian but, by the way, nothing’s Gaussian and nothing’s linear, for God’s sake. Everything is non-linear and non-Gaussian.
Sejnowski: In the brain what’s remarkable is that everything from distributions of activity to distributions of synaptic weights are log-normal; that is to say they have long tails.
Hanson: I heard Buzsaki saying something about this a few years ago.
Sejnowski: Yes, Gyorgi Buzsaki has written a book about this, “The Brain from Inside Out.” The consequence of long tails in the synaptic weights is that the big weights out on the tail are doing the heavy lifting. They also tend to be the most stable – most synapses are turning over all the time.
Hanson: That’s right.
Sejnowski: If you start from different random places in weight space, at the end of learning you end up with different networks that are equally good solutions. You don’t have one function and one set of parameters that solves the problem, which is what most statisticians assume. Finding the optimum set of parameters to solve a low-dimensional nonlinear problem is like finding a needle in a haystack. Non-convex gradient descent learning is possible because singular points of the loss function are saddle points that take you to the bottom, like wormholes in spacetime. But if there’s a huge number of possible networks, then getting a good solution is like finding a needle in a haystack of needles. In other words it’s a completely different problem in high-dimensional spaces of parameters.
Hanson: This is this lottery hypothesis kind of argument that I forget the kids who made this but it but it’s the idea that you know there’s an equivalence class of solutions and one of them is gonna get picked out in a lottery. And then you can scale the entire network down to that particular solution and it still works once you remove you know 75% of the original parameters. There’s also a book that’s coming out by another two kids, two guys at Facebook and they pushed this thing on the archive a month or two ago called “Principles of deep learning”. It’s a 500 page book that is mostly math and I’ve talked to them a bit. My goodness, it’s like they took a quantum mechanics beer truck and smashed it through a deep learning wall! They just put all this physics in it and they’re looking at it. And I was very impressed with… I mean they seem both naive and very impressive.
Sejnowski: This is an indication of how much energy is going into analyzing deep learning networks. That must’ve taken — 500 pages, come on — a huge investment of time and effort. But, it’s happening all over the world. Mathematicians are making real progress. This is where explainability is coming from.
Hanson: There was a deep learning workshop that all of a sudden appeared in Princeton at the Institute of Advanced Studies and there was a person, a mathematician, I’m sorry I don’t remember his name. He had a group and he had a couple of workshops. I attended a couple of them and the people who were giving plenary talks were singularly unimpressive but all the young people – amazing! And the posters – what the hell are you doing here, I was like, I was shocked at every place I went. It’s just one of those initial signs that you know the weather’s changing folks.
Sejnowski: Yes, Steve, you’ve just put your finger on what’s happening right now, which is that for senior people who have already made their career, and I won’t mention names like Michael Jordan, in an area that was, at the time, very exciting, a lot of things are now passing them by. It’s the younger people jumping in who have the new ideas. They have the energy and the enthusiasm. My students are so much further ahead than I was at their age, in terms of sophistication and the computational facilities that they have.
Hanson: Well this is right. I mean I had started writing something sort of historical exegesis. I got this great book (U. Chicago) and it covers all the transactions between von Neumann, Wiener and Shannon and you know just about everybody who was anybody in computation. Now they had a whole bunch of psychiatrists and they had Margaret Mead, and her awful husband Bateson.
Sejnowski: Cybernetics?
Hanson: Yeah, transactions on all of the original cybernetics meetings.
Sejnowski: It was Warren McCulloch who organized it.
Hanson: Correct, McCulloch was given a grant by Josiah Macy foundation to run nine years of this thing.
Sejnowski: When I was a grad student, when I should have been studying for my general exam, I read all of those transcripts. It was like being a fly on the wall listening in on what they discussed.
Hanson: I got it a couple of years ago.
Sejnowski: I just bought a compilation of all the conference transcripts – there were five, from 1946 to 1953.
Hanson: Oh yeah I know there’s a book you can get that has all the transcripts in it.
Sejnowski: I’ve been perusing it and was reminded that they had some really good ideas. The subtitle of Wiener’s book on Cybernetics is: “Control and Communication in the Animal and the Machine.” This was a forerunner of modern control theory, but somewhere along the way they dropped the “Animal.” We can put animals back into cybernetics now that we have better models. John Doyle at Caltech has a new theory for distributed control that explains the presence of massive internal feedback pathways in brains.
Hanson: They did, and a lot of it is familiar to what we’re kind of going through now. They didn’t really have a computational substrate but they had a lot of theory. I mean his colleague Pitts set the stage for propositional logic being in the brain. Then there was just all this other stuff. Now what was interesting when I was reading more about – there were so many psychiatrists there they were all pushing … one of them actually connected with the CIA and he was the guy who pushed MKUltra. And the Macy foundation funded that stuff. The other thing was an endocrinologist who attended the 3rd and the 4th meeting, and he was talking about reproductive control … at this point they were all racists and eugenicists anyway. They said “anything you could do here about reproductive control”. He says “yeah, probably” and so he invented the pill. And so the Macy foundation funded, despite the fact they started chasing artificial intelligence landmarks, they basically funded MKUltra and the pill. Josiah Macy was just this awful character, but he believed in science and the whole idea after World War II was really going to be like cross sciences and that’s why we bring all these people together. The second thing that happened from this is that computer scientists and mathematicians who were there hated the psychiatrists so much they organized a meeting with a little assistant professor at Dartmouth named John McCarthy and it was that sort of backlash from the Macy meeting that initiated the first AI conference. It wouldn’t have happened except these guys … we don’t want these psychiatrists here. Let’s study computation!
Sejnowski: It’s wonderful to go back and visit these historical moments. I’ll bet 99% of those following your blog have never heard of Josiah Macy. This set of conferences is an obscure corner of science history. I took several history of science classes when I was a graduate student. Basically, when we go back and find out what was actually happening during a period of scientific discovery, it was completely chaotic. They didn’t know for sure what was right and where things were going. People were interacting, they had many ideas. The history of science in textbooks is mainly focused on great men — Isaac Newton had all the ideas and everyone else just worked out the details. But science emerges from a community effort. Newton and others were part of a social network, interacting and fighting with each other. And that’s what’s going on right now. We have a new wave of people coming in from all areas of science and engineering including the social sciences with ethical concerns. NeurIPS had its 34th annual conference recently with 17,000 attendees from all these diverse tribes, a nontrivial scaling problem. When we founded NeurIPS, we thought we were going to change the world, but what we didn’t know was that it would take 30 years.
Hanson: Of course the advantages we were talking about younger people have over that kind of historical exegesis, is that they’re worried about solving problems. They don’t want to know who McCulloch was, and they don’t really care. They are just doing it. So, it’s like saying “we’re going to make a risotto tonight” and we start debating about the best way of doing it, and someone goes to the kitchen and just does it, they make it.
Sejnowski: You’re absolutely right. This reminds me of the old Chinese proverb: “People who say it can’t be done should not interrupt those doing it.” What’s important is not to argue about philosophy or who did what, but to make a contribution. There’s also another good reason why it’s impossible to accurately reconstruct the past: It’s too complicated. You only have enough room in the textbook for a thumbnail sketch. And history books have biases.
Hanson: It’s a narrative, you have to have the narrative and it has to be pigeonholed so people can go yeah that’s the thing.
Sejnowski: It’s all about stories: Origin stories, mystical stories, scientific stories.
Hanson: I have kind of an architecture question but I want to go back, and the architecture question ’cause you know, we’ve got ZFNet, AlexNet and LeNet and all the name nets and the architectures are vary a bit, but the principles under which these folks are constructing the architectures seem to me still a little bit after the fact. A little bit of the seat of the pants. We probably need something more narrow than broad. So there’s a sense in which there’s not a principle, that there’s not what’s called a handbook of architectures. If there was such a thing written like “the Handbook of Deep Learning architectures”, what would that actually be? There would have to be some kind of hierarchy.
Sejnowski: Someone will write that book someday, but it will be based on a richer set of architectures than the CNNs and the transformers we have today. It’s funny you bring this up because I’m revising a paper that I think is going to illuminate that question.
Hanson: Can you send it to me?
Sejnowski: I’d be happy to. There’s an earlier arXiv version. I’ll send you the revision. So, with the deep learning network, we fix and connect the architecture, number of layers, numbers of units, all the hyperparameters. Then put all the data in and train it up. At the end we have a network that is fixed and does remarkably well at whatever it’s supposed to do. One application per network. Ben Tsuda and I have been collaborating with Hava Siegelmann at the University of Massachusetts and Kay Tye, who works on neuromodulation here at the Salk Institute. We have uncovered something interesting. Here’s the motivation: The ganglion in the lobster that controls how food moves through its stomach has a tiny network with 26 neurons. How does such a tiny brain control all the muscles needed to ingest whatever the lobster finds at the bottom of the ocean? It uses neuromodulation. Modulatory inputs can shift the network to do different things, different functions by changing either the strengths of the weights or the excitability, or other physiological parameters. You can make the same network do more than one thing.
Hanson: So, it taps different dynamics.
Sejnowski: Yes. For example, in the lobster stomach there’s something called a gastric mill that chews up the food but it can have two different frequencies. One has a period of one second and the other is 10 seconds. They work sequentially to prepare food and move it along the digestive tract. This is orchestrated by dozens of neuromodulators. In our brains, neuromodulators project diffusely throughout the cortical mantle and other brain areas and are very important for regulating cognitive function. We trained a recurrent neural network to do two different things, one with the modulation that simply increased the strengths of all the synapses by 50% and one without modulation. We discovered that yes, we can train the network to give two different outputs to the same input depending on the state of modulation. And then we said, “OK, that’s interesting, it means that by changing a single parameter, you can embed two different functions in the same network”. Then we showed you can embed 9 different behaviors in a recurrent network with 200 units by just modulating 10% of them at a time. This is how even a small network with just a handful of neurons can be flexibly reconfigured to have many functions.
Hanson: So they’re not orthogonal functions, but they’re functions that might create a mosaic across the function space so that they can coexist and be tapped by the neuromodulatory kinds of effects.
Sejnowski: Exactly. By varying what subset is being modulated you vary the output for the same input. You can even have the modulator change the behavior to its opposite. So, we analyzed the trajectory of activity in the network in this high-dimensional space. Activity is confined to a low-dimensional hypertube moving through the high-dimensional space. Neuromodulators can move the hypertube around. Interestingly, there’s experimental evidence for this in rodents. David Tank has a recent paper in Nature where during a task, activities in large populations of neurons go through hypertubes in activity space. It’s really interesting.
Hanson: That sounds terribly exciting and I look forward to looking at that. I see what you’re saying but if you’re a Google engineer sitting trying to tweak some task a little bit to get, you know, one 10th of a percent more error, usually when I talk to some of these folks, at least one was my student and he says well we just add more hidden units you know. So like Alan Lapedes –the fairy dust approach –just sprinkle hidden units. Sadly that, at least in the engineering points of view, that seems to be the prevailing that kind of strategy, right.
Sejnowski: We have an established paradigm that is being optimized. This is what happens to all technologies. Let’s look at the steam engine. The very first working steam engine by James Watt was inefficient and dangerous – it tended to blow up. He and other engineers tinkered with governors and other improvements over decades. And then the theory of thermodynamics made it possible to optimize the design. So now, we have a safer and much more efficient steam engine, but it took a hundred years to get to the point where steam engines could pull a million pounds of freight across the continent. You start with an intuition, you build a device that barely works, you optimize it, you keep improving and improving, and eventually you start solving more and more diverse problems, like plowing fields and running factories. We’re just at the very beginning of that process. What we now have is a new computer architecture based on a new set of principles that will undergo many generations of improvements. Nature is a wonderful source of ideas, because nature has honed processes through evolution and found ingenious solutions. There are hundreds different types of neurons in the cortex. Nature has optimized each brain component, and given them inductive biases starting out with generic types of neurons connected in generic ways and then specializing them. These biases have evolved over hundreds of millions of years. That’s why babies, when they come out of the box are already pretty far along in the development of their brain systems. They don’t have to learn everything. But what they do learn makes all the difference in the world between a helpless baby and a mature adult.
Hanson: But the inductive bias is, part of that is what you’re describing it as opposed to let’s say you know Chomsky and various of his students might describe the language acquisition- like a bump in the head– the called LAD!..So for you, inductive bias is a lot more subtle. It somehow reduces the function class you’re living in, maybe that is what you have to deal with as a baby.
Sejnowski: Up until recently computer science was living in a focused part of computation space. The von Neumann architecture is very powerful but its inductive biases have severe limitations in the time domain. As you say, inductive biases in computer architectures constrain the function class they can compute. But by constraining it in the right way, it can become more flexible. We’ve explored a few architectures in the space of massively parallel architectures, and we have found a few algorithms that work really well for difficult problems in vision, speech and natural language. But that space is vast and we’ve only explored a tiny part. We are sending search parties out into a vast space where no man or woman has gone before. They’re coming back with marvelous new architectures that they’ve discovered. It’s going to amaze us because we’re still living here in Flatland (one of my favorite books). And the mathematics will be different from anything we now know.
Hanson: Maybe in one dimension. So yeah, there’s actually a quote here from your PNAS paper. “Once regarded as ‘just statistics’, deep recurrent networks are high-dimensional dynamical systems through which information flows much as electrical activity flows through the brain”.
Sejnowski: It’s a dynamical system and a very complex one.
Hanson: Well, it’s a great quote.
Sejnowski: This is the key to the AI-Neuro convergence.
Hanson: That’s one of my favorite questions in the paper and you’re creating a space so people can understand what’s going on. And the problem is, what we’re seeing is there just a lot of people looking for spotlights coming in and criticizing and doing “oh we’re just on the verge of another AI winter” and of course the media has basically taken deep learning and called it AI, and that’s just the reality of our situation. So whatever happens, you know if Elon Musk develops the algorithm and smashes cars into people, it’s awful, doing deep learning. We know he doesn’t use deep learning, so, so far we are OK.
Sejnowski: This is another lesson. So it took 100 years to perfect the steam engine. It’s going to take decades to get to the point where we have safe and reliable self-driving cars. All technologies take time. There’s nothing special about this technology. There will be unintended consequences along the way. When something happens that you don’t anticipate, you don’t give up but work hard to fix it.
Hanson: One could say, the so-called AI winter from the 1980s and 90s was really just part of this longer process, it really wasn’t an AI winter, it was kind of a pause. And here we are with the whole system rebooting itself but really not that far from where it was, so if you remember” back propagation”… I mean in other words if you came into this right now and said “what is this deep learning stuff”, if you didn’t know there was back propagation, you would say “oh my God it’s all brand new and amazing…”. No, it’s this continuous thing that is unfolding in front of us. I think this AI winter stuff is that kind of made up narratives so people can explain things to themselves.
Sejnowski: Ironically, the so-called AI winter in the 80s and 90s was neural network spring: We just didn’t know at the time that it would someday become a foundational part of AI. Scalability is what is driving this. You brought this up earlier. The main challenge in solving difficult problems is how to scale up computation. We didn’t know in the 80s whether we could scale up our tiny networks to solve these problems. You’re right, it’s the same learning algorithms, but on steroids,
Hanson: Right, it’s different. It does feel like the thing became more of an amoeba that’s covering history in an interesting way that we wouldn’t have predicted in 1998. I wouldn’t have guessed this.
Sejnowski: Here’s a little retrospective: Remember NETtalk?
Hanson: I do remember NETtalk. (Charlie Rosenberg –who was in the same Princeton Lab with Hanson, worked with Sejnowski for a summer when NETtalk was created).
Sejnowski: It was a language app in the 1980s, a network that learned how to pronounce the letters in words, which are highly ambiguous in English with a lot of exceptions. It worked. The same architecture could both generalize and learn the exceptions. How could a tiny network with a few thousand weights have solved a difficult language problem? But what I could not imagine back then was what a network with billions of weights could solve. Well, we now know that neural networks love language. That shouldn’t surprise us because speech is just another complex motor activity and language is at the top of that motor hierarchy. As human cortex expanded, new capabilities became possible, but with the same architecture. That’s why GPT-3 needed such a big network to become so effective. It’s all about matching the computation to the architecture and having enough representational power. Language loves neural networks.
Hanson: I had written somewhere in the review of somebody’s book I can’t remember who it was now, it was back then and I said “well for sure we know that you’re never gonna have a machine just use back propagation and read the encyclopedia and learn language – that’ll never happen”. I was giving a talk somewhere and I put that quote up and I said “wrong”. And then we go through the GPT’s and you go “well I don’t know what’s happening there I mean probably these are high dimensional spaces with clusters of phrase structure and they get parsed out by similarity” but whatever, it’s still incredible.
Sejnowski: The beauty is that the very same network does both syntax and semantics …
Hanson: Which it should. That’s what we do, right.
Sejnowski: Talking about syntax – Noam Chomsky in his influential essay “The Case Against BF Skinner” in the New York Review of Books had a powerful argument that dismissed the possibility that learning could create language capabilities. He said that he could not imagine being able to express the relevant properties of language without using abstract theories. Philosophers call this an “argument from ignorance.” I can’t imagine it, therefore, it’s impossible. Never believe an expert who tells you something is impossible when someone else is doing it.
Hanson: That’s right. Dave Rumelhart, had a phrase you might remember: “proof by lack of imagination”.
Hanson: You know what, I think we’ll also end it here. We are probably past time. This has been so much fun, I could just go on talking and talking with you and who knows, if I continue doing this before I’m dead I may come back some other time if you’re available. But this has been fantastic, thanks again.
Sejnowski: Looking forward to that!
Terry Sejnowski is Professor and Laboratory Head of the Computational Neurobiology Laboratory at the Salk Institute. He is also Professor of Biology at the University of California, San Diego. He is the President of the Neural Information Processing Systems (NeurIPS) Foundation.
Stephen José Hanson is Full Professor of Psychology at Rutgers University and Director of Rutgers Brain Imaging Center (RUBIC) and an executive member of the Rutgers Cognitive Science Center.
Other interviews in this series:
What is AI? Stephen Hanson in conversation with Richard Sutton
What is AI? Stephen Hanson in conversation with Michael Jordan