The AIhub coffee corner captures the musings of AI experts over a 30-minute conversation. This month, we discuss an article that appeared recently in IEEE Spectrum entitled: Deep learning’s diminishing returns. The article reports that deep-learning models are becoming more and more accurate, but the computing power needed to achieve this accuracy is increasing at such a rate that, to further reduce the error rates, the cost and environmental impact is going to be unsustainably high.
Joining the discussion this time are: Tom Dietterich (Oregon State University), Stephen Hanson (Rutgers University), Sabine Hauert (University of Bristol), and Sarit Kraus (Bar-Ilan University).
Sarit Kraus: I would like to start by considering the research aspect. Suppose a PhD student has a great idea about how to improve some machine learning algorithm. So now, they need to show that this improved algorithm is much better than all those before. In other, non-machine learning settings, the student could implement the new algorithm, implement the previous algorithm, and compare the algorithms in several testbeds. But, in machine learning the student needs to find the specific features and the hyperparameters of the network. For this, he or she needs to run a huge number of experiments with many possible parameters; to do smoothing and all sorts of tricks in order to show that their version of the algorithm is better than previous ones. Maybe the new algorithm is great, maybe it isn’t, but to prove it requires a lot of energy and a lot of data. Similarly, when you want to implement something. The time and GPU time that is taken to do something is huge and this is one of the problems.
Tom Dietterich: It seems to me that it’s the same as if you propose a new algorithm for simulating what’s happening inside a nuclear fusion reactor. Or, you have a proposal to improve world-wide weather forecasts. Some computations are just big – they require supercomputers, and that’s just the reality and graduate students really can only contribute to those if they get an internship at the UK Met Office or the US Department of Energy, for example.
Sarit: Right, but taking the energy example, it’s a very well-defined problem you want to study and there are not many people that are trying to solve it. But, the deep-learning people said “well, you can use deep-learning to do anything”, so the large-scale use of this tool means that for everything you do you need so much computational power. So, it’s not just for the specific case of a nuclear reactor, it’s for almost any application. I don’t have a solution. Somebody told me that one graph in a DeepMind paper will probably cost you $100,000 if you had to reproduce it. But, I’m not that worried about the money, but here the idea was about the environment. Think how much we harm the environment by producing this research, which may be interesting or it might not be.
Tom: Right now we’re in the age of brute force deep-learning. We don’t know why the things really work, so we lack the theory to tell us how we could do it much more efficiently. But, I think that theory is emerging. We’ve had a lot of progress in the last couple of years understanding what’s going on with over-parameterisation, why it’s working. There are some people exploring alternatives, but until we really understand why this strategy is succeeding, we can’t figure out how to remove all the redundancy. I would be amazed if ten years from now we’re spending this kind of money. Even DeepMind doesn’t want to spend $10million to get a graph in a paper.
Of course, there has been a huge flowering of start-up companies all trying to build special purpose chips to do the same computation for much less. I thought the article was very misleading in the sense that it mentioned special purpose chips like tensor processing units and then it said later “but nobody’s using them”, but Google does all their computation on tensor processing units. If they used standard GPUs it would cost another order of magnitude, at least.
I really think we’re in a temporary period, in the sense that there’s low-hanging fruit if we just bring a lot of computation to bear, but that is definitely hitting diminishing returns. There is a lack of good theory and we actually need to think for a while about why these things work.
Stephen Hanson: Just a couple of comments on this. First off, DeepMind actually started making money, I think yesterday, which is a shocker since they were like $400million in debt. Joking aside, the protein folding thing is a very important event. One thing DeepMind has been doing very well is benchmarks. Benchmarks are very nice for algorithm development perhaps, but they don’t really help humanity or mankind much, and the thing that we’re seeing is a precursor to something. I just thought this article was way too premature. It’s way too early to start predicting AI winters. The cost of this is going to go down, there are some very good people working on trying to get these networks smaller. There’s the original lottery ticket idea, which I think still hasn’t been exploited very well, but can be, and it will be very interesting once we figure out what that relationship is to these architectures, which are, from what I can tell, totally arbitrary. In all, I agree that the article was highly misleading. I think it’s way too early to make this argument.
Sabine Hauert: You mentioned DeepMind and the latest in the news this week from them was their prediction of the weather. One of the challenges with these models is that, you can look at the energy costs – which is very high – but if it does allow for energy savings, because lots of houses heat less because they know exactly what temperature it’s going to be, or they know there’s going to be a downpour or something, then the overall balance would be interesting to explore.
I think the other challenge is the competition. So, if everyone is building their own model for the sake of owning that model, and is training their system, there’s just redundancy in the models. There’s probably an opportunity to share a little bit, although I don’t know what the business model would be.
Steve: I think that’s natural in terms of the experimental nature of everything. Again, I have to agree with Tom here, there’s not much theory. There was a book that was dropped recently – The principles of deep learning theory – it’s a 500 page book and it’s mostly mathematical analysis. It’s on the arXiv right now. There are people out there trying to figure this out but this whole process is all experimental and people are just trying things out. So, you’re right Sabine. The sharing really occurs at the conferences and workshops, but it doesn’t prevent redundancy and replication.
Tom: Certainly, Google has trained it’s own version of GPT-3. I assume Microsoft has as well. OpenAI has given lots of academics access to GPT-3, at no cost.
Steve: OpenAI has also announced that they have GPT-4 coming out which has something like 5-6 trillion weights, I think. Again, I don’t know what that means, it’s worrisome. Just in terms of counting bits, I don’t really understand if you’re reading all of Wikipedia, how many bits is that and how many trillions of weights you need to actually store it or whatever they are doing. That is something that is not testable. There’s no linguistic analysis of this that makes any sense to me.
Sabine: How iterative is it to go from GPT-3 to GPT-4? Do they retrain the network, or do they bootstrap with the previous version? Maybe it’s an increasing return, if we have a baseline, maybe we don’t need that much training to get that improvement?
Steve: That would make a lot of sense, except OpenAI (it’s kind of an oxymoron) is the least transparent group on the entire planet – no-one knows exactly what they’re doing and they don’t really release any of the algorithm details, so, as Tom says, Google has replicated this and I suppose at least they published a paper on this. I think you’re right, it should actually be cumulative. You should take GPT-3 and train it with more data and more layers, or whatever and that would be additive. I don’t know if that’s true. It’s certainly true in human knowledge acquisition. When you’re two years old you don’t know a lot, so the things you learn are fairly slow. But, once you get to three or four, the language acquisition is exponential – it just jumps amazingly.
Tom: There is research on continual learning which is trying to learn over time cumulatively. But, again, we don’t understand the dynamics of what’s going on inside the networks to know when we can resume learning and carry on, or when we need to start fresh. Most people start fresh because the risk of it not working otherwise is too high.
Sabine: From the robotics angle, we’re trying to put all of our computation on board the robots and have a bit more of that AI happen at the edge. We can’t do any of this as we’re always constrained by the size of the models we have.
Tom: Everyday on arXiv there are probably a dozen papers on edge computing for deep learning, quantization, shrinking, compressing, and pruning. So, there’s a huge amount of research concerning overlarge trained networks, and working out how to make that network tiny and make it fit on low energy devices. I don’t know if they’re getting to the point where it can be on your units…
Sabine: So, they still need to be trained off-board then?
Tom: Yes, absolutely.
Sabine: Any other thoughts on the article. What can be done to fix this problem of diminishing returns then?
Tom: We need a theoretical understanding. To get ahead, get a theory.
Steve: I think a lot of the DeepMind and Google folks assume there’s a theory. There’s some kind of axiomatic belief that it’s working because it has some sound theory behind it (even if they are clueless as to what it is) but, in reality, we have no idea how this works, we don’t understand the dynamics, it’s just a mess.
Tom: There have been some very interesting papers recently on this so-called “double descent phenomenon”. So, traditionally we’ve seen that as you increase the capacity of the network, you’ll start to overfit, and at the point where the number of parameters in the network matches the number of data points, your error goes off toward infinity because of the very high variance in the fit. What’s weird is that if you keep adding capacity to the network the error comes down again. It often comes down lower than it did in the under-parameterised regime. There’s a very nice review that came out on arXiv a couple of weeks ago from some folks in the signal processing community analysing this and there have been some other papers on this. So, we’re starting to understand why that’s working, and why stochastic gradient descent actually finds the right optima. As you add capacity there are infinitely many bad optima in the resulting network, but stochastic gradient descent, because it’s so noisy, doesn’t find the bad optima, it can only find the big, flat basins. That smoothness seems to be a critical part of the answer. We’re getting some ideas. None of them tell us how to make it work with less computation.
Steve: Regularization is an important part, and has been since the 90s. I think the first person to publish something on this was John Moody, obviously back when we only had one-layer networks, but if you had a very large network and you trained it, he basically looked at the Hessian, the derivative information, at each weight, and what he showed was that, as it started to overfit, the weights that were contributing began to collapse down to the complexity of the dataset that he was fitting. This is a beautiful paper – it’s actually a precursor to what I call naturalistic regularisation.
Sabine: Going back to what you were saying, Sarit, that we shouldn’t be doing this for the sake of doing it, and it’s causing challenges and competition for researchers, is there something you think we should do to improve things? Should we ask people who submit papers to tell us about their computational costs? Should there be more training on the costs?
Steve: I think it’s premature. We’re in an experimental phase. We’re not even 10% the way through what’s going to happen. I for one would predict that deep learning is not going anywhere. There’s no AI winter on the edge here. One can complain about the way folks are working with it and the cost of it, but there’s something happening that’s quite unique and novel. I could see back in the 1980s we weren’t scaling, the neural networks weren’t working with hard problems. I had a lot of speech recognition friends tell me that it wasn’t really working. I knew there was a reckoning coming, I could feel it. I don’t see it here.
Tom: From the research community standpoint, there’s certainly a move in the US to fund a National AI computational resource that academics can use to run larger-scale experiments. Presumably, they are going to have to get their experiments reviewed.
Sabine: There was a big Facebook failure this week. Are we going to have single point of failure because everyone’s feeding off GPT-4 or some other AI model that could fail?
Tom: Or even worse, they are all training on the same data. They are all scraping all these websites, Wikipedia and so on. I’m quite worried about what I call the “garbage out, garbage in” problem, which is that now we’re going to have a lot of GPT-3-generated text which will be scraped and used as input to train GPT-n. And so we get this “garbage out, garbage in”. I think that’s the biggest risk.
Sabine: So it’s a single point of failure even when it’s working then.
Tom: One final thought, there used to be this joke that the best thing to do if you had a fixed computational budget was just to invest in the stock market for a couple of years and then buy a computer. It’s always too early to buy computers because of Moore’s Law. Maybe there’s a reverse phenomenon happening here. If I have an application that I want to use deep learning on, and it’s not urgent timewise, maybe I should wait a couple of years for the techniques to improve, because then I’ll need less computing in two years than I need today. That might be something to think about. We are in this brute force phase and it would be good if we could deepen our understanding before we spend a lot more computing time on frivolous things.
Steve: Unfortunately, by that time, Google and DeepMind would have done it, patented it, and sold it to somebody…
You can find all of our previous coffee corner discussions here.