The virtual International Conference on Learning Representations (ICLR) was held on 26-30 April and included eight keynote talks. In part two of our round-up we summarise the final four presentations. Courtesy of the conference organisers you can watch the talks in full and see the question and answer sessions.
Mihaela van der Schaar, University of Cambridge, The Alan Turing Institute, University of California Los Angeles
The aim of Mihaela’s research is to contribute to the transformation of healthcare by rigorous formulation and development of diverse new tools in machine learning and AI. Her group has worked on many problems in medicine and healthcare, including risk prognosis, modelling disease trajectories, adaptive clinical trials, individualised treatment, early-warning systems in hospitals, and personalised screening. They needed to develop a variety of machine learning methods to carry out this work. These include novel deep-learning methods, causal inference models, automated machine learning, time-series models, reinforcement learning and transfer learning.
To date, machine learning has proved very successful where there are well-posed problems, the notion of a “solution” is well-defined, and these solutions are verifiable. In medicine however, the problems are not well-posed, the notion of a solution is not often well-defined and solutions are hard to verify. This presents an enormous challenge, but also offers an opportunity to transform medicine.
To demonstrate what is possible, Mihaela presented a decision support system for cancer care that her lab has developed in partnership with Public Health England. This system combines machine learning with data for entire clinical histories (drawn from a UK nationwide database that is constantly updated in near real time), including the trajectories of each person’s disease. Importantly, the team have built interpretability into their model – the user can be provided with an explanation as to why a particular prediction has been made. This system could aid clinicians by providing patient-specific predictions of expected outcomes if no treatment is provided, and also the risks for a variety of treatment choices. Watch Mihaela speak more about this system at a Turing Lecture
A number of interesting topics were covered in the talk:
Automating the design of clinical predictive analytics
Mihaela and her team have developed (AutoPrognosis) – a system for automating the design of predictive modelling pipelines tailored for clinical prognosis. This tool (which combines Bayesian Optimization with Structured Kernel Learning) enables researchers to study many diseases using an automated pipeline rather than having to build a new model for every single disease.
Interpreting and explaining predictions
In the field of medicine it critical to have not only predictions, but interpretability. The models should be transparent, and users need to understand, quantify and manage risk. The users also need to be able to check that the models do not learn biases. Read more about Mihaela’s interpretable INVASE method here. Further to this work, the team developed an approach to demystify black-box models with symbolic metamodels.
Issuing dynamic forecasts
Mihaela and her team recently developed the first clinically actionable model for patient-level trajectories. The attentive state space model, combines probabilistic hidden Markov models with recurrent neural networks to model trajectories. Combining these two powerful methods gives predictions that are interpretable.
Estimating individualised treatment effects
Moving from making predictions for a general population to making predictions for an individual is a hard, causal inference problem. Mihaela’s lab have published a number of models on this topic, including using Generative Adversarial Nets and Counterfactual RNNs.
You can view the talk and the Q&A session here.
Devi Parikh, Georgia Tech and Facebook AI Research
Devi’s talk focussed on AI systems that combine computer vision and language processing. Examples of such vision and language systems include: taking an image or video and describing it in a sentence, taking a short phrase and returning a relevant image, and enabling a user to interact and ask questions about what is going on in an image. This final example is known as visual question answering (VQA) and formed the basis of Devi’s presentation.
Vision and language systems have a number of potential applications. Importantly, they will be useful for visually impaired people, through image description. These systems will also aid navigation of unstructured visual data and enable the use of multi-modal data. They could also be used to detect harmful content or as part of an augmented virtual reality assistant.
Given an input image and free-form question, the typical model architecture for a VQA system consists of the following: a mechanism to encode this question (usually using a deep neural network) and a mechanism to extract visual features from the image (again, typically a deep neural network). These are fed into an attention mechanism which determines which parts of the image are more relevant for that particular question. There follows a fusion step where information from the question and image are combined. Finally, there is often a classifier layer that returns an answer. There has been a lot of research on each of these individual components. Devi’s team has been organising a competition on VQA accuracy over the past few years. There has been tremendous progress, with accuracy increasing from around 55% in 2015 to 75% in 2019.
As a demonstration of the kind of things these models can do, Devi directed listeners to the CloudCV site, specifically the section for VQA. Users can choose from a database of photographs, or upload their own, and ask questions about that image, such as: “How many people are in the image?”, “what colour are the walls?”, “what is the dog doing?”.
Devi moved on to talk about some of the challenges in the vision and language space. One of these concerns image captioning models; these tend to have fairly strong language priors and often the model will not ground the caption sufficiently on the image. For example, if the model has been trained on many images where a dog is on a couch with a toy it may miscaption an image where the dog is sat at a table with a piece of cake as “dog sat on couch with a toy”. To counter this problem, Devi and colleagues proposed the “neural baby talk” method which can produce natural language explicitly grounded in entities that object detectors find in the image. Their approach reconciles classical slot filling approaches (that are generally better grounded in images) with modern neural captioning approaches (that are generally more natural sounding and accurate).
Developed by Devi and colleagues, ViLBERT is a model for visual grounding which can be used for a number of vision and language tasks. The team have been further refining the model and it can now be used for 12 tasks. They have reached state-of-the-art on seven of these tasks. You can try out the model here.
You can view the talk and the Q&A session here.
Yann LeCun, Facebook AI Research and New York University, and Yoshua Bengio, Montreal Institute for Learning Algorithms
This keynote slot was shared by two of the Turing award winners and they both gave an overview of the research problems they are concerned with at present.
One question that Yann has been asking himself for many years is: how do humans and animals learn so quickly? The learning barely requires much supervision and it is rarely reinforced. Babies and animals learn mostly through observation and they accumulate enormous amounts of background knowledge about the world, such as intuitive physics (gravity being an example). Being able to recreate this kind of learning in machines would be incredibly powerful. In his opinion the next revolution in AI will be neither supervised nor reinforced.
Yann outlined what he views as the three main challenges in deep learning today: 1) learning with fewer labelled samples and fewer trials, 2) learning to reason, 3) learning to plan complex action sequences.
The answer to solving the first problem could lie with “self-supervised learning”. In this method one is trying to predict a subset of information using the rest. The idea is to generate labels from existing information and use that to learn the representations for the problem in hand.
Such self-supervised learning has been widely used in language modelling where the default task is to predict the next word given the previous word sequence. Things become a little more tricky when applying this method to images and videos. It is much more difficult to represent uncertainty and prediction in images and videos than it is in text because the systems are not discrete. If you ask a neural network to predict the next frames in a video it will produce a blurry image. That is because it cannot predict exactly what will happen so it shows an average over many possible outcomes. There are a few possible options to solve the problem and Yann described latent variable energy-based models. These are essentially like a probabilistic model and latent variables allow the system to make multiple predictions. As an example, he presented theoretical research on using a latent variable model to train neural networks for autonomous driving.
Yoshua’s talk focussed on consciousness and deep learning. Over the past couple of decades neuroscience researchers have made much progress in the study of consciousness. Yoshua believes it’s time for machine learning to consider these advances and incorporate them in machine learning models.
He began by describing the two ways in which we can think: “system 1” and “system 2”, as proposed by Daniel Kahneman. System 1 thinking is intuitive, fast, unconscious and habitual. Current deep learning is very good at these things. System 2 processing allows us to do things that require consciousness, things that take more time to compute, things such as planning and reasoning. This kind of thinking allows humans to deal with very novel situations that are very different from what we’ve been trained on.
In his talk, Yoshua considered the priors that could be incorporated into deep learning to enable system 2 processing. These are:
1) Sparse factor graph in space of high-level semantic variables – for learning representations of high-level concepts of the kind we manipulate with language.
2) Semantic variables are causal
3) Simple mapping between high-level semantic variables, such as words and sentences
4) Shared rules across instance tuples, requiring variables and indirection
5) Distributional changes due to localised causal interventions – the innate ability that humans (and many animals) have enabling them to deal with new scenarios, containing many agents, without specific training.
6) Meaning is stable and robust with respect to changes in distribution
7) Credit assignment is only over short causal chains
Yoshua concluded by saying that knowledge can be decomposed into recombinable pieces corresponding to dependencies involving very few variables at a time. The way that knowledge changes over time is local, involving interventions that rely on only a few variables. This allows agents to quickly learn and react.
You can view the talk and the Q&A session here.
Michael I Jordan, University of California, Berkeley.
To begin his talk Michael summarised progress in the field of machine learning to date, noting that we have now reached the point where research on decision making systems will pay a major role. There are many advances needed with regards to large-scale networks and flows if we are to further integrate AI systems into our lives.
Michael has been researching areas at the interface between machine learning and economics. Such topics that fit into this space include: multi-way markets in which individual agents need to explore to learn their preferences, large-scale multi-way markets in which agents view other sides of the market via recommendation systems, inferential methods for mitigating information asymmetries, and latent variable inference in game theory. He presented a few projects that he and his research teams have been working on.
The multi-armed bandit problem is Michael’s favourite learning problem. It involves a situation where there is a decision maker who is trying to decide between a number of options and doesn’t know which is the best, and has to explore to find the best option. This work, carried out with PhD students Lydia Liu and Horia Mania, proposes a statistical learning model in which one side of the market does not have a priori knowledge about its preferences for the other side and is required to learn these from stochastic rewards. Their model extends the standard multi-armed bandits framework to multiple players, with the added feature that arms have preferences over players.
This work investigates what happens when there isn’t just one single action (as in the example above) but where there is a sequence of actions. Michael and colleagues have used Q-learning with UCBs (upper confidence bounds). You can read more about their work here.
This research looks at situations where one is not looking at one decision at a time, or a sequence of decisions, but a whole host of decisions happening at the same time. In addition, there could be a whole network of agents where the overall group is making a large number of decisions. The aim is for the overall fraction of good decisions to be high.
A significant project that Michael has been working on is Ray – a distributed platform for emerging decision-focussed AI applications. Ray is an open-source distributed system that can work on a laptop, on a powerful multi-core machine, or on any cloud provider. Users can access scalable machine learning libraries out-of-the-box for hyperparameter searches, reinforcement learning, training, serving, and more.
You can view the talk and the Q&A session here.
Read our summary of the first four keynote talks here.