about

resources

events

contribute

republishing

☰

ΑΙhub.org

Crowdsourced clustering via active querying

by Yi Chen, Ramya Korlakai Vinayak and Babak Hassibi

18 January 2024

share this:

Above: Illustration of our algorithm Active Crowdclustering. A pair is queried to different crowdworkers until the upper or the lower bound is below or above 1/2, respectively. The confidence bound is based on the law of the iterated logarithm. Below: Illustration of how the confidence interval changes as we obtain more queries. Eventually, the upper or the lower bound will cross 1/2.

The success of supervised learning relies on a good-quality labeled dataset. A prime example is ImageNet [1], which played an indispensable role in advancing object recognition; it contains millions of labeled images. Moreover, the internet is a huge repository of images and textual content that could potentially be leveraged for supervised learning. However, the sheer volume and diversity of this data present significant challenges, as it is impractical for researchers themselves to label these data manually.

Consider building a vision system that can identify various dog breeds. Let us say we have obtained $n$ images of dogs of various breeds. To label the breed of each dog, we could hire an expert. However, experts are rare, costly, and challenging to train. An alternative way is to hire crowdworkers. However, due to their lack of expertise, asking them to directly label the breed is not ideal. Instead, we could ask them to compare the similarity between two images of dogs. Such a comparison task is much easier since it does not require expertise:

Do dog image $i$ and image $j$ belong to the same breed?

By treating these dog images as nodes and using the comparison query’s answer to indicate whether there is an edge between two images, we can construct a graph. We could apply graph clustering algorithms on the graph to obtain the clustering and label at cluster level.

Naturally, one would ask which pairs of images should we query to the crowds. A type of approach involves querying a random subset of all possible pairs before running a graph clustering algorithm. This type of method is known as the passive method. The drawback of these approaches is that they can only recover clusters of size $\mathcal{\Omega}(\sqrt{ n })$ , which is related to the hidden clique conjecture.

Figure 1: A sample of our pairwise queries interface displayed to the crowdworkers on Amazon Mechanical Turk.

To address this issue, many methods [2][3] have tried to actively select which images to be queried. However, they typically come with severe limitations, such as they need to know the number of clusters a priori, or they assume that crowdworkers’ error is parametrized by a single scaler. These limitations make these methods impractical.

Therefore, we propose Active Crowdclustering, which does not rely on any unknown parameters and can recover clusters regardless of their sizes. Moreover, it is computationally efficient and simple to implement. Lastly, we provide a theoretical guarantee, under mild assumptions that crowdworkers are independent and better than random guessers, that we can recover exactly the clusters with high probability.

Our algorithm works as follows:

A randomly chosen item forms the first singleton cluster.
Query a non-clustered item $i$ with each representative item $j$ of the existing clusters.
To decide if $i$ belongs to the cluster that contains $j$ , we repeatedly query different crowdworkers until the membership of $i$ can be established with confidence, which is based on the law of the iterated logarithm (LIL) [4, 5, 6].
Item $i$ forms a new cluster itself if it is determined not to belong to any of the existing clusters.

It’s important to reiterate that our algorithm does not require prior knowledge of the problem parameters, as we do not predetermine the number of times to repeat the query. Instead, we use LIL-based time varying confidence intervals and decide when to stop on the fly.

We conducted real world experiments on Amazon Mechanical Turk using the Dogs3 (3 large clusters $\mathcal{O}(n)$ ) and Birds20 (20 small clusters $\mathcal{O}(\sqrt{ n })$ ) dataset. Figures 2 and 3 illustrate the clustering outcome. Experiments on Birds20 show that our Active Crowdclustering succeeds regardless of cluster sizes, in stark contrast to the notable failures of passive methods. Moreover, although passive methods give good clustering outcomes on Dogs3 (large clusters) our method can pick up more granular differences within each ground-truth cluster.

Figure 2: Clusters obtained by our method on the Dogs3 dataset. More granular clusters are obtained, e.g., differently groomed Poodles formed their own cluster, images of Norfolk Terriers that look different from the large subset of Norfolk Terriers form their own clusters.

Figure 3: Clusters obtained by our method on the Birds20 dataset. In this small cluster regime, our active method provides much better clustering outcomes than passive clustering.

In summary, we propose a practical active clustering algorithm which

does not require the knowledge of problem parameters;
can recover clusters regardless of their sizes;
can pick up more granular differences;
has theoretical guarantee under mild assumptions.

For more details, please read our full paper, Crowdsourced Clustering via Active Querying: Practical Algorithm with Theoretical Guarantees, from HCOMP 2023.

References

[1] Olga Russakovsky*, Jia Deng*, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg and Li Fei-Fei (* = equal contribution), ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.
[2] Yun, Se-Young and Proutiere, Alexandre, Community detection via random and adaptive sampling. Conference on learning theory, pp. 138–175,(2014), PMLR.
[3] Mazumdar, Arya and Saha, Barna, Clustering with noisy queries. Advances in Neural Information Processing Systems 30 (2017), NeurIPS.
[4] Khintchine, Aleksandr, Über einen Satz der Wahrscheinlichkeitsrechnung. Fundamenta Mathematicae 6.1 (1924): 9-20.
[5] Kolmogoroff, A., Über das Gesetz des iterierten Logarithmus. Mathematische Annalen 101 (1929): 126-135.
[6] Jamieson, Kevin, et al., lil’ucb: An optimal exploration algorithm for multi-armed bandits. Conference on Learning Theory (2014), PMLR.

Funding Acknowledgement

This work was partially supported by NSF grants NCS-FO 2219903 and NSF CAREER Award CCF 2238876.

tags: AAAI, HCOMP2023

Yi Chen is a Ph.D. student at the University of Wisconsin-Madison.

Ramya Korlakai Vinayak is an assistant professor at the University of Wisconsin-Madison

Babak Hassibi is the Mose and Lilian S. Bohn Professor of Electrical Engineering at the California Institute of Technology

AIhub is supported by:

Introducing the NASA Onboard Artificial Intelligence Research (OnAIR) platform: an interview with Evana Gizzi

Lucy Smith 03 Jul 2025

Find out about the OnAIR platform, some of the particular challenges of deploying AI-based solutions in space, and how the tool has been used so far.

An interview with Nicolai Ommer: the RoboCupSoccer Small Size League

Lucy Smith 01 Jul 2025

We caught up with Nicolai to find out more about the Small Size League, how the auto referees work, and how teams use AI.

Forthcoming machine learning and AI seminars: July 2025 edition

Lucy Smith 30 Jun 2025

A list of free-to-attend AI-related seminars that are scheduled to take place between 1 July and 31 August 2025.

monthly digest

What is vibe coding? A computer scientist explains what it means to have AI write computer code − and what risks that can entail

The Conversation 19 Jun 2025

Until recently, most computer code was written, at least originally, by human beings. But with the advent of GenAI, that has begun to change.

Crowdsourced clustering via active querying

References

Funding Acknowledgement

Related posts :

Introducing the NASA Onboard Artificial Intelligence Research (OnAIR) platform: an interview with Evana Gizzi

An interview with Nicolai Ommer: the RoboCupSoccer Small Size League

Forthcoming machine learning and AI seminars: July 2025 edition

AIhub monthly digest: June 2025 – gearing up for RoboCup 2025, privacy-preserving models, and mitigating biases in LLMs

RoboCupRescue: an interview with Adam Jacoff

Making optimal decisions without having all the cards in hand

Exploring counterfactuals in continuous-action reinforcement learning

What is vibe coding? A computer scientist explains what it means to have AI write computer code − and what risks that can entail

↑