ΑΙhub.org
 

Crowdsourced clustering via active querying


by , and
18 January 2024



share this:

Above: Illustration of our algorithm Active Crowdclustering. A pair is queried to different crowdworkers until the upper or the lower bound is below or above 1/2, respectively. The confidence bound is based on the law of the iterated logarithm. Below: Illustration of how the confidence interval changes as we obtain more queries. Eventually, the upper or the lower bound will cross 1/2.

The success of supervised learning relies on a good-quality labeled dataset. A prime example is ImageNet [1], which played an indispensable role in advancing object recognition; it contains millions of labeled images. Moreover, the internet is a huge repository of images and textual content that could potentially be leveraged for supervised learning. However, the sheer volume and diversity of this data present significant challenges, as it is impractical for researchers themselves to label these data manually.

Consider building a vision system that can identify various dog breeds. Let us say we have obtained n images of dogs of various breeds. To label the breed of each dog, we could hire an expert. However, experts are rare, costly, and challenging to train. An alternative way is to hire crowdworkers. However, due to their lack of expertise, asking them to directly label the breed is not ideal. Instead, we could ask them to compare the similarity between two images of dogs. Such a comparison task is much easier since it does not require expertise:

Do dog image i and image j belong to the same breed?

By treating these dog images as nodes and using the comparison query’s answer to indicate whether there is an edge between two images, we can construct a graph. We could apply graph clustering algorithms on the graph to obtain the clustering and label at cluster level.

Naturally, one would ask which pairs of images should we query to the crowds. A type of approach involves querying a random subset of all possible pairs before running a graph clustering algorithm. This type of method is known as the passive method. The drawback of these approaches is that they can only recover clusters of size \mathcal{\Omega}(\sqrt{ n }), which is related to the hidden clique conjecture.

Figure 1: A sample of our pairwise queries interface displayed to the crowdworkers on Amazon Mechanical Turk.

To address this issue, many methods [2][3] have tried to actively select which images to be queried. However, they typically come with severe limitations, such as they need to know the number of clusters a priori, or they assume that crowdworkers’ error is parametrized by a single scaler. These limitations make these methods impractical.

Therefore, we propose Active Crowdclustering, which does not rely on any unknown parameters and can recover clusters regardless of their sizes. Moreover, it is computationally efficient and simple to implement. Lastly, we provide a theoretical guarantee, under mild assumptions that crowdworkers are independent and better than random guessers, that we can recover exactly the clusters with high probability.

Our algorithm works as follows:

  1. A randomly chosen item forms the first singleton cluster.
  2. Query a non-clustered item i with each representative item j of the existing clusters.
  3. To decide if i belongs to the cluster that contains j, we repeatedly query different crowdworkers until the membership of i can be established with confidence, which is based on the law of the iterated logarithm (LIL) [4, 5, 6].
  4. Item i forms a new cluster itself if it is determined not to belong to any of the existing clusters.

It’s important to reiterate that our algorithm does not require prior knowledge of the problem parameters, as we do not predetermine the number of times to repeat the query. Instead, we use LIL-based time varying confidence intervals and decide when to stop on the fly.

We conducted real world experiments on Amazon Mechanical Turk using the Dogs3 (3 large clusters \mathcal{O}(n)) and Birds20 (20 small clusters \mathcal{O}(\sqrt{ n })) dataset. Figures 2 and 3 illustrate the clustering outcome. Experiments on Birds20 show that our Active Crowdclustering succeeds regardless of cluster sizes, in stark contrast to the notable failures of passive methods. Moreover, although passive methods give good clustering outcomes on Dogs3 (large clusters) our method can pick up more granular differences within each ground-truth cluster.

Figure 2: Clusters obtained by our method on the Dogs3 dataset. More granular clusters are obtained, e.g., differently groomed Poodles formed their own cluster, images of Norfolk Terriers that look different from the large subset of Norfolk Terriers form their own clusters.

Figure 3: Clusters obtained by our method on the Birds20 dataset. In this small cluster regime, our active method provides much better clustering outcomes than passive clustering.

In summary, we propose a practical active clustering algorithm which

  1. does not require the knowledge of problem parameters;
  2. can recover clusters regardless of their sizes;
  3. can pick up more granular differences;
  4. has theoretical guarantee under mild assumptions.

For more details, please read our full paper, Crowdsourced Clustering via Active Querying: Practical Algorithm with Theoretical Guarantees, from HCOMP 2023.

References

[1] Olga Russakovsky*, Jia Deng*, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg and Li Fei-Fei (* = equal contribution), ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.
[2] Yun, Se-Young and Proutiere, Alexandre, Community detection via random and adaptive sampling. Conference on learning theory, pp. 138–175,(2014), PMLR.
[3] Mazumdar, Arya and Saha, Barna, Clustering with noisy queries. Advances in Neural Information Processing Systems 30 (2017), NeurIPS.
[4] Khintchine, Aleksandr, Über einen Satz der Wahrscheinlichkeitsrechnung. Fundamenta Mathematicae 6.1 (1924): 9-20.
[5] Kolmogoroff, A., Über das Gesetz des iterierten Logarithmus. Mathematische Annalen 101 (1929): 126-135.
[6] Jamieson, Kevin, et al., lil’ucb: An optimal exploration algorithm for multi-armed bandits. Conference on Learning Theory (2014), PMLR.

Funding Acknowledgement

This work was partially supported by NSF grants NCS-FO 2219903 and NSF CAREER Award CCF 2238876.



tags: ,


Yi Chen is a Ph.D. student at the University of Wisconsin-Madison.
Yi Chen is a Ph.D. student at the University of Wisconsin-Madison.

Ramya Korlakai Vinayak is an assistant professor at the University of Wisconsin-Madison
Ramya Korlakai Vinayak is an assistant professor at the University of Wisconsin-Madison

Babak Hassibi is the Mose and Lilian S. Bohn Professor of Electrical Engineering at the California Institute of Technology
Babak Hassibi is the Mose and Lilian S. Bohn Professor of Electrical Engineering at the California Institute of Technology

            AUAI is supported by:



Subscribe to AIhub newsletter on substack



Related posts :

Interview with AAAI Fellow Tanya Berger-Wolf: AI for ecology, biodiversity, and conservation

  11 Jun 2026
Find out about Tanya work on a foundation model for biology and the insights that this can provide.

Statistical or embodied? Comparing people and LLMs in their processing of color metaphors: an interview with Douglas Guilbeault

  09 Jun 2026
We learn what implications color metaphors and synaesthesia have for human and AI cognition.

The Good Robot podcast: the battle over data centres with Tara Merk

  08 Jun 2026
Eleanor Drage speaks with Tara Merk about how community-owned data centers could transform digital ownership and challenge the dominance of Big Tech.

Congratulations to the #AAMAS2026 best paper award winners

  05 Jun 2026
Find out who won in the categories of best paper, best student paper, and best blue sky paper.

Interview with AAAI Fellow Sanmay Das: multiagent systems

  04 Jun 2026
We find out more about multi-agent research for the allocation of scarce societal resources.

Design tweaks promote responsible AI use for environmental protection, research shows

  03 Jun 2026
Systems that ask users to pause to consider AI’s energy consumption and environmental impacts are likely to reduce unnecessary AI use

An AI solution to an 80‑year‑old problem has shocked mathematicians

  02 Jun 2026
An OpenAI model has been used to find a counterexample to a famous conjecture made by legendary Hungarian mathematician Paul Erdős.

Forthcoming machine learning and AI seminars: June 2026 edition

  01 Jun 2026
A list of free-to-attend AI-related seminars that are scheduled to take place between 1 June and 31 July 2026.



AUAI is supported by:







Subscribe to AIhub newsletter on substack




 















©2026.05 - Association for the Understanding of Artificial Intelligence