about

resources

events

contribute

republishing

☰

ΑΙhub.org

RLPrompt: Optimizing discrete text prompts with reinforcement learning

by ML@CMU

07 March 2023

Discrete Prompt Optimization with RL

This paper presents RLPrompt, a new discrete prompt optimization approach based on reinforcement learning (RL). This approach brings together a wide range of desirable properties for efficient use on diverse tasks and LMs (see the table below).

_{RLPrompt unites the desirable properties of a wide range of previous prompt optimization approaches}

Crucially, rather than directly editing the discrete tokens, which has been difficult and inefficient, RLPrompt trains a policy network that generates the desired prompts. Discrete prompt optimization thus amounts to learning a small number of policy parameters which we set as an MLP layer inserted into a frozen compact model such as distilGPT-2. We describe the specific formulations in Section §2.1-2.3 of our paper.

This formulation also allows us to employ off-the-shelf RL algorithms (e.g., soft Q-learning) that learn the policy with arbitrary reward functions—defined either with available data (e.g., in few-shot classification) or other weak signals when no supervised data is accessible (e.g., in controllable text generation).

Reward Stabilization

On the other hand, RL for prompt optimization poses new challenges to learning efficiency: the large black-box LM presents a highly complex environment that, given the prompt (i.e., actions), goes through a long series of complex transitions (e.g., reading the input and inferring the output) before computing the rewards. This makes the reward signals extremely unstable and hard to learn from.

To overcome this difficulty, we propose two simple yet surprisingly effective ways to stabilize the rewards and improve the optimization efficiency.

Normalizing the training signal by computing the z-score of rewards for the same input.
Designing piecewise reward functions that provide a sparse, qualitative bonus to desirable behaviors (e.g., certain accuracy on certain class).

We describe more details in Section §2.4 of our paper.

Experiments

We evaluate our approach on both classification (in the few-shot setting) and generation (unsupervised text style transfer), and perform rich analyses for new insights on LM prompting. We describe implementation details such as reward function design in Section §3 our paper, and publish the code at our Github codebase.

Few-Shot Text Classification

For few-shot classification, we follow previous work and experiment on popular sentiment and topic classification tasks, using 16 examples per class for both training and validation. Results using RoBERTa-large (left table below) show our approach improving over a wide range of fine-tuning and prompting methods, and is as efficient to optimize as similar methods that tune soft prompts (e.g., right figure below). We report detailed dataset-level results in Section §3.1 of our paper.

_{Figure 2: Comparison of our method (orange) and BlackBox (BB) Tuning (blue) in terms of training efficiency. The solid curves are the mean and the shaded regions are the max. and min. test accuracies over 5 trials.}

Unsupervised Text Style Transfer

For text style transfer, we evaluate on the popular Yelp sentiment transfer dataset using popular automatic metrics for content preservation, style accuracy, and fluency, and report their sentence-level joint product $J(\cdot)$ below. Our full paper also includes few-shot experiments on the Shakespeare dataset and human evaluations.

Results using GPT-2 (left table below) show our method outperforms or competes with various fine-tuning and prompting baselines, including DiRR which expensively fine-tunes all parameters of a GPT-2 model. Ablation study (right figure below) shows that our proposed reward normalization technique is crucial to optimization success. We describe the full evaluation results in Section §3.2 of our paper.

_{Table 2: Automatic evaluation of our method vs. baselines on the Yelp sentiment transfer dataset. is our main metric which measures the average joint sentence-level scores of content preservation, style accuracy, and fluency. Numbers in (parentheses) are standard deviations across 3 sets of prompts.}

_{Figure 3: Comparison of our method with (orange) and without (purple) z-score reward normalization. The format is the same as Figure 2.}

Analysis

Optimal Prompts Don’t Follow Human Language

The resulting discrete prompts also facilitate rich interpretations and analyses for new insights into LM prompting. In particular, the optimized prompts, though inducing strong task performance, tend to be gibberish text without clear human-understandable meaning (e.g., table below), echoing recent research (e.g., Webson and Pavlick (2021), Zhao et al., (2021), and Prasad et al., (2022)) that LMs making use of prompts do not necessarily follow human language patterns.

_{Table 3: Comparison of our method (RLPrompt) with manually-written (Manual) prompts for text style transfer performance on Yelp. For the manual prompts, we take one from this paper and write two more for this experiment. is the main metric introduced in Table 2. All outputs are generated using GPT-2-xl and metrics are averaged over 5 runs.}

Learned Prompts Transfer Trivially Across LMs

Perhaps surprisingly, those gibberish prompts learned with one LM can be used in other LMs for significant performance, indicating that those different pre-trained LMs have grasped shared structures for prompting (e.g., figures below).

_{Figure 4: Heatmap of sentiment analysis performance with transferred discrete prompts of 2 tokens. The columns represent the models used to learn the prompts, and the rows represent the models we perform classification with. Brighter color represents higher accuracy.}

_{Figure 5: Heatmap of text style transfer performance with transferred discrete prompts. The columns represent the models used to learn the prompts, and the rows represent the models we perform text generation with. Manual and Random refer to manual prompts and random tokens, respectively. Brighter color represents better joint score .}

Conclusion

We have presented RLPrompt, an efficient and flexible approach for discrete prompt optimization using RL, which improves over a wide range of fine-tuning and prompting methods in experiments on few-shot classification and unsupervised text style transfer.

Analysis reveals that strong optimized prompts are incoherent but transferable between LMs for remarkable performance. The observation opens up many promising possibilities for prompting, such as learning prompts cheaply from smaller models and performing inference with larger models. We are excited to explore further.

This article was initially published on the ML@CMU blog and appears here with the authors’ permission.

tags: deep dive

ML@CMU

AIhub is supported by:

RLPrompt: Optimizing discrete text prompts with reinforcement learning

Discrete Prompt Optimization with RL

Reward Stabilization

Experiments

Few-Shot Text Classification

Unsupervised Text Style Transfer

Analysis

Optimal Prompts Don’t Follow Human Language

Learned Prompts Transfer Trivially Across LMs

Conclusion

Related posts :

AI chatbots can effectively sway voters – in either direction

Studying the properties of large language models: an interview with Maxime Meyer

What the Moltbook experiment is teaching us about AI

The malleable mind: context accumulation drives LLM’s belief drift

RWDS Big Questions: how do we balance innovation and regulation in the world of AI?

Studying multiplicity: an interview with Prakhar Ganesh

Top AI ethics and policy issues of 2025 and what to expect in 2026

The greatest risk of AI in higher education isn’t cheating – it’s the erosion of learning itself

↑