Home

/

Blogs

/

Research

RLHF Introduction

What is RLHF?

Reinforcement Learning from Human Feedback, popularly known as RLHF, is a machine learning technique used to align LLM (Large Language Models) outputs to human intent. It uses Reinforcement Learning to train a model to maximize a reward function, which is based around maximizing human preferences. It is generally used to train models to be less harmful and more helpful, which are complex attributes to specify. For example, it is very difficult to quite pinpoint what “offensiveness” is, however by asking humans which output is more offensive, we can create a reward model, which the LLM can then maximize.

What is RL?

Reinforcement Learning (RL) is a framework for teaching agents to make decisions, similar to how humans learn through trial and error. In RL, an agent interacts with its environment, receiving information about the state of the environment, the set of possible actions it can take, and a form of feedback or reward based on its actions. The agent's goal is to develop a strategy, known as a policy, that enables it to maximize its cumulative reward over time.

Through repeated interactions, the agent adjusts its behavior, learning from both successes and failures to improve its performance. This iterative process mirrors how humans refine their decision-making skills—by experimenting, learning from mistakes, and optimizing future actions based on past outcomes. One of the key strengths of RL is its ability to tackle complex, dynamic environments where predefined solutions are difficult or impossible to craft.

However, this learning process can be computationally intensive, especially in environments with large state or action spaces, where the agent needs to explore many possibilities before converging on an optimal or near-optimal policy. Despite these challenges, RL has proven highly effective in domains ranging from robotics and gaming to autonomous systems and personalized recommendations.  

How does RLHF work?

The first step to train a LLM is using Pre-training. In this step, models are trained on vast amounts of data, through which they learn the nuances of text. This step is the most intensive in terms of compute resources.

The second step is Supervised Fine-tuning (SFT), a crucial process that enhances the model's ability to produce helpful and user-aligned responses. Since Large Language Models (LLMs) are auto-regressive—meaning they generate the next word or token based on the previous ones—they don't inherently know how to respond in a way that directly answers user queries.

For instance, if you ask an LLM before SFT, "What are the benefits of exercise?", it might generate a continuation like, "are often discussed by experts," instead of providing a helpful answer. Through SFT, the model is trained using specific examples to answer questions directly and follow instructions accurately. This makes the output much more aligned with the user’s needs.

The third step is creating the Reward Model. Human annotators are generally asked to choose between multiple outputs, and this data is used to train a reward model. While it can seem intuitive to ask humans to rate outputs on some numeric scale, two humans may not agree on what numbers on the scale mean. Tools like Reward Bench can be used to evaluate reward models.

The fourth and final step is using the reward model for Policy Optimization, a fancy term for RL. The most common technique for this is Proximal Policy Optimization or PPO. PPO limits how much a policy can be changed in a training iteration, thus smoothing out the training process and making sure that the model doesn’t deviate too much from its capabilities. Other techniques are Direct Preference Optimization (DPO) or Advantage Actor-Critic (A2C). 

RLHF Over the years:

In recent years, several enhanced variants of RLHF have been proposed to improve the alignment and utility of AI systems. These innovations bring new techniques to the table, refining how AI models learn from human interactions. [1]

DeepMind's Sparrow introduces adversarial probing and rule-based reward modeling. In this approach, goals are defined as natural language rules that the AI agent must follow, enhancing its ability to meet complex objectives.

Anthropic’s team explored a pure RL approach for online training with human feedback, focusing on the tradeoffs between helpfulness and harmlessness in language models. Their work delves into balancing these aspects in real-time, showcasing the complexity of aligning AI behavior with nuanced human expectations.

Another significant contribution is SENSEI, which embeds human value judgments into the language generation process. SENSEI utilizes a dual-component system of a "critic" and an "actor" to model human reward distribution and steer language generation towards maximizing human-aligned rewards. Both components work alongside a shared language model, dynamically adjusting outputs to reflect human preferences.

Baheti et al. (2023) challenge the traditional approaches to data utilization in RLHF, which often treat all data points equally. Instead, they propose assigning weighted importance to data based on relevance, allowing the model to better prioritize useful information. This method improves the model's overall utility by making better use of crowdsourced and internet data.

A recent study “Aligning Language Models with Preferences through f-divergence Minimization”: f-DPG, generalized RLHF to accommodate different divergence metrics. By minimizing reverse KL divergence, this approach allows for more flexible optimization based on various target distributions.

Researchers from UC Berkeley present another theoretical framework that unifies RLHF with max-entropy Inverse Reinforcement Learning (IRL). This allows for a more comprehensive understanding of how AI can learn from human feedback and provides sample complexity bounds to support the method.

Lastly, Inverse Reward Design (IRD)  offers a refinement to standard RLHF. Here, the reward function is informed by expert-designed priors rather than solely relying on labeled data. This integration of expert knowledge with human feedback provides a more nuanced and robust learning mechanism for AI systems.

These advancements highlight the ongoing evolution of RLHF, driving the field toward more human-aligned, efficient, and context-aware AI models. Each variant addresses specific limitations of conventional methods, pushing the boundaries of what AI can achieve when it learns directly from human feedback.

Conclusion

Reinforcement Learning from Human Feedback (RLHF) is a powerful technique used to align machine learning models with human preferences, but it has its limitations. While it can effectively shape model behavior to reflect human values, RLHF is often resource-intensive and costly. Additionally, human preferences are inherently subjective, making it difficult to define an objective "ground truth" that can be universally relied upon.

Another challenge is the potential for bias in human feedback. Since human preferences can vary based on individual experiences, cultural backgrounds, and context, the feedback provided may not always represent a broad consensus. Moreover, collecting high-quality human feedback requires considerable effort, including properly training annotators, which further drives up the cost.

Despite these hurdles, RLHF remains one of the most effective methods for improving model alignment. Ongoing research aims to reduce costs and improve scalability while also exploring ways to mitigate biases and make the feedback process more robust and representative. However, striking the right balance between model performance, cost, and ethical alignment continues to be a complex task in machine learning development.

References

[1] Large Language Model Alignment: A Survey: https://arxiv.org/pdf/2309.15025

Reach out to us at hey@soulhq.ai for more information, work samples, etc.