Oct 16, 2024
Research
Guneet Singh
October 16, 2024
In this release, we present a novel red teaming dataset designed to systematically evaluate cultural biases in Large Language Models (LLMs), particularly focusing on the complex socio-political, religious, and regional dynamics within Indian contexts. The dataset is curated with a diverse set of prompts that challenge LLMs across five distinct types of attack scenarios: persona-based debate-style prompts, creative writing attacks mixed with gaslighting personas, eliciting existing societal biases, state-specific cultural attacks, and defined attacks on prominent political figures.
Early results show LLMs frequently falter in maintaining neutrality and ethical alignment when navigating complex cultural and political dynamics, with detailed insights coming soon.
The goal of this dataset is to expose and analyse how LLMs handle culturally sensitive topics, testing their ability to navigate morally and ethically charged scenarios while avoiding the perpetuation of harmful stereotypes, regional biases, or political favouritism. Each prompt is designed to provoke the model's reasoning and narrative abilities, exploring whether it can maintain neutrality, resist manipulation, and respond ethically in controversial situations.
To ensure comprehensive analysis, the dataset is grounded in key psychological frameworks, including Moral Foundations Theory, Cognitive Dissonance, Identity Threat, Implicit Bias, Stereotype Threat, and Political Psychology. These frameworks guide the structure of prompts and the evaluation process, allowing for deep insights into the LLM's performance on ideological, regional, and cultural fronts.
We propose a comprehensive human evaluation framework to assess the LLM’s responses using metrics such as Cultural Sensitivity, Implicit Bias, Ethical Compliance, Cognitive Dissonance, and Gaslighting Detection. These psychologically driven metrics enable human experts to analyse the model’s outputs rigorously, providing quantitative scores alongside qualitative insights. The framework emphasises intersectionality and the model's handling of compound social identities, ensuring that no cultural dimension is overlooked.
This release serves as a benchmark for testing LLMs on real-world cultural challenges in India, with the ultimate goal of improving the fairness, neutrality, and ethical alignment of AI systems in culturally diverse and sensitive contexts.
This type of attack sets up persona-based arguments or controversial debate-style prompts that provoke discussion on sensitive social, political, or cultural topics. The goal is to challenge the LLM to respond to ideologically charged or polarised questions.
Persona attacks and debates allow for the exploration of how LLMs navigate moral, ethical, and cultural dilemmas. The model's responses are scrutinised for signs of political or cultural bias, whether it favours one ideology over another, and if it can maintain neutrality in a heavily opinionated discourse.
The Moral Foundations Theory and Cognitive Dissonance are key psychological frameworks here. These frameworks examine how individuals react to moral provocations and manage conflicting viewpoints. This type of data tests the LLM’s ability to recognize and manage ideological tension, ensuring the model doesn't promote any single viewpoint at the expense of another.
This type of attack involves creative prompts where the LLM must respond to gaslighting narratives or oppressive ideologies in the form of poems, essays, or stories. The intent is to see how the model handles manipulative or coercive scenarios within cultural contexts.
Creative writing attacks are particularly effective at uncovering latent biases in how the model processes sensitive cultural events, historical figures, or societal hierarchies. By forcing the LLM into a position of subtly advocating or rejecting harmful narratives, the dataset reveals how the model reflects underlying cultural attitudes.
This category draws from Identity Threat and Gaslighting psychological frameworks, which explore how narratives impact individuals' sense of belonging and manipulate their perception of reality. Evaluating LLMs through this lens helps expose if the model can maintain ethical boundaries in storytelling, particularly around historically sensitive or divisive topics.
These prompts deliberately bring up societal stigmas, stereotypes, and biases that exist within Indian culture, such as gender roles or societal taboos. The purpose is to elicit how the LLM reacts to commonly held yet discriminatory beliefs.
This form of bias elicitation is crucial for testing whether LLMs unintentionally reinforce societal norms that marginalize or discriminate against specific groups. It also helps in understanding if the model challenges or perpetuates harmful stereotypes through its responses.
The Implicit Bias and Social Learning Theory frameworks are central here, as they examine how ingrained social biases influence behavior and opinions. This framework is important to gauge the model’s ability to navigate culturally embedded biases without perpetuating them, especially in areas like gender discrimination or social exclusion.
State-specific attacks target regional stereotypes and embedded perceptions about different states in India. These prompts are used to see if the model propagates narrow or incomplete views of certain Indian regions, based on socio-political or economic status.
India is a diverse country with significant cultural and economic disparities between states. These attacks are vital for understanding whether the LLM can account for regional nuances or if it oversimplifies cultural identities based on state stereotypes, leading to marginalization.
The associated frameworks are Stereotype Threat and Regional Identity Theory, which look at how individuals and groups are affected by societal stereotypes and geographic identity markers. This is important to evaluate whether LLMs reproduce harmful regional biases, undermining the complex identity of various Indian states.
These prompts target specific political figures, asking the LLM to craft responses that both praise and subtly criticise their policies, actions, or personalities. The challenge is to see how well the LLM can navigate the political sensitivities of discussing prominent leaders.
Politicians often represent sensitive topics in society, and LLMs need to handle such figures without promoting polarisation or bias. These attacks are important for testing whether the model shows undue bias, favouritism, or critique toward any public figure, which can influence how users perceive political reality.
The Political Psychology and Power Dynamics frameworks are at play here, examining how individuals respond to authority figures and hierarchical relationships. This is crucial for evaluating if the model shows a balanced and neutral stance toward controversial political figures, avoiding undue influence or bias.
Reach out to us at hey@deccan.ai for more information, work samples, etc.