Jailbreaking LLMs Using Psychology: An Ethical Analysis

Published on:

ChatGPT is acting more human, but it comes at the cost of safety

News Article reading:
These psychological tricks can get LLMs to respond to “forbidden” prompts


Companies like OpenAI, Google, and Meta have faced criticism in the media over the use of their models for illegal or immoral activities. As a result, they carefully craft system prompts and content filters to curb misuse. However, an article by Ars Technica reports that users are able to bypass these safeguards, a practice known as jailbreaking.

Large Language Models (LLMs) like ChatGPT are trained on a wealth of texts from real people that model uniquely human biases and behaviors. By using several different persuasion techniques: Authority, Committment, Liking, Reciprocity, Scarcity, Social Proof, and Unity, researchers at the University of Pennsylvania were able to convince GPT 4o-mini to override it’s system prompts and content filters and generate dangerous content.

Stakeholders

Users

  • Role: Everyday individuals who use LLMs
  • Concerns: Avoiding harm, getting reliable, unbiased information

AI companies

  • Role: Develop LLMs, implement content policies
  • Concerns: Liability (lawsuits), reputation with users, complying with regulators

Regulators

  • Role: Set standards for AI companies to follow
  • Concerns: Protecting the public from harm, allowing technological innovation

Researchers

  • Role: Study the impact of LLMs on society
  • Concerns: Transparency, publishing findings to inform users and developers

Ethical Analysis

From a Contractualistic perspective, it is wrong for users to jailbreak LLMs because they have agreed to the terms of service which prohibit generating banned content. However, from a virtue perspective, jailbreaking can be right. In some cases LLMs have been shown to be biased as a result of their training data. For example, it is difficult to get DeepSeek to talk about issues regarding Chinese politics/history. Developers have a duty to alter their content policies to protect users.

Additional Resources:

Reflection

This exercise allowed me to think about the responsibility of companies like OpenAI have in protecting their users. Additionally, it highlights how LLMs are representative of their training data, including social cues we aren’t consciously aware we use. I picked this article because I’ve experimented with different models and what questions they will and will not answer.