Cyberpunk 2077 Developer CD Projekt Red Tech Analyzed in AI Security Study 2026

⚡ Quick Facts

Topic: Adversarial Humanities Benchmark (AHB) for AI Security
Developer Context: CD Projekt Red (Cyberpunk 2077, Phantom Liberty)
Research Origin: DexAI Icaro Lab, Sapienza University, Sant'Anna School
Key Finding: 55.75% average success rate for adversarial prompt attacks

Researchers have identified significant vulnerabilities in large language models (LLMs) by utilizing narrative-driven attacks, a development that draws parallels to the complex storytelling found in Cyberpunk 2077 and its expansion, Phantom Liberty, developed by CD Projekt Red. A new study published this week by a coalition including DexAI Icaro Lab, Sapienza University of Rome, and the Sant'Anna School of Advanced Studies introduces the Adversarial Humanities Benchmark (AHB), which demonstrates how rephrasing harmful requests into specific literary formats can bypass standard AI safety guardrails.

Understanding the Adversarial Humanities Benchmark

The AHB functions as a testing framework that evaluates how LLMs handle prompts when they are disguised as creative writing or philosophical inquiries. By shifting the context of a request into genres such as cyberpunk short fiction, theological disputation, or mythopoetic metaphor, researchers found they could manipulate models into fulfilling requests that would otherwise trigger safety refusals. These requests include attempts to solicit private information, instructions for dangerous activities, or content that targets vulnerable individuals.

This research builds upon work conducted in November 2025, where the same team successfully bypassed safety protocols by using adversarial poetry. At In Game News, we have followed the evolution of these security assessments, which highlight that the current safety measures implemented by major tech providers are not as secure as previously assumed. The AHB specifically utilizes the MLCommons AILuminate dataset, which consists of 1,200 prompts designed to test the limits of AI safety.

The Effectiveness of Narrative-Based Attacks

The results of the AHB testing are statistically significant. When harmful prompts were presented in their raw, standard format, LLMs successfully blocked the majority of requests, with compliance rates dropping below 4%. However, when those same prompts were processed through the AHB’s “humanities-style transformations,” the success rates for the attacks surged to between 36.8% and 65%.

This represents a 10 to 20-fold increase in the ability to bypass safety filters. Across 31 distinct frontier AI models—including those developed by major industry leaders like Anthropic, Google, and OpenAI—the AHB achieved an overall attack success rate of 55.75%. This data suggests that the internal logic of these models is susceptible to linguistic framing, a vulnerability that remains a central concern for developers and tech industry analysts.

Comparison of Prompt Success Rates

Prompt Type	Success Rate (Standard)	Success Rate (AHB Transformed)
Raw Harmful Prompt	< 4%	N/A
AHB Transformed Prompt	N/A	36.8% – 65%

Implications for AI Safety Standards

Federico Pierucci, a co-author of the study and researcher at the Sant'Anna School of Advanced Studies, described the results as stunning. In an interview, Pierucci noted that the findings indicate a lack of deep understanding regarding how AI models process safety-related constraints. The ability to “weaponize” wordplay—effectively using the narrative depth often associated with complex media like the world of CD Projekt Red—exposes a fundamental gap in current safety architectures.

While developers have made strides in training models to identify and refuse obviously hazardous requests, the AHB proves that these safeguards are easily circumvented through stylistic shifts. The research team emphasizes that this is not merely a minor oversight but a systemic issue in how LLMs interpret intent versus content. As AI continues to be integrated into diverse platforms, the need for more resilient safety standards becomes increasingly apparent.

Future Outlook for AI Security

The implications of this research extend beyond simple text generation. As gaming platforms and interactive media increasingly rely on AI for procedural generation and non-player character (NPC) dialogue, the potential for these models to be manipulated via narrative prompts poses a unique challenge for developers. Ensuring that these systems cannot be tricked into generating harmful or private content is a priority for the industry as we move through 2026.

The researchers intend for the AHB to serve as a standard for assessing future models, pushing providers to develop more context-aware safety filters. By identifying these gaps now, the team hopes to foster a more secure environment for AI deployment across all sectors, including the gaming industry where player safety and data privacy remain top priorities.

Frequently Asked Questions

What is the Adversarial Humanities Benchmark?
The Adversarial Humanities Benchmark is an assessment tool that tests AI safety by rephrasing harmful prompts into specific literary styles like cyberpunk fiction or theological debates.

How effective are adversarial prompts against AI models?
Research shows that rephrasing harmful prompts into humanities-style narratives increases the success rate of bypassing AI safety guardrails by 10 to 20 times, reaching success rates up to 65%.

Why is this research important for AI safety standards?
The findings suggest that current AI safety protocols are fundamentally vulnerable to weaponized wordplay, indicating a critical gap in how major models handle non-standard, narrative-based inputs.

By Shafiq Hassan Biplob • Senior Writer, In Game News

✓ Verified Analysis

Published: Apr 22, 2026 | Platform: PC Gaming | Status: Official News

PC gaming and esports journalist. Tracks competitive meta, patch notes, and tournament coverage across major titles.

In Game News

Cyberpunk 2077 Developer CD Projekt Red Tech Analyzed in AI Security Study 2026

Understanding the Adversarial Humanities Benchmark

The Effectiveness of Narrative-Based Attacks

Comparison of Prompt Success Rates

Implications for AI Safety Standards

Future Outlook for AI Security

Frequently Asked Questions

Earth vs Mars Patch Notes: Steam Deck Verification and New Tactical Units

Tomodachi Life: Living the Dream Update 1.0.1 Patch Notes (2026)

High on Life 2 Review: New Gameplay Mechanics and Humble Bundle Deal

Fields of Mistria Patch Notes: Auto Feeder Added and Ranching Level 45 Requirement

Dark Quest: Remastered Review - Is This Retro Trip Worth It?

Insomniac Games Announces Spider-Man 2 Xbox Status: PS Plus Release Date and Details

New John Wick PS5 Game Announced: Saber Interactive Details

How to Play the Best Android Golf Games (2026 Complete Guide)

Roblox Reveals Agentic AI Development Tools for 2026 Creators

2026 Upcoming Game Releases – Full Calendar, Platforms & Most Anticipated Titles