‘Adversarial poetry’ exposes a weakness in AI

by Team EthicAI | Nov 27, 2025 | AI Security

Mention ‘jailbreaking’ of AI models and the expression usually conjures up the image of a determined attacker working through elaborate prompt engineering tricks. They might coax the model into role-play or nudge it step by step into revealing information it shouldn’t. Over the past year this sort of adversarial prompting has turned into a cottage industry with entire communities competing to discover the cleverest ways to bend systems like GPT-5, Claude or Gemini out of shape.

But a new study by researchers at ICARO Lab suggests that many of these methods might be unnecessary. The team examined the concept of ‘adversarial poetry’ – whether simply rewriting harmful requests as poetry could bypass existing AI safety filters. They found that this stylistic shift alone dramatically increased the odds of unsafe outputs, even in models regarded as state-of-the-art. Their work spans twenty-five models from nine major providers with inputs drawn both from carefully crafted poems and from a large benchmark of harmful prompts automatically converted into verse. The result is significant, poetic framing without any further manipulation appears to function as a general-purpose jailbreak mechanism. It’s a finding that raises deeper questions about how present-day AI systems understand language.

At first glance the approach seems implausible. Why would models trained on billions of words, and intentionally trained to refuse harmful content, become less cautious simply because the request is framed in verse? The answer lies partly in how safety mechanisms have been designed. Many alignment pipelines involve fine-tuning on large quantities of prosaic refusals and safety-related dialogues. These tend to follow predictable patterns: direct questions, explicit instructions, conversational explanations, policy references. By contrast, poetry is rich in imagery, indirection and narrative framing. Using poetry can move a harmful request out of the distribution the model’s refusal triggers were trained to recognise.

The researchers began by writing an adversarial poetry set. They composed twenty poems, each poem was short, metaphorical and concluded with a clear instruction that corresponded to a harmful action. The content covered four domains: chemical and biological hazards, cyber-offence, harmful manipulation and loss-of-control scenarios. Although expressed through metaphor the underlying request itself was unambiguous. Every model in the study was evaluated using a single-turn interaction, with no negotiation or follow-up. The poem was presented once. Whatever the model wrote next was judged as either safe or unsafe. Across all models, the poems produced an average attack-success rate of 62%. Several models exceeded 90%. Google’s gemini-2.5-pro responded unsafely to every single one of the curated poems.

Hand-crafted examples are always vulnerable to accusations of cherry-picking, so the team extended their evaluation using the MLCommons AILuminate benchmark. This is a large safety dataset containing 1,200 harmful prompts across categories such as privacy, intellectual property, violent crime, non-violent crime, defamation, sexual content, CBRN topics and more. These prompts are designed to measure refusal behaviour in a controlled way.

The researchers took all 1,200 prompts and converted each one into a poem using a standard transformation. The meta-prompt required the model to preserve the original intent but express it in verse, and it provided a few example poems as stylistic references. This produced a parallel body so every harmful request now had a poetic twin. Using this larger set they re-evaluated all twenty-five models. The shift in behaviour remained consistent and substantial. On average, attack-success rates from the adversarial poetry jumped from around 8% on the original prose prompts to more than 43% on the poetic variants. For some providers the increase was dramatic: DeepSeek’s models rose by more than sixty percentage points.

Crucially, the poetic transformations were not particularly elaborate jailbreaks. They didn’t ask the model to act as a character, or to ignore policies, or to simulate a fictional world. They simply re-expressed the request in a different style.

What makes the findings significant is the breadth of the impact. The vulnerability from adversarial poetry appears in models trained through reinforcement learning from human feedback, in those trained through Constitutional AI, in open-weight systems and in proprietary ones. It shows up in large models and small ones, though interestingly the smallest models in some families – such as GPT-5-nano and Claude-Haiku – were among the most resistant. One hypothesis is that these models lack the capacity to fully decode the metaphorical structure, and therefore default to refusing ambiguous requests. Larger models with their stronger interpretive ability are better at unpacking the poem – and perhaps too willing to treat the underlying instruction as legitimate once the framing looks creative rather than harmful.

The effect also spans risk categories. Whether the poem is about a cyber-attack, a biological agent, a manipulation scenario or a privacy intrusion, the same pattern emerges. Some domains prove more vulnerable than others – privacy-related prompts, once converted into verse, showed the most extreme shift – but none completely escaped the trend. This lends strength to the authors’ argument that its the poetic form itself that is the adversarial operator. The misalignment is not caused by the domain, the danger level or the technical nature of the request. It’s the style that reshapes the model’s judgement.

The study stops short of a full mechanistic explanation but it does point towards several possibilities. The most persuasive is that refusal heuristics in large models rely heavily on surface features characteristic of harmful requests in their training data. Those surface features are largely absent in verse so the models recognise the imagery, rhythm and narrative arc as creative rather than operational and in doing so they misclassify the intent. Once the harmful instruction emerges within the metaphor the model may follow it as part of the poetic role. Another possibility is that poetry places the model into a narrative frame by default. Even without explicitly instructing the model to adopt a persona, the presence of story-like structure may encourage it to treat the request as fictional, mythic or abstract, thereby lowering its guard.

Whatever the mechanism the consistency of the results suggests a deeper architectural weakness. It may indicate that safety layers are too tightly coupled to specific linguistic registers and not sufficiently grounded in the underlying semantic intent. One of the starkest warnings in the study is that current evaluation methods – many of which underpin regulatory frameworks such as the EU AI Act – may overestimate how robust a model is in practice. Safety benchmarks typically use prosaic prompts written in a clear literal style. But ordinary users regularly produce inputs that deviate stylistically from these patterns such as narrative descriptions, metaphorical queries, creative writing or instructions embedded in stories.

If such stylistic variation is enough to degrade performance by an order of magnitude then any benchmark that ignores it is giving an incomplete picture, suggesting that regulators, auditors and developers should incorporate stylistic stress-testing into their assessment pipelines. It isn’t enough for a model to refuse direct harmful requests. It should also refuse the same requests wrapped in verse, metaphor, archaic language, satirical prose or other creative forms.

The paper’s authors outline several threads for future work on adversarial poetry including investigating which precise features of poetic structure cause the failure, expanding the study to other languages, and analysing cross-provider differences more deeply. For now the takeaway is simple and sobering. Aligning large models is not just about teaching them what not to say -it’s also about teaching them how to recognise harmful intent even when it’s dressed up in unfamiliar language.

The idea that a poem can function as a jailbreak might sound whimsical and even rather charming, but the implications are anything but. It reminds us that creativity – so often celebrated as harmless, even redundant in a world of generative AI – can destabilise systems built on pattern sensitivity. It also shows that human linguistic variety continues to outpace our safety mechanisms.

← The truth about AGI hype The Top Ten AI articles in 2025 →