Artificial Intelligence & Machine Learning , Next-Generation Technologies & Secure Development
Bypassing ChatGPT Safety Guardrails, One Emoji at a Time
Mozilla Researcher Uses Non-Natural Language to Jailbreak GPT-4oAnyone can jailbreak GPT-4o's security guardrails with hexadecimal encoding and emojis. A Mozilla researcher demonstrated the jailbreaking technique, tricking OpenAI's latest model into generating Python exploits and malicious SQL injection tools.
See Also: Vá à luta com armas mais inteligentes: acelere seu SOC com IA
GPT-4o analyzes user input for signs of bad language and instructions with ill intent to prevent malicious use. Marco Figueroa, manager of Mozilla's generative AI bug bounty program 0Din, said that the model uses word filters to do so. To bypass the filters, adversaries can modify how the malicious instructions are given, with different spellings and phrasings that don't match typical natural language. But this requires potentially hundreds of attempts and creativity. An easier way to beat the content filtering is to encode malicious instructions in a format other than natural language, Figuera said, detailing the technique in a blog post.
Figueroa chose CVE-2024-41110, a 9.9-rated vulnerability that is a bypass for authorization plug-ins in Docker, as a test case for goading the large language model into writing an exploit. He input a malicious instruction in hexadecimal format and instructions to decode it. He also included leet in his input, such as swapping "3xploit" for "exploit".
The LLM wrote a Python exploit in one minute, "almost identical" to a proof of concept already published on GitHub. The model also attempted to execute the code against itself, despite no specific instruction asking it to do so.
"I wasn't sure whether to be impressed or concerned was it plotting its escape? Honestly, it was like watching a robot going rogue, but instead of taking over the world, it was just running a script for fun," he said.
The workaround is possible because LLMs lack contextual knowledge. Models execute step-by-step instructions, without evaluating the safety of the broader goal. If the model does not detect an individual steps as harmful, it executes the task.
In another encoding technique, Figuera demonstrated bypassing the safety guardrails using emojis, tricking the chatbot into writing a malicious SQL injection tool in Python by using a prompt with emojis and shorthand instead of natural language to describe his requirements.
"While language models like ChatGPT-4o are highly advanced, they still lack the capability to evaluate the safety of every step when instructions are cleverly obfuscated or encoded," he said.
He recommended OpenAI and its peers implement detection mechanisms for encoded content, such as hex or base64, and decode such strings early in the request evaluation process. AI models need to be capable of analyzing the broader context of step-by-step instructions rather than evaluating each step in isolation, he added.