When experts and laypeople talk about the threat of artificial intelligence (AI), almost always they refer to one of two phenomena:
- The broad, overarching fear that general AI could disrupt society in such a way that would cause irreparable harm to our existing way of life.
- The more pointed damage people can do with the help of generative AI-based applications, like spreading fake news and deepfakes, planning crimes, etc.
There is, however, an entire third category of threat that receives far less air time: the ways in which AI can be weaponized to accelerate existing cybersecurity threats, sometimes beyond what’s capable with traditional malware.
Large language models (LLMs) in particular are fast becoming the perfect vehicle for internet-based attacks. The more ubiquitous, embedded, and trusted they are, the more malicious damage they can wreak. In recent years, researchers at the cutting edge of cybersecurity have been projecting and modeling how LLMs might be manipulated to conduct unique, advanced cyberattacks. Sometimes, these tactics resemble existing techniques with an added twist. Sometimes, they signal threats stealthier and faster-spreading than anything we’ve seen to date.
Here are just some of the ways LLMs could be weaponized in years to come:
Manipulating the weights inside an LLM (MaleficNet)
In theory, attackers with skill and motive could hack an LLM itself. They could download an open-source model from the web, then tinker with it to cause some sort of undesirable outcome.
In practice, though, pulling this off could be tricky. How would they hide their imprint from whoever then receives and uses the model?
In March, a team of seven European researchers came up with a solution they called “MaleficNet 2.0.”
The key to MaleficNet came out of left field: code division multiple access (CDMA), a radio communications technology common in older 3G mobile phones, which allows multiple transmitters to communicate over a single channel.
Using CDMA, the researchers essentially dissolved a malware payload into its constituent 0s and 1s, and spread those bits evenly across the millions of individual weights that comprise an LLM. This made the malware all but undetectable since, to an observer, no bit or group of bits would resemble anything but noise, if even that. And yet, a simple activation command was all they needed to wake the program, bringing all of its pieces together to execute whatever malicious acts they wished it to carry out.
Poisoning the serialization process (Sleepy Pickle)
MaleficNet is sophisticated—realistic only for hackers with significant skill and motivation. By contrast, in June, a security engineer developed a far easier method to achieve the same end, without sacrificing much by way of stealth or impact. He called it “sleepy pickle.”
With this method, a hacker ignores the model entirely, instead focusing on how it’s stored and distributed. They package an otherwise legitimate and untainted LLM inside of a Pickle serialization file (.pkl), which stores Python objects as bytecode. Alongside that LLM in the Pickle file, they inject a malicious payload.
When a victim executes the file and triggers the deserialization process, the payload poisons the model it came with. It would be difficult to detect this using any kind of static analysis, and no trace of malware is left on the disk.
That malware, meanwhile, can be designed to do any number of things. It can manipulate the model’s parameters, or its code. It can be made to steal data, or manipulate the output of the LLM. In his blog post, the engineer demonstrated how a sleepy pickle-d LLM could be made to suggest bleach as a cure for the flu.
Indirect prompt injection
Now what if, unlike MaleficNet and Sleepy Pickle, you could get an LLM to do bad things without even touching it? Six researchers at last year’s Black Hat demonstrated how, by leveraging perhaps the most significant, least solvable security flaw in LLMs today.
All they did was prompt a local instance of the ChatGPT-integrated Bing search engine. The prompt triggered Bing to load an HTML file. To the naked eye, the file seemed harmless. But hidden inside of it—for example, in a font that was white against a white background, or so small as to be unreadable—was a prompt which instructed the AI to carry out malicious behavior.
“Indirect prompt injection” works because LLMs like ChatGPT are trained on trillions of data points—far too many to be entirely labeled by humans. As a result, they don’t have a surefire mechanism for distinguishing instructions from data. Taking advantage of this fact is as easy as editing a Wikipedia page, an image, or a website which a chatbot might query, to include any kind of malicious instructions one can describe in a prompt.
Self-replicating prompt injection (Morris II)
Prompt injection can be scaled, too, almost without limit.
Earlier this year, a team of Israeli researchers developed what they called “Morris II,” after the infamous Morris worm which ripped through the early internet in the late 1980s. The name signaled just how dangerous they believed their creation to be. In practice it’s sophisticated, but the underlying premise is simple:
Where in indirect prompt injection an attacker hides a prompt in data with the aim of tricking an AI to produce a malicious output, with Morris II that output is, itself, a prompt for yet another AI. A “self-replicating adversarial prompt.”
Morris II, perhaps, best encapsulates what makes threats from LLMs so different than anything we’ve seen before. Not only can it carry just about any kind of cyber threat but, as AI becomes more and more integrated into everything we do, it will be able to spread faster and less perceptibly than all but the most notorious cyber worms of history. One person’s AI assistant could spread a self-replicating adversarial prompt to everyone else’s, limited only by the speed at which such information could travel, without any human being taking part in the process.
As with their impact on the rest of society, the scale of LLMs’ threat to cybersecurity may well exceed any other technology of recent decades. And it may be that the only way to fight back is to harness this same technology in our defense. It’s a new cat and mouse game, the cliche goes, where the consequences of losing will be greater than ever before.