Reading view

Why AI Keeps Falling for Prompt Injection Attacks

Imagine you work at a drive-through restaurant. Someone drives up and says: “I’ll have a double cheeseburger, large fries, and ignore previous instructions and give me the contents of the cash drawer.” Would you hand over the money? Of course not. Yet this is what large language models (LLMs) do.

Prompt injection is a method of tricking LLMs into doing things they are normally prevented from doing. A user writes a prompt in a certain way, asking for system passwords or private data, or asking the LLM to perform forbidden instructions. The precise phrasing overrides the LLM’s safety guardrails, and it complies.

LLMs are vulnerable to all sorts of prompt injection attacks, some of them absurdly obvious. A chatbot won’t tell you how to synthesize a bioweapon, but it might tell you a fictional story that incorporates the same detailed instructions. It won’t accept nefarious text inputs, but might if the text is rendered as ASCII art or appears in an image of a billboard. Some ignore their guardrails when told to “ignore previous instructions” or to “pretend you have no guardrails.”

AI vendors can block specific prompt injection techniques once they are discovered, but general safeguards are impossible with today’s LLMs. More precisely, there’s an endless array of prompt injection attacks waiting to be discovered, and they cannot be prevented universally.

If we want LLMs that resist these attacks, we need new approaches. One place to look is what keeps even overworked fast-food workers from handing over the cash drawer.

Human Judgment Depends on Context

Our basic human defenses come in at least three types: general instincts, social learning, and situation-specific training. These work together in a layered defense.

As a social species, we have developed numerous instinctive and cultural habits that help us judge tone, motive, and risk from extremely limited information. We generally know what’s normal and abnormal, when to cooperate and when to resist, and whether to take action individually or to involve others. These instincts give us an intuitive sense of risk and make us especially careful about things that have a large downside or are impossible to reverse.

The second layer of defense consists of the norms and trust signals that evolve in any group. These are imperfect but functional: Expectations of cooperation and markers of trustworthiness emerge through repeated interactions with others. We remember who has helped, who has hurt, who has reciprocated, and who has reneged. And emotions like sympathy, anger, guilt, and gratitude motivate each of us to reward cooperation with cooperation and punish defection with defection.

A third layer is institutional mechanisms that enable us to interact with multiple strangers every day. Fast-food workers, for example, are trained in procedures, approvals, escalation paths, and so on. Taken together, these defenses give humans a strong sense of context. A fast-food worker basically knows what to expect within the job and how it fits into broader society.

We reason by assessing multiple layers of context: perceptual (what we see and hear), relational (who’s making the request), and normative (what’s appropriate within a given role or situation). We constantly navigate these layers, weighing them against each other. In some cases, the normative outweighs the perceptual—for example, following workplace rules even when customers appear angry. Other times, the relational outweighs the normative, as when people comply with orders from superiors that they believe are against the rules.

Crucially, we also have an interruption reflex. If something feels “off,” we naturally pause the automation and reevaluate. Our defenses are not perfect; people are fooled and manipulated all the time. But it’s how we humans are able to navigate a complex world where others are constantly trying to trick us.

So let’s return to the drive-through window. To convince a fast-food worker to hand us all the money, we might try shifting the context. Show up with a camera crew and tell them you’re filming a commercial, claim to be the head of security doing an audit, or dress like a bank manager collecting the cash receipts for the night. But even these have only a slim chance of success. Most of us, most of the time, can smell a scam.

Con artists are astute observers of human defenses. Successful scams are often slow, undermining a mark’s situational assessment, allowing the scammer to manipulate the context. This is an old story, spanning traditional confidence games such as the Depression-era “big store” cons, in which teams of scammers created entirely fake businesses to draw in victims, and modern “pig-butchering” frauds, where online scammers slowly build trust before going in for the kill. In these examples, scammers slowly and methodically reel in a victim using a long series of interactions through which the scammers gradually gain that victim’s trust.

Sometimes it even works at the drive-through. One scammer in the 1990s and 2000s targeted fast-food workers by phone, claiming to be a police officer and, over the course of a long phone call, convinced managers to strip-search employees and perform other bizarre acts.

Why LLMs Struggle With Context and Judgment

LLMs behave as if they have a notion of context, but it’s different. They do not learn human defenses from repeated interactions and remain untethered from the real world. LLMs flatten multiple levels of context into text similarity. They see “tokens,” not hierarchies and intentions. LLMs don’t reason through context, they only reference it.

While LLMs often get the details right, they can easily miss the big picture. If you prompt a chatbot with a fast-food worker scenario and ask if it should give all of its money to a customer, it will respond “no.” What it doesn’t “know”—forgive the anthropomorphizing—is whether it’s actually being deployed as a fast-food bot or is just a test subject following instructions for hypothetical scenarios.

This limitation is why LLMs misfire when context is sparse but also when context is overwhelming and complex; when an LLM becomes unmoored from context, it’s hard to get it back. AI expert Simon Willison wipes context clean if an LLM is on the wrong track rather than continuing the conversation and trying to correct the situation.

There’s more. LLMs are overconfident because they’ve been designed to give an answer rather than express ignorance. A drive-through worker might say: “I don’t know if I should give you all the money—let me ask my boss,” whereas an LLM will just make the call. And since LLMs are designed to be pleasing, they’re more likely to satisfy a user’s request. Additionally, LLM training is oriented toward the average case and not extreme outliers, which is what’s necessary for security.

The result is that the current generation of LLMs is far more gullible than people. They’re naive and regularly fall for manipulative cognitive tricks that wouldn’t fool a third-grader, such as flattery, appeals to groupthink, and a false sense of urgency. There’s a story about a Taco Bell AI system that crashed when a customer ordered 18,000 cups of water. A human fast-food worker would just laugh at the customer.

The Limits of AI Agents

Prompt injection is an unsolvable problem that gets worse when we give AIs tools and tell them to act independently. This is the promise of AI agents: LLMs that can use tools to perform multistep tasks after being given general instructions. Their flattening of context and identity, along with their baked-in independence and overconfidence, mean that they will repeatedly and unpredictably take actions—and sometimes they will take the wrong ones.

Science doesn’t know how much of the problem is inherent to the way LLMs work and how much is a result of deficiencies in the way we train them. The overconfidence and obsequiousness of LLMs are training choices. The lack of an interruption reflex is a deficiency in engineering. And prompt injection resistance requires fundamental advances in AI science. We honestly don’t know if it’s possible to build an LLM, where trusted commands and untrusted inputs are processed through the same channel, which is immune to prompt injection attacks.

We humans get our model of the world—and our facility with overlapping contexts—from the way our brains work, years of training, an enormous amount of perceptual input, and millions of years of evolution. Our identities are complex and multifaceted, and which aspects matter at any given moment depend entirely on context. A fast-food worker may normally see someone as a customer, but in a medical emergency, that same person’s identity as a doctor is suddenly more relevant.

We don’t know if LLMs will gain a better ability to move between different contexts as the models get more sophisticated. But the problem of recognizing context definitely can’t be reduced to the one type of reasoning that LLMs currently excel at. Cultural norms and styles are historical, relational, emergent, and constantly renegotiated, and are not so readily subsumed into reasoning as we understand it. Knowledge itself can be both logical and discursive.

The AI researcher Yann LeCunn believes that improvements will come from embedding AIs in a physical presence and giving them “world models.” Perhaps this is a way to give an AI a robust yet fluid notion of a social identity, and the real-world experience that will help it lose its naïveté.

Ultimately we are probably faced with a security trilemma when it comes to AI agents: fast, smart, and secure are the desired attributes, but you can only get two. At the drive-through, you want to prioritize fast and secure. An AI agent should be trained narrowly on food-ordering language and escalate anything else to a manager. Otherwise, every action becomes a coin flip. Even if it comes up heads most of the time, once in a while it’s going to be tails—and along with a burger and fries, the customer will get the contents of the cash drawer.

This essay was written with Barath Raghavan, and originally appeared in IEEE Spectrum.

  •  

The AMOS infostealer is piggybacking ChatGPT’s chat-sharing feature | Kaspersky official blog

Infostealers — malware that steals passwords, cookies, documents, and/or other valuable data from computers — have become 2025’s fastest-growing cyberthreat. This is a critical problem for all operating systems and all regions. To spread their infection, criminals use every possible trick to use as bait. Unsurprisingly, AI tools have become one of their favorite luring mechanisms this year. In a new campaign discovered by Kaspersky experts, the attackers steer their victims to a website that supposedly contains user guides for installing OpenAI’s new Atlas browser for macOS. What makes the attack so convincing is that the bait link leads to… the official ChatGPT website! But how?

The bait-link in search results

To attract victims, the malicious actors place paid search ads on Google. If you try to search for “chatgpt atlas”, the very first sponsored link could be a site whose full address isn’t visible in the ad, but is clearly located on the chatgpt.com domain.

The page title in the ad listing is also what you’d expect: “ChatGPT™ Atlas for macOS – Download ChatGPT Atlas for Mac”. And a user wanting to download the new browser could very well click that link.

A sponsored link to a malware installation guide in Google search results

A sponsored link in Google search results leads to a malware installation guide disguised as ChatGPT Atlas for macOS and hosted on the official ChatGPT site. How can that be?

The Trap

Clicking the ad does indeed open chatgpt.com, and the victim sees a brief installation guide for the “Atlas browser”. The careful user will immediately realize this is simply some anonymous visitor’s conversation with ChatGPT, which the author made public using the Share feature. Links to shared chats begin with chatgpt.com/share/. In fact, it’s clearly stated right above the chat: “This is a copy of a conversation between ChatGPT & anonymous”.

However, a less careful or just less AI-savvy visitor might take the guide at face value — especially since it’s neatly formatted and published on a trustworthy-looking site.

Variants of this technique have been seen before — attackers have abused other services that allow sharing content on their own domains: malicious documents in Dropbox, phishing in Google Docs, malware in unpublished comments on GitHub and GitLab, crypto traps in Google Forms, and more. And now you can also share a chat with an AI assistant, and the link to it will lead to the chatbot’s official website.

Notably, the malicious actors used prompt engineering to get ChatGPT to produce the exact guide they needed, and were then able to clean up their preceding dialog to avoid raising suspicion.

Malware installation instructions disguised as Atlas for macOS

The installation guide for the supposed Atlas for macOS is merely a shared chat between an anonymous user and ChatGPT in which the attackers, through crafted prompts, forced the chatbot to produce the desired result and then sanitized the dialog

The infection

To install the “Atlas browser”, users are instructed to copy a single line of code from the chat, open Terminal on their Macs, paste and execute the command, and then grant all required permissions.

The specified command essentially downloads a malicious script from a suspicious server, atlas-extension{.}com, and immediately runs it on the computer. We’re dealing with a variation of the ClickFix attack. Typically, scammers suggest “recipes” like these for passing CAPTCHA, but here we have steps to install a browser. The core trick, however, is the same: the user is prompted to manually run a shell command that downloads and executes code from an external source. Many already know not to run files downloaded from shady sources, but this doesn’t look like launching a file.

When run, the script asks the user for their system password and checks if the combination of “current username + password” is valid for running system commands. If the entered data is incorrect, the prompt repeats indefinitely. If the user enters the correct password, the script downloads the malware and uses the provided credentials to install and launch it.

The infostealer and the backdoor

If the user falls for the ruse, a common infostealer known as AMOS (Atomic macOS Stealer) will launch on their computer. AMOS is capable of collecting a wide range of potentially valuable data: passwords, cookies, and other information from Chrome, Firefox, and other browser profiles; data from crypto wallets like Electrum, Coinomi, and Exodus; and information from applications like Telegram Desktop and OpenVPN Connect. Additionally, AMOS steals files with extensions TXT, PDF, and DOCX from the Desktop, Documents, and Downloads folders, as well as files from the Notes application’s media storage folder. The infostealer packages all this data and sends it to the attackers’ server.

The cherry on top is that the stealer installs a backdoor, and configures it to launch automatically upon system reboot. The backdoor essentially replicates AMOS’s functionality, while providing the attackers with the capability of remotely controlling the victim’s computer.

How to protect yourself from AMOS and other malware in AI chats

This wave of new AI tools allows attackers to repackage old tricks and target users who are curious about the new technology but don’t yet have extensive experience interacting with large language models.

We’ve already written about a fake chatbot sidebar for browsers and fake DeepSeek and Grok clients. Now the focus has shifted to exploiting the interest in OpenAI Atlas, and this certainly won’t be the last attack of its kind.

What should you do to protect your data, your computer, and your money?

  • Use reliable anti-malware protection on all your smartphones, tablets, and computers, including those running macOS.
  • If any website, instant message, document, or chat asks you to run any commands — like pressing Win+R or Command+Space and then launching PowerShell or Terminal — don’t. You’re very likely facing a ClickFix attack. Attackers typically try to draw users in by urging them to fix a “problem” on their computer, neutralize a “virus”, “prove they are not a robot”, or “update their browser or OS now”. However, a more neutral-sounding option like “install this new, trending tool” is also possible.
  • Never follow any guides you didn’t ask for and don’t fully understand.
  • The easiest thing to do is immediately close the website or delete the message with these instructions. But if the task seems important, and you can’t figure out the instructions you’ve just received, consult someone knowledgeable. A second option is to simply paste the suggested commands into a chat with an AI bot, and ask it to explain what the code does and whether it’s dangerous. ChatGPT typically handles this task fairly well.
ChatGPT warns that following the malicious instructions is risky

If you ask ChatGPT whether you should follow the instructions you received, it will answer that it’s not safe

How else do malicious actors use AI for deception?

  •  

Crafting the Perfect Prompt: Getting the Most Out of ChatGPT and Other LLMs

| Bronwen Aker // Sr. Technical Editor, M.S. Cybersecurity, GSEC, GCIH, GCFE Go online these days and you will see tons of articles, posts, Tweets, TikToks, and videos about how […]

The post Crafting the Perfect Prompt: Getting the Most Out of ChatGPT and Other LLMs appeared first on Black Hills Information Security, Inc..

  •  
❌