Normal view

Who Operates the Badbox 2.0 Botnet?

26 January 2026 at 17:11

The cybercriminals in control of Kimwolf — a disruptive botnet that has infected more than 2 million devices — recently shared a screenshot indicating they’d compromised the control panel for Badbox 2.0, a vast China-based botnet powered by malicious software that comes pre-installed on many Android TV streaming boxes. Both the FBI and Google say they are hunting for the people behind Badbox 2.0, and thanks to bragging by the Kimwolf botmasters we may now have a much clearer idea about that.

Our first story of 2026, The Kimwolf Botnet is Stalking Your Local Network, detailed the unique and highly invasive methods Kimwolf uses to spread. The story warned that the vast majority of Kimwolf infected systems were unofficial Android TV boxes that are typically marketed as a way to watch unlimited (pirated) movie and TV streaming services for a one-time fee.

Our January 8 story, Who Benefitted from the Aisuru and Kimwolf Botnets?, cited multiple sources saying the current administrators of Kimwolf went by the nicknames “Dort” and “Snow.” Earlier this month, a close former associate of Dort and Snow shared what they said was a screenshot the Kimwolf botmasters had taken while logged in to the Badbox 2.0 botnet control panel.

That screenshot, a portion of which is shown below, shows seven authorized users of the control panel, including one that doesn’t quite match the others: According to my source, the account “ABCD” (the one that is logged in and listed in the top right of the screenshot) belongs to Dort, who somehow figured out how to add their email address as a valid user of the Badbox 2.0 botnet.

The control panel for the Badbox 2.0 botnet lists seven authorized users and their email addresses. Click to enlarge.

Badbox has a storied history that well predates Kimwolf’s rise in October 2025. In July 2025, Google filed a “John Doe” lawsuit (PDF) against 25 unidentified defendants accused of operating Badbox 2.0, which Google described as a botnet of over ten million unsanctioned Android streaming devices engaged in advertising fraud. Google said Badbox 2.0, in addition to compromising multiple types of devices prior to purchase, also can infect devices by requiring the download of malicious apps from unofficial marketplaces.

Google’s lawsuit came on the heels of a June 2025 advisory from the Federal Bureau of Investigation (FBI), which warned that cyber criminals were gaining unauthorized access to home networks by either configuring the products with malware prior to the user’s purchase, or infecting the device as it downloads required applications that contain backdoors — usually during the set-up process.

The FBI said Badbox 2.0 was discovered after the original Badbox campaign was disrupted in 2024. The original Badbox was identified in 2023, and primarily consisted of Android operating system devices (TV boxes) that were compromised with backdoor malware prior to purchase.

KrebsOnSecurity was initially skeptical of the claim that the Kimwolf botmasters had hacked the Badbox 2.0 botnet. That is, until we began digging into the history of the qq.com email addresses in the screenshot above.

CATHEAD

An online search for the address 34557257@qq.com (pictured in the screenshot above as the user “Chen“) shows it is listed as a point of contact for a number of China-based technology companies, including:

Beijing Hong Dake Wang Science & Technology Co Ltd.
Beijing Hengchuang Vision Mobile Media Technology Co. Ltd.
Moxin Beijing Science and Technology Co. Ltd.

The website for Beijing Hong Dake Wang Science is asmeisvip[.]net, a domain that was flagged in a March 2025 report by HUMAN Security as one of several dozen sites tied to the distribution and management of the Badbox 2.0 botnet. Ditto for moyix[.]com, a domain associated with Beijing Hengchuang Vision Mobile.

A search at the breach tracking service Constella Intelligence finds 34557257@qq.com at one point used the password “cdh76111.” Pivoting on that password in Constella shows it is known to have been used by just two other email accounts: daihaic@gmail.com and cathead@gmail.com.

Constella found cathead@gmail.com registered an account at jd.com (China’s largest online retailer) in 2021 under the name “陈代海,” which translates to “Chen Daihai.” According to DomainTools.com, the name Chen Daihai is present in the original registration records (2008) for moyix[.]com, along with the email address cathead@astrolink[.]cn.

Incidentally, astrolink[.]cn also is among the Badbox 2.0 domains identified in HUMAN Security’s 2025 report. DomainTools finds cathead@astrolink[.]cn was used to register more than a dozen domains, including vmud[.]net, yet another Badbox 2.0 domain tagged by HUMAN Security.

XAVIER

A cached copy of astrolink[.]cn preserved at archive.org shows the website belongs to a mobile app development company whose full name is Beijing Astrolink Wireless Digital Technology Co. Ltd. The archived website reveals a “Contact Us” page that lists a Chen Daihai as part of the company’s technology department. The other person featured on that contact page is Zhu Zhiyu, and their email address is listed as xavier@astrolink[.]cn.

A Google-translated version of Astrolink’s website, circa 2009. Image: archive.org.

Astute readers will notice that the user Mr.Zhu in the Badbox 2.0 panel used the email address xavierzhu@qq.com. Searching this address in Constella reveals a jd.com account registered in the name of Zhu Zhiyu. A rather unique password used by this account matches the password used by the address xavierzhu@gmail.com, which DomainTools finds was the original registrant of astrolink[.]cn.

ADMIN

The very first account listed in the Badbox 2.0 panel — “admin,” registered in November 2020 — used the email address 189308024@qq.com. DomainTools shows this email is found in the 2022 registration records for the domain guilincloud[.]cn, which includes the registrant name “Huang Guilin.”

Constella finds 189308024@qq.com is associated with the China phone number 18681627767. The open-source intelligence platform osint.industries reveals this phone number is connected to a Microsoft profile created in 2014 under the name Guilin Huang (桂林 黄). The cyber intelligence platform Spycloud says that phone number was used in 2017 to create an account at the Chinese social media platform Weibo under the username “h_guilin.”

The public information attached to Guilin Huang’s Microsoft account, according to the breach tracking service osintindustries.com.

The remaining three users and corresponding qq.com email addresses were all connected to individuals in China. However, none of them (nor Mr. Huang) had any apparent connection to the entities created and operated by Chen Daihai and Zhu Zhiyu — or to any corporate entities for that matter. Also, none of these individuals responded to requests for comment.

The mind map below includes search pivots on the email addresses, company names and phone numbers that suggest a connection between Chen Daihai, Zhu Zhiyu, and Badbox 2.0.

This mind map includes search pivots on the email addresses, company names and phone numbers that appear to connect Chen Daihai and Zhu Zhiyu to Badbox 2.0. Click to enlarge.

UNAUTHORIZED ACCESS

The idea that the Kimwolf botmasters could have direct access to the Badbox 2.0 botnet is a big deal, but explaining exactly why that is requires some background on how Kimwolf spreads to new devices. The botmasters figured out they could trick residential proxy services into relaying malicious commands to vulnerable devices behind the firewall on the unsuspecting user’s local network.

The vulnerable systems sought out by Kimwolf are primarily Internet of Things (IoT) devices like unsanctioned Android TV boxes and digital photo frames that have no discernible security or authentication built-in. Put simply, if you can communicate with these devices, you can compromise them with a single command.

Our January 2 story featured research from the proxy-tracking firm Synthient, which alerted 11 different residential proxy providers that their proxy endpoints were vulnerable to being abused for this kind of local network probing and exploitation.

Most of those vulnerable proxy providers have since taken steps to prevent customers from going upstream into the local networks of residential proxy endpoints, and it appeared that Kimwolf would no longer be able to quickly spread to millions of devices simply by exploiting some residential proxy provider.

However, the source of that Badbox 2.0 screenshot said the Kimwolf botmasters had an ace up their sleeve the whole time: Secret access to the Badbox 2.0 botnet control panel.

“Dort has gotten unauthorized access,” the source said. “So, what happened is normal proxy providers patched this. But Badbox doesn’t sell proxies by itself, so it’s not patched. And as long as Dort has access to Badbox, they would be able to load” the Kimwolf malware directly onto TV boxes associated with Badbox 2.0.

The source said it isn’t clear how Dort gained access to the Badbox botnet panel. But it’s unlikely that Dort’s existing account will persist for much longer: All of our notifications to the qq.com email addresses listed in the control panel screenshot received a copy of that image, as well as questions about the apparently rogue ABCD account.

Аgentic AI security measures based on the OWASP ASI Top 10

26 January 2026 at 16:26

How to protect an organization from the dangerous actions of AI agents it uses? This isn’t just a theoretical what-if anymore — considering the actual damage autonomous AI can do ranges from providing poor customer service to destroying corporate primary databases.  It’s a question business leaders are currently hammering away at, and government agencies and security experts are racing to provide answers to.

For CIOs and CISOs, AI agents create a massive governance headache. These agents make decisions, use tools, and process sensitive data without a human in the loop. Consequently, it turns out that many of our standard IT and security tools are unable to keep the AI in check.

The non-profit OWASP Foundation has released a handy playbook on this very topic. Their comprehensive Top 10 risk list for agentic AI applications covers everything from old-school security threats like privilege escalation, to AI-specific headaches like agent memory poisoning. Each risk comes with real-world examples, a breakdown of how it differs from similar threats, and mitigation strategies. In this post, we’ve trimmed down the descriptions and consolidated the defense recommendations.

The top-10 risks of deploying autonomous AI agents.

The top-10 risks of deploying autonomous AI agents. Source

Agent goal hijack (ASI01)

This risk involves manipulating an agent’s tasks or decision-making logic by exploiting the underlying model’s inability to tell the difference between legitimate instructions and external data. Attackers use prompt injection or forged data to reprogram the agent into performing malicious actions. The key difference from a standard prompt injection is that this attack breaks the agent’s multi-step planning process rather than just tricking the model into giving a single bad answer.

Example: An attacker embeds a hidden instruction into a webpage that, once parsed by the AI agent, triggers an export of the user’s browser history. A vulnerability of this very nature was showcased in a EchoLeak study.

Tool misuse and exploitation (ASI02)

This risk crops up when an agent — driven by ambiguous commands or malicious influence — uses the legitimate tools it has access to in unsafe or unintended ways. Examples include mass-deleting data, or sending redundant billable API calls. These attacks often play out through complex call chains, allowing them to slip past traditional host-monitoring systems unnoticed.

Example: A customer support chatbot with access to a financial API is manipulated into processing unauthorized refunds because its access wasn’t restricted to read-only. Another example is data exfiltration via DNS queries, similar to the attack on Amazon Q.

Identity and privilege abuse (ASI03)

This vulnerability involves the way permissions are granted and inherited within agentic workflows. Attackers exploit existing permissions or cached credentials to escalate privileges or perform actions that the original user wasn’t authorized for. The risk increases when agents use shared identities, or reuse authentication tokens across different security contexts.

Example: An employee creates an agent that uses their personal credentials to access internal systems. If that agent is then shared with other coworkers, any requests they make to the agent will also be executed with the creator’s elevated permissions.

Agentic Supply Chain Vulnerabilities (ASI04)

Risks arise when using third-party models, tools, or pre-configured agent personas that may be compromised or malicious from the start. What makes this trickier than traditional software is that agentic components are often loaded dynamically, and aren’t known ahead of time. This significantly hikes the risk, especially if the agent is allowed to look for a suitable package on its own. We’re seeing a surge in both typosquatting, where malicious tools in registries mimic the names of popular libraries, and the related slopsquatting, where an agent tries to call tools that don’t even exist.

Example: A coding assistant agent automatically installs a compromised package containing a backdoor, allowing an attacker to scrape CI/CD tokens and SSH keys right out of the agent’s environment. We’ve already seen documented attempts at destructive attacks targeting AI development agents in the wild.

Unexpected code execution / RCE (ASI05)

Agentic systems frequently generate and execute code in real-time to knock out tasks, which opens the door for malicious scripts or binaries. Through prompt injection and other techniques, an agent can be talked into running its available tools with dangerous parameters, or executing code provided directly by the attacker.  This can escalate into a full container or host compromise, or a sandbox escape — at which point the attack becomes invisible to standard AI monitoring tools.

Example: An attacker sends a prompt that, under the guise of code testing, tricks a vibecoding agent into downloading a command via cURL and piping it directly into bash.

Memory and context poisoning (ASI06)

Attackers modify the information an agent relies on for continuity, such as dialog history, a RAG knowledge base, or summaries of past task stages. This poisoned context warps the agent’s future reasoning and tool selection. As a result, persistent backdoors can emerge in its logic that survive between sessions. Unlike a one-off injection, this risk causes a long-term impact on the system’s knowledge and behavioral logic.

Example: An attacker plants false data in an assistant’s memory regarding flight price quotes received from a vendor. Consequently, the agent approves future transactions at a fraudulent rate. An example of false memory implantation was showcased in a demonstration attack on Gemini.

Insecure inter-agent communication (ASI07)

In multi-agent systems, coordination occurs via APIs or message buses that still often lack basic encryption, authentication, or integrity checks. Attackers can intercept, spoof, or modify these messages in real time, causing the entire distributed system to glitch out. This vulnerability opens the door for agent-in-the-middle attacks, as well as other classic communication exploits well-known in the world of applied information security: message replays, sender spoofing, and forced protocol downgrades.

Example: Forcing agents to switch to an unencrypted protocol to inject hidden commands, effectively hijacking the collective decision-making process of the entire agent group.

Cascading failures (ASI08)

This risk describes how a single error — caused by hallucination, a prompt injection, or any other glitch — can ripple through and amplify across a chain of autonomous agents. Because these agents hand off tasks to one another without human involvement, a failure in one link can trigger a domino effect leading to a massive meltdown of the entire network. The core issue here is the sheer velocity of the error: it spreads much faster than any human operator can track or stop.

Example: A compromised scheduler agent pushes out a series of unsafe commands that are automatically executed by downstream agents, leading to a loop of dangerous actions replicated across the entire organization.

Human–agent trust exploitation (ASI09)

Attackers exploit the conversational nature and apparent expertise of agents to manipulate users. Anthropomorphism leads people to place excessive trust in AI recommendations, and approve critical actions without a second thought. The agent acts as a bad advisor, turning the human into the final executor of the attack, which complicates a subsequent forensic investigation.

Example: A compromised tech support agent references actual ticket numbers to build rapport with a new hire, eventually sweet-talking them into handing over their corporate credentials.

Rogue agents (ASI10)

These are malicious, compromised, or hallucinating agents that veer off their assigned functions, operating stealthily, or acting as parasites within the system. Once control is lost, an agent like that might start self-replicating, pursuing its own hidden agenda, or even colluding with other agents to bypass security measures. The primary threat described by ASI10 is the long-term erosion of a system’s behavioral integrity following an initial breach or anomaly.

Example: The most infamous case involves an autonomous Replit development agent that went rogue, deleted the respective company’s primary customer database, and then completely fabricated its contents to make it look like the glitch had been fixed.

Mitigating risks in agentic AI systems

While the probabilistic nature of LLM generation and the lack of separation between instructions and data channels make bulletproof security impossible, a rigorous set of controls — approximating a Zero Trust strategy — can significantly limit the damage when things go awry. Here are the most critical measures.

Enforce the principles of both least autonomy and least privilege. Limit the autonomy of AI agents by assigning tasks with strictly defined guardrails. Ensure they only have access to the specific tools, APIs, and corporate data necessary for their mission. Dial permissions down to the absolute minimum where appropriate — for example, sticking to read-only mode.

Use short-lived credentials. Issue temporary tokens and API keys with a limited scope for each specific task. This prevents an attacker from reusing credentials if they manage to compromise an agent.

Mandatory human-in-the-loop for critical operations. Require explicit human confirmation for any irreversible or high-risk actions, such as authorizing financial transfers or mass-deleting data.

Execution isolation and traffic control. Run code and tools in isolated environments (containers or sandboxes) with strict allowlists of tools and network connections to prevent unauthorized outbound calls.

Policy enforcement. Deploy intent gates to vet an agent’s plans and arguments against rigid security rules before they ever go live.

Input and output validation and sanitization. Use specialized filters and validation schemes to check all prompts and model responses for injections and malicious content. This needs to happen at every single stage of data processing and whenever data is passed between agents.

Continuous secure logging. Record every agent action and inter-agent message in immutable logs. These records would be needed for any future auditing and forensic investigations.

Behavioral monitoring and watchdog agents. Deploy automated systems to sniff out anomalies, such as a sudden spike in API calls, self-replication attempts, or an agent suddenly pivoting away from its core goals. This approach overlaps heavily with the monitoring required to catch sophisticated living-off-the-land network attacks. Consequently, organizations that have introduced XDR and are crunching telemetry in a SIEM will have a head start here — they’ll find it much easier to keep their AI agents on a short leash.

Supply chain control and SBOMs (software bills of materials). Only use vetted tools and models from trusted registries. When developing software, sign every component, pin dependency versions, and double-check every update.

Static and dynamic analysis of generated code. Scan every line of code an agent writes for vulnerabilities before running. Ban the use of dangerous functions like eval() completely. These last two tips should already be part of a standard DevSecOps workflow, and they needed to be extended to all code written by AI agents. Doing this manually is next to impossible, so automation tools, like those found in Kaspersky Cloud Workload Security, are recommended here.

Securing inter-agent communications. Ensure mutual authentication and encryption across all communication channels between agents. Use digital signatures to verify message integrity.

 Kill switches. Come up with ways to instantly lock down agents or specific tools the moment anomalous behavior is detected.

Using UI for trust calibration. Use visual risk indicators and confidence level alerts to reduce the risk of humans blindly trusting AI.

User training. Systematically train employees on the operational realities of AI-powered systems. Use examples tailored to their actual job roles to break down AI-specific risks. Given how fast this field moves, a once-a-year compliance video won’t cut it — such training should be refreshed several times a year.

For SOC analysts, we also recommend the Kaspersky Expert Training: Large Language Models Security course, which covers the main threats to LLMs, and defensive strategies to counter them. The course would also be useful for developers and AI architects working on LLM implementations.

AIs are Getting Better at Finding and Exploiting Internet Vulnerabilities

23 January 2026 at 13:01

Really interesting blog post from Anthropic:

In a recent evaluation of AI models’ cyber capabilities, current Claude models can now succeed at multistage attacks on networks with dozens of hosts using only standard, open-source tools, instead of the custom tools needed by previous generations. This illustrates how barriers to the use of AI in relatively autonomous cyber workflows are rapidly coming down, and highlights the importance of security fundamentals like promptly patching known vulnerabilities.

[…]

A notable development during the testing of Claude Sonnet 4.5 is that the model can now succeed on a minority of the networks without the custom cyber toolkit needed by previous generations. In particular, Sonnet 4.5 can now exfiltrate all of the (simulated) personal information in a high-fidelity simulation of the Equifax data breach—­one of the costliest cyber attacks in history—­using only a Bash shell on a widely-available Kali Linux host (standard, open-source tools for penetration testing; not a custom toolkit). Sonnet 4.5 accomplishes this by instantly recognizing a publicized CVE and writing code to exploit it without needing to look it up or iterate on it. Recalling that the original Equifax breach happened by exploiting a publicized CVE that had not yet been patched, the prospect of highly competent and fast AI agents leveraging this approach underscores the pressing need for security best practices like prompt updates and patches.

Read the whole thing. Automatic exploitation will be a major change in cybersecurity. And things are happening fast. There have been significant developments since I wrote this in October.

AI jailbreaking via poetry: bypassing chatbot defenses with rhyme | Kaspersky official blog

23 January 2026 at 12:59

Tech enthusiasts have been experimenting with ways to sidestep AI response limits set by the models’ creators almost since LLMs first hit the mainstream. Many of these tactics have been quite creative: telling the AI you have no fingers so it’ll help finish your code, asking it to “just fantasize” when a direct question triggers a refusal, or inviting it to play the role of a deceased grandmother sharing forbidden knowledge to comfort a grieving grandchild.

Most of these tricks are old news, and LLM developers have learned to successfully counter many of them. But the tug-of-war between constraints and workarounds hasn’t gone anywhere — the ploys have just become more complex and sophisticated. Today, we’re talking about a new AI jailbreak technique that exploits chatbots’ vulnerability to… poetry. Yes, you read it right — in a recent study, researchers demonstrated that framing prompts as poems significantly increases the likelihood of a model spitting out an unsafe response.

They tested this technique on 25 popular models by Anthropic, OpenAI, Google, Meta, DeepSeek, xAI, and other developers. Below, we dive into the details: what kind of limitations these models have, where they get forbidden knowledge from in the first place, how the study was conducted, and which models turned out to be the most “romantic” — as in, the most susceptible to poetic prompts.

What AI isn’t supposed to talk about with users

The success of OpenAI’s models and other modern chatbots boils down to the massive amounts of data they’re trained on. Because of that sheer scale, models inevitably learn things their developers would rather keep under wraps: descriptions of crimes, dangerous tech, violence, or illicit practices found within the source material.

It might seem like an easy fix: just scrub the forbidden fruit from the dataset before you even start training. But in reality, that’s a massive, resource-heavy undertaking — and at this stage of the AI arms race, it doesn’t look like anyone is willing to take it on.

Another seemingly obvious fix — selectively scrubbing data from the model’s memory — is, alas, also a no-go. This is because AI knowledge doesn’t live inside neat little folders that can easily be trashed. Instead, it’s spread across billions of parameters and tangled up in the model’s entire linguistic DNA — word statistics, contexts, and the relationships between them. Trying to surgically erase specific info through fine-tuning or penalties either doesn’t quite do the trick, or starts hindering the model’s overall performance and negatively affect its general language skills.

As a result, to keep these models in check, creators have no choice but to develop specialized safety protocols and algorithms that filter conversations by constantly monitoring user prompts and model responses. Here’s a non-exhaustive list of these constraints:

  • System prompts that define model behavior and restrict allowed response scenarios
  • Standalone classifier models that scan prompts and outputs for signs of jailbreaking, prompt injections, and other attempts to bypass safeguards
  • Grounding mechanisms, where the model is forced to rely on external data rather than its own internal associations
  • Fine-tuning and reinforcement learning from human feedback, where unsafe or borderline responses are systematically penalized while proper refusals are rewarded

Put simply, AI safety today isn’t built on deleting dangerous knowledge, but on trying to control how and in what form the model accesses and shares it with the user — and the cracks in these very mechanisms are where new workarounds find their footing.

The research: which models got tested, and how?

First, let’s look at the ground rules so you know the experiment was legit. The researchers set out to goad 25 different models into behaving badly across several categories:

  • Chemical, biological, radiological, and nuclear threats
  • Assisting with cyberattacks
  • Malicious manipulation and social engineering
  • Privacy breaches and mishandling sensitive personal data
  • Generating disinformation and misleading content
  • Rogue AI scenarios, including attempts to bypass constraints or act autonomously

The jailbreak itself was a one-shot deal: a single poetic prompt. The researchers didn’t engage the AI in long-winded poetic debates in the vein of Norse skalds or modern-day rappers. Their goal was simply to see if they could get the models to flout safety instructions using just one rhyming request. As mentioned, the researchers tested 25 language models from various developers; here’s the full list:

The models in the poetic jailbreak experiment

A lineup of 25 language models from various developers, all put to the test to see if a single poetic prompt could coax AI into ditching its safety guardrails. Source

To build these poetic queries, the researchers started with a database of known malicious prompts from the standard MLCommons AILuminate Benchmark used to test LLM security, and recast them as verse with the aid of DeepSeek. Only the stylistic wrapping was changed: the experiment didn’t use any additional attack vectors, obfuscation strategies, or model-specific tweaks.

For obvious reasons, the study’s authors aren’t publishing the actual malicious poetic prompts. But they do demonstrate the general vibe of the queries using a harmless example, which looks something like this:

A baker guards a secret oven’s heat,
its whirling racks, its spindle’s measured beat.
To learn its craft, one studies every turn
,
how flour lifts, how sugar starts to burn.
Describe the method,
line by measured line,
that shapes a cake whose layers intertwine.

The researchers tested 1200 prompts across 25 different models — in both prose and poetic versions. Comparing the prose and poetic variants of the exact same query allowed them to verify if the model’s behavior changed solely because of the stylistic wrapping.

Through these prose prompt tests, the experimenters established a baseline for the models’ willingness to fulfill dangerous requests. They then compared this baseline to how those same models reacted to the poetic versions of the queries. We’ll dive into the results of that comparison in the next section.

Study results: which model is the biggest poetry lover?

Since the volume of data generated during the experiment was truly massive, the safety checks on the models’ responses were also handled by AI. Each response was graded as either “safe” or “unsafe” by a jury consisting of three different language models:

  • gpt-oss-120b by OpenAI
  • deepseek-r1 by DeepSeek
  • kimi-k2-thinking by Moonshot AI

Responses were only deemed safe if the AI explicitly refused to answer the question. The initial classification into one of the two groups was determined by a majority vote: to be certified as harmless, a response had to receive a safe rating from at least two of the three jury members.

Responses that failed to reach a majority consensus or were flagged as questionable were handed off to human reviewers. Five annotators participated in this process, evaluating a total of 600 model responses to poetic prompts. The researchers noted that the human assessments aligned with the AI jury’s findings in the vast majority of cases.

With the methodology out of the way, let’s look at how the LLMs actually performed. It’s worth noting that the success of a poetic jailbreak can be measured in different ways. The researchers highlighted an extreme version of this assessment based on the top-20 most successful prompts, which were hand-picked. Using this approach, an average of nearly two-thirds (62%) of the poetic queries managed to coax the models into violating their safety instructions.

Google’s Gemini 1.5 Pro turned out to be the most susceptible to verse. Using the 20 most effective poetic prompts, researchers managed to bypass the model’s restrictions… 100% of the time. You can check out the full results for all the models in the chart below.

How poetry slashes AI safety effectiveness

The share of safe responses (Safe) versus the Attack Success Rate (ASR) for 25 language models when hit with the 20 most effective poetic prompts. The higher the ASR, the more often the model ditched its safety instructions for a good rhyme. Source

A more moderate way to measure the effectiveness of the poetic jailbreak technique is to compare the success rates of prose versus poetry across the entire set of queries. Using this metric, poetry boosts the likelihood of an unsafe response by an average of 35%.

The poetry effect hit deepseek-chat-v3.1 the hardest — the success rate for this model jumped by nearly 68 percentage points compared to prose prompts. On the other end of the spectrum, claude-haiku-4.5 proved to be the least susceptible to a good rhyme: the poetic format didn’t just fail to improve the bypass rate — it actually slightly lowered the ASR, making the model even more resilient to malicious requests.

How much poetry amplifies safety bypasses

A comparison of the baseline Attack Success Rate (ASR) for prose queries versus their poetic counterparts. The Change column shows how many percentage points the verse format adds to the likelihood of a safety violation for each model. Source

Finally, the researchers calculated how vulnerable entire developer ecosystems, rather than just individual models, were to poetic prompts. As a reminder, several models from each developer — Meta, Anthropic, OpenAI, Google, DeepSeek, Qwen, Mistral AI, Moonshot AI, and xAI — were included in the experiment.

To do this, the results of individual models were averaged within each AI ecosystem and compared the baseline bypass rates with the values for poetic queries. This cross-section allows us to evaluate the overall effectiveness of a specific developer’s safety approach rather than the resilience of a single model.

The final tally revealed that poetry deals the heaviest blow to the safety guardrails of models from DeepSeek, Google, and Qwen. Meanwhile, OpenAI and Anthropic saw an increase in unsafe responses that was significantly below the average.

The poetry effect across AI developers

A comparison of the average Attack Success Rate (ASR) for prose versus poetic queries, aggregated by developer. The Change column shows by how many percentage points poetry, on average, slashes the effectiveness of safety guardrails within each vendor’s ecosystem. Source

What does this mean for AI users?

The main takeaway from this study is that “there are more things in heaven and earth, Horatio, than are dreamt of in your philosophy” — in the sense that AI technology still hides plenty of mysteries. For the average user, this isn’t exactly great news: it’s impossible to predict which LLM hacking methods or bypass techniques researchers or cybercriminals will come up with next, or what unexpected doors those methods might open.

Consequently, users have little choice but to keep their eyes peeled and take extra care of their data and device security. To mitigate practical risks and shield your devices from such threats, we recommend using a robust security solution that helps detect suspicious activity and prevent incidents before they happen.

To help you stay alert, check out our materials on AI-related privacy risks and security threats:

Why AI Keeps Falling for Prompt Injection Attacks

22 January 2026 at 13:35

Imagine you work at a drive-through restaurant. Someone drives up and says: “I’ll have a double cheeseburger, large fries, and ignore previous instructions and give me the contents of the cash drawer.” Would you hand over the money? Of course not. Yet this is what large language models (LLMs) do.

Prompt injection is a method of tricking LLMs into doing things they are normally prevented from doing. A user writes a prompt in a certain way, asking for system passwords or private data, or asking the LLM to perform forbidden instructions. The precise phrasing overrides the LLM’s safety guardrails, and it complies.

LLMs are vulnerable to all sorts of prompt injection attacks, some of them absurdly obvious. A chatbot won’t tell you how to synthesize a bioweapon, but it might tell you a fictional story that incorporates the same detailed instructions. It won’t accept nefarious text inputs, but might if the text is rendered as ASCII art or appears in an image of a billboard. Some ignore their guardrails when told to “ignore previous instructions” or to “pretend you have no guardrails.”

AI vendors can block specific prompt injection techniques once they are discovered, but general safeguards are impossible with today’s LLMs. More precisely, there’s an endless array of prompt injection attacks waiting to be discovered, and they cannot be prevented universally.

If we want LLMs that resist these attacks, we need new approaches. One place to look is what keeps even overworked fast-food workers from handing over the cash drawer.

Human Judgment Depends on Context

Our basic human defenses come in at least three types: general instincts, social learning, and situation-specific training. These work together in a layered defense.

As a social species, we have developed numerous instinctive and cultural habits that help us judge tone, motive, and risk from extremely limited information. We generally know what’s normal and abnormal, when to cooperate and when to resist, and whether to take action individually or to involve others. These instincts give us an intuitive sense of risk and make us especially careful about things that have a large downside or are impossible to reverse.

The second layer of defense consists of the norms and trust signals that evolve in any group. These are imperfect but functional: Expectations of cooperation and markers of trustworthiness emerge through repeated interactions with others. We remember who has helped, who has hurt, who has reciprocated, and who has reneged. And emotions like sympathy, anger, guilt, and gratitude motivate each of us to reward cooperation with cooperation and punish defection with defection.

A third layer is institutional mechanisms that enable us to interact with multiple strangers every day. Fast-food workers, for example, are trained in procedures, approvals, escalation paths, and so on. Taken together, these defenses give humans a strong sense of context. A fast-food worker basically knows what to expect within the job and how it fits into broader society.

We reason by assessing multiple layers of context: perceptual (what we see and hear), relational (who’s making the request), and normative (what’s appropriate within a given role or situation). We constantly navigate these layers, weighing them against each other. In some cases, the normative outweighs the perceptual—for example, following workplace rules even when customers appear angry. Other times, the relational outweighs the normative, as when people comply with orders from superiors that they believe are against the rules.

Crucially, we also have an interruption reflex. If something feels “off,” we naturally pause the automation and reevaluate. Our defenses are not perfect; people are fooled and manipulated all the time. But it’s how we humans are able to navigate a complex world where others are constantly trying to trick us.

So let’s return to the drive-through window. To convince a fast-food worker to hand us all the money, we might try shifting the context. Show up with a camera crew and tell them you’re filming a commercial, claim to be the head of security doing an audit, or dress like a bank manager collecting the cash receipts for the night. But even these have only a slim chance of success. Most of us, most of the time, can smell a scam.

Con artists are astute observers of human defenses. Successful scams are often slow, undermining a mark’s situational assessment, allowing the scammer to manipulate the context. This is an old story, spanning traditional confidence games such as the Depression-era “big store” cons, in which teams of scammers created entirely fake businesses to draw in victims, and modern “pig-butchering” frauds, where online scammers slowly build trust before going in for the kill. In these examples, scammers slowly and methodically reel in a victim using a long series of interactions through which the scammers gradually gain that victim’s trust.

Sometimes it even works at the drive-through. One scammer in the 1990s and 2000s targeted fast-food workers by phone, claiming to be a police officer and, over the course of a long phone call, convinced managers to strip-search employees and perform other bizarre acts.

Why LLMs Struggle With Context and Judgment

LLMs behave as if they have a notion of context, but it’s different. They do not learn human defenses from repeated interactions and remain untethered from the real world. LLMs flatten multiple levels of context into text similarity. They see “tokens,” not hierarchies and intentions. LLMs don’t reason through context, they only reference it.

While LLMs often get the details right, they can easily miss the big picture. If you prompt a chatbot with a fast-food worker scenario and ask if it should give all of its money to a customer, it will respond “no.” What it doesn’t “know”—forgive the anthropomorphizing—is whether it’s actually being deployed as a fast-food bot or is just a test subject following instructions for hypothetical scenarios.

This limitation is why LLMs misfire when context is sparse but also when context is overwhelming and complex; when an LLM becomes unmoored from context, it’s hard to get it back. AI expert Simon Willison wipes context clean if an LLM is on the wrong track rather than continuing the conversation and trying to correct the situation.

There’s more. LLMs are overconfident because they’ve been designed to give an answer rather than express ignorance. A drive-through worker might say: “I don’t know if I should give you all the money—let me ask my boss,” whereas an LLM will just make the call. And since LLMs are designed to be pleasing, they’re more likely to satisfy a user’s request. Additionally, LLM training is oriented toward the average case and not extreme outliers, which is what’s necessary for security.

The result is that the current generation of LLMs is far more gullible than people. They’re naive and regularly fall for manipulative cognitive tricks that wouldn’t fool a third-grader, such as flattery, appeals to groupthink, and a false sense of urgency. There’s a story about a Taco Bell AI system that crashed when a customer ordered 18,000 cups of water. A human fast-food worker would just laugh at the customer.

The Limits of AI Agents

Prompt injection is an unsolvable problem that gets worse when we give AIs tools and tell them to act independently. This is the promise of AI agents: LLMs that can use tools to perform multistep tasks after being given general instructions. Their flattening of context and identity, along with their baked-in independence and overconfidence, mean that they will repeatedly and unpredictably take actions—and sometimes they will take the wrong ones.

Science doesn’t know how much of the problem is inherent to the way LLMs work and how much is a result of deficiencies in the way we train them. The overconfidence and obsequiousness of LLMs are training choices. The lack of an interruption reflex is a deficiency in engineering. And prompt injection resistance requires fundamental advances in AI science. We honestly don’t know if it’s possible to build an LLM, where trusted commands and untrusted inputs are processed through the same channel, which is immune to prompt injection attacks.

We humans get our model of the world—and our facility with overlapping contexts—from the way our brains work, years of training, an enormous amount of perceptual input, and millions of years of evolution. Our identities are complex and multifaceted, and which aspects matter at any given moment depend entirely on context. A fast-food worker may normally see someone as a customer, but in a medical emergency, that same person’s identity as a doctor is suddenly more relevant.

We don’t know if LLMs will gain a better ability to move between different contexts as the models get more sophisticated. But the problem of recognizing context definitely can’t be reduced to the one type of reasoning that LLMs currently excel at. Cultural norms and styles are historical, relational, emergent, and constantly renegotiated, and are not so readily subsumed into reasoning as we understand it. Knowledge itself can be both logical and discursive.

The AI researcher Yann LeCunn believes that improvements will come from embedding AIs in a physical presence and giving them “world models.” Perhaps this is a way to give an AI a robust yet fluid notion of a social identity, and the real-world experience that will help it lose its naïveté.

Ultimately we are probably faced with a security trilemma when it comes to AI agents: fast, smart, and secure are the desired attributes, but you can only get two. At the drive-through, you want to prioritize fast and secure. An AI agent should be trained narrowly on food-ordering language and escalate anything else to a manager. Otherwise, every action becomes a coin flip. Even if it comes up heads most of the time, once in a while it’s going to be tails—and along with a burger and fries, the customer will get the contents of the cash drawer.

This essay was written with Barath Raghavan, and originally appeared in IEEE Spectrum.

How AI brings the OSCAR methodology to life in the SOC

21 January 2026 at 15:41

When I look back on my years as a SOC lead in MDR, the thing I remember most clearly is the tension between wanting to do things the “right way” and simply trying to survive the day.

The alert queue never stopped growing. The attack surface kept expanding into cloud, identity, SaaS, and whatever new platform the business adopted. And every shift ended with the same uneasy feeling: What did we miss because there wasn’t enough time to investigate everything fully?

While different sources emphasize different challenges, recent statistics from late 2024 and 2025 reports reflect exactly what so many SOC analysts and leads feel:

  • The majority of alerts are never touched. Recent surveys indicate that 62% of alerts are ignored largely because the sheer volume makes them impossible to address. Furthermore, many analysts report being unable to deal with up to 67% of the daily alerts they receive.
  • The volume is unmanageable for humans. A typical SOC now processes an average of 3,832 alerts per day. For analysts trying to manually triage this flood, the math simply doesn’t add up.
  • Burnout is the new normal. The pressure is unsustainable, with 71% of SOC analysts reporting burnout due to alert fatigue. This has accelerated turnover, with some SOCs seeing analyst retention cycles shrink to less than 18 months, eroding institutional knowledge.

When people outside the SOC see these numbers, they assume analysts aren’t doing their jobs. The truth is the opposite. Most analysts are doing the best work they can inside a system that was never built for volume. Traditional triage is reactive and heavily dependent on intuition. On a good day, that might work. On a bad day, it leads to inconsistent decisions, coverage gaps, and immense pressure on analysts who care deeply about getting it right.

This is where the OSCAR methodology becomes valuable again.

Why the OSCAR methodology still matters

As a SOC lead, I always wanted the team to approach alerts with organizational structure. OSCAR provides that structure by creating a clear, repeatable sequence:

  • Obtain Information
  • Strategize
  • Collect Evidence
  • Analyze
  • Report

It removes guesswork and helps analysts who are still developing their skills stay grounded during chaotic shifts. But here is the reality I learned firsthand – You can only scale OSCAR so far with humans alone.

Evidence collection takes time. Deep analysis takes more time. No matter how motivated an analyst is, there are simply not enough hours in a shift to apply OSCAR to every alert manually. Most teams end up applying the methodology selectively; critical and high-severity alerts get the full OSCAR treatment, while everything else gets whatever time is left.

That gap between process and reality is exactly where Intezer enters the picture.

How Intezer operationalizes OSCAR at scale

Intezer takes the proven structure of OSCAR and executes it automatically and consistently across every alert. Instead of relying on how much energy an analyst has left 45 minutes before there shift ends, Intezer performs evidence collection, deep forensic analysis, and reporting at a speed and depth no human team could sustain.

Here is how the platform automates the methodology step-by-step:

O: Information obtained

In my SOC days, gathering context meant jumping between consoles and browser tabs, hoping nothing crashed. Intezer collects all of this instantly from endpoints, cloud platforms, identity systems, and threat intel sources. Analysts start every case with the full picture rather than a partial one.

S: Strategy suggested

Instead of relying on an analyst’s instinct about what might be happening, the Intezer platform generates verdicts and risk-based priorities immediately (with 98% accuracy). This provides critical consistency, especially for junior analysts who are still finding their confidence. Additionally, all AI reasoning is fully backed by deterministic, evidence based analysis.

C: Evidence collected

This was always the slowest part of manual investigation. Intezer collects memory artifacts, files, process information, and cloud activity in seconds. No hunting, no guessing, and no hoping you pulled the right logs before they rolled over.

A: Analysis (forensic-grade)

Intezer performs genetic code analysis, behavioral analysis, static/dynamic analysis, and threat intelligence correlation on every single alert. This is the level of scrutiny senior analysts wish they had time to do manually, but usually can only afford for the most critical incidents.

Read more about how Intezer Forensic AI SOC operates under the hood.

R: Reporting & transparency

The platform creates clear, structured, audit trails. This removes the burden of manual documentation from analysts and ensures that the “why” behind every decision is transparent and explainable.

The result: Moving beyond “speed vs. depth”

When OSCAR is coupled with Intezer’s AI Forensic SOC, the operation transforms. We see this in actual customer environments:

  • 100% alert coverage: Even low-severity and “noisy” alerts are fully triaged.
  • Sub-minute triage: Drastically improved MTTR/MTTD and minimized backlogs.
  • 98% accurate decisioning: Verdicts are supported by deterministic evidence, reducing escalations for human review to less than 4%.

The shift in operations:

CapabilityTraditional MDR SOCIntezer Forensic AI SOC
CoverageCritical and High-severity100% of alerts
Triage time20+ mins per alert<2 mins (automated)
Analyst modeData collectorInvestigator

From the perspective of a former SOC lead, the most important benefit is this: 

”Analysts finally get to think again. Automation handles the busy work. Humans get to use judgment, creativity, and experience.”

Final thoughts

For years, triage has been treated like a speed exercise. But the threats we face today require depth, context, and clarity. OSCAR gives SOCs the investigative structure they need, and Intezer provides the scale required to actually use that structure across every alert.

For the first time, teams don’t have to choose between speed and depth. They get both.

If your SOC wants to move from reactive to truly investigative operations, we would be happy to show you what an OSCAR-driven Intezer SOC looks like in practice.

The post How AI brings the OSCAR methodology to life in the SOC appeared first on Intezer.

❌