MalTree: Tracing Malware Evolution from Embeddings at Scale

A phylogenetic lens on malware: reconstructing how families evolve across 103,883 samples and 538 families.

Akash Amalan, Georgios Smaragdakis, Tom Viering
Delft University of Technology
International Conference on Machine Learning (ICML) 2026 · Seoul, South Korea
Simplified MalTree phylogeny: 32 malware families grouped into IoT botnets, banking and stealers, RATs, and ransomware

A simplified view of the malware phylogeny: 32 representative families grouped into IoT botnets, banking and stealer malware, RATs, and ransomware. The full tree of all 538 families is interactive below.

Abstract

Malware detection remains largely reactive: machine learning models trained on known samples degrade as threats evolve. Understanding evolutionary relationships among malware families can inform proactive defense, but traditional reverse engineering can take months to years to uncover such lineage relationships. We propose MalTree, a framework that applies bioinformatics-inspired phylogenetic techniques (UPGMA and Neighbor-Joining) at scale to model malware evolution automatically using structural, behavioral, and image-based features. We introduce temporal validation using VirusTotal timestamps to assess whether inferred trees reflect actual evolutionary order. MalTree achieves 87% temporal consistency, indicating that inferred evolutionary relationships closely align with real-world emergence timelines. Our analysis shows that some families mutate over 10 times faster than others, suggesting that detection strategies should be tailored to family-specific evolutionary tempos. Case studies, including the Mirai botnet, confirm that inferred relationships from our phylogenetic tree align with documented threat intelligence. Our framework provides a foundation for shifting malware analysis from sample-by-sample classification toward lineage-aware evolutionary modeling.

The MalTree Pipeline

MalTree extracts three complementary embeddings from every sample: pseudo-static features from memory dumps, dynamic features from sandbox behavior, and an image embedding of the binary. The embeddings are fused and reduced into a single representation, turned into a pairwise distance matrix, and resolved into a phylogenetic tree with Neighbor-Joining or UPGMA.

The MalTree pipeline: static, dynamic, and image embeddings are concatenated and reduced to a distance matrix, then resolved into a phylogenetic tree
Static, dynamic, and image embeddings are concatenated and reduced to a pairwise distance matrix D, from which a phylogenetic tree is built. Clades correspond to functional malware categories.

Interactive Phylogenetic Tree

Explore the full tree of 538 malware families below. Pan and zoom to follow how families branch, cluster, and diverge across the malware landscape.

Key Findings

A validated phylogeny at unprecedented scale

MalTree builds validated phylogenetic trees from 103,883 malware samples across 538 families in 11 hours with UPGMA or 3 days with Neighbor-Joining on 20 cores, the largest such analysis to date. Neighbor-Joining with outgroup rooting reaches 87.1% temporal consistency against VirusTotal first-submission timestamps. Those timestamps are independent of the family labels used to train the embeddings, so the agreement shows the trees capture genuine evolutionary order rather than feature similarity; random ordering would score about 50%.

87.1%
temporal consistency against VirusTotal timestamps
103,883
malware samples placed in one tree
538
malware families spanning 2010 to 2023

Malware families evolve at very different tempos

Per-family embedding drift on a log scale; Bashlite and Syslogk drift far faster than other families
Embedding drift per family (distance per year, log scale). Maximum drift varies by more than two orders of magnitude: families such as Bashlite and Syslogk move over ten times faster than slow movers like Loda.

This rate heterogeneity violates the constant-rate (molecular clock) assumption behind UPGMA, and explains why Neighbor-Joining, which lets lineages evolve at different rates, produces better-rooted trees. In practice, detection strategies may need to be tuned to each family's evolutionary tempo.

The Mirai lineage, recovered automatically

Inferred Mirai inter-family subgraph: five validated descendants in red, three unvalidated links in gray
The inferred Mirai inter-family subgraph. Red edges are validated by public threat intelligence; gray edges lack corroborating evidence. Lower edge weights indicate stronger phylogenetic support.

The 2016 Mirai source-code leak spawned many variants. Without supervision, MalTree recovers five Mirai descendants, Bashlite, Okiru, MooBot, Gafgyt, and RapperBot, each corroborated by documented threat intelligence. Weaker, unverified links to Conti, Turla, and Rdat carry visibly higher edge weights, so MalTree both confirms known relationships and flags hypotheses that warrant further investigation.

BibTeX

@inproceedings{amalan2026maltree,
  title     = {MalTree: Tracing Malware Evolution from Embeddings at Scale},
  author    = {Amalan, Akash and Smaragdakis, Georgios and Viering, Tom},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)},
  series    = {PMLR},
  volume    = {306},
  year      = {2026}
}