MIDS Capstone Project Spring 2025

MalwareLens

Team members

Problem & Motivation

In today’s rapidly evolving cybersecurity landscape, the volume and sophistication of malware threats continue to outpace traditional detection methods. Security analysts are overwhelmed, and many organizations, especially small to mid-sized businesses, lack the resources or expertise to identify and respond to malicious files effectively. Meanwhile, existing malware detection tools often require advanced knowledge, leaving non-specialists and entry-level analysts at a significant disadvantage.

The challenge is clear: How can organizations of all sizes accurately detect and understand malware without relying on scarce expert resources? MalwareLens was created to solve this pressing issue, leveraging cutting-edge AI to democratize malware detection, reduce manual effort, and empower users with intuitive, real-time threat insights.

Data Source & Data Science Approach

MalwareLens leverages state-of-the-art AI to transform malware detection into a scalable, intuitive, and user-friendly experience. Here's how we built it:

Data Source:

Our pipeline is trained on a curated dataset of over 200 thousand benign and malicious executables, primarily from publicly available malware repositories. To simulate real-world diversity, we incorporated malware spanning various families such as ransomware, spyware, and trojans. Our model was trained with varying benign and malicious samples up to 200 MB in size and included malware reported as late as April 2025. The dataset underwent preprocessing to transform raw binaries into structured formats suitable for deep learning, including byteplot hexadecimal representations and frequency-domain transformations.

Model Development:

Our detection engine is powered by a Convolutional Neural Network (CNN) trained to classify malware using a novel image-based approach. Executable files are converted into visual representations through two transformations:

Discrete Cosine Transform (DCT): Enables the model to detect patterns and anomalies in the frequency domain. The equation is listed below:

Header-Foot Byteplot Transformation: Enables the model to detect critical structural and metadata patterns often present in malware.

Key Steps in the Data Science Workflow:

Preprocessing & Transformation:
- Executable binaries are read as byte sequences.
- Discrete Cosine Transform:
  - Byte bigrams frequencies are computed across the entire file.
  - DCT is then applied to the frequency counts, mapping them into a matrix that captures underlying frequency patterns as a combination of cosine functions.
- Header-Foot Byteplot Transform:
  - The first and last 32,768 bytes from each file are extracted.
  - The decimal representation of each byte (0-255) is mapped to a matrix entry, representing the header and footer structures.
- Image Formation:
  - The matrices from both transformations are stacked to create a dual-channel image representation of the binary file for prediction

Model Architecture:
- Best-performing architecture: CNN with 2 convolutional layers, achieving 98% test accuracy.
- Input: Transformed DCT and Byteplot images of binaries.
- Output: Malware classification percentage.
Loss Function & Training:
- Utilized standard cross-entropy loss with accuracy metrics for classification performance.
- Training was performed on AWS SageMaker with containerized models deployed via ECR for scalability.
Challenges & Tradeoffs:
- Limited access to recent malware datasets due to security restrictions.
- Tradeoff between visual pattern accuracy and raw hexadecimal fidelity in DCT transformation.
- Dataset aging (from 2021) offered scale but introduced concerns around relevancy.

Production-Grade Deployment:

The full AI pipeline is deployed on AWS Cloud Infrastructure, ensuring seamless, scalable service delivery:

Compute & Storage:
- AWS Fargate for serverless model execution.
- Amazon S3 for storing user-submitted files and model outputs.
Model Serving:
- Inference hosted on AWS SageMaker, enabling real-time predictions.
- API endpoints allow direct interaction between the LLM agent and malware classifiers.
Intelligent Agent Integration:
- A LangChain-powered LLM agent (Gemini) enables users to ask natural language questions about file structure, function calls, or other malware characteristics.
- Integration with tools like VirusTotal and Ghidra enhances static analysis.

Evaluation

Model performance was primarily assessed using accuracy on a labeled malware test dataset. The core evaluation metric reflects the proportion of correctly classified malware samples, where the model predicts the presence of malware. While this metric captures overall effectiveness, we also qualitatively reviewed the model’s interpretability.

To benchmark our performance, we compared our custom CNN with two convolutional layers against other conventional machine learning approaches found in academic literature. Our model achieved a 95% accuracy on the test dataset, closely aligning with the top results reported in similar studies using frequency-domain transformations such as DCT.

However, evaluation came with several caveats:

Data Constraints: Due to limited access to recent and diverse malware samples, we relied on a 2021 dataset. Although this provided volume, it may lack the latest malware variants.
Transformation Tradeoffs: Converting binaries into DCT-based image representations improved CNN compatibility but came at the cost of losing raw byte-level granularity, which may hinder the detection of certain obfuscation techniques.
Model Generalization: The image-based classification approach showed strong accuracy but may require further tuning when applied to novel or packed malware, where visual signatures deviate significantly.

Despite these constraints, the model demonstrated robust generalization within the test dataset. Its integration with a large language model (LLM) further enhances usability by offering interpretive, natural language insights, bridging the gap between technical analysis and human decision-making.

The LLM agent has at its disposal multiple tools to analyze executables beyond our CNN classification model. In order to evaluate its performance and its capacity to select the correct set of tools to effectively answer each question, we created a synthetic dataset containing questions paired with expected tool sets. The agent’s performance was assessed using the following metrics.

Exactness: Measures how completely and exclusively correct the agent’s tool call is compared to the expected set.
Precision: Measures how many of the tools called by the agent are useful to answer the question.
Jaccard Similarity: Measures the overall similarity between the set of called tools and the expected set.
LLM-Judge Score: A qualitative measure generated by a separate LLM (GPT-4) acting as a judge. It assigns a score between 1 and 5 based on clarity, helpfulness and appropriateness of the generation. This acts as a proxy for user satisfaction in real world interactions.
P90 Latency: A measure of how fast the LLM agent responds. Essential for real-time applications.

We evaluated a couple of commercially available LLMs, varying the temperature and system prompts. Google’s Gemini Flash demonstrated consistently high precision and Jaccard scores, indicating that it accurately selected relevant tools while avoiding calling unnecessary tools. Additionally Gemini’s latencies were consistently lower, making it a faster and more efficient choice for real-time cybersecurity analysis.

Key Learnings & Impact

Impact

MalwareLens empowers users, regardless of their technical background, to identify and understand malware through an intuitive, AI-powered platform.
CNN-based malware classification achieves high accuracy while reducing the manual effort typically required by cybersecurity analysts.
LLM Integration provides real-time, natural language insights into malware behavior, structure, and classification, bridging the cybersecurity skills gap for entry-level analysts and general users.
Cloud-Native Architecture enables scalable, accessible malware detection without requiring local software installation.

Top Technical Challenges

Data Availability: Access to recent and diverse malware datasets is limited due to strict containment policies and licensing restrictions.
Non-Standard Input Representation: Converting binary executables into frequency-domain images introduced transformation tradeoffs, sacrificing some raw byte fidelity for model compatibility.
Model Complexity vs. Interpretability: Balancing CNN performance with the ability to explain model outputs in an educational, user-friendly way.
Tool Integration: Customizing LangChain’s agent framework to support real-time tool invocation via LLMs while managing bugs in third-party libraries (e.g., HuggingFace tools).
Cloud Deployment: Deploying a secure, containerized architecture using AWS SageMaker, ECS, and Fargate posed technical and cost-related hurdles.

Future Work

Expand Dataset Coverage by incorporating newer malware variants and obfuscation techniques to improve real-world generalization.
Implement Dynamic Analysis through more advanced sandboxing capabilities that allow deeper behavioral tracking of suspicious files.
Improve LLM Agent Autonomy by refining the ReAct framework to reduce reliance on rigid tool-chaining and improve multi-step reasoning.
Introduce RAG-Enhanced Intelligence using retrieval-augmented generation (RAG) for the LLM to better source context from malware knowledge bases and documentation.
Optimize Model Efficiency to enable local, lightweight deployment for organizations with limited cloud access.

User Feedback

“From an advanced user’s perspective, I see a lot of potential here. I’d love to see deeper integration with CVE databases, sandboxing, or GHIDRA disassembly summaries. Right now, it’s a great assistant, but it could evolve into something seriously powerful.”

Samantha Reyes, Cyber Security Analyst

“The chatbot was surprisingly easy to use. I liked the natural flow of conversation—it guided me step by step through the analysis. If they fix the little UX quirks and add downloadable reports or export options, this would be a game-changer.”

Brandon Kim, Malware Analyst

Acknowledgements

We extend our sincere gratitude to our project advisors and the UC Berkeley W210 course instructors for their invaluable guidance and support throughout this project. Special thanks to the organizations and open-source communities that provided access to malware datasets, tools like Ghidra, and APIs such as VirusTotal. We’d also like to thank the developers and contributors behind LangChain and AWS for their robust infrastructure and documentation. Lastly, we are grateful to cybersecurity professionals who shared their insights and feedback, helping us shape MalwareLens into a more impactful and user-centered solution.

References

Muhammed, Tajuddin et. el. 2021. “Malware Detection Using Frequency Domain-Based Image Visualization and Deep Learning” arXiv. https://doi.org/10.48550/arXiv.2101.10578 .

Course

Data Science 210. Capstone , Spring 2025

Class Project Gallery

More Information

MalwareLens

Architecture Diagram

Video

Last updated: April 14, 2025