MIDS Capstone Project Spring 2024

Just.Ethel: Decompiling Software With Neural Machine Translation (NMT)

Team members

Just.Ethel aims to empower malware analysts by offering invaluable tools that unravel complex assembly code to help reveal the true intentions behind malware.

Just.Ethel, an innovative web-based platform, bridges the gap between complex assembly code and more accessible high-level languages like C. By leveraging advanced neural machine translation (NMT), we are streamlining the decompilation process, making it faster and more intuitive. Our technology expands the accessibility of cybersecurity expertise beyond specialists proficient in low-level machine code.

Problem

In recent years, the cybersecurity landscape has been increasingly dominated by ransomware attacks, with projections suggesting that the global damage costs could surpass $250 billion by 2031. These attacks not only cause significant financial losses but also disrupt operations across various sectors, posing a severe threat to global economic security and stability. The complexity and sophistication of malware have evolved making traditional cybersecurity measures less effective. A critical challenge in this evolving threat landscape is the difficulty in understanding and counteracting malware due to the lack of access to its source code.

Our Work

In response to the escalating threat and sophistication of malware, our team developed Just.Ethel. This innovative system utilizes Neural Machine Translation (NMT) to transform complex assembly code back into its original C source code form, addressing a critical challenge in cybersecurity: the difficulty of analyzing malware without access to its source code. We used a robust dataset from the Beyond the C research which involved in two sources: a smaller set derived from programming competitions and interview sites for model development and testing, and a larger, more comprehensive dataset of 50k projects from Debian Linux, encompassing 2 million functions for the fine-tuning of our final model.

We compared one NMT model (CodeT5) and one LLM model (GPT-4) assessing them on their ability to translate zero-shot, generate fluent human-like code, and their performance across various sizes of tasks. Our models were evaluated using ROUGE-1 scores, with GPT-4 showing superior performance in generating readable C code from assembly language.

We believe that our product not only makes the reverse engineering process more accessible but also enhances malware analysis capabilities, offering a vital tool in the fight against the evolving cyber threat landscape.

Acknowledgements

We would like to extend our gratitude towards Fred Nugen and Korin Reid, our capstone instructors, for their invaluable support and guidance.

We would also like to thank Mark Long and Jamie Dicken from New Relic for providing subject matter expertise, as well as colleagues at The MITRE Corporation.

Course

Data Science 210. Capstone , Spring 2024

Class Project Gallery

More Information

Just.Ethel website

Video

Last updated: April 15, 2024