Protocol Fuzzing- A Comparison between LLM-Assisted and Mutation-Based Fuzzers

Eberhardt, Moa; Karlsson, Lucas

Protocol Fuzzing- A Comparison between LLM-Assisted and Mutation-Based Fuzzers

Mark

Eberhardt, Moa ^LU and Karlsson, Lucas (2025) EITM01 20251
Department of Electrical and Information Technology

Abstract: As the demand for secure software and hardware solutions increase in our modern digitalized society, the importance of reliable and cost-effective product testing is apparent. One such testing technique is fuzzing, which involves generating large amounts of malformed and randomized data to stress test system interfaces. This study aimed to compare traditional fuzzing methods to modern approaches assisted with Large Language Models (LLMs) for the data generation, focusing on efficiency and performance.

The fuzzing targets were three consumer grade routers-the D-Link Eagle Pro AI G416, R03 and R15-which received the input data via a direct Ethernet LAN connection. The tested interface was each router's internal HTTP-server, being fuzzed... (More); As the demand for secure software and hardware solutions increase in our modern digitalized society, the importance of reliable and cost-effective product testing is apparent. One such testing technique is fuzzing, which involves generating large amounts of malformed and randomized data to stress test system interfaces. This study aimed to compare traditional fuzzing methods to modern approaches assisted with Large Language Models (LLMs) for the data generation, focusing on efficiency and performance.

The fuzzing targets were three consumer grade routers-the D-Link Eagle Pro AI G416, R03 and R15-which received the input data via a direct Ethernet LAN connection. The tested interface was each router's internal HTTP-server, being fuzzed with data packets aimed at triggering errors based on previously published CVEs.

The traditional fuzzing model used in the comparison was a mutation-based fuzzer, while the LLM-assisted approach employed OpenAI's LLM GPT-4.1 to generate the fuzzing data. Both models operated without prior knowledge of the target systems, qualifying the experiments as black-box fuzzing. Each router underwent five test sessions, during which HTTP-requests, timestamps, and potential error states were logged, with Wireshark capturing background traffic. The collected data included benchmark performance metrics, counts of discovered vulnerabilities, request deviation matrices, and energy usage estimates.

The outcome of the project showcased that the traditional fuzzer performed quantitively better with regards to the benchmark metrics and energy cost per generated byte, but did not perform as well as the LLM-fuzzer with regards to fuzz data variation, alteration dispersion and complexity. Since no errors or vulnerabilities were found during any of the tests, it was concluded that the published CVEs did not aid enough in the context of black-box fuzzing. (Less)
Popular Abstract: In 1990, the first paper on fuzzing was published by Barton Miller, who explored a simple form of randomized testing on UNIX utility programs, revealing crashes or vulnerabilities in about a quarter of them. These tests are now known as black-box fuzzing, a technique that operates without any knowledge of the target system. A fuzzer, can be described as a software program that generates malformed or unexpected inputs to test the robustness of software or hardware interfaces. Over time, fuzzing has evolved from simple randomization to more intelligent techniques, where modern fuzzers can be aware of the target and analyse code coverage. One common method is mutation-based fuzzing, which alters known valid inputs by flipping bits or... (More); In 1990, the first paper on fuzzing was published by Barton Miller, who explored a simple form of randomized testing on UNIX utility programs, revealing crashes or vulnerabilities in about a quarter of them. These tests are now known as black-box fuzzing, a technique that operates without any knowledge of the target system. A fuzzer, can be described as a software program that generates malformed or unexpected inputs to test the robustness of software or hardware interfaces. Over time, fuzzing has evolved from simple randomization to more intelligent techniques, where modern fuzzers can be aware of the target and analyse code coverage. One common method is mutation-based fuzzing, which alters known valid inputs by flipping bits or injecting incorrect data.

Today, fuzzing has progressed to use Large Language Models (LLMs) to generate input data. An LLM is an Artificial Intelligence (AI) tool based on Natural Language Models (NLMs), which are trained on large datasets to provide responses to user queries. Examples include ChatGPT, Google Gemini, and Claude. Due to recent technological advancements, LLMs are a relatively new tool for use in fuzzing.

Fuzzing can also target protocol message testing, which is more complex due to communication rules. For example, regarding a HyperText Transfer Protocol (HTTP), a Transmission Control Protocol (TCP) session must first be established using the "three-way-handshake". Inputs that violate protocol syntax are discarded and do not reach the intended target.

This thesis compares traditional fuzzing methods with modern LLM-based approaches to assess if LLM-based fuzzers are more efficient. It also investigates whether using known Common Vulnerabilities and Exposures (CVEs) improves vulnerability detection. To achieve the goals, three consumer-grade D-Link routers were selected as fuzzing targets, chosen based on price and number of reported CVEs. A CVE is a reported vulnerability discovered through testing of software or hardware products. One of the routers had 21 HTTP-related CVEs reported. Based on these, a mutation-based black-box fuzzing strategy was developed to generate malformed HTTP-requests.

The traditional fuzzer used three mutation test strategies: large message body buffer overflows, command injections, and a combination involving nested HTTP body mutations. The LLM-based fuzzer was prompted to generate mutated HTTP-requests using both zero-shot and few-shot prompting. %A prompt is the input given to an LLM to generate a response.

Results of the fuzzing showed that the traditional fuzzer was more efficient in terms of speed, the total number of generated HTTP-requests and energy consumption. However, the LLM-based fuzzer produced a greater variety of malformed input injections. Due to the limited scope of basing tests solely on reported CVEs and the low number of selected valid HTTP-requests, no new vulnerabilities were discovered during the tests performed. (Less)

Please use this url to cite or link to this publication: http://lup.lub.lu.se/student-papers/record/9202449

author

Eberhardt, Moa ^LU and Karlsson, Lucas

supervisor

Paul Stankovski ^LU

organization

Department of Electrical and Information Technology

course

EITM01 20251

year

2025

type

H2 - Master's Degree (Two Years)

subject

Technology and Engineering

keywords

Fuzzing, Large Language Models, Protocol fuzzing, Mutation-based fuzzing, HTTP

report number

LU/LTH-EIT 2025-1080

language

English

id

9202449

date added to LUP

2025-06-24 14:04:53

date last changed

2025-06-24 14:04:53

@misc{9202449,
  abstract     = {{As the demand for secure software and hardware solutions increase in our modern digitalized society, the importance of reliable and cost-effective product testing is apparent. One such testing technique is fuzzing, which involves generating large amounts of malformed and randomized data to stress test system interfaces. This study aimed to compare traditional fuzzing methods to modern approaches assisted with Large Language Models (LLMs) for the data generation, focusing on efficiency and performance. 

The fuzzing targets were three consumer grade routers-the D-Link Eagle Pro AI G416, R03 and R15-which received the input data via a direct Ethernet LAN connection. The tested interface was each router's internal HTTP-server, being fuzzed with data packets aimed at triggering errors based on previously published CVEs.

The traditional fuzzing model used in the comparison was a mutation-based fuzzer, while the LLM-assisted approach employed OpenAI's LLM GPT-4.1 to generate the fuzzing data. Both models operated without prior knowledge of the target systems, qualifying the experiments as black-box fuzzing. Each router underwent five test sessions, during which HTTP-requests, timestamps, and potential error states were logged, with Wireshark capturing background traffic. The collected data included benchmark performance metrics, counts of discovered vulnerabilities, request deviation matrices, and energy usage estimates.

The outcome of the project showcased that the traditional fuzzer performed quantitively better with regards to the benchmark metrics and energy cost per generated byte, but did not perform as well as the LLM-fuzzer with regards to fuzz data variation, alteration dispersion and complexity. Since no errors or vulnerabilities were found during any of the tests, it was concluded that the published CVEs did not aid enough in the context of black-box fuzzing.}},
  author       = {{Eberhardt, Moa and Karlsson, Lucas}},
  language     = {{eng}},
  note         = {{Student Paper}},
  title        = {{Protocol Fuzzing- A Comparison between LLM-Assisted and Mutation-Based Fuzzers}},
  year         = {{2025}},
}

LUP Student Papers

LUND UNIVERSITY LIBRARIES

Protocol Fuzzing- A Comparison between LLM-Assisted and Mutation-Based Fuzzers