Embedded fuzzing: a review of challenges, tools, and solutions

Fuzzing has become one of the best-established methods to uncover software bugs. Meanwhile, the market of embedded systems, which binds the software execution tightly to the very hardware architecture, has grown at a steady pace, and that pace is anticipated to become yet more sustained in the near future. Embedded systems also benefit from fuzzing, but the innumerable existing architectures and hardware peripherals complicate the development of general and usable approaches, hence a plethora of tools have recently appeared. Here comes a stringent need for a systematic review in the area of fuzzing approaches for embedded systems, which we term “embedded fuzzing” for brevity. The inclusion criteria chosen in this article are semi-objective in their coverage of the most relevant publication venues as well as of our personal judgement. The review rests on a formal definition we develop to represent the realm of embedded fuzzing. It continues by discussing the approaches that satisfy the inclusion criteria, then defines the relevant elements of comparison and groups the approaches according to how the execution environment is served to the system under test. The resulting review produces a table with 42 entries, which in turn supports discussion suggesting vast room for future research due to the limitations noted.


Introduction
Fuzzing is an increasingly popular technique for software testing, namely for findings bugs that could either represent functional problems and vulnerabilities that could be exploited by a malicious attacker. It uses randomness to generate test data for a target with the goal of triggering faults. Faults indicate bugs and may potentially pose a security vulnerability. Because fuzzing is a dynamic method, it analyzes the software while it is executed. By design, dynamic analysis only allows us to find faults that actually occur during execution. Consequently, it is necessary to exercise as many parts of the code and interleaving of branches as possible.
Since fuzzing with pure random input has a small chance of reaching large parts of the code, sophisticated fuzzing tools make use of additional information, such as input structure or code coverage, to generate inputs. A simple but effective approach is to gather code coverage information during input processing of the SUT and collect inputs that trigger previously unreached code parts. This growing collection of inputs, called corpus, is used continuously to generate further inputs.
Despite its simple underlying principles, fuzzing has proved to be an effective method for system and software testing and is recommended by several industry standards. For example, in ISO 26262-Road vehicles-Functional Safety (Road Vehicles 2018), fuzzing is advocated as one of the testing methods to ensure robustness. Fuzzing is also found as a recommendation in ISA/IEC 62443-4-1 -Secure product development lifecycle requirement (Secure Product Development Lifecycle Requirements Eisele et al. Cybersecurity (2022)  Fuzzing user (software) applications is perhaps the best-established use of fuzzing, and there are several consolidated techniques for gathering feedback from a target process. For example, the OSS-Fuzz (Serebryany 2017) project revealed over 30,000 bugs in 500 open source projects by using coverage-guided fuzzers such as lib-Fuzzer (LLVM 2021), AFL++ (Fioraldi et al. 2020), and honggfuzz (Swiecki 2021).
Another important, growing area for fuzzing pertains to embedded systems, which are microcontroller-based devices in conjunction with their dedicated software. Typically developed for specific purposes, embedded systems are used pervasively in modern society, and innumerable examples could be made, including smart meters, pacemakers, and factory robots, to name just a few. The market of embedded computing has been growing constantly and this trend is expected to continue in the near future (Alsop 2019). Notably, embedded systems are key components for the Internet of Things (IoT) and for Cyber Physical Systems (CPSs). Therefore, the motivation for fuzzing embedded systems is remarkable.
A first essential feature of embedded systems is that that their firmware is tightly coupled with the specific hardware, including connected peripherals. For example, the firmware of a smart light bulb or of a central heating control panel are both extremely unlikely to work seamlessly on different hardware. A second essential feature of embedded systems is their inherent diversity, which is reflected in the operating systems, CPU architectures, communication mechanisms, and hardware peripherals adopted. For example, while some embedded systems may run Linux-based operating systems, some run without any operating system at all. Also, while desktop and server systems mainly rely on a few CPU architectures and operating systems, these may vary significantly for embedded systems.
We contend that these two features also form the two essential reasons why fuzzing embedded systems is still an open challenge at present . For example, compiling distinct modules, such as libraries, into common user applications and exercising fuzzing on them is not an effective means of testing code portions that interact directly with the hardware; incidentally, because of the diverging compiler and environment, this would not test the exact code that ends up on the actual device. It becomes apparent that reliable, holistic fuzz testing of embedded systems ought to cover both the firmware code as well as the appropriate environment for that firmware. Moreover, the aforementioned diversity poses the biggest challenge due to the need for the fuzzer to scale up to innumerable variants of hardware and firmware that are often poorly documented.
Therefore, we hypothesize that a golden tool and solution for fuzzing embedded systems (embedded fuzzing for short) do not exist yet. To verify this hypothesis, we formulate the following research question: What are the main features and limitations of current tools for fuzzing embedded systems? To address this question, this article conducts a systematic review of the state of the art of approaches to embedded fuzzing. Our review rests on a formal description of fuzzing for embedded systems and leverages it to advance a clustering of the reviewed works upon the basis of their underlying mechanisms. The taxonomy criteria used to categorize the reviewed works is presented in "Section Taxonomy criteria".
The treatment highlights that emulation-based approaches work well for academic examples but may fail on real-world use cases. By contrast, hardware-based approaches with all their incarnations may yield best results albeit not without limitations. Hybrid approaches seem to bear disadvantages from both worlds. By presenting the whole picture of fuzzing for embedded systems, this article demonstrates features as well as limitations of each reviewed work, ultimately demonstrating what kind of future research is needed and deriving directions on how to pursue it.
"Section Inclusion criteria" defines the criteria for a piece of research to be included in our review, and "Section Background and notation" introduces our extended model for fuzzing embedded systems. Thereafter, we review related work of hardware-based and emulationbased embedded fuzzing in "Sections Hardware-based embedded fuzzing" and "Emulation-based embedded fuzzing", respectively. Abstraction-based approaches are reviewed in "Section Abstraction-based execution environment". We review the relevant works for embedded fuzzing in "Section Reviewing embedded fuzzing works", discuss future trends in "Section Discussion and future directions", and related work in 'Section Related work". We conclude the article in "Section Conclusion".

Inclusion criteria
The inclusion criteria for published material to be included in this review are: C1 Research papers that are published in the top five venues in the category "Engineering & Computer Science", sub-category "Computer Security & Cryptography" according to Google Scholar (Scholar 2021). C2 Research papers that are published during the five years between 2017 and 2021. C3 Research papers that mention "fuzzing" and "firmware" or, alternatively, "fuzzing" and "embedded". C4 Research papers or tools that we feel convey relevant approaches to embedded fuzzing.
The first two criteria are objective, as Scholar offers convenient selection and sorting facilities for research venues. The chosen area of security is the one that we found most relevant to fuzzing in general, considering fuzzing as a technique for unveiling software vulnerabilities that an attacker could exploit. To confirm this, we also tried subcategories "Software Systems" and "Computing Systems" but none of the corresponding papers survived the criterion C4. The five venues arising through the first criterion are: Criterion C3 is also objective. Scholar offers a convenient search facility for the contents of published papers. We searched in each of the five identified venues with following search string: However, many papers identified this way were not relevant to our purposes for a variety of reasons, ranging from fuzzing being treated only marginally or being mentioned only in the paper references. Here is where criterion C4 comes into play, indicating that we had to exercise manual scrutiny to further select the very contributions that would convey relevant approaches and tools for embedded fuzzing.
Moreover, we decided to appeal to an additional, purposely subjective, inclusion criterion in order to freely represent our experience through the review. It is apparent that criterion C4 does not deliberately refer to a specific time window or venue, hence applying it in isolation from the previous criteria provides us with the freedom of selection we also wanted to have. Therefore, our resulting inclusion criteria can be represented as a sentence in propositional logic: Clearly, this sentence is logically equivalent to C4 because our personal judgement had to be applied to all possible candidates. However, its construction allows us to represent the numbers of papers for the meaningful combinations of criteria and venue as well as the papers that we freely decide to consider. Such numbers, in particular for the two main disjuncts in the sentence, can be found in Table 1. The selection process is additionally depicted in Fig. 1.
It can be understood why our review features a total of 42 papers.

Background and notation
In this section, a formal description of embedded fuzzing is proposed to mathematically describe fuzzing as a stochastic process. Therefore, the distinct tasks an embedded fuzzer must fulfil are described in an algorithmic Apply C4 to papers outside of C1, C2, and C3 18 Fig. 1 The selection process for finding relevant works, including the numbers of papers each step has mined manner. We use the notation introduced by Böhme (2018) and apply it to fuzzing systems. Let a system S be our target that we fuzz. The sample space for system S is the input space D . Fuzzing is then a stochastic process (D, F, P) of selecting inputs t i from the input space D . The event space F , or fuzzing campaign, is then the collection of all drawn input, i.e.
The probability function P dictates the selection of an input t i with probability p i to be part of the fuzzing campaign F . Note that we leave out the often used but poorly specified terms black-box, gray-box, or white-box fuzzing. The degree of smartness is modeled by adjusting probability function P, i.e. probability p i for each drawn test input. A tool that implements the sampling function of (D, F, P) is called a fuzzer.
The probability function P can depend on observations of the system S . If no observations influence the probability p i for selecting a new input t i (all p i 's are equal), the fuzzing campaign is a uniform random tester 1 .
Sampled inputs t i are processed by system S with its configuration C , as in equation 2. The configuration C describes the static environment of the system, including hardware properties.
In contrast to existing formal definitions, we introduce an observing mechanism that can observe system S in desired dimensions that are not further specified. The observation of the system's behavior when processing input t i is then described by O t i ∈ O and is obtained by where observe ← −−− − describes the observations of the system during the execution. This construction allows, for example, to gather code coverage of a system or to observe whether exceptional states of the system have been reached. It also allows us to monitor emitted physical side-channel data or perform liveness checks of the system after a processed input. Further observations can be execution time or the output of a system. The specific observation space depends on the actual device and observer.
For fuzzing, algorithm 1 is built around equation 2, which is called in line 4, where O t i is the concrete observation of system S C on processing input t i . (1) The algorithm continuously samples inputs t i ∈ D on behalf of the probability function P, which are then processed by system S. The observation O t i is inspected for unspecified behavior in function specified. For example, the specification can contain maximum execution durations or illegal states of the system. If unspecified behavior is discovered, the (hopefully) responsible input t i is preserved in T × .
Finally, the probability function P may be adjusted by function adjust, based on the new observation O t i . For example, mutation-based coverage-guided fuzzers implicitly alter their probability function, when a new execution path has been discovered by adding the responsible input to an input corpus. On each iteration, a seed is picked from the input corpus and mutated randomly to generate a new input-so the seeds directly influence the probability space of newly sampled inputs.
Differential Fuzzing (Nilizadeh et al. 2019;Noller et al. 2020;He 2020) refers to fuzzing of different programs with respect to differences between the observations O t i , such as coverage or execution time. With an adaption of algorithm 1, systems can be fuzzed differentially, e.g. to test two implementations of the same algorithm for a deviating behavior.
We model stateful fuzzing by allowing t i to contain multiple inputs, Executing such a sequence on system S brings it to a state s, which we collect as part of S 's observation O t i .
Ensemble Fuzzing, as introduced by Chen et al. (2019), is when multiple fuzzers execute algorithm 1. The main idea is that the different tools synchronize their observations. The same system S can be run with different configurations C and C ′ . For example, configuration C ′ can have the input validation, such as a checksum, turned off to allow a fuzzer to get deeper into the SUT more quickly. The original configuration C is then used to validate inputs from configuration C ′ to reduce false positives.
Fuzzing Harness, or Fuzz Wrapper, is an adapter between a fuzzer and a specific target. Applications that process data directly from a file or console input channel can most likely be fuzzed without any adapter in between. For all other cases-a typically lightweightfuzzing harness is necessary to route input data from the fuzzer to the target's interface.

Hardware-based embedded fuzzing
The high coherency of software and hardware in embedded systems suggests that fuzz testing is to be performed on the actual device. However, observing of the device, i.e. implementing observe ← −−− − , already poses a challenge. In this section, we present approaches that aim to run the target application in its designed hardware environment. Fan (2020) ported the popular fuzzer AFL to ARMbased IoT devices. Within their ARM-AFL project they developed a code instrumentation strategy for ARM assembly and implemented a lightweight heap memory corruption detector. The whole fuzzing process runs on the target device itself, leading to a high throughput. In principle, the fuzzing process works exactly like fuzzing on a desktop PC. The target process is observed on crash signals and code coverage in each O t i . ARM-AFL requires Linux as the operating system and the source code of the target program.
Frida (FRIDA 2020) is a dynamic code instrumentation toolkit that can hook into arbitrary user processes enabling transparent access to the execution. It can also be controlled remotely, allowing for hooking into Linux, QNX, Android, and iOS applications. In addition, Frida enables the collection of code coverage data from the hooked process to facilitate fuzzing. However, the Frida server application must be executed on the target device, which can be challenging on closed/commercial devices.
Bogad and Huber (2019) developed Harzer Rollera linker-based instrumentation tool for embedded security testing. They address the problem that embedded firmware often needs closed-source libraries in order to communicate with the hardware, which cannot be instrumented by the compiler. These libraries are usually shipped as an object file and are integrated into the firmware by the linker. To be able to generate call traces, all functions within the object file are renamed and appropriate proxy functions are generated. For detecting stack overflows, a stack canary can be generated by the framework before calling the original function. The authors state that this technique is meant for simple embedded devices with limited debug capabilities. The instrumentation of an object file increases its size up to 150%, which usually makes it impossible to instrument all libraries on memory-limited targets. The framework has been used for fuzzing an ESP8266 using Boofuzz (Pereyda 2017) as black-box fuzzer. Oh et al. (2015) present a simple Dynamic Binary Instrumentation (DBI) method for embedded systems without any dependency on the operating system. They connect the target device with a debugger and insert software breakpoints at manually chosen locations. When a breakpoint is reached, the instrumentation framework is notified, and the breakpoint is removed for further execution. This method enables observation of manually selected, executed code parts in O t i and could be used for coverage-guided fuzzing of any embedded system that provides a suitable debugger. According to the measurements of the authors, the overhead of this method is only around 1%. However, the measurements have only been performed on one device. Börsig et al. (2020) present a method to instrument code for ESP32 microcontrollers, whereby the coverage data is returned to the fuzzer's host via a JTAG connection. For this, the source code must be available and the GCC coverage instrumentation mechanism is used. The input data is sent to the target via the original channel, e.g. WiFi. However, the transfer of the coverage data via the JTAG interface slows down the fuzzing process roughly by a factor of ten. Tychalas et al. (2021) investigate security evaluation of Programmable Logic Controllers (PLCs). Although, PLC binaries are not regular programs, the authors show that they can introduce vulnerabilities into systems. To reveal such vulnerabilities, they propose a method to instrument PLC binaries, and enable coverage-guided fuzzing on them. Song et al. (2019) presented PERISCOPE to examine communication between devices and drivers over Memory-Mapped IO (MMIO) and Direct Memory Access (DMA). The extension PERIFUZZ allows fuzzing on this hardware-OS boundary. PERISCOPE needs to be compiled directly into the target's kernel. Analysis and fuzzing can then be performed directly on present MMIO and DMA regions. For demonstration, AFL is used, but the actual fuzzer is interchangeable.
Delshadtehrani et al. (2020) designed the programmable hardware monitor PHMon for debugging, assisting vulnerability detection, and enforcing security policies. A prototype of the hardware monitor has been deployed on a Field Programmable Gate Array (FPGA) in conjunction with a RISC-V processor. It can be used to generate coverage feedback directly from the execution on the hardware. The authors state that coverage-guided fuzzing with PHMon and AFL is 16 times faster than fuzzing in a full-system emulator. However, the hardware monitor module needs to be included directly on the hardware chip, to enable this performance advantage.
Sperl and Böttinger (2019) present a side-channel approach of gathering code coverage from embedded systems by precisely monitoring the power consumption of the target device during execution. Therefore, an oscilloscope is used to record power traces, which are processed further on a host PC to recognize the different executed basic blocks. The recognition is realized by machine learning classification algorithms. With this technique, they are able to approximate the Control Flow Graph (CFG) with correlation coefficients of up to 0.9. For correct results the setup needs to be calibrated and trained on the actual Device under Test (DUT). García et al. (2020) use timing and electromagnetic emanation side channels from embedded devices for analyzing implementations of cryptographic algorithms. They use these side channels in a specialized feedbackdriven fuzzing algorithm to recover cryptographic private keys. Chen et al. (2018) present IoTFuzzer, which aims for fuzzing IoT devices that are controlled by mobile phone applications-in this case Android apps only. It makes use of the fact that accompanying mobile apps of IoT devices are aware of the exact protocol and encryption for controlling the device. The idea is to reuse the mobile app to send correct messages to the target device, thereby enabling protocol-aware fuzzing. For this, the mobile app is initially scanned for functions that consume user input and send it to the IoT device. These functions are then reused to send fuzzing messages to the target device. This way, the generation of syntactically and semantically correct fuzzing messages is ensured. Crashes are detected by observing the communication or performing liveness checks. Redini et al. (2021) have refined this method in their tool DIANE. In contrast to IoTFuzzer, DIANE tries not to hook into the function that consumes user input first, but the last possible one, before the message is encoded and send to the SUT. Thereby, eventual sanitization of the user input within the mobile application is bypassed and the possible input space is enlarged.
Snipuzz (Feng et al. 2021), also aims to fuzz test IoT devices with accompanying mobile applications. Unlike IoTFuzzer and DIANE, it additionally analyzes responses from the target device to enable feedback-driven fuzzing. Appropriate message sequences are gathered by reading the public API, when it is available, or from analyzing the communication between the accompanying mobile application and the target device. As an alternative, the accompanying mobile application can also be disassembled, but this usually requires more effort. Although Snipuzz aims to be lightweight, it requires some manual analysis to gather valid initial seeds and select the right message sequences for fuzzing. Aafer et al. (2021) present a technique to perform feedback-driven fuzzing of Android TV boxes based on logging outputs. First, static analysis is applied to extract logging statements within the target's firmware. With taint analysis, the collected logging statements are classified according to whether they are related to input validation. This labelled collection of logging statements is then used to train a Convolutional Neural Network (CNN) model, which serves as a classifier for logging outputs. During fuzzing, output logs are analyzed by using the model to detect diverging behavior of the target and to provide feedback to the fuzzer. In addition, they introduce an external component that detects visual and auditory anomalies by capturing and comparing video and audio signals before and after each fuzzing step. This method generates a coarse-grained feedback, compared to branch code coverage, and is designated for rather talkative devices, that give feedback via logs.

Emulation-based embedded fuzzing
Emulators offer transparency and control of the emulated subject and enable a precise observation O t i of internal operations in manifold dimensions. Furthermore, multiple instances of an emulator can be created easily, enabling horizontal scaling of the fuzzing process.
However, running firmware of embedded devices in an emulator presents several challenges, which are carved out well by Wright et al. (2021). Most notable for fuzzing is the fidelity and the effort needed to adapt an emulator to a specific target. Figure 2 shows an architecture model for embedded systems. While the application logic is contained in the application layer, potential operating systems are located within the system software layer. However, there are embedded systems without a dedicated operating system, often referred to as bare-metal systems. The system software layer then may contain bootloader, drivers, and Hardware Abstraction Layer (HAL) modules. Executing the application within an emulator can be realized by either replacing the hardware layer with a system emulator or by moving only the application into a user-mode emulator.

Application
System Software HW Fig. 2 Embedded systems architecture model according to Noergaard (2012) In this section, the most notable approaches are presented that enable embedded fuzzing in an emulator.

User mode emulation fuzzing
User applications that are built for running in an operating system can potentially be executed very easily in an emulator, because of the well-defined operating system interfaces at the application layer. User mode emulation enables fuzzing of binary-only applications with coverage guidance.
It is also possible to transfer user applications from (in particular Linux-based) embedded systems into a user mode emulator like QEMU to perform coverag-guided fuzzing, independently from the instruction set architecture. However, accesses to the hardware that embedded applications normally rely on need to be treated adequately by the emulator.
All investigated fuzzing frameworks in this category use a custom kernel for this purpose, also depicted in Fig. 3. The thick boxes depict the parts that originate from the actual target. Chen et al. (2016) developed the Firmadyne framework, which allows for automated dynamic analysis of Linux-based embedded firmware images. It extracts the root filesystem from a binary firmware image and utilizes a custom kernel to run the image within the QEMU fullsystem emulator. With this setup, dynamic analysis of the user applications in the firmware can be performed, which is demonstrated by providing a set of known exploits that can be tried on the emulated device. Even though the full-system mode of QEMU is used, Firmadyne should be considered to enter at the application layer, because it deploys its own customized kernel and only the user space applications from the firmware are executed. The custom kernel partially compensates for missing hardware emulation, for example, by providing an emulated NVRAM that embedded devices often use.
The Firmadyne framework is enhanced by Kim et al. in FirmAE (Kim et al. 2020). They claim that the Firmadyne framework could only get 16.28% of their tested set of firmware images up and running for dynamic analysis. To solve this problem, they introduced heuristics to configure boot parameters, kernel parameters, network interfaces, and file systems correctly. With these modifications, they were able to automatically run 79.36% of the aforementioned set of firmware images within QEMU.
FirmFuzz (Srivastava et al. 2019) is an automated introspection and analysis framework for IoT firmware. It is designed for embedded devices that offer user interfaces through a webpage and are based on Linux. The QEMU system emulator is set up with a customized kernel in conjunction with fake peripheral drivers to compensate for potential missing hardware emulation. A headless browser is used to communicate with the device automatically through a virtual network interface to find user interfaces. After the static analysis of the firmware, a generation-based fuzzer is set up. Seed input data is generated, using the contextual information that is gathered from the firmware image. The target is monitored for faults by the modified Linux kernel within the emulator.
FIRM-AFL (Zheng et al. 2019) is based on AFL and Firmadyne. The idea is to speed up fuzzing within QEMU by letting the target user process run in the user-mode as long as possible. When necessary, the user process is translated to the full-system emulator of the appropriate device hardware. As a result, the overhead of a full-system emulation is largely omitted. The authors state that with this mechanism, the fuzzing process can be sped up by a factor of ten. However, it is required that the target device runs a POSIX-compatible operating system and the hardware can be emulated by QEMU.
Transferring embedded applications from Linux-based devices into an emulator by providing a customized kernel can be successful in some cases, in particular when the target application does not rely on special hardware peripherals. Nevertheless, there remain many embedded systems to which this does not apply, and which demand a different approach for emulation-based fuzzing.

Full-system emulation fuzzing
Once an embedded system can be emulated adequately, code coverage, fault states, and other meta information of the execution can be obtained easily. The next section is about methods that enable full-system emulation of embedded devices. For a correct emulation of embedded firmware, all hardware peripheral accesses must be treated in the emulator.

Peripheral emulation
A hardware access manifests itself in read and write operations on the hardware address space. Additionally, hardware interrupts are a mechanism to let hardware peripherals trigger code areas from the firmware. Implementing software equivalents of hardware peripherals and providing them on their expected locations in  Fig. 3 Scheme of fuzzing applications in a user-mode emulator the hardware address space is a way to enable emulation. When all peripherals from a target device can be emulated, an unmodified firmware image can be executed and fuzzing can be enabled with little effort, as depicted in Fig. 4.
The QEMU system mode is a popular full-system emulator, which already provides configurations for several microcontrollers and peripherals and supports a large variety of architectures. TriforceAFL (Hertz and Newsham 2021) combines AFL with QEMU and enables emulation-based coverage-guided fuzzing for targets that can be emulated with QEMU. If the desired target device is not supported, the implementation and configuration can be very laborious and requires deep knowledge of the hardware. Herdt et al. (2020) present a different solution for emulating the whole hardware of an embedded system. They apply libFuzzer to a SystemC virtual prototype. Sys-temC is defined as IEEE-1666 standard (Group S-SCSW 2011) and provides a set of C++ libraries to define virtual prototypes. Virtual prototypes are models of the entire hardware system and allow an accurate simulation. They are an established way of testing systems during their development in the industry. Fuzzing is performed on the virtual hardware by using a fully booted state of the system, which is preserved by a fork-server mechanism. However, the complete system must be described in SystemC, which requires deep insights into the SUT and can again require a lot of manual work. Clements et al. (2020) present HALucinator to address the problem of emulating peripherals by using the HAL as an entry point. First, it locates HAL functions in the firmware through binary analysis. Second, it intercepts the execution of the HAL functions and instead mimics its expected behavior. Handlers for each HAL function must be implemented manually once. Beside correct emulation, HALucinator can intercept functions that provide random values and is able replace them by deterministic functions, which can render fuzzing more efficient. Kim et al. (2019) proposed RVFuzzer for detecting input validation bugs in robotic vehicles. Robotic vehicles are cyber-physical systems managed in real-time by a microcontroller. It needs to control actuators, process sensor data, and react to control commands. A careful validation of incoming control commands is therefore required, especially if they are received from an unencrypted broadcast medium. RVFuzzer tries to detect (sequences of ) control commands that bring the robotic vehicle into an unstable state. Therefore, the control program is connected to a physical simulation of the robotic vehicle, and input commands as well as environment parameters are mutated. Instabilities are detected by observing whether the presumed state in the control program deviates too much from that in the simulation.

Peripheral proxying
When deep knowledge about the SUT is missing, hardware accesses of the firmware must be treated differently. An alternative solution is to forward each hardware access to the real device. Therefore, a proxy application is introduced to route appropriate values and triggered interrupts between the actual hardware and the emulation, as shown in Fig. 5.
PROSPECT (Kammerstetter et al. 2014) uses TCP/ IP connection to forward hardware accesses, Avatar (Zaddach et al. 2014) a debugging connection, and SUR-ROGATES (Koscher et al. 2015) routes hardware accesses through a dedicated FPGA to the actual hardware.
Regarding mobile system drivers, Talebi et al. (2018) developed Charm that enables fuzzing of device drivers by forwarding hardware peripheral accesses through a USB-based connection. Since the drivers need to be modified for this method, Charm works only with open source drivers.
Avatar has a successor, Avatar 2 (Muench et al. 2018), which is not only intended for hardware access rerouting, but more for orchestrating different frameworks to enable dynamic analysis. Its flexibility is proven by .
They enable coverage-guided fuzzing on a wide variety of devices by using PANDA  (Pereyda 2017) as the fuzzer. Furthermore, they uncover the issue of silent memory corruptions that can occur in embedded devices without Memory Management Units (MMUs) or operating systems that take care of memory accesses. These are memory corruptions that do not result in a crash of the device upon occurrence and are therefore are not easily observable. To detect silent memory corruptions, they present heuristics that can be applied to an emulator, regardless of the manner of hardware access treatment. When using these heuristics all, occurring memory corruptions of a device can be discovered. Peripheral proxying offers a solution for emulating an embedded device without excessive implementation effort. However, the forwarding of peripheral accesses to the real hardware can present a bottleneck, depending on the number of requests to the hardware. Additionally, manual configuration and setup of the proxying mechanism is required.

Peripheral modeling
Where implementing virtual hardware requires too much effort and peripheral proxying is too slow for fuzzing, automated hardware modeling can be a solution. The idea is to learn how to respond to hardware accesses such that the firmware continues its execution. The peripheral model is thereby directly connected to the MMIO address space and can be supported by the fuzzer, as depicted in Fig. 6. Gustafson et al. (2019) present a semi-automated rehosting framework, called PRETENDER. They solve the modeling of hardware peripherals by means of preliminary observation and recording of the behavior of the real device with Avatar 2 . As a result, not only accesses to the hardware are recorded, but also the timings and orders of interrupts. Next, a rather complex step of categorizing MMIO registers and initializing State Approximation model occurs. This should allow for smart responses to hardware accesses of the firmware. Finally, human interaction is needed to define the entry point of the fuzzing data. The authors state that PRETENDER allows for a survivable execution, which can just be sufficient for a dynamic analysis of the device. Spensky et al. (2021) refined this approach with Conware, which can also learn hardware peripheral behavior by first recording interactions between the firmware and the real hardware peripheral and subsequently extracting models for each of them. The extracted models can then be used for a full-system emulation. In contrast to PRETENDER, Conware claims to be more generic and can even model peripheral behavior that has not been recorded directly.
Another hardware-agnostic approach for embedded fuzzing is presented by Feng et al. (2020). Their framework P 2 IM responds to each peripheral access (a read from the MMIO address space) with input data from the fuzzer. Therefore, the MMIO registers are categorized into Control Registers, Status Registers, Data Registers, and Control-Status Registers by observing how the firmware accesses the registers. Depending on the category, interaction with the registers is treated differently. Most important is the treatment of Data Registers, where P 2 IM directly injects input data from the fuzzer. Thereby, the fuzzer itself models all of the peripheral input generically, omitting the need for finding and choosing the correct input vector for the target. The interrupt emulation is implemented quite pragmatically by sequentially firing one interrupt per 1000 executed basic blocks. When the initially supplied fuzz input buffer is exhausted, the execution is terminated and the code coverage is fed back to the fuzzer. The explorative nature of the fuzzer is used to improve the hardware peripheral modeling successively. The framework allows existing fuzzers to be added as a drop-in component, offering AFL as default. However, peripherals that use DMA are not modeled by P 2 IM, as this would require insights on the internal design of the target device.
For automatic emulation of DMA input channels in P 2 IM, Mera et al. (2020) present the drop-in solution DICE. It observes the behavior of running firmware in the emulator and recognizes candidates for DMA input channels heuristically. In principle, it searches for pointers to the internal RAM that are written to memory-mapped IOregisters. The authors claim that, during their tests, DICE did not create any false positive categorization and successfully detected 21 out of 22 actively used DMA input channels. With negligible overhead, it enables fuzzing of DMA input processing firmwares without further hardware knowledge. Johnson et al. (2021) present a more targeted peripheral modeling approach with Jetset. In this case, an analyst manually defines a goal address in the firmware that should be reached, and Jetset tries to derive the necessary hardware peripheral responses to reach this address with symbolic execution. For instance, the transition from kernel space to user space can be used as such a goal address. The explicit goal address allows Jetset to mitigate path explosion during symbolic execution. Zhou et al. (2021) enable peripheral modeling in their tool µEmu by mixing symbolic and concrete execution to calculate appropriate responses to hardware accesses. First, all hardware peripheral dependent inputs are treated symbolically. To avoid path explosion, symbolically calculated values are cached and reused during concrete execution. When invalid execution states are reached, the responsible cached values and the state itself are marked as invalid and different paths are taken by future symbolic executions. This way, the hardware peripherals are enhanced iteratively. Scharnowski et al. (2020) refine the mechanism of P 2 IM. Instead of putting a memory-mapped register into a category, their framework Fuzzware handles each individual access to a memory-mapped register by additionally considering the program counter on each access. On the first occurrence of an access, the emulator is reset to the instruction right before accessing the memorymapped register and Dynamic Symbolic Execution (DSE) is used to determine whether and how the value affects the further execution. Accordingly, the individual memory-mapped register access is assigned just enough random input bits to ensure that all dependent branches can be reached. This leads to a minimal consumption of input bits from the fuzzer while fuzzing the whole peripheral interaction. The authors claim that DMA could also be modeled with further effort, but this is considered out of scope of their work.

Sandbox emulation fuzzing
In cases where a full-system emulation is not feasible, lightweight sandbox emulation can be a solution. Thereby, the binary code is executed from a manually chosen point with a manually created context. The idea is to fuzz functions that do not communicate with peripherals at all, meaning that the hardware peripherals do not need to be emulated. This technique is almost hardware-independent since only a simulator for the respective instruction set is required. Fuzzing a function from a binary firmware file within a sandbox can be realized as shown in Fig. 7.
Miasm is a reverse engineering tool to analyze, modify, and partially emulate binary programs. It offers features such as assembling and disassembling for various architectures, emulation with Just-In-Time (JIT) and symbolic execution. In combination with Python-AFL, Miasm can be used to perform fuzzing (Guedou 2017). Therefore, a sandbox is created by Miasm, input data needs to be mapped to appropriate memory addresses, and registers need to be initialized correctly. This technique is mainly interesting for penetration testers, who reverse engineer binaries and want to perform fuzzing of interesting functions in this way. If the source code is available, it is easier to perform fuzzing of hardwareindependent functions by compiling them into a user application and using a general purpose fuzzer.
The Unicorn CPU Simulator (Nguyen and Dang 2015) was used by Nathan in Voss (2021) in a similar way. Maier et al. (2020) present BaseSAFE, where they also used the Unicorn CPU Simulator to fuzz different layers of a smartphone baseband chip on manually selected target functions and manually created memory contexts. The downside of these sandbox emulation fuzzing approaches is the constrained, manual selection of the target function and manual creation of the execution context.
A semi-automated approach of supplying an execution context to the target code is presented by Harrison et al. (2020) with their tool PartEMU. They present required steps that allow experts to set up and configure an emulator to enable dynamic analysis of TrustZones from embedded systems. Therefore, it is explained when hardware and software components should be emulated or reused, and how specific emulation stubs can be implemented. Nevertheless, developing such an emulationbased execution context can involve huge manual effort and requires expert knowledge. Ruge et al. (2020) present Frankenstein, a highly specialized framework for fuzzing wireless modem firmware in an emulated environment. They run the firmware of a Broadcom Bluetooth chip within QEMU user mode. Through sophisticated reverse engineering, about 100 locations in the code have been determined, where the execution needs to be redirected and substituted manually. This hooking is required to ensure correct emulation of the firmware. With this setup, they were able to fuzz the Bluetooth modems of popular mobile phones from Apple and Samsung and unveiled several security problems. However, the setup is highly customized and requires a lot of manual effort to adapt it to other embedded firmware.
An automated sandbox-based fuzzing tool for IoT Firmware is presented by Gui et al. (2020) with FIRM-CORN. First, the firmware image is disassembled and detected functions are rated based on the memory operations they contain and the use of predetermined sensitive functions, such as read, strcpy, and execve. For high rated functions, a context dump (memory and register values) at the starting point of the function is gathered from the actual device. This allows specific fuzzing of potential vulnerable functions within the CPU emulator Unicorn. An automated mechanism detects crashes of the emulator, which result from missing emulated hardware, and skips these crashing functions during further virtual execution. They state that the tool is developed for Linux-based devices only, but it should be possible to extend it to further platforms.

Abstraction-based execution environment
Symbolic execution is known for several decades (King 1976) and seems not to be located within the domain of fuzzing at first glance. It analyzes the target program independently from its execution environment. The core idea is to treat all input vectors of a program symbolically (similarly to a variable in a mathematical formula) and derive input constraints for all possible program paths. From these constraints, concrete inputs can be extracted that are known to trigger all possible program pathswhich is exactly the goal of fuzzing.
However, for each conditional branch in a program, each possible path must be considered in different states. This can lead to the state explosion problem and usually prevents the use of pure symbolic execution in real-life applications.

Symbolic execution of embedded firmware
Symbolic execution does not execute the program code directly, but rather interprets it. It is therefore a good candidate for tackling the challenge of lacking hardware peripheral emulation. All values from hardware peripherals can therefore be symbolized and possible program paths can be calculated. However, the more hardware values are symbolized, the more constraints and paths are present (usually growing exponentially). Davidson et al. (2013) implemented FIE, which allows symbolic execution of firmware for MSP430 microcontrollers by using a modified version of KLEE (Cadar et al. 2008). They assume that software of embedded systems is simple enough to allow symbolic execution. Therefore, the target firmware is compiled into a representation that can be symbolically executed with KLEE. FIE includes two notable optimizations: state pruning and memory smudging. State pruning detects whether the current state has already been reached before and prunes it, instead of adding it to the set of active states. The memory smudging function allows to avoid an intractable state, e.g. an infinite loop with an increment inside. In this case, the state pruning cannot work because the state is not equivalent due to the presence of the increased variable. The memory smudging sets a threshold for consecutive states that differ only in one memory location. Corteggiani et al. (2018) present Inception, a symbolic execution engine for embedded firmwares, also based on the KLEE engine. They added a mechanism to symbolically execute assembly code, which is commonly found in embedded firmware code. Additionally, they enable hardware access forwarding for retrieving concrete values from the actual hardware to reduce the symbolical input space.

Concolic execution of embedded firmware
Concolic execution refers to the combination of CON-Crete and symbOLIC execution. In this case, traces are used to analyze reached conditions during a concrete execution, and related constraints are derived. These constraints can be used to generate new input data that exercises a different path of the code. This idea is also termed as hybrid or concolic fuzzing.
Several general-purpose hybrid fuzzers, such as QSYM (Yun et al. 2018), SymCC (Poeplau and Francillon 2020) are available, as well as frameworks that focus on concolic execution for embedded firmwares. Herdt et al. (2019) present an approach to integrate a concolic testing engine with SystemC-based virtual prototypes for the RISK-V architecture. This is once again subject to all the requirements of virtual prototypes. Ai et al. (2020) propose a concolic execution approach for embedded devices that supports various architectures. They perform the concrete execution on the physical device and move the symbolic execution to the host via a debugging connection.
Although concolic execution is a promising method to test code, it faces similar challenges as other embedded fuzzers, because it relies on concrete program traces.

Reviewing embedded fuzzing works
A summary of the relevant embedded fuzzing works is given in Table 2.

Taxonomy criteria
This section summarizes the criteria used to cluster the relevant embedded fuzzing works.
The columns in Table 2 show what we feel are the relevant elements of comparison for each work.
• Source Code Agnostic-This criterion indicates whether the fuzzer needs the source code of the SUT to run, which is a major factor for many application scenarios. The rows in Table 2 categorize the works based on the execution environment. The categories are as follows.
• Hardware-based Overall, the wide variety of approaches in Table demonstrates the diversity in the steadily growing research field of embedded fuzzing. Therefore, devising meaningful categories for the existing approaches in order to effectively group the lines in Table requires care and consideration of existing attempts.
Notably, general principles for evaluating and benchmarking traditional fuzzers exist, as proposed by Klees et al. (2018). Fuzzers should be tested against a large set of benchmark programs, such as GCG (Cyber grand challenge 2014) or LAVA-M (Dolan-Gavitt et al. 2016) multiple times for at least 24 hours, with the performance plotted over time. The performance should ideally be measured in the number of detected bugs. The reached code coverage can be used as a secondary performance measure. Additionally, different sets of seeds should be considered and documented. Arguably, a transfer of these principles to embedded fuzzers would be useful. However, current research on embedded fuzzing still faces more fundamental issues of portability and scalability, namely about enabling a fuzzing approach over the widest possible variety of embedded systems of any complexity. Wright et al. (2021) propose to compare different re-hosting frameworks particularly with regard to the amount of user interaction needed for the setup, termed as application effort. The application effort refers to the ease of adapting a framework to new targets. Preferably, a framework can be adapted with little knowledge of the target and low configuration effort. It could be measured in the estimation of time needed for the setup, but this would heavily depend on the developer, thus making the results highly subjective.
In light of the existing classification attempts, we feel that the relatively young field of embedded fuzzing may currently be partitioned most beneficially on the basis of how the execution environment is served to the SUT. Therefore, we build three essential categories: hardwarebased approaches for those that use the very hardware of the SUT to operate, emulation-based approaches for those that re-host the firmware of the SUT into an emulator, abstraction-based approaches for those that abstract away the details of the hardware. We further classify each category according to finer observations. Hardware-based approaches let the target software run in its designated environment. Therefore, we decide to further divide these approaches upon the basis of how they gather feedback from the hardware about the execution of the software. Thus the hardware category features the three sub-categories Instrumentation, Side-Channel, and Message Interface Reusing.
A defining feature for emulation-based approaches is the way they treat hardware peripheral accesses. Therefore, we coherently decide the five sub-categories User Mode Emulation, Full-System Emulation, Peripheral Proxying, Peripheral Modeling, and Sandboxing.
The last category features abstraction-based approaches, hence the two sub-categories for enabling the abstraction process are Symbolic Execution and Concolic Execution. It should be noted that concolic approaches usually need traces from the execution environment and therefore a concrete execution environment but (manually) selected input vectors can be made symbolic. Therefore, we decide to keep these with abstraction-based approaches.

Discussion and future directions
Desktop user programs communicate via well defined syscalls and do run in their particular virtual address space. Therefore, fuzzing such programs can benefit from different flavours of feedback and sanitizing options. Similarly, well defined target constraints and boundaries are present for hardware fuzzing. Hardware designs are usually represented in HDLs, where hardware fuzzing approaches can be based on Trippel et al. (2021), Laeufer et al. (2018). In between, embedded fuzzing faces a much less precisely specified environment. Generalized statements about interfaces, the environment, and other circumstances can not be made for embedded applications. In fact, an embedded program is an accumulation of machine code instructions that only function properly together with their intended environment and made assumptions. This is why despite the growing attention and proliferation of embedded systems, the research field of embedded fuzzing still lacks generic solutions. Even comparing different tools remains a big challenge. It would seem that most tools are evaluated on a small set of targets, chosen by the authors themselves, whereas it would be useful to devise public, independent benchmarks.
The effectiveness of embedded fuzzers can only be evaluated when testing can be performed on a large collection of test subjects. A benchmarking suite for embedded fuzzers may consist of open-source embedded firmwares in conjunction with appropriate hardware peripheral emulation solutions. In this way, different fuzzing strategies can be evaluated on embedded systems instead of relying on the ones that are developed for user applications.
Furthermore, the different characteristics of embedded systems in contrast to user applications should be considered. Traditional fuzzing originates from quickly terminating data processing applications. Embedded systems, on the other hand, are continuously running systems that usually do not terminate after processing a single input. If the internal state of a system changes during sequences of inputs, it is called stateful. Recently, several fuzzers for stateful software have been proposed (Yu et al. 2019;Pham et al. 2020;Natella 2021;Schumilo et al. 2021). In particular, Pham et al. (2020) showed that stateful programs, like network servers, have to be fuzzed with awareness of their state to be efficient. Since embedded systems typically are stateful, stateful embedded fuzzing approaches are needed as well.
Most reviewed papers are emulation-based and emulators currently seem to be the preferred way of enabling embedded fuzzing. Beside their mentioned advantages, there is always the disadvantage of a lower fidelity, which makes it necessary to validate all found bugs on the actual hardware or at least an accurate model of it. This process may be automated by putting the actual device in the loop and testing input candidates directly.
The other disadvantage of emulators is the setup and configuration effort required to imitate the whole execution environment. However, with the actual hardware, there is an environment already present in which the embedded software runs as expected. Therefore, we see more research potential in performing fuzzing on the actual hardware and extracting feedback from existing functionalities e.g. debug interfaces. Common embedded debugging tools from Lauterbach (Lauterbach 2021) or SEGGER (Segger 2021) provide real-time tracing mechanisms for a wide variety of microcontrollers, which may be used for fuzzing feedback.
Another albeit rarely handled aspect is that an embedded system has multiple interfaces that can be highly entangled. Further research is needed to consider the whole system, and not only individual functions, interfaces, or processes while fuzzing. Such a fuzzer could fuzz on multiple interfaces simultaneously, while observing the whole system. Multiple fuzzers or harnesses would need to synchronize their observations, similarly to ensemble fuzzing.
Recently, plenty of automated peripheral modeling approaches, such as P 2 IM ) and FUZ-ZWARE (Scharnowski et al. 2020), have been proposed.
For now, they seem to target rather simple embedded systems. Since they need to model all hardware peripherals that are accessed by the firmware, the approaches do not scale well for more complex systems. Nevertheless, automated peripheral modeling remains one of the most promising methods to enable generic embedded fuzzing. Further research in this area could also enable emulation-based fuzzing with low application effort for more complex embedded systems. Another option could be to design generic and reusable HALs to ease re-hosting and enable efficient fuzz testing of hardware-related code. Moreover, as highlighted by Boehme et al. (2020) for traditional fuzzing, we also advocate a larger scope for embedded fuzzers, which should identify a range of vulnerabilities, such as information and timing leakages, and not just bugs.
Future research and tools should aim to unite existing techniques in an embedded ensemble fuzzing framework in order to eliminate their current, individual disadvantages. In addition, such a framework should be crossarchitecture, state-aware, and compatible with emulated and real devices. Embedded Fuzzing should consider the whole system in all its details.

Related work
Detailed summaries of the challenges of fuzzing embedded systems ) and security analysis of embedded systems (Fasano et al. 2021;Wright et al. 2021) have been published. However, these reviews do concentrate almost solely on emulation-based approaches. We agree that emulation-based approaches are on the rise, but to get the whole picture of embedded fuzzing, hardware-based approaches in all their facets need to be considered, too. We aim to draw such a complete picture and particularly want to highlight the diversity and creativity of the reviewed methods in this article.

Conclusion
This article reviewed the current state of the art of embedded fuzzing. To structure the field, we proposed a formal definition of embedded fuzzing and suggested a taxonomy for it. We carved out the additional challenges of embedded fuzzing compared to the research field of traditional fuzzing. Furthermore, we showed that no easily applicable solution for embedded fuzzing exists. As traditional fuzzing has already found numerous vulnerabilities in non-embedded software, efficient and easily applicable embedded fuzzing would increase the security and integrity of the ubiquitous embedded systems people interact with every day.