EuroP4 '22: Proceedings of the 5th International Workshop on P4 in Europe

Full Citation in the ACM Digital Library

SESSION: Security

BACKORDERS: using random forests to detect DDoS attacks in programmable data planes

Networks and the services they support form the communication backbone of our society, and it is important that potential Distributed Denial of Service (DDoS) attacks are detected quickly, in order to avoid or minimize the impact they may have on the availability of services. Recent technological advances in programmable networks - specifically the programmability of data planes in switches and routers, have made available new ways of detecting such attacks. By relying on this newfound possibility, this paper proposes the utilization of a Random Forest (RF) to aid in quickly and accurately detecting DDoS attacks in a programmable switch. Random forests utilize several classification trees, each of them for independently classifying an input as one of a set of classes. Here, each decision tree will classify a network flow as potentially malicious, i.e. part of a DDoS attack, or a legitimate user flow. Despite utilizing multiple classification trees to improve accuracy, random forests are relatively lightweight, with each tree requiring few and simple computations to arrive at a classification. Our results show that even small RFs, requiring as few as 63 match+action table entries, can achieve F1-Scores of over 90%.

A P4-based content-aware approach to mitigate slow HTTP POST attacks

A slow HTTP POST attack is an application-layer distributed denial-of-service attack targeting web servers. The attacker simulates a legitimate user with a slow network speed and continues to send requests, resulting in server resources being unavailable for a long time to other users. The similarity to legitimate behavior makes it challenging to identify such attack traffic. To address this issue, this paper proposes a responsive defense mechanism that exploits programmable network devices to identify attack traffic based on HTTP headers. With information that is not available from legacy network devices, this method can identify different types of requests and apply limitations. This approach achieves a distributed, source-based defense capability by utilizing data plane programmability, making it a scalable solution. The simulation results show that the approach is effective and accurate against slow HTTP POST attacks.

On implementing ChaCha on a programmable switch

This paper presents an implementation of a practical cryptographic primitive based on ChaCha on a Tofino programmable switch. A key challenge is optimizing the implementation by leveraging the structure of ChaCha operations and hardware features of Tofino. Our implementation outperforms the AES-based approach in terms of performance and small memory footprint and achieves up to 203 Gbps of throughput.

SESSION: Architectures and language

Reducing P4 language's voluminosity using higher-level constructs

Over the last years, P4 has positioned itself as the primary language for data-plane programming. Despite its constant evolution, the P4 language still "suffers" from one significant limitation: the voluminosity of its code. P4 applications easily reach thousands of lines of code, becoming hard to develop, debug, and maintain. The reason is twofold: P4 requires many characters to express individual concepts (verbosity), and it relies on code repetition (lack of parametrization).

Today, P4 users overcome this limitation by relying on templating tools, hand-crafted scripts, and complicated macros. Unfortunately, these methods are not optimal: they make the development process difficult and do not generalize well beyond one codebase.

In this work, we propose reducing the voluminosity of P4 code by introducing higher-level language constructs. We present O4, an extended version of P4, that includes three such constructs: arrays (which group same-type entities together), loops (which reduce simple repetitions), and factories (which enable code parametrization).

We evaluate O4 on several state-of-the-art programs and show how, with respect to P4: (i) it reduces code volumes by up to 80%, (ii) it decreases code verbosity by 44% on average, and (iii) it cuts duplicated code by 60%. We contribute a compiler implementation that provides said benefits with just a 3.5% increase in compilation time.

Compiling packet programs to dRMT switches: theory and algorithms

A critical step in P4 compilation is finding an efficient mapping of the high-level P4 source code constructs to the physical resources exposed by the underlying hardware, while meeting data and control flow dependencies in the program. In this paper, we take a new look at the algorithmic aspects of this problem, with the motivation to understand the fundamental theoretical limits and obtain better P4 pipeline embeddings in the dRMT (disaggregated Match-Action Table) switch architecture. We report mixed results. We find that optimizing P4 program embedding for maximizing throughput is computationally intractable even when some architectural constraints are relaxed, and there is no hope for a tractable approximation with arbitrary precision unless P = NP. At the same time, we find that the maximal throughput embedding is approximable in quasi-linear time with a small constant bound. Our evaluations show that the proposed algorithm outperforms the heuristics of prior work both in terms of throughput and compilation speed.

In-network fractional calculations using P4 for scientific computing workloads

Recent P4 research has motivated the need for in-network fractional calculations to support functions in Networking (for calculations related to active queue management and load balancing) and in Machine Learning. The P4 language and ASICs do not natively support fractional types (e.g., float). Existing P4 techniques provide incomplete emulations of the IEEE-754 standard, which was designed as a generic approach that can benefit from dedicated hardware acceleration, but whose features are difficult to fully support in P4.

This paper re-thinks the foundation of in-network fractional calculation and proposes a new approach that is more resource conscious and is straightforward to encode in P4. Instead of floating-point, it uses a fixed-point encoding of numerals; and instead of sampling functions into tables it uses Taylor Approximation to reduce data-plane calculations to simple arithmetic over pre-calculated coefficients, requiring constant space and linear time. The paper describes and evaluates a P4 code synthesis algorithm that allows users to trade-off switch resources for accuracy, grounded on an application of a well-understood mathematical theory. It describes how to encode π and various functions including cos, log and exp.

This technique is being developed to support Scientific Computing (SC) applications which typically make heavy use of fractional approximations of Real numbers. The paper applies this technique in a novel P4 program that is being open-sourced: in-network Monte Carlo simulation of photon propagation that models the analysis that is carried out in a class of cancer treatments. This technique is also being used in ongoing work on another SC application: online event detection in a large-scale neutrino detection experiment.

HOL4P4: semantics for a verified data plane

We introduce a formal semantics of P4 for the HOL4 interactive theorem prover. We exploit properties of the language, like the absence of call by reference and the copy-in/copy-out mechanism, to define a heapless small-step semantics that is abstract enough to simplify verification, but that covers the main aspects of the language: interaction with the architecture via externs, table match, and parsers. Our formalization is written in the Ott metalanguage, which allows us to export definitions to multiple interactive theorem provers. The exported HOL4 semantics allows us to establish machine-checkable proofs regarding the semantics, properties of P4 programs, and soundness of analysis tools.

SESSION: Monitoring and applications

Causal network telemetry

Current approaches to network observability rely on techniques like active probing, packet sampling, and path-level telemetry, which only provide a partial view. This paper presents causal telemetry, a new model that adapts ideas from distributed systems to the network setting. Causal telemetry captures causal relationships between events, including those that take place on physically separated devices. We motivate causal telemetry through examples, we show how it can be used to diagnose anomalies and faults, and we present algorithms for constructing the needed causal graphs from network executions. We develop a P4-based prototype implementation, CoCaTel, and discuss a case study that uses causal telemetry to detect Priority-Based Flow Control (PFC) deadlocks.

Innovative network monitoring techniques through in-band inter packet gap telemetry (IPGNET)

Network monitoring is a fundamental task for proper network troubleshooting and performance management. Recently, in-band Network Telemetry (INT) has been demonstrated as a powerful and efficient network monitoring framework. Using INT, network information hop-by-hop can be collected directly from the data plane by gathering this information in the production traffic. However, INT data collection is limited by available packet size and processing overhead, making it critical to choose what data to collect and when to collect it. In this demo, we propose the In-band Inter Packet Gap Network Telemetry (IPGNET) per-hop monitoring. We argue that by monitoring the IPG hop-by-hop, it is possible to correlate the data and identify: (i) Network problems like congestion and delays, finding their root cause, and (ii) Microbursts and their contributing flows. Our preliminary results show that IPGNET can detect microbursts on multiple queues and report all the contributing flows with high efficiency in terms of control/data plane overhead.

Sketch-based entropy estimation: a tabular interpolation approach using P4

This work presents the implementation of a tabular interpolation approach to estimate empirical Shannon entropy on programmable data plane ASICs using P4. The technique transforms the complex computations of the random projection into fast lookup over pre-computed tables in the match-action pipeline. Likewise, the interpolation heuristic further reduces the table size substantially. Thus, more tables can be accommodated, achieving higher estimation accuracy. Simulations based on real-world network traffic traces are performed to evaluate the estimation accuracy. The scheme is deployed in a Barefoot Tofino2 switch connected to the International Center for Advanced Internet Research (iCAIR) national testbed. The system can estimate the entropy of network traffic accurately at 400 Gbps throughput.

In-network angle approximation for supporting adaptive beamforming

There is a great interest in utilizing P4 for in-network computing along with programmable data planes. This use is emerging as a new network paradigm that can not just reduce the complexity but the delay as well. Beamforming is now an integral feature of modern wireless communication systems and its implementation calls for an accurate beam alignment by estimating the direction of signal arrival. However, this estimation is computationally complex, especially in a dynamic environment where a user is constantly on the move. In this paper, we propose a user-assisted in-network method to optimally approximate the angle of arrival by segmenting the cell area into an exponentially binned grid and make use of the advantages offered by programmable data planes and their match-action table (MAT) logic. The method expects location messages periodically reported by user equipment, processes them in the network and reconfigures the base station antennas accordingly, implementing user-assisted in-network beam control. The proposed method is implemented in P4 and runs on a Tofino ASIC. Our evaluation proves a theoretical bound on the absolute error of the proposed MAT-based angle approximation and shows that it is in accordance with the empirical error distributions. Moreover, there is no significant increase in errors attributed to the latency of various control cycle times (less than 100ms) and the user's movement at moderate speeds (of less than 90km/h.) We also show that the resource usage is only affected by the size of the TCAM table used to store the angle approximation values and that the proposed method has no significant per-stage resource usage on the pipeline.

Distributed DNN serving in the network data plane

Programmable networks have received tremendous attention recently. Apart from exciting network innovations, in-network computing has been explored as a means to accelerate a variety of distributed systems concerns, by leveraging programmable network devices. In this paper, we extend in-network computing to an important class of applications called deep neural network (DNN) serving. In particular, we propose to run DNN inferences in the network data plane in a distributed fashion and make our programmable network a powerful accelerator for DNN serving. We demonstrate the feasibility of this idea through a case study with a real-world DNN on a typical data center network architecture.