Publications

Introducing multi-layer concatenation as a scheme to combine information in water distribution cyber-physical systems

Côme Frappé-Vialatoux · Pierre Parrend

Water Distribution
Cyber-Physical Systems
Machine Learning
Cyber-Security

As Water distribution infrastructures are ageing, their modernization process is leading to an increased incorporation of connected devices into these physical systems. This transition is changing the nature of water distribution control systems from physical systems to cyber-physical systems (CPS). However, this evolution is associated with an increased vulnerability to cyber-attacks. Detecting such attacks in CPS is gaining traction in the scientific community with the recent release of cyber-physical datasets that capture simultaneously the network traffic and the physical state of a water distribution testbed. This novel paradigm of conjoint availability of these two types of data from a common source infrastructure opens a new question on how to combine their information when training machine learning models for attack detection. As an alternative approach to previous models that rely on model aggregation, this paper introduces Multi-Layer Concatenation, a combination scheme to merge the information from the physical and network parts of a CPS from a data perspective, through a time-based join operation coupled with a propagation process to keep the coherence of the global system. The evaluation of its impact assesses its benefits for machine learning-based detection on three cyber-physical datasets, by measuring machine learning models’ performances on physical and network data separately, and then on data combined through the proposed scheme.

Translation of semi-extended regular expressions using derivatives

Antoine Martin · Etienne Renault · Alexandre Duret-Lutz

We generalize Antimirov’s notion of linear form of a regular expression, to the Semi-Extended Regular Expressions typically used in the Property Specification Language or SystemVerilog Assertions. Doing so requires extending the construction to handle more operators, and dealing with expressions over alphabets $\Sigma=2^{AP}$ of valuations of atomic propositions. Using linear forms to construct automata labeled by Boolean expressions suggests heuristics that we evaluate. Finally, we study a variant of this translation to automata with accepting transitions: this construction is more natural and provides smaller automata.

LKoopman ensembles for probabilistic time series forecasting

Anthony Frion · Lucas Drumetz · Guillaume Tochon · Mauro Dalla Mura · Abdeldjalil Aïssa El Bey

In the context of an increasing popularity of data-driven models to represent dynamical systems, many machine learning-based implementations of the Koopman operator have recently been proposed. However, the vast majority of those works are limited to deterministic predictions, while the knowledge of uncertainty is critical in fields like meteorology and climatology. In this work, we investigate the training of ensembles of models to produce stochastic outputs. We show through experiments on real remote sensing image time series that ensembles of independently trained models are highly overconfident and that using a training criterion that explicitly encourages the members to produce predictions with high inter-model variances greatly improves the uncertainty quantification of the ensembles.

Weakly supervised training for hologram verification in identity documents

Glen Pouliquen · Guillaume Chiron · Joseph Chazalon · Thierry Géraud · Ahmad Montaser Awal

Know Your Consumer (KYC)
Identity Documents
Hologram Verification
Weakly Supervised Learning
Contrastive Loss

We propose a method to remotely verify the authenticity of Optically Variable Devices (OVDs), often referred to as “holograms”, in identity documents. Our method processes video clips captured with smartphones under common lighting conditions, and is evaluated on two public datasets: MIDV-HOLO and MIDV-2020. Thanks to a weakly-supervised training, we optimize a feature extraction and decision pipeline which achieves a new leading performance on MIDV-HOLO, while maintaining a high recall on documents from MIDV-2020 used as attack samples. It is also the first method, to date, to effectively address the photo replacement attack task, and can be trained on either genuine samples, attack samples, or both for increased performance. By enabling to verify OVD shapes and dynamics with very little supervision, this work opens the way towards the use of massive amounts of unlabeled data to build robust remote identity document verification systems on commodity smartphones. Code is available at https://github.com/EPITAResearchLab/pouliquen.24.icdar.

Combining physical and network data for attack detection in water distribution networks

Côme Frappé-Vialatoux · Pierre Parrend

Cyber Physical System
Machine Learning
Cyberattack
Dataset
Water Distribution Network

Water distribution infrastructures are increasingly incorporating IoT in the form of sensing and computing power to improve control over the system and achieve a greater adaptability to the water demand. This evolution, from physical towards cyberphysical systems, comes with an attack perimeter extended to the cyberspace. Being able to detect this novel kind of attacks is gaining traction in the scientific community. However, machine learning detection algorithms, which are showing encouraging results in cybersecurity applications, needs training data as close as possible to real world data in order to perform well in production environment. The availability of such data, with complexity levels on par with real world infrastructures, with acquisitions from both from physical and cyber spaces, is a bottleneck for the development of machine learning algorithms. This paper addresses this problem by providing an analysis of the currently available cyberphysical datasets in the water distribution field, together with a multi-layer comparison methodology to assess their complexity. This multi-layer approach to complexity evaluation of datasets is based on three major axes, namely attack scenarios, network topology and network communications, allowing for a precise look at the forces and weaknesses of available datasets across a wide spectrum. The results show that currently available datasets are emphasizing on one aspect of real world complexity but lacks on the others, highlighting the need for a more global approach in further work to propose datasets with an increased complexity on multiple aspects at the same time.

Neural koopman prior for data assimilation

Anthony Frion · Lucas Drumetz · Mauro Dalla Mura · Guillaume Tochon · Abdeldjalil Aissa El Bey

With the increasing availability of large scale datasets, computational power and tools like automatic differentiation and expressive neural network architectures, sequential data are now often treated in a data-driven way, with a dynamical model trained from the observation data. While neural networks are often seen as uninterpretable black-box architectures, they can still benefit from physical priors on the data and from mathematical knowledge. In this paper, we use a neural network architecture which leverages the long-known Koopman operator theory to embed dynamical systems in latent spaces where their dynamics can be described linearly, enabling a number of appealing features. We introduce methods that enable to train such a model for long-term continuous reconstruction, even in difficult contexts where the data comes in irregularly-sampled time series. The potential for self-supervised learning is also demonstrated, as we show the promising use of trained dynamical models as priors for variational data assimilation techniques, with applications to e.g. time series interpolation and forecasting.

Transforming gradient-based techniques into interpretable methods

Caroline Mazini-Rodriguez · Nicolas Boutry · Laurent Najman

The explication of Convolutional Neural Networks (CNN) through xAI techniques often poses challenges in interpretation. The inherent complexity of input features, notably pixels extracted from images, engenders complex correlations. Gradient-based methodologies, exemplified by Integrated Gradients (IG), effectively demonstrate the significance of these features. Nevertheless, the conversion of these explanations into images frequently yields considerable noise. Presently, we introduce GAD (Gradient Artificial Distancing) as a supportive framework for gradient-based techniques. Its primary objective is to accentuate influential regions by establishing distinctions between classes. The essence of GAD is to limit the scope of analysis during visualization and, consequently reduce image noise. Empirical investigations involving occluded images have demonstrated that the identified regions through this methodology indeed play a pivotal role in facilitating class differentiation.

Graph-based spectral analysis for detecting cyber attacks

Majed Jaber · Nicolas Boutry · Pierre Parrend

Cybersecurity
Graphs
Spectral Graph Analysis
Spectrum
Anomaly Detection
Laplacian Matrix

Spectral graph theory delves into graph properties through their spectral signatures. The eigenvalues of a graph’s Laplacian matrix are crucial for grasping its connectivity and overall structural topology. This research capitalizes on the inherent link between graph topology and spectral characteristics to enhance spectral graph analysis applications. In particular, such connectivity information is key to detect low signals that betray the occurrence of cyberattacks. This paper introduces SpectraTW, a novel spectral graph analysis methodology tailored for monitoring anomalies in network traffic. SpectraTW relies on four spectral indicators, Connectedness, Flooding, Wiriness, and Asymmetry, derived from network attributes and topological variations, that are defined and evaluated. This method interprets networks as evolving graphs, leveraging the Laplacian matrix’s spectral insights to detect shifts in network structure over time. The significance of spectral analysis becomes especially pronounced in the medical IoT domains, where the complex web of devices and the critical nature of healthcare data amplify the need for advanced security measures. Spectral analysis’s ability to swiftly pinpoint irregularities and shift in network traffic aligns well with the medical IoT’s requirements for prompt attack detection.

The Quickref cohort

Didier Verna

The internal architecture of Declt, our reference manual generator for Common Lisp libraries, is currently evolving towards a three-stage pipeline in which the information gathered for documentation purposes is first reified into a formalized set of object-oriented data structures. A side-effect of this evolution is the ability to dump that information for other purposes than documentation. We demonstrate this ability applied to the complete Quicklisp ecosystem. The resulting "cohort" includes more than half a million programmatic definitions, and can be used to gain insight into the morphology of Common Lisp software.