IWSPA '22: Proceedings of the 2022 ACM on International Workshop on Security and Privacy Analytics

Full Citation in the ACM Digital Library

SESSION: Keynote Talk: Trust Architectures

SDP Based Zero-Trust Architectures

In an increasingly de-centralized networked environment, the security controls provided by traditional perimeter security architectures are becoming ineffective. Software Defined Perimeter (SDP), as an implementation of a Zero Trust Network Architecture (ZTNA), provides the ability to deploy perimeter-like security controls across any network. To achieve this, an SDP controller acts as a central point that manages endpoint agents on all nodes in the SDP. This paper introduces modifications to the SDP to incorporate real-time protocol monitoring (RTPM) to extend session security beyond the initial authentication phase. Further, RTPM enables scalable security and reliability management of the system.

SESSION: Session 1: Adversarial Machine Learning

Adversarial Robustness is Not Enough: Practical Limitations for Securing Facial Authentication

The current body of work on adversarial robustness seems to imply that theoretical robustness against adversarial examples leads to more secure systems. In this paper we demonstrate that this premise is erroneous by assessing the strengths and limitations of prominent robustness methods in light of facial authentication, a realistic use case where adversarial perturbations pose a real threat, which allows for a natural reflection on the security gained by the obtained robustness. The main contribution of this paper is an evaluation and critical reflection upon why prominent robustness methods fail to deliver a secure system despite living up to their promises of adding robustness. Our analysis shows that state-of-the-art robustness methods such as Adversarial Training and Guided Complement Entropy struggle to accommodate for two key requirements of facial authentication: (1)~the threat model of facial authentication assumes physical adversarial examples that can be added to the scene as opposed to "classical" adversarial examples that are applied to the digital input. Moreover, an attacker that can directly perturb digital input, does not require adversarial perturbations to impersonate their victim; (2)~robustness properties are only validated for standard classification problems, and often ignore the impact of more practical training paradigms that re-purpose models. Our extensive evaluation of robustness in light of facial authentication allowed us to pinpoint the limitations of these methods. To ensure that the concepts of adversarial robustness and security are more tightly coupled, we recommend to evaluate new defences with applications where adversarial perturbations pose a security threat.

Enhancing Boundary Attack in Adversarial Image Using Square Random Constraint

An adversarial image is a sample with intentional small perturbations that causes deep learning models to classify the image incorrectly. In the image recognition field, adversarial images have become an attractive research topic because they can efficiently attack many state-of-the-art and even commercial models. The challenge now for any deep learning models is how to find out potentially sophisticated adversarial images and prepare proactive prevention against adversarial attacks. Among various existing adversarial attacks, Boundary Attack, proposed in 2018, is one of the state-of-the-art attack methods due to its efficiency, extreme flexibility, simplicity, and high utilization in real-world applications. However, we found a severe drawback existing in the Boundary Attack. First, when randomizing the direction for the next perturbation, it uses a Gaussian distribution over the entire image space to choose the next movement. This causes losing various useful statistic information from the models, such as the high usage of the convolutional layers. Therefore, in this paper, we aim to investigate an enhancement for the Boundary Attack. In the perturbation direction randomization step, we restrict the perturbation direction in a square shape in the geometrical presentation of the image. Compared to the existing randomization strategy, as described in more detail in Section 1.2, our approach can exploit the nature of most image recognition models originating from the convolutional layers that capture the image features in square patterns. We experimented with our proposed method with the well-known CIFAR-10 image dataset on the ResNet-v2 model. Our experimental result showed that the proposed method could successfully reduce the similarity between the adversarial image and the original image by 41.06% with the same number of queries.

Data Poisoning in Sequential and Parallel Federated Learning

Federated Machine Learning has recently become a prominent approach to leverage data that is distributed across different clients, without the need to centralize data. Models are trained locally, and only model parameters are shared and aggregated into a global model. Federated learning can increase privacy of sensitive data, as the data itself is never shared, and benefit from the distributed setting by utilizing computational resources of the clients. Adversarial Machine Learning attacks machine learning systems in respect to their confidentiality, integrity or availability. Recent research has shown that many forms of machine learning are susceptible to these types of attacks. Besides its advantages, federated learning opens new attack surfaces due to its distributed nature, which amplifies concerns of adversarial attacks. In this paper, we evaluate data poisoning attacks in federated settings. By altering certain training inputs that are used in the training phase with a specific pattern, an adversary may later trigger malicious behavior in the prediction phase. We show on datasets for traffic sign and face recognition that federated learning is effective on a similar level as centralized learning, but is indeed vulnerable to data poisoning attacks. We test both a parallel as well as a sequential (incremental cyclic) federated learning, and perform an in-depth analysis on several hyper-parameters of the adversaries.

SESSION: Session 2: Privacy

PriveTAB: Secure and Privacy-Preserving sharing of Tabular Data

Machine Learning has increased our ability to model large quantities of data efficiently in a short time. Machine learning approaches in many application domains require collecting large volumes of data from distributed sources and combining them. However, sharing of data from multiple sources leads to concerns about privacy. Privacy regulations like European Union's General Data Protection Regulation (GDPR) have specific requirements on when and how such data can be shared. Even when there are no specific regulations, organizations may have concerns about revealing their data. For example in cybersecurity, organizations are reluctant to share their network-related data to permit machine learning-based intrusion detectors to be built. This has, in particular, hampered academic research. We need an approach to make confidential data widely available for accurate data analysis without violating the privacy of the data subjects. Privacy in shared data has been discussed in prior work focusing on anonymization and encryption of data. An alternate approach to make data available for analysis without sharing sensitive information is by replacing sensitive information with synthetic data that behave as original data for all analytical purposes. Generative Adversarial Networks (GANs) are one of the well-known models to generate synthetic samples that can have the same distributional characteristics as the original data. However, modeling tabular data using GAN is a non-trivial task. Tabular data contain a mix of categorical and continuous variables and require specialized constraints as described in the CTGAN model. In this paper, we propose a framework to generate privacy-preserving synthetic data suitable for release for analytical purposes. The data is generated using the CTGAN approach, and so is analytically similar to the original dataset. To ensure that the generated data meet the privacy requirements, we use the principle of t-closeness. We ensure that the distribution of attributes in the released dataset is within a certain threshold distance from the real dataset. We also encrypt sensitive values in the final released version of the dataset to minimize information leakage. We show that in a variety of cases, models trained on this synthetic data instead of the real data perform nearly as well when tested on the real data. Specifically, we show that the machine learning models used for network event/attack recognition tasks do not have a significant loss in accuracy when trained on data generated from our framework in place of the real dataset.

The Proof is in the Glare: On the Privacy Risk Posed by Eyeglasses in Video Calls

The past few years have seen a rapid growth of video conferencing as a communication mechanism of choice for personal and professional collaboration. This increased dependence on video conferencing brings with it a host of privacy challenges regarding audio/video data that could be inadvertently leaked during a call. In this paper, we show that, even when equipped with anti-glare functionality, eyeglasses worn during a video call could leak information that the user privately views in their computer window. The attack exploits patterns in the eyeglass reflection that despite not being discernible by humans, can be picked up by machine learning algorithms. We rigorously investigate this line of attack under a wide range of environmental conditions (such as room illumination, lighting color temperature, graphical attributes of content viewed, etc) and show it to be highly effective. While video conferencing platforms already obfuscate information in the user's background, they do not yet perform such obfuscation for eyeglass reflections. Our paper calls for the incorporation of filters designed to specifically obfuscate content in eyeglass reflections.

How Attacker Knowledge Affects Privacy Risks: An Analysis Using Probabilistic Programming

Governments and businesses routinely disclose large amounts of private data on individuals, for data analytics. However, despite attempts by data controllers to anonymise data, attackers frequently deanonymise disclosed data by matching it with their prior knowledge. When is a chosen anonymisation method adequate? For this, a data controller must consider attackers befitting their scenario; how does attacker knowledge affect disclosure risk? We present a multi-dimensional conceptual framework for assessing privacy risks given prior knowledge about data. The framework defines three dimensions: distinctness (of input records), informedness (of attacker), and granularity (of anonymisation program output). We model three well-known types of disclosure risk: identity disclosure, attribute disclosure, and quantitative attribute disclosure. We demonstrate how to apply this framework in a health record privacy scenario: We analyse how informing the attacker with COVID-19 infection rates affects privacy risks. We perform this analysis using Privug, a method that uses probabilistic programming to do standard statistical analysis with Bayesian Inference.

SESSION: Keynote Talk: Deception

Shadows Behind the Keyboard: Dark Personalities and Deception in Cyberattacks

Understanding the psychology of cyberattacks is critical for finding ways to minimize their efficacy and harm. Specifically, there are multiple types of attackers, and different attackers have different goals and varied approaches. Through the study of individual differences, we can better understand not only who is most likely to engage in criminal activity in cyberspace but the dispositional tendencies towards specific types of online attacks. Thus, by understanding the psychology of malevolent characters, we can better understand (a) how different attackers go about launching their attacks, (b) who these attackers are likely to target, and (c) the tactics and strategies these different attackers are likely to use.

SESSION: Session 3: Vulnerabilities/Anomalies

Chronos vs. Chaos: Timing Weak Memory Executions in the Linux Kernel

Timing is one of the key metrics by which side-channel attacks distinguish between classes of executions. For example, a speculative execution may be specified in the architecture as having no visible side-effects, but the cache may still be accessed for some concrete micro-architecture implementation. Cache side-channel attacks interpret this signal by measuring the time a memory access will take to complete under some set of cache preconditions, in turn revealing some machine state that is expected to remain opaque. Some of the speculative structures in the micro-architecture (such as the Store Buffer) responsible for these behaviours also expose visible out-of-order, or "weak memory" execution under the correct conditions. This work investigates the environmental conditions under which visible weak memory executions occur and whether there is a micro architectural "signal" associated with those executions that can be exposed. It is our hypothesis that these characteristics can be used to identify micro-architectural speculation that may lead to weak-memory behaviour, and that the mechanisms at play in these executions, if subsequently rolled back, may induce cache side effects necessary for building transient execution attacks. kerntime is a kernel-mode utility that provides cycle-level granularity for execution time of weak memory litmus tests. We use kerntime to analyse the timing profile of Store Buffering behaviour present on x86 and develop characteristics based on observations from the dataset generated by kerntime, including a thread local indicator of Store Buffering behaviour.

No Features Needed: Using BPE Sequence Embeddings for Web Log Anomaly Detection

Problem: Manual data analysis for extracting useful features in web log anomaly detection can be costly and time-consuming. Automated techniques on the other hand (e.g. Auto-Encoders and CNNs based) usually require supplemental network trainings for feature extractions. Often the systems trained on these features suffer from high False Positive Rates (FPRs) and rectifying them can negatively impact accuracies and add training/tuning delays. Thus manual analysis delays, mandatory supplementary trainings and inferior detection outcomes are the limitations in contemporary web log anomaly detection systems. Proposal: Byte Pair Encoding (BPE) is an automated data representation scheme which requires no training, and only needs a single parsing run for tokenizing available data. Models trained using BPE-based vectors have shown to outperform models trained on similar representations, in tasks such as machine translation (NMT) and language generation (NLG). We therefore propose to use BPE tokens obtained from web log data and consequently vectorized by a pre-trained sequence embedding model for performing web log anomaly detection. Our experiments using two public data sets show that ML models trained on BPE sequence vectors, achieve better results compared to training on both manually and automatically extracted features. Moreover our technique of obtaining log representations is fully automated (requiring only a single hyper-parameter), needs no additional network training and provides representations that give consistent performance across different ML algorithms (a facet absent from feature-based techniques). The only trade-off with our method is an increased upper limit in system memory consumption, as compared to using manual features. This is due to the higher dimensions of the utilized pre-trained embeddings, and reducing it, is our motivation for future work

SESSION: Session 4: Model Evaluation and Datasets

An Empirical Evaluation of Adversarial Examples Defences, Combinations and Robustness Scores

Over the past few years, deep learning has been dominating the field of machine learning in applications such as speech, image, and text recognition, which lead to an increased use of deep learning techniques in safety-critical tasks. However, Neural Networks are vulnerable to adversarial examples, i.e. well-crafted small perturbations of the input that aim to disturb the prediction correctness. Therefore, robustness and security of deep learning models has become a major concern, indirectly also affecting safety. In this paper, we therefore evaluate several state-of-the-art white- and black-box adversarial attacks against Convolutional Neural Networks for image recognition, for various attack targets. Further, defences such as adversarial training and pre-processors are evaluated. Moreover, we investigate whether combinations of them can improve these defences. Finally, we examine whether attack-agnostic robustness scores such as CLEVER are able to correctly estimate the robustness against our large range of attack. Our results indicate that pre-processors are very effective against attacks with adversarial examples that are very close to the original images, that combinations can improve the defence strength, and that CLEVER is insufficient as the sole indicator of robustness.

On the Effectiveness of Dataset Watermarking

In a data-driven world, datasets constitute a significant economic value. Dataset owners who spend time and money to collect and curate the data are incentivized to ensure that their datasets are not used in ways that they did not authorize. When such misuse occurs, dataset owners need technical mechanisms for demonstrating their ownership of the dataset in question. Dataset watermarking provides one approach for ownership demonstration which can, in turn, deter unauthorized use. In this paper, we investigate a recently proposed data provenance method, radioactive data, to assess if it can be used to demonstrate ownership of (image) datasets used to train machine learning (ML) models. The original paper radioactive reported that radioactive data is effective in white-box settings. We show that while this is true for large datasets with many classes, it is not as effective for datasets where the number of classes is low (łeq 30) or the number of samples per class is low (łeq 500). We also show that, counter-intuitively, the black-box verification technique described in radioactive is effective for all datasets used in this paper, even when white-box verification in radioactive is not. Given this observation, we show that the confidence in white-box verification can be improved by using watermarked samples directly during the verification process. We also highlight the need to assess the robustness of radioactive data if it were to be used for ownership demonstration since it is an adversarial setting unlike provenance identification.

Compared to dataset watermarking, ML model watermarking has been explored more extensively in recent literature. However, most of the state-of-the-art model watermarking techniques can be defeated via model extraction robustness. We show that radioactive data can effectively survive model extraction attacks, which raises the possibility that it can be used for ML model ownership verification robust against model extraction.

A Dataset of Networks of Computing Hosts

We are making public a dataset of 21 disjoint graphs representing communications among machines running different distributed applications in various enterprises. We provide a ground truth grouping for one graph. The grouping is useful for evaluating tasks such as clustering hosts based on network communications. We describe the graphs and present a brief exploratory analysis to illustrate some of the properties, possible uses of the data, and some of the challenges.