# syscall-trace-classification **Repository Path**: chen_yu_no/syscall-trace-classification ## Basic Information - **Project Name**: syscall-trace-classification - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-02-24 - **Last Updated**: 2026-02-24 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Syscall Trace Classification This project focuses on classifying syscall traces using two distinct methodologies: Graph Neural Networks (GNNs) and sequence models like BERT/RoBERTa. The primary goal is to compare the effectiveness of these approaches in identifying patterns in syscall data, which are crucial for applications such as anomaly detection and malware identification. ## Graph Models Utilizes GNNs like `['MLP', 'GCN', 'Sage', 'GIN']`, including Graph Attention Networks `['GAT', 'GATv2']`, to model syscall traces as graphs. - A syscall trace is encoded into a graph by representing system calls as nodes with attributes derived from their type, token, and centrality measures ( betweenness, closeness, Katz, and PageRank), while the sequential relationships between syscalls are encoded as directed edges with weights indicating the frequency of transitions between syscalls. - An embedding layer is added to the GNN, which is used to generate the vector representation of each node's token. The token is given by the system call that the node represent. These vector representations serve as features of the node alongside the centrality measures. ## Transformer-based Models Employs sequence models (like BERT/TinyBERT/RoBERTa) to analyze syscall sequences for pattern detection. ### Training During training, the model processes syscall traces by taking a crop of `max_len` tokens from each trace. To enhance the model's robustness and ability to generalize, `15%` of the tokens in each crop are masked randomly. ### Evaluation For evaluation, each syscall trace is chunked into `n` samples of `max_len` with an overlap coefficient of `0.1`. Each chunk is then feed through the model, and the outputs ( either `logits` or `[CLS]` representations) are pooled together to form the final result for the entire trace. ## Dataset Preparation Before running the models, the ADFA-LD dataset must be placed in the appropriate directory and preprocessed using provided scripts: - Download the ADFA-LD dataset from the [ADFA-LD Dataverse](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/IFTZPF). - Place the dataset in the `/datasets/ADFA-LD` directory within the project structure. - Run `graph_preprocess` and `sequence_preprocess` scripts to prepare the data for training. Note for datasets in which the traces are not tokenized (differently from the ADFA-LD), the tokenizer must be adapted accordingly. ## Initial Results ### Graph Models' Performance The initial results of the classification models are visualized in the figures below, showcasing the performance metrics such as F1 score, accuracy, and loss over the training epochs for the graph models. **Loss** ![Graph Loss](experiments/plots/GNNs/train_loss_loss_plot.png) **Accuracy (Train/Test)** ![Graph Accuracy](experiments/plots/GNNs/train_acc_accuracy_plot.png) **F1 Score (Test)** ![Graph F1 Score](experiments/plots/GNNs/f1_score_f1_score_plot.png) ### Transformer-based Models Still working on these, it achieves similar performance of the GNNs but with more instability. It's plausible that the models' complexity surpasses the requirements of the ADFA-LD dataset. **Loss** ![Graph Loss](experiments/plots/Transformers/train_loss_loss_plot.png) **Accuracy (Train/Test)** ![Graph Accuracy](experiments/plots/Transformers/train_acc_accuracy_plot.png) **F1 Score (Test)** ![Graph F1 Score](experiments/plots/Transformers/f1_score_f1_score_plot.png)