mirror of https://github.com/vulnerability-lookup/VulnTrain.git synced 2026-03-16 08:13:23 +00:00

A tool to generate datasets and models based on vulnerabilities descriptions from @vulnerability-lookup.

Python 100%

Find a file

Cédric Bonhomme fa619b9d83 chg: [dependencies] Updated dependencies.		2026-03-12 10:05:40 +01:00
.github/workflows	chg: disable training command	2025-02-19 16:22:24 +01:00
docs	chg: [documentation] Improved HPC SLURM documentation.	2026-03-10 08:15:58 +01:00
vulntrain	new: [training] Add --no-codecarbon, --no-push, and --no-cache options to severity trainer	2026-02-19 10:02:17 +01:00
.gitignore	chg: [added] added validator for severity prediction	2025-02-25 10:39:21 +01:00
AUTHORS	chg: [RELEASE] Updated CHANGELOG, README, and dependencies. Bumped release number.	2025-07-01 10:40:26 +02:00
CHANGELOG.md	new: [release] Release 2.2.0	2026-02-19 10:03:29 +01:00
CITATION.cff	chg: updated changelog	2025-07-23 09:14:35 +02:00
CLAUDE.md	new: [documentation] Add CLAUDE.md for Claude Code guidance	2026-02-19 09:47:18 +01:00
COPYING	chg: [documentation] Updated README and COPYING.	2025-02-24 10:54:47 +01:00
poetry.lock	chg: [dependencies] Updated dependencies.	2026-03-12 10:05:40 +01:00
pyproject.toml	new: [release] Release 2.2.0	2026-02-19 10:03:29 +01:00
README.md	chg: [documentation] Improved README with CLI commands table, HPC section, and dev install instructions.	2026-03-10 08:24:30 +01:00

README.md

VulnTrain

VulnTrain offers a suite of commands to generate diverse AI datasets and train models using comprehensive vulnerability data from Vulnerability-Lookup. It harnesses over one million JSON records from all supported advisory sources (CVE, GitHub advisories, CSAF, PySecDB, CNVD) to build high-quality, domain-specific models.

Additionally, data from the vulnerability-lookup:meta container, including enrichment sources such as vulnrichment and Fraunhofer FKIE, is incorporated to enhance model quality.

Check out the datasets and models on Hugging Face:

For more information about the use of AI in Vulnerability-Lookup, please refer to the user manual.

Installation

pipx install VulnTrain

For development:

git clone https://github.com/vulnerability-lookup/VulnTrain.git
cd VulnTrain/
poetry install

Usage

Three types of commands are available:

Dataset generation: Create and prepare datasets from vulnerability sources.
Model training: Train models using the prepared datasets.
Model validation: Assess the performance of trained models (validations, benchmarks, etc.).

CLI commands

Command	Purpose
`vulntrain-dataset-generation`	Generate datasets from vulnerability sources
`vulntrain-train-severity-classification`	Train severity classifier (RoBERTa/DistilBERT)
`vulntrain-train-severity-cnvd-classification`	Train severity classifier for CNVD data
`vulntrain-train-description-generation`	Train GPT-2 vulnerability description generator
`vulntrain-train-cwe-classification`	Train CWE classifier from patches
`vulntrain-validate-severity-classification`	Validate severity model
`vulntrain-validate-text-generation`	Validate text generation model

Models

Severity classification:
Description generation:

Distributed training on HPC clusters

VulnTrain supports distributed multi-GPU training via SLURM, making it suitable for EuroHPC-style GPU clusters. See the HPC documentation for Conda environment setup, single-node and multi-node SLURM job scripts, and NCCL configuration.

Documentation

Check out the full documentation for detailed usage instructions, dataset generation examples, and training recipes.

How to cite

Bonhomme, C., & Dulaunoy, A. (2025). VLAI: A RoBERTa-Based Model for Automated Vulnerability Severity Classification (Version 1.4.0) [Computer software]. https://doi.org/10.48550/arXiv.2507.03607

@misc{bonhomme2025vlai,
    title={VLAI: A RoBERTa-Based Model for Automated Vulnerability Severity Classification},
    author={Cédric Bonhomme and Alexandre Dulaunoy},
    year={2025},
    eprint={2507.03607},
    archivePrefix={arXiv},
    primaryClass={cs.CR}
}

License

VulnTrain is licensed under GNU General Public License version 3

Copyright (c) 2025-2026 Computer Incident Response Center Luxembourg (CIRCL)
Copyright (C) 2025-2026 Cédric Bonhomme - https://github.com/cedricbonhomme
Copyright (C) 2025 Léa Ulusan - https://github.com/3LS3-1F