This data article describes a flow-level dataset derived from paired captures on both sides of a WireGuard virtual private network tunnel. Pre-tunnel traffic was recorded on the inner tunnel interface before encapsulation, and encrypted transport traffic was recorded on the outer side, using a GL.iNet Flint 2 (GL-MT6000) router, an inline network TAP, and a Linux capture host. Two capture sessions totaling approximately 80 hours of residential broadband traffic from 10 devices were recorded with nanosecond-precision packet timestamps; the released flow-level data uses millisecond resolution as exported by NFStream. The raw captures were cleaned to retain TCP and UDP packets and to remove non-initial IPv4 fragments. Flow records were generated from the cleaned inner-side captures using NFStream, which assigned each flow an application name and application category label via deep packet inspection. Inner packets were matched to outer WireGuard transport data packets using time alignment and a padded-length consistency rule, and matched packets were attributed to flows using 5-tuple keys with temporal and capacity constraints. Encrypted-side statistics were then aggregated per flow. The released dataset consists of two Parquet files, one per capture session, that combine NFStream flow fields, including application labels and inner-side per-packet sequences for the first 255 packets, with encrypted-side derived attributes such as matched packet counts, byte totals, durations, rates, direction-specific byte volumes, packet-size statistics, inter-arrival time distributions, size-ratio metrics and outer-side per-packet sequences for the first 255 packets. This cross-correlation structure pairing pre-tunnel application labels with encrypted tunnel-side features, can support research on encrypted traffic classification, application identification, VPN detection, and feature engineering for flow-level analysis under encryption.
Razooqi et al. (Sun,) studied this question.