Abstract:In response to the current surge in network traffic leading to a sudden increase in network security incidents and an added burden on network management, a network architecture based on deep learning techniques has been proposed. This architecture involves the parallel use of ResNet and one-dimensional Vision Transformer for the identification and classification of network traffic. ResNet is capable of extracting deep spatial features from flow data, ensuring high accuracy in traffic recognition. Meanwhile, the one-dimensional Vision Transformer excels at capturing more representative temporal features. By employing an attention mechanism to adaptively merge these two types of features, a more comprehensive feature representation is obtained to enhance the network′s capability in traffic identification. Experiments conducted on the ISCX VPN-nonVPN dataset demonstrate that the proposed method achieves an accuracy of 99.5% in application-based traffic classification experiments. Compared to standalone ResNet and one-dimensional Vision Transformer, as well as classical one-dimensional Convolutional Neural Networks (1D-CNN) and CNN combined with Long Short-Term Memory (CNN+LSTM), the proposed method shows improvements of 0.9%, 3.6%, 6.6%, and 3.3%, respectively. On the USTC-TFC 2016 dataset, the proposed method not only easily identifies malicious traffic but also accomplishes the classification of 13 different applications, with an average classification accuracy of 98.92%. This proves its ability to recognize malicious traffic and perform fine-grained classification tasks.