Abstract:At present, the identification of malicious encrypted traffic based on machine learning mainly uses supervised learning and relies on a large number of labeled samples. However, in the real environment, malicious traffic is not only scarce but also depends on expert experience, and the labeling cost is high. Active learning selects difficult samples through iterative for training, which reduces the amount of training samples to a certain extent, but the current hardsample selection strategy based on committee votes has a coarser granularity, and the quality of the selected samples is not good. In response to this problem, a CBU (Committee-based Uncertainty, CBU) is proposed to improve the Query by Committee (QBC) method for identifying malicious encrypted traffic. Labeling sample similarity analysis, effectively measuring sample uncertainty, and selecting high-quality hardsamples to reduce sample labeling and training volume. The experiment uses the industry standard data set CTU and real malicious data sets for testing. The results show that compared with the traditional committee voting strategy, the amount of CBU sample labeling is doubled, and the recognition accuracy rate of only 15% of the data amount is 96%, which effectively reduces the sample labeling. And training volume, and it has strong practicability.