Abstract:Addressing the issue of erroneous extraction of fine structures such as human hair in image matting tasks, the problem essentially stemmed from inaccurate prediction of pixel alpha mattes due to mixed information within these regions. To address this problem, a novel end-to-end hierarchical feature aggregation matting network model is proposed. This model incorporates a shared encoder and two independent decoders, leveraging channel and positional attention mechanisms to aggregate low-level texture clues and high-level semantic information in a hierarchical manner. It enables perceiving foreground transparency masks from fine boundaries of individual portraits and adaptive semantics without additional inputs. To guide the hierarchical feature aggregation matting network model in refining the overall foreground structure and restoring hair texture details, cross-entropy loss, alpha matte prediction loss for unknown regions, and structural losses are integrated. To validate the effectiveness of the proposed model, experiments were conducted on the self-constructed MCP-1k dataset and the publicly available P3M-500-NP dataset. Experimental results demonstrated that the proposed model achieved errors of 0.0076 MSE and 25.59 SAD on MCP-1k dataset, and 0.0072 MSE and 25.52 SAD on P3M-500-NP dataset, respectively. Compared with other typical deep matting models, it showed significant improvements in restoring fine human hair and enhancing semantic structure in portraits, effectively addressing the issue of erroneous extraction in human hair regions.