Abstract:Aiming at the problem of poor reconstruction of Multi-View Stereo Networks in challenging regions such as weak textures or non-Lambertian surfaces, this paper first proposes a multi-scale feature extraction module based on three parallel dilated convolution and attention mechanism, which enables the network to capture the dependencies between features while increasing the sensory field to obtain global context information, thus enhancing the multi-view stereo network′s ability to characterize features in challenging regions for robust feature matching. Secondly, an attention mechanism is introduced in the 3D CNN part of the cost volume regularization so that the network pays attention to the important regions in the cost volume for smoothing. Additionally, a neural rendering network is built, which utilizes the rendering reference loss to accurately resolve the geometric appearance information expressed by the radiance field and introduces the depth consistency loss to maintain the geometric consistency between the multi-view stereo network and the neural rendering network, which effectively mitigates the detrimental effect of the noisy cost volume on the multi-view stereo network. The algorithm is tested in the indoor DTU dataset, achieving completeness and overall metrics of 0.289 and 0.326, respectively. Compared to the benchmark method CasMVSNet, there is an improvement of 24.9% and 8.2% in the two metrics, demonstrating high-quality reconstruction even in challenging regions. In the outdoor Tanks and Temples intermediate dataset, the average F-score for point cloud reconstruction is 60.31, showing a 9.9% improvement over the UCS-Net method. This reflects the algorithm′s strong generalization capability.