Abstract:Pruning is an effective approach to reduce weight and computation of convolutional neural network, which provides a solution for the efficient implementation of CNN. However, the irregular distribution of weight in the pruned sparse CNN also makes the workloads among the hardware computing units different, which reduces the computing efficiency of the hardware. In this paper, a fine-grained CNN model pruning method is proposed, which divides the overall weight into several local weight groups according to the architecture of the hardware accelerator. Then each group of local weights is pruned independently respectively, and the sparse CNN obtained is workload-balancing on the accelerator. Furthermore, a sparse CNN accelerator with efficient PE and configurable sparsity is designed and implemented on FPGA. The efficient PE improves the throughput of the multiplier, and the configurability makes it flexible to compute CNN with different sparsity. Experimental results show that the presented pruning algorithm can reduce the weight parameters of CNN by 70% and the accuracy loss is less than 3%. Compared to dense accelerator research, the accelerator proposed in this paper achieves up to 3.65x speedup. The accelerator improves the hardware efficiency by 28~167% compared with other sparse accelerators.