语音转录后文本的中文拼写纠错模型
DOI:
CSTR:
作者:
作者单位:

北京邮电大学人工智能学院 北京 100876

作者简介:

通讯作者:

中图分类号:

TP3

基金项目:

教育部-中国移动科研基金(MCM20190701)项目资助


Chinese spelling error correction model for transcribed text
Author:
Affiliation:

Beijing University of Posts and Telecommunications, School of Artificial Intelligence,,Beijing 100876, China

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    针对目前语音转录文本错误率较高的问题,本文提出一种基于MacBERT的文本先检错后纠错模型,对语音转录后文本进行校正。检错阶段使用MacBERT-BiLSTM-CRF模型检查文本是否有错及出错位置。纠错阶段从置信度和字音相似度两个维度出发,划定“置信度-字音相似度”曲线判断候选字是否进行纠错。候选字的置信度使用MacBERT语言模型计算,并提出一种基于拼音码的字音相似度计算方法。在语音公开数据集Thchs-30上通过调用百度语音识别API进行实验,相比现有方法,在检错阶段和纠错阶段的精确率、召回率、F1值都得到了提高,其中纠错阶段精确率达到83.32%,提高了转录文本的正确性。

    Abstract:

    Aiming at the high error rate of speech transcription text, proposes a text error detection and correction model based on MacBERT to correct the text after speech transcription. In the error detection stage, the MacBERT-BiLSTM-CRF model is used to check whether the text is wrong and where it is. In the error correction stage, starting from the two dimensions of confidence and phonetic similarity, a curve of "confidence-phonetic similarity" is delineated to determine whether candidate words are to be corrected for errors. The confidence of the candidate words is calculated using the MacBERT language model, and a phonetic similarity calculation method based on pinyin code is proposed. Experiments were conducted on the public speech dataset Thchs-30 by calling Baidu speech recognition API. Compared with the existing methods, the precision rate, recall rate and F1 value in the error detection stage and error correction stage have been improved. Among them, the error correction stage The accuracy rate reaches 83.32%, which improves the accuracy of the transcribed text.

    参考文献
    相似文献
    引证文献
引用本文

邢月晗,郑岩.语音转录后文本的中文拼写纠错模型[J].电子测量技术,2023,46(6):57-61

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:
  • 最后修改日期:
  • 录用日期:
  • 在线发布日期: 2024-02-19
  • 出版日期:
文章二维码