tess-two使用笔记

2020-06-21 约 1188 字预计阅读 3 分钟次阅读

OCR简介

OCR （Optical Character Recognition，光学字符识别）是指电子设备（例如扫描仪或数码相机）检查纸上打印的字符，通过检测暗、亮的模式确定其形状，然后用字符识别方法将形状翻译成计算机文字的过程；

Tesseract简介

Tesseract是Ray Smith于1985到1995年间在惠普布里斯托实验室开发的一个OCR引擎，曾经在1995 UNLV精确度测试中名列前茅。但1996年后基本停止了开发。2006年，Google邀请Smith加盟，重启该项目。目前项目的许可证是Apache 2.0。该项目目前支持Windows、Linux和Mac OS等主流平台。但作为一个引擎，它只提供命令行工具。现阶段的Tesseract由Google负责维护，是最好的开源OCR Engine之一，并且支持中文。

主页地址：https://github.com/tesseract-ocr

在Tesseract的主页中，我们可以下载到Tesseract的源码及语言包，常用的语言包为

中文：chi-sim.traineddata
英文：eng.traineddata

Tess-two的诞生

因为Tesseract使用C++实现的，在Android中不能直接使用，需要封装JavaAPI才能在Android平台中进行调用，这里我们直接使用TessTwo项目，tess-two是TesseraToolsForAndroid的一个git分支，使用简单，切集成了leptonica，在使用之前需要先从git上下载源码进行编译。

Tess-two在git上地址为：https://github.com/rmtheis/tess-two
Tess-two字体库：https://github.com/tesseract-ocr/tessdata

开始集成

1、添加tess-two依赖

1

    implementation 'com.rmtheis:tess-two:9.1.0'

2、下载所需字体库

因为我主要用于识别简体中文，所以我选择的字体库是 https://github.com/tesseract-ocr/tessdata/blob/master/chi_sim.traineddata
将字体库下载之后可以自由选择存放的位置，如果字体库比较大，为了不影响包体，我们一般是存放在服务器采用动态下载的方式，使用前下载到手机本地缓存起来，然后再使用

我这里保存traineddata的路径是：xxxx/tesserart/tessdata/chi_sim.traineddata

3、初始化tess-two

1
2
3


    val tessBaseAPI = TessBaseAPI()
    tessBaseAPI.init("xxxx/tesserart", language)
    

初始化的时候，我们传入的路径只需要定位到tessdata的上级目录即可，因为其内部初始化时会根据我们传入的路径自动定位到下一级tessdata目录去寻找字体文件,内部源码如下：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


    public boolean init(String datapath, String language, @OcrEngineMode int ocrEngineMode) {
        if (datapath == null)
            throw new IllegalArgumentException("Data path must not be null!");
        if (!datapath.endsWith(File.separator))
            datapath += File.separator;

        File datapathFile = new File(datapath);
        if (!datapathFile.exists())
            throw new IllegalArgumentException("Data path does not exist!");

        File tessdata = new File(datapath + "tessdata");
        if (!tessdata.exists() || !tessdata.isDirectory())
            throw new IllegalArgumentException("Data path must contain subfolder tessdata!");
    。。。。

4、识别文字初始化成功后，我们就可以调用tessBaseApi对象去加载我们的图片，然后获取识别的结果了