Scanning with sane’s scanimage from an ADF scanner to PDF and OCRed Text

Using libsane and tesseract, you can scan from an ADF (or non ADF) scanner in Ubuntu 7.10 to a PDF and OCR’ed text document with a few easy steps.

First we need to make sure we have the necessary packages installed.

apt-get install tesseract-ocr sane-utils

The tesseract-ocr package gives us a utility called tesseract which takes a TIFF file as input and will output the OCR’d .txt file of the tiff.

tesseract my.tif output

Now we need a command line method to grab the TIFF, sane-utils comes to the rescue. The command “scanimage” from sane will let us do that. It is a great little utility that I recommend reading up on to learn more about its features and options, as they may vary based on the type of scanner you have. My scanner has an Auto Document Feeder (ADF) so be aware that my instructions are specific to an ADF scanner.

This example is for scanning a letter sized piece of paper in batch mode saving output in the format of a TIFF

scanimage -y 279.4 -x 215.9 --batch --format=tiff --mode Lineart --resolution 300 --source ADF

This will output a new TIFF for each page that is scanned.

The below script combines several steps to output a single PDF document and .txt file for a scan job.


outname=$1
startdir=$(pwd)
tmpdir=scan-$RANDOM

cd /tmp
mkdir $tmpdir
cd $tmpdir
echo "################## Scanning ###################"
scanimage -y 279.4 -x 215.9 --batch --format=tiff --mode Lineart --resolution 300 --source ADF
echo "################### OCRing ####################"
i=1
for page in $(ls *.tif); do
        echo -n "Page: $i - "
        #run tesseract on each page and combine the outputs in a single file with a .txt extension.
        tesseract $page $page
        echo "---BEGIN PAGE: $i ---" >> $outname.txt
        cat $page.txt >> $outname.txt
        echo "---END PAGE: $1 ---" >> $outname.txt
        i=$(expr $i + 1)
done
mv $outname.txt $startdir
echo "############## Converting to PDF ##############"
#Use tiffcp to combine output tiffs to a single mult-page tiff
tiffcp -c lzw out*.tif output.tif
#Convert the tiff to PDF
tiff2pdf output.tif > $startdir/$outname
cd ..
echo "################ Cleaning Up ################"
rm -rf $tmpdir
cd $startdir

I name the above script “scandoc” and it can be run by typing “scandoc myoutput.pdf” which will drop a pdf file (called myoutput.pdf) and a .txt (called myoutput.pdf.txt) file in the current directory with all the pages from the ADF. Very handy!

EDIT:
I’ve added Joe’s contributions in the comments to a gist at github.

8 Responses to “Scanning with sane’s scanimage from an ADF scanner to PDF and OCRed Text”


  1. 1 jeromio

    This is a very convenient script. I did have 2 small problems though. I had to install the tiff package and an english language package for tesseract. I’m using this script with an HP C2780 and it works great. I’m emptying out my filing cabinet (and filling up my HD). Thanks.

    – Jeremy

  2. 2 Nathan

    Looks interesting. Have you had any luck trying to get the scan button to work in linux?

  3. 3 Charles

    Thank you; works nicely and smoothly. I have added option –batch-start=101 to the scanimage command, because as written the order of pages in the single tiff file is not correct when more than 10 pages are scanned (with 101, you can scan 900 pages).

  4. 4 elio

    Excellent Jonah. It works very well! Let me add a few annoyances I bumped into, so that other people can take adavntage.
    a) You’ve got to discover the name of the device of your scanner. Issuing >scanimage -L will tell it to you. In my case
    elio@gazelle:$ scanimage -L
    device `hpaio:/net/Officejet_Pro_L7500?ip=192.168.1.98′ is a Hewlett-Packard Officejet_Pro_L7500 all-in-one

    b) tesseract installs also country dependant resources. In my Ubuntu 9.04 (english US) it install by default the German files. Go and install a compatible country. I also installed tesseract-ocr-eng

    c) on the last step of your program, I couldn’t resolv tiffcp and tiff2pdf. Fixed by installing the package libtiff-tools

    Albeit this note is somehow long I want to state that your solution is very simple and effective. Again, my compliments, I encourage everyone to adopt your solution, It took my five minutes and three tries to be up and running

    Still, I have to discover how to scan a double sided document. I’m investigating the command scanimage. I’ll post again if I discover how it should be accomplished
    Cheers Elio

  5. 5 Joe

    Hi,
    thanks for this nice Script.
    I’ve done some changes and improvements:

    - adding Format to the scanimage batch option (–batch=out%02d.tif)
    - compress with zip when called tiff2pdf
    - added ImageMagick Image enhancement (little more contrast) and two-bit Tiff (scans in Gray and reduces colors to 4)

    Scanning in Gray have no decrease in speed on my HP 5590.
    I’ve also added Posibility to scan more than one dokument (page sequence) with the ADF.
    When you have onle one page to scan, its faster not to use the ADF.

    So you can call scan2pdf:
    scan2pdf myDocument -> scans one page, without the adf (saved as myDocument.pdf)
    scan2pdf 99 myDocument -> uses ADF to scan into myDocument.pdf
    scan2pdf 3,8,2 myDocument -> uses ADF and scans 3 page sequences, files are saved to myDocument.01.pdf (3 pages), myDocument.02.pdf (8 pages) und myDocument.03.pdf (2 pages)

    Maybe someone like it:
    [code]

    #!/bin/sh

    SOURCE=""

    if [ $# -gt 1 ]
    then

    SOURCE="--source ADF -l 3"
    outname=$2
    pbreak=$1

    echo "$pbreak" | egrep "[^0-9,]+"
    if [ $? -ne 1 ]
    then
    echo "Check Sequnence List !"
    exit 1
    fi
    else

    pbreak=99
    outname=$1
    SOURCE="--batch-count=1"

    fi

    startdir=$(pwd)
    tmpdir=scan-$RANDOM

    cd /tmp
    mkdir $tmpdir
    cd $tmpdir
    echo "################## Scanning ###################"
    scanimage -x 210 -y 297 --batch=out%02d.tif --format=tiff --mode Gray --resolution 300 $SOURCE

    start=1
    cnt=1
    sc=$(echo "$pbreak" | cut -d"," -f1-99 --output-delimiter=" " | wc -w)
    for pb in $(echo "$pbreak" | cut -d "," -f1-99 --output-delimiter=" ")
    do
    ende=$(expr $start + $pb - 1)
    pnr=0
    i=1
    echo "############ Page-Sequence ($cnt), Pages: $pb, Start: $start, End: $ende ############"
    tpages=""
    for page in $(ls out*.tif); do
    pnr=$(expr $pnr + 1)
    if [ $pnr -ge $start -a $pnr -le $ende ]
    then
    echo "... Converting"
    # increase contrast and reduce colordepth
    convert $page -level 15%,85% -depth 2 "b$page"
    echo "... OCRing"
    tpages="$tpages b$page"
    i=$(expr $i + 1)
    echo -n " "
    tesseract $page $page -l deu
    if [ $sc -gt 1 ]
    then
    cnts=`printf %02d $cnt`
    cat $page.txt >> $outname.$cnts.txt
    else
    cat $page.txt >> $outname.txt
    fi

    fi
    done

    echo "... Converting to PDF"
    #Use tiffcp to combine output tiffs to a single mult-page tiff
    tiffcp $tpages output.tif
    #Convert the tiff to PDF
    if [ $sc -gt 1 ]
    then
    cnts=`printf %02d $cnt`
    tiff2pdf -z output.tif > $startdir/$outname.$cnts.pdf
    mv $outname.$cnts.txt $startdir
    else
    tiff2pdf -z output.tif > $startdir/$outname.pdf
    mv $outname.txt $startdir
    fi

    start=$(expr $start + $pb)
    cnt=$(expr $cnt + 1)

    done

    cd ..
    echo "################ Cleaning Up ################"
    rm -rf $tmpdir
    cd $startdir

    [/code]

  6. 6 Jonah

    Thanks for the contribution Joe. I no longer have a scanner with an ADF, but I’m happy to see this is the most popular post on my seldom updated blog. I’ve added your code as a revision to my original script as a gist at github for others to work with and fork as they please.

  1. 1 links for 2008-04-01 « The Adventures of Geekgirl
  2. 2 Links For | Delodder.be

Leave a Reply