Scanning with sane’s scanimage from an ADF scanner to PDF and OCRed Text

Using libsane and tesseract, you can scan from an ADF (or non ADF) scanner in Ubuntu 7.10 to a PDF and OCR’ed text document with a few easy steps.

First we need to make sure we have the necessary packages installed.

apt-get install tesseract-ocr sane-utils

The tesseract-ocr package gives us a utility called tesseract which takes a TIFF file as input and will output the OCR’d .txt file of the tiff.

tesseract my.tif output

Now we need a command line method to grab the TIFF, sane-utils comes to the rescue. The command “scanimage” from sane will let us do that. It is a great little utility that I recommend reading up on to learn more about its features and options, as they may vary based on the type of scanner you have. My scanner has an Auto Document Feeder (ADF) so be aware that my instructions are specific to an ADF scanner.

This example is for scanning a letter sized piece of paper in batch mode saving output in the format of a TIFF

scanimage -y 279.4 -x 215.9 --batch --format=tiff --mode Lineart --resolution 300 --source ADF

This will output a new TIFF for each page that is scanned.

The below script combines several steps to output a single PDF document and .txt file for a scan job.

#!/bin/bash
outname=$1
startdir=$(pwd)
tmpdir=scan-$RANDOM

cd /tmp
mkdir $tmpdir
cd $tmpdir
echo "################## Scanning ###################"
scanimage -y 279.4 -x 215.9 --batch --format=tiff --mode Lineart --resolution 300 --source ADF
echo "################### OCRing ####################"
i=1
for page in $(ls *.tif); do
        echo -n "Page: $i - "
        #run tesseract on each page and combine the outputs in a single file with a .txt extension.
        tesseract $page $page
        echo "---BEGIN PAGE: $i ---" >> $outname.txt
        cat $page.txt >> $outname.txt
        echo "---END PAGE: $1 ---" >> $outname.txt
        i=$(expr $i + 1)
done
mv $outname.txt $startdir
echo "############## Converting to PDF ##############"
#Use tiffcp to combine output tiffs to a single mult-page tiff
tiffcp -c lzw out*.tif output.tif
#Convert the tiff to PDF
tiff2pdf output.tif > $startdir/$outname
cd ..
echo "################ Cleaning Up ################"
rm -rf $tmpdir
cd $startdir

I name the above script “scandoc” and it can be run by typing “scandoc myoutput.pdf” which will drop a pdf file (called myoutput.pdf) and a .txt (called myoutput.pdf.txt) file in the current directory with all the pages from the ADF. Very handy!

2 Responses to “Scanning with sane’s scanimage from an ADF scanner to PDF and OCRed Text”


  1. 1 jeromio

    This is a very convenient script. I did have 2 small problems though. I had to install the tiff package and an english language package for tesseract. I’m using this script with an HP C2780 and it works great. I’m emptying out my filing cabinet (and filling up my HD). Thanks.

    - Jeremy

  1. 1 links for 2008-04-01 « The Adventures of Geekgirl

Leave a Reply