<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Scanning with sane&#8217;s scanimage from an ADF scanner to PDF and OCRed Text</title>
	<atom:link href="http://jduck.net/2008/01/05/ocr-scanning/feed/" rel="self" type="application/rss+xml" />
	<link>http://jduck.net/2008/01/05/ocr-scanning/</link>
	<description></description>
	<lastBuildDate>Mon, 22 Feb 2010 21:48:18 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Jonah</title>
		<link>http://jduck.net/2008/01/05/ocr-scanning/comment-page-1/#comment-3472</link>
		<dc:creator>Jonah</dc:creator>
		<pubDate>Mon, 22 Feb 2010 21:48:18 +0000</pubDate>
		<guid isPermaLink="false">http://jduck.net/2008/01/05/ocr-scanning/#comment-3472</guid>
		<description>Thanks for the contribution Joe.  I no longer have a scanner with an ADF, but I&#039;m happy to see this is the most popular post on my seldom updated blog.  I&#039;ve added your code as a revision to my original script as a &lt;a href=&quot;http://gist.github.com/311548&quot; rel=&quot;nofollow&quot;&gt;gist&lt;/a&gt; at github for others to work with and fork as they please.</description>
		<content:encoded><![CDATA[<p>Thanks for the contribution Joe.  I no longer have a scanner with an ADF, but I&#8217;m happy to see this is the most popular post on my seldom updated blog.  I&#8217;ve added your code as a revision to my original script as a <a href="http://gist.github.com/311548" rel="nofollow">gist</a> at github for others to work with and fork as they please.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Joe</title>
		<link>http://jduck.net/2008/01/05/ocr-scanning/comment-page-1/#comment-3467</link>
		<dc:creator>Joe</dc:creator>
		<pubDate>Sun, 21 Feb 2010 12:37:09 +0000</pubDate>
		<guid isPermaLink="false">http://jduck.net/2008/01/05/ocr-scanning/#comment-3467</guid>
		<description>Hi,
thanks for this nice Script.
I&#039;ve done some changes and improvements:

- adding Format to the scanimage batch option (--batch=out%02d.tif)
- compress with zip when called tiff2pdf 
- added ImageMagick Image enhancement (little more contrast) and two-bit Tiff (scans in Gray and reduces colors to 4)

Scanning in Gray have no decrease in speed on my HP 5590.
I&#039;ve also added Posibility to scan more than one dokument (page sequence) with the ADF.
When you have onle one page to scan, its faster not to use the ADF.

So you can call scan2pdf:
scan2pdf myDocument -&gt; scans one page, without the adf (saved as myDocument.pdf)
scan2pdf 99 myDocument -&gt; uses ADF to scan into myDocument.pdf
scan2pdf 3,8,2 myDocument -&gt; uses ADF and scans 3 page sequences, files are saved to myDocument.01.pdf (3 pages), myDocument.02.pdf (8 pages) und myDocument.03.pdf (2 pages)

Maybe someone like it:
[code]

#!/bin/sh

SOURCE=&quot;&quot;

if [ $# -gt 1 ]
then

  SOURCE=&quot;--source ADF -l 3&quot;
  outname=$2
  pbreak=$1

  echo &quot;$pbreak&quot; &#124; egrep &quot;[^0-9,]+&quot;
  if [ $? -ne 1 ]
  then
    echo &quot;Check Sequnence List !&quot;
    exit 1
  fi
else

  pbreak=99
  outname=$1
  SOURCE=&quot;--batch-count=1&quot;

fi

startdir=$(pwd)
tmpdir=scan-$RANDOM

cd /tmp
mkdir $tmpdir
cd $tmpdir
echo &quot;################## Scanning ###################&quot;
scanimage -x 210 -y 297 --batch=out%02d.tif --format=tiff --mode Gray --resolution 300 $SOURCE

start=1
cnt=1
sc=$(echo &quot;$pbreak&quot; &#124; cut -d&quot;,&quot; -f1-99 --output-delimiter=&quot; &quot; &#124; wc -w)
for pb in $(echo &quot;$pbreak&quot; &#124; cut -d &quot;,&quot; -f1-99 --output-delimiter=&quot; &quot;)
do
    ende=$(expr $start + $pb - 1)
    pnr=0
    i=1
    echo &quot;############ Page-Sequence ($cnt), Pages: $pb, Start: $start, End: $ende ############&quot;
    tpages=&quot;&quot;
    for page in $(ls out*.tif); do
	pnr=$(expr $pnr + 1)
	if [ $pnr -ge $start -a $pnr -le $ende ]
	then
	    echo &quot;... Converting&quot;
	    # increase contrast and reduce colordepth 
	    convert $page -level 15%,85% -depth 2 &quot;b$page&quot; 
	    echo &quot;... OCRing&quot;
	    tpages=&quot;$tpages b$page&quot;
	    i=$(expr $i + 1)
	    echo -n &quot;    &quot;
            tesseract $page $page -l deu
            if [ $sc -gt 1 ]
            then
        	cnts=`printf %02d $cnt`
    		cat $page.txt &gt;&gt; $outname.$cnts.txt
    	    else
    		cat $page.txt &gt;&gt; $outname.txt
    	    fi

	fi
    done

    echo &quot;... Converting to PDF&quot;
    #Use tiffcp to combine output tiffs to a single mult-page tiff
    tiffcp $tpages output.tif
    #Convert the tiff to PDF
    if [ $sc -gt 1 ]
    then
    	cnts=`printf %02d $cnt`
        tiff2pdf -z output.tif &gt; $startdir/$outname.$cnts.pdf
	mv $outname.$cnts.txt $startdir
    else
        tiff2pdf -z output.tif &gt; $startdir/$outname.pdf
	mv $outname.txt $startdir
    fi

    start=$(expr $start + $pb)
    cnt=$(expr $cnt + 1)

done

cd ..
echo &quot;################ Cleaning Up ################&quot;
rm -rf $tmpdir
cd $startdir


[/code]</description>
		<content:encoded><![CDATA[<p>Hi,<br />
thanks for this nice Script.<br />
I&#8217;ve done some changes and improvements:</p>
<p>- adding Format to the scanimage batch option (&#8211;batch=out%02d.tif)<br />
- compress with zip when called tiff2pdf<br />
- added ImageMagick Image enhancement (little more contrast) and two-bit Tiff (scans in Gray and reduces colors to 4)</p>
<p>Scanning in Gray have no decrease in speed on my HP 5590.<br />
I&#8217;ve also added Posibility to scan more than one dokument (page sequence) with the ADF.<br />
When you have onle one page to scan, its faster not to use the ADF.</p>
<p>So you can call scan2pdf:<br />
scan2pdf myDocument -&gt; scans one page, without the adf (saved as myDocument.pdf)<br />
scan2pdf 99 myDocument -&gt; uses ADF to scan into myDocument.pdf<br />
scan2pdf 3,8,2 myDocument -&gt; uses ADF and scans 3 page sequences, files are saved to myDocument.01.pdf (3 pages), myDocument.02.pdf (8 pages) und myDocument.03.pdf (2 pages)</p>
<p>Maybe someone like it:<br />
[code]</p>
<p>#!/bin/sh</p>
<p>SOURCE=""</p>
<p>if [ $# -gt 1 ]<br />
then</p>
<p>  SOURCE="--source ADF -l 3"<br />
  outname=$2<br />
  pbreak=$1</p>
<p>  echo "$pbreak" | egrep "[^0-9,]+"<br />
  if [ $? -ne 1 ]<br />
  then<br />
    echo "Check Sequnence List !"<br />
    exit 1<br />
  fi<br />
else</p>
<p>  pbreak=99<br />
  outname=$1<br />
  SOURCE="--batch-count=1"</p>
<p>fi</p>
<p>startdir=$(pwd)<br />
tmpdir=scan-$RANDOM</p>
<p>cd /tmp<br />
mkdir $tmpdir<br />
cd $tmpdir<br />
echo "################## Scanning ###################"<br />
scanimage -x 210 -y 297 --batch=out%02d.tif --format=tiff --mode Gray --resolution 300 $SOURCE</p>
<p>start=1<br />
cnt=1<br />
sc=$(echo "$pbreak" | cut -d"," -f1-99 --output-delimiter=" " | wc -w)<br />
for pb in $(echo "$pbreak" | cut -d "," -f1-99 --output-delimiter=" ")<br />
do<br />
    ende=$(expr $start + $pb - 1)<br />
    pnr=0<br />
    i=1<br />
    echo "############ Page-Sequence ($cnt), Pages: $pb, Start: $start, End: $ende ############"<br />
    tpages=""<br />
    for page in $(ls out*.tif); do<br />
	pnr=$(expr $pnr + 1)<br />
	if [ $pnr -ge $start -a $pnr -le $ende ]<br />
	then<br />
	    echo "... Converting"<br />
	    # increase contrast and reduce colordepth<br />
	    convert $page -level 15%,85% -depth 2 "b$page"<br />
	    echo "... OCRing"<br />
	    tpages="$tpages b$page"<br />
	    i=$(expr $i + 1)<br />
	    echo -n "    "<br />
            tesseract $page $page -l deu<br />
            if [ $sc -gt 1 ]<br />
            then<br />
        	cnts=`printf %02d $cnt`<br />
    		cat $page.txt &gt;&gt; $outname.$cnts.txt<br />
    	    else<br />
    		cat $page.txt &gt;&gt; $outname.txt<br />
    	    fi</p>
<p>	fi<br />
    done</p>
<p>    echo "... Converting to PDF"<br />
    #Use tiffcp to combine output tiffs to a single mult-page tiff<br />
    tiffcp $tpages output.tif<br />
    #Convert the tiff to PDF<br />
    if [ $sc -gt 1 ]<br />
    then<br />
    	cnts=`printf %02d $cnt`<br />
        tiff2pdf -z output.tif &gt; $startdir/$outname.$cnts.pdf<br />
	mv $outname.$cnts.txt $startdir<br />
    else<br />
        tiff2pdf -z output.tif &gt; $startdir/$outname.pdf<br />
	mv $outname.txt $startdir<br />
    fi</p>
<p>    start=$(expr $start + $pb)<br />
    cnt=$(expr $cnt + 1)</p>
<p>done</p>
<p>cd ..<br />
echo "################ Cleaning Up ################"<br />
rm -rf $tmpdir<br />
cd $startdir</p>
<p>[/code]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: elio</title>
		<link>http://jduck.net/2008/01/05/ocr-scanning/comment-page-1/#comment-2460</link>
		<dc:creator>elio</dc:creator>
		<pubDate>Wed, 19 Aug 2009 12:24:47 +0000</pubDate>
		<guid isPermaLink="false">http://jduck.net/2008/01/05/ocr-scanning/#comment-2460</guid>
		<description>Excellent Jonah. It works very well! Let me add a few annoyances I bumped into, so that other people can take adavntage.
a) You&#039;ve got to discover the name of the device of your scanner. Issuing &gt;scanimage -L will tell it to you. In my case 
elio@gazelle:$ scanimage -L
device `hpaio:/net/Officejet_Pro_L7500?ip=192.168.1.98&#039; is a Hewlett-Packard Officejet_Pro_L7500 all-in-one

b) tesseract installs also country dependant resources. In my Ubuntu 9.04 (english US) it install by default the German files. Go and install a compatible country. I also installed tesseract-ocr-eng

c) on the last step of your program, I couldn&#039;t resolv tiffcp and tiff2pdf. Fixed by installing the package  libtiff-tools

Albeit this note is somehow long I want to state that your solution is very simple and effective. Again, my compliments, I encourage everyone to adopt your solution, It took my five minutes and three tries to be up and running

Still, I have to discover how to scan a double sided document. I&#039;m investigating the command scanimage. I&#039;ll post again if I discover how it should be accomplished
Cheers Elio</description>
		<content:encoded><![CDATA[<p>Excellent Jonah. It works very well! Let me add a few annoyances I bumped into, so that other people can take adavntage.<br />
a) You&#8217;ve got to discover the name of the device of your scanner. Issuing &gt;scanimage -L will tell it to you. In my case<br />
elio@gazelle:$ scanimage -L<br />
device `hpaio:/net/Officejet_Pro_L7500?ip=192.168.1.98&#8242; is a Hewlett-Packard Officejet_Pro_L7500 all-in-one</p>
<p>b) tesseract installs also country dependant resources. In my Ubuntu 9.04 (english US) it install by default the German files. Go and install a compatible country. I also installed tesseract-ocr-eng</p>
<p>c) on the last step of your program, I couldn&#8217;t resolv tiffcp and tiff2pdf. Fixed by installing the package  libtiff-tools</p>
<p>Albeit this note is somehow long I want to state that your solution is very simple and effective. Again, my compliments, I encourage everyone to adopt your solution, It took my five minutes and three tries to be up and running</p>
<p>Still, I have to discover how to scan a double sided document. I&#8217;m investigating the command scanimage. I&#8217;ll post again if I discover how it should be accomplished<br />
Cheers Elio</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Charles</title>
		<link>http://jduck.net/2008/01/05/ocr-scanning/comment-page-1/#comment-2246</link>
		<dc:creator>Charles</dc:creator>
		<pubDate>Sat, 27 Jun 2009 21:59:53 +0000</pubDate>
		<guid isPermaLink="false">http://jduck.net/2008/01/05/ocr-scanning/#comment-2246</guid>
		<description>Thank you; works nicely and smoothly.  I have added option --batch-start=101 to the scanimage command, because as written the order of pages in the single tiff file is not correct when more than 10 pages are scanned (with 101, you can scan 900 pages).</description>
		<content:encoded><![CDATA[<p>Thank you; works nicely and smoothly.  I have added option &#8211;batch-start=101 to the scanimage command, because as written the order of pages in the single tiff file is not correct when more than 10 pages are scanned (with 101, you can scan 900 pages).</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Nathan</title>
		<link>http://jduck.net/2008/01/05/ocr-scanning/comment-page-1/#comment-1516</link>
		<dc:creator>Nathan</dc:creator>
		<pubDate>Fri, 07 Nov 2008 22:05:49 +0000</pubDate>
		<guid isPermaLink="false">http://jduck.net/2008/01/05/ocr-scanning/#comment-1516</guid>
		<description>Looks interesting. Have you had any luck trying to get the scan button to work in linux?</description>
		<content:encoded><![CDATA[<p>Looks interesting. Have you had any luck trying to get the scan button to work in linux?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Links For &#124; Delodder.be</title>
		<link>http://jduck.net/2008/01/05/ocr-scanning/comment-page-1/#comment-1515</link>
		<dc:creator>Links For &#124; Delodder.be</dc:creator>
		<pubDate>Fri, 07 Nov 2008 08:06:10 +0000</pubDate>
		<guid isPermaLink="false">http://jduck.net/2008/01/05/ocr-scanning/#comment-1515</guid>
		<description>[...] Scanning with sane’s scanimage from an ADF scanner to PDF and OCRed Text at Jonah M. Duckles [...]</description>
		<content:encoded><![CDATA[<p>[...] Scanning with sane’s scanimage from an ADF scanner to PDF and OCRed Text at Jonah M. Duckles [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: jeromio</title>
		<link>http://jduck.net/2008/01/05/ocr-scanning/comment-page-1/#comment-1340</link>
		<dc:creator>jeromio</dc:creator>
		<pubDate>Tue, 10 Jun 2008 00:26:19 +0000</pubDate>
		<guid isPermaLink="false">http://jduck.net/2008/01/05/ocr-scanning/#comment-1340</guid>
		<description>This is a very convenient script. I did have 2 small problems though. I had to install the tiff package and an english language package for tesseract. I&#039;m using this script with an HP C2780 and it works great. I&#039;m emptying out my filing cabinet (and filling up my HD). Thanks.

 - Jeremy</description>
		<content:encoded><![CDATA[<p>This is a very convenient script. I did have 2 small problems though. I had to install the tiff package and an english language package for tesseract. I&#8217;m using this script with an HP C2780 and it works great. I&#8217;m emptying out my filing cabinet (and filling up my HD). Thanks.</p>
<p> &#8211; Jeremy</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: links for 2008-04-01 &#171; The Adventures of Geekgirl</title>
		<link>http://jduck.net/2008/01/05/ocr-scanning/comment-page-1/#comment-1305</link>
		<dc:creator>links for 2008-04-01 &#171; The Adventures of Geekgirl</dc:creator>
		<pubDate>Tue, 01 Apr 2008 04:41:55 +0000</pubDate>
		<guid isPermaLink="false">http://jduck.net/2008/01/05/ocr-scanning/#comment-1305</guid>
		<description>[...] Scanning with sane’s scanimage from an ADF scanner to PDF and OCRed Text at Jonah M. Duckles (tags: linux hack scanning imaging pdf ocr tif) [...]</description>
		<content:encoded><![CDATA[<p>[...] Scanning with sane’s scanimage from an ADF scanner to PDF and OCRed Text at Jonah M. Duckles (tags: linux hack scanning imaging pdf ocr tif) [...]</p>
]]></content:encoded>
	</item>
</channel>
</rss>
