<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>jduck.net &#187; hacks</title>
	<atom:link href="http://jduck.net/category/hacks/feed/" rel="self" type="application/rss+xml" />
	<link>http://jduck.net</link>
	<description></description>
	<lastBuildDate>Wed, 14 Mar 2012 21:21:11 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>Scanning with sane&#8217;s scanimage from an ADF scanner to PDF and OCRed Text</title>
		<link>http://jduck.net/2008/01/05/ocr-scanning/</link>
		<comments>http://jduck.net/2008/01/05/ocr-scanning/#comments</comments>
		<pubDate>Sat, 05 Jan 2008 17:34:21 +0000</pubDate>
		<dc:creator>Jonah</dc:creator>
				<category><![CDATA[hacks]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[scanning]]></category>
		<category><![CDATA[software]]></category>
		<category><![CDATA[ubuntu]]></category>
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://jduck.net/2008/01/05/ocr-scanning/</guid>
		<description><![CDATA[Using libsane and tesseract, you can scan from an ADF (or non ADF) scanner in Ubuntu 7.10 to a PDF and OCR&#8217;ed text document with a few easy steps. First we need to make sure we have the necessary packages installed. apt-get install tesseract-ocr sane-utils The tesseract-ocr package gives us a utility called tesseract which [...]]]></description>
			<content:encoded><![CDATA[<p>Using libsane and tesseract, you can scan from an ADF (or non ADF) scanner in Ubuntu 7.10 to a PDF and OCR&#8217;ed text document with a few easy steps.</p>
<p>First we need to make sure we have the necessary packages installed.</p>
<pre>
apt-get install tesseract-ocr sane-utils
</pre>
<p><span id="more-110"></span></p>
<p>The tesseract-ocr package gives us a utility called tesseract which takes a TIFF file as input and will output the OCR&#8217;d .txt file of the tiff.</p>
<pre>
tesseract my.tif output
</pre>
<p>Now we need a command line method to grab the TIFF, sane-utils comes to the rescue.  The command &#8220;scanimage&#8221; from sane will let us do that.  It is a great little utility that I recommend reading up on to learn more about its features and options, as they may vary based on the type of scanner you have.  My scanner has an Auto Document Feeder (ADF) so be aware that my instructions are specific to an ADF scanner.</p>
<p>This example is for scanning a letter sized piece of paper in batch mode saving output in the format of a TIFF</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;">scanimage <span style="color: #660033;">-y</span> <span style="color: #000000;">279.4</span> <span style="color: #660033;">-x</span> <span style="color: #000000;">215.9</span> <span style="color: #660033;">--batch</span> <span style="color: #660033;">--format</span>=tiff <span style="color: #660033;">--mode</span> Lineart <span style="color: #660033;">--resolution</span> <span style="color: #000000;">300</span> <span style="color: #660033;">--source</span> ADF</pre></div></div>

<p>This will output a new TIFF for each page that is scanned.</p>
<p>The below script combines several steps to output a single PDF document and .txt file for a scan job.</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;"><span style="color: #007800;">outname</span>=<span style="color: #007800;">$1</span>
<span style="color: #007800;">startdir</span>=$<span style="color: #7a0874; font-weight: bold;">&#40;</span><span style="color: #7a0874; font-weight: bold;">pwd</span><span style="color: #7a0874; font-weight: bold;">&#41;</span>
<span style="color: #007800;">tmpdir</span>=scan-<span style="color: #007800;">$RANDOM</span>
&nbsp;
<span style="color: #7a0874; font-weight: bold;">cd</span> <span style="color: #000000; font-weight: bold;">/</span>tmp
<span style="color: #c20cb9; font-weight: bold;">mkdir</span> <span style="color: #007800;">$tmpdir</span>
<span style="color: #7a0874; font-weight: bold;">cd</span> <span style="color: #007800;">$tmpdir</span>
<span style="color: #7a0874; font-weight: bold;">echo</span> <span style="color: #ff0000;">&quot;################## Scanning ###################&quot;</span>
scanimage <span style="color: #660033;">-y</span> <span style="color: #000000;">279.4</span> <span style="color: #660033;">-x</span> <span style="color: #000000;">215.9</span> <span style="color: #660033;">--batch</span> <span style="color: #660033;">--format</span>=tiff <span style="color: #660033;">--mode</span> Lineart <span style="color: #660033;">--resolution</span> <span style="color: #000000;">300</span> <span style="color: #660033;">--source</span> ADF
<span style="color: #7a0874; font-weight: bold;">echo</span> <span style="color: #ff0000;">&quot;################### OCRing ####################&quot;</span>
<span style="color: #007800;">i</span>=<span style="color: #000000;">1</span>
<span style="color: #000000; font-weight: bold;">for</span> page <span style="color: #000000; font-weight: bold;">in</span> $<span style="color: #7a0874; font-weight: bold;">&#40;</span><span style="color: #c20cb9; font-weight: bold;">ls</span> <span style="color: #000000; font-weight: bold;">*</span>.tif<span style="color: #7a0874; font-weight: bold;">&#41;</span>; <span style="color: #000000; font-weight: bold;">do</span>
        <span style="color: #7a0874; font-weight: bold;">echo</span> <span style="color: #660033;">-n</span> <span style="color: #ff0000;">&quot;Page: <span style="color: #007800;">$i</span> - &quot;</span>
        <span style="color: #666666; font-style: italic;">#run tesseract on each page and combine the outputs in a single file with a .txt extension.</span>
        tesseract <span style="color: #007800;">$page</span> <span style="color: #007800;">$page</span>
        <span style="color: #7a0874; font-weight: bold;">echo</span> <span style="color: #ff0000;">&quot;---BEGIN PAGE: <span style="color: #007800;">$i</span> ---&quot;</span> <span style="color: #000000; font-weight: bold;">&gt;&gt;</span> <span style="color: #007800;">$outname</span>.txt
        <span style="color: #c20cb9; font-weight: bold;">cat</span> <span style="color: #007800;">$page</span>.txt <span style="color: #000000; font-weight: bold;">&gt;&gt;</span> <span style="color: #007800;">$outname</span>.txt
        <span style="color: #7a0874; font-weight: bold;">echo</span> <span style="color: #ff0000;">&quot;---END PAGE: $1 ---&quot;</span> <span style="color: #000000; font-weight: bold;">&gt;&gt;</span> <span style="color: #007800;">$outname</span>.txt
        <span style="color: #007800;">i</span>=$<span style="color: #7a0874; font-weight: bold;">&#40;</span><span style="color: #c20cb9; font-weight: bold;">expr</span> <span style="color: #007800;">$i</span> + <span style="color: #000000;">1</span><span style="color: #7a0874; font-weight: bold;">&#41;</span>
<span style="color: #000000; font-weight: bold;">done</span>
<span style="color: #c20cb9; font-weight: bold;">mv</span> <span style="color: #007800;">$outname</span>.txt <span style="color: #007800;">$startdir</span>
<span style="color: #7a0874; font-weight: bold;">echo</span> <span style="color: #ff0000;">&quot;############## Converting to PDF ##############&quot;</span>
<span style="color: #666666; font-style: italic;">#Use tiffcp to combine output tiffs to a single mult-page tiff</span>
tiffcp <span style="color: #660033;">-c</span> lzw out<span style="color: #000000; font-weight: bold;">*</span>.tif output.tif 
<span style="color: #666666; font-style: italic;">#Convert the tiff to PDF</span>
tiff2pdf output.tif <span style="color: #000000; font-weight: bold;">&gt;</span> <span style="color: #007800;">$startdir</span><span style="color: #000000; font-weight: bold;">/</span><span style="color: #007800;">$outname</span>
<span style="color: #7a0874; font-weight: bold;">cd</span> ..
<span style="color: #7a0874; font-weight: bold;">echo</span> <span style="color: #ff0000;">&quot;################ Cleaning Up ################&quot;</span>
<span style="color: #c20cb9; font-weight: bold;">rm</span> <span style="color: #660033;">-rf</span> <span style="color: #007800;">$tmpdir</span>
<span style="color: #7a0874; font-weight: bold;">cd</span> <span style="color: #007800;">$startdir</span></pre></div></div>

<p>I name the above script &#8220;scandoc&#8221; and it can be run by typing &#8220;scandoc myoutput.pdf&#8221; which will drop a pdf file (called myoutput.pdf) and a .txt (called myoutput.pdf.txt) file in the current directory with all the pages from the ADF.  Very handy!</p>
<p><strong>EDIT:</strong><br />
I&#8217;ve added Joe&#8217;s contributions in the comments to a <a href="http://gist.github.com/311548">gist</a> at github. </p>
]]></content:encoded>
			<wfw:commentRss>http://jduck.net/2008/01/05/ocr-scanning/feed/</wfw:commentRss>
		<slash:comments>14</slash:comments>
		</item>
		<item>
		<title>Wikifying my life</title>
		<link>http://jduck.net/2007/03/02/wikifying-my-life/</link>
		<comments>http://jduck.net/2007/03/02/wikifying-my-life/#comments</comments>
		<pubDate>Sat, 03 Mar 2007 01:24:44 +0000</pubDate>
		<dc:creator>Jonah</dc:creator>
				<category><![CDATA[gradwork]]></category>
		<category><![CDATA[hacks]]></category>
		<category><![CDATA[oss]]></category>
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://jduck.net/2007/03/02/wikifying-my-life/</guid>
		<description><![CDATA[I started working on my PhD proposal and decided that it would be best done as a wiki. I have been using MoinMoin for a wiki in my lab with my labmates. I like MoinMoin, but I don&#8217;t like how much of a pain it is to install it on debian. I decided to go [...]]]></description>
			<content:encoded><![CDATA[<p>I started working on my PhD proposal and decided that it would be best done as a wiki.  I have been using <a href="http://moinmoin.wikiwikiweb.de/">MoinMoin</a> for a wiki in my lab with my labmates.  I like MoinMoin, but I don&#8217;t like how much of a pain it is to install it on debian.  I decided to go back to <a href="http://www.mediawiki.org/">mediawiki</a> as it seems to be performing a bit better these days and is increasingly the standard wiki markup.  So I have a mediawiki set up now for tracking my research, my reading and my PhD proposal.  I might even get really brave and do my whole dissertation in wiki form.<br />
<span id="more-78"></span><br />
Once I got started on the proposal in wiki form I decided I could probably track all of my simulation analysis by logging the steps I take on the wiki.  This required finding some code that acts as a wiki client so that I can effectively pipe output directly to a wiki page, say Log:2007-XX-XX, or a new topic appropriate log for every day.  Looking for awhile, I have found two tools that can do this in one way or another.  <a href="http://search.cpan.org/~markj/WWW-Mediawiki-Client/bin/mvs">mvs</a>, and <a href="http://wikipediafs.sourceforge.net/">wikipediafs</a>.  mvs will work wherever there is Perl with the appropriate Perl modules and wikipediafs depends on the FUSE system in Linux.  They both look promising.  More on this later as I get some examples going.</p>
]]></content:encoded>
			<wfw:commentRss>http://jduck.net/2007/03/02/wikifying-my-life/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

