thewayne | Mar. 4th, 2019

About 12:30am she was either checking email from work or signed on to the work chat channel, and found out some pretty rough news: a friend of ours with whom we game occasionally, her dog got skunked - at work! I have never seen, much less smelled, a skunk at the observatory. But it makes sense that if I see and smell skunks on a regular basis around our house, 15 or so miles away from the observatory and at roughly the same altitude, give or take 300 feet, that they'd also be at the observatory. The owner isn't on my wife's crew, she's on the 2.5 meter telescope.

So Russ changed clothes (fortunately I'd just finished laundry and they were in the dryer), grabbed our dog washing kit, complete with anti-skunk enzyme, and drove off to the observatory.

I don't envy the concept of washing a dog outside. If it wasn't below freezing, it had to be in the low 30s. But the bathrooms in the main building are toilets-only, and as I recall the dorms are showers-only, so it had to be a hose bib and bucket bath outside. Plus, they'd probably get in a lot of trouble for washing a skunked dog in the dorms!

Fortunately Russet's working Wednesday/Thursday, so this'll help get her on a day sleep schedule. Lexie, the dog, is a medium-hair, I'm not sure how that'll help in terms of how long the smell will persist. When Rupert, our blue tick hound with short hair got skunked last year, one bath and just a couple of days and he was scent-free. Hopefully Lexie won't persist in being odoriferous like our poodles do.

I'm doing an internship in our local university library through April, and my main task is scanning their annual 'Reports To The President', a précis of college activity sent up to main campus and bound in a book, usually hard-bound. The oldest book was 1965-66, the newest that I've seen thus far is '98-99. I believe there are newer already in PDF format online on the local network. Apparently by scanning them and then coding an RDA record for each file, we can get them hosted by the state academic library organization, or somebody, for free.

So that's cool.

I'm using a fairly spiffy Fujitsu specialized document scanner that can scan two pages of a bound book in one pass, but I don't think their software is as good as they claim it to be. It can handle a pretty significant amount of curvature in the books - for example, I was scanning three pages starting at page 385 of 427, so LOTS of curve when you're that far in. I was holding up the left side to get the right page reasonably flat, then holding down both sides with one finger.

And yes, the fingers were captured by the scan.

After you've done your scan, you get into the next phase, where you drag this wire frame to line up one line down the spine between the two pages, then you align four corners to the outside corners of the pages. The program does a good job of detecting the edges and snapping to it, but sometimes you have to do some dragging to improve alignment. Once you've aligned all the scanned pages correctly, you click an Apply button and it re-cuts the scans into individual pages and flattens them, programmatically removing the curve. It does a very good job, though not perfect.

THEN you have to go back through every page and remove the fingertips! It has a special tool just for it and works a lot like Photoshop's patch tool, but it auto-selects the fingertip. Click Apply, and the fingertip vanishes.

Once you've removed the fingertips, you can save it to PDF. Theoretically the program performs OCR (optical character recognition), but I can't see that it has any effect. I end up loading the PDF into Acrobat Pro and running OCR there.

And this is where I learned something tonight. While you can't do a spell-check on a scanned document because you're dealing with a scanned image, not words, there's something that's similar: a fragment check. Fragments are words that Adobe Acrobat recognizes as 'I think this is a word or something, but I'm not sure, therefor I didn't map it into the OCR side of the document. Fix it.' Acrobat can't provide a dictionary of suggestions like Word, so when it sees something that it thinks was a word but it couldn't map, you have to type the correction. Or page number. Or budget number. Or tell it to ignore it.

It took me a good half an hour to fix a three page document. I don't know how many times I typed the San of San Juan. Just the San, apparently Juan was recognizable.

And that was a three page document from '67-68. The latest document from '98-99? That was 40some pages, I'm going to run a fragment check on it tomorrow afternoon and we shall see how long it takes to fix.

One very odd thing about scanning two pages at once in bound books - the page sequence is reversed! This is easily fixed in Acrobat Pro when you're dealing with a handful or two of pages, you just slide page thumbnails around. But dealing with 30 or 40 pages? Next week I'll try scanning a book starting with the last pages and working my way forward and seeing how that works.

So important tip when creating PDFs from scanned docs for public consumption: running OCR is only half the job. If you need the document to be searchable, you MUST spend the time to run a fragment check on it and fix all of the problems! Otherwise you're going to frustrate anyone needing to do anything serious with the document.

One thing that makes me really wish I had a working Mac laptop: I'd like to take an unfixed doc and run it through text to speech and see how it works. Then run the fixed doc through TTS. Might be interesting.

S	M	T	W	T	F	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30

Always strive to learn something useful. --Sophocles

You are coming to a sad realization. Cancel or allow?

Mar. 4th, 2019

Mar. 4th, 2019

My wife was quite the good samaritan last night

Learned some interesting things about scanning books and OCR processing today

Profile

April 2026

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags