thewayne | More on PDFs and OCR

More things learned today:

Quality of OCR in Acrobat Pro decreases (fragment rate increases) if the:
-- Font is Bold
-- Page isn't flat
-- Page is skewed

I knew about the second and third, the first one surprised me. It might have been increased by non-perfect flatness and skew.

One thing that surprised me was what I learned when I did a little digging beneath the shortcut that launches the program. The executable was dated 2014! They bought the system in '15, and the software was pre-installed, so nothing has been updated since it was first set up. The head librarian is going to look for the contact info for the salesman to see what can be done for an update. Their software neither checks for updates nor has a menu function for checking for updates. I went to the manufacturer's web site and the actual software that we use isn't listed! I think they've upgraded to something with a different name, so I don't know if we'll be able to update it.

I scanned one of the hardbounds from back to front, and the behavior of the software building the PDF backwards was consistent: the PDF was in the correct sequence since I scanned it backwards.

I tested how long it took to fix fragments by repairing the first ten pages of one of the 40+ page PDFs, and it came to 3-4 minutes per page, meaning 2+ hours for the large PDFs. Speaking with the head librarian, she didn't think it was a good use of my limited time right now, so we're not going to do it. I think it's a good call, we can focus on the core job of getting all of the reports scanned which is the main goal of my internship. The OCR is good enough that anyone who wants to do Finds on these documents will have reasonable hit rates. Still, spending half an hour fixing 10 pages was a good use of my time to determine that 3-4 minutes per page number.

I learned of the bold problem seeming to increase fragment rates while I was doing the edit. The second page of these reports is a table of contents with an index entry on the left and then periods filling to the page number on the right, and the entire page is bold. On many lines the program identified these repeating periods as fragments, I'd have to tell it that these are not words. If the page had been laid out in an actual publishing program with proper kerning and such, maybe it would have scanned better and the OCR would have performed better, I don't know. It's definite that just bolding the entire page made it harder to read.

And this highlights a programming weakness in Acrobat Pro. The fragment interface has an option to highlight all fragments in the document, but if you skip over one, you have no way of going back to the one that you skipped. Bad interface design! But it's also Acrobat v11, which is long past support date, so I guess that I shouldn't be surprised. It's possible that current AcroPro versions have improved functionality, I wouldn't know: my version on my Mac is version 10, a generation older!

Threaded | Top-Level Comments Only

"If the page had been laid out in an actual publishing program with proper kerning and such"

IE, they shouldn't have use MS Word to create the document. The table of contents you describe is exactly what MS Word produces if you want a table of contents that updates as you update the document. Usually, it winds-up in bold because the headers for the various sections of your document are in bold.

Not a good use of your time, but hey another sacrifice being made because it's you that's doing the project and not someone on the staff following behind you to clean things up and make sure it's good.

It sounds interesting, having to learn the quirks of the system.

True. The format was probably a standard laid down by the main campus Office of the President, possibly even a master document. Been a while since I did hard-core work in a word processor, I've forgotten a lot (and glad that I have!).

More on PDFs and OCR

no subject

no subject

no subject