Job Update

May. 5th, 2006 04:39 pm
kirby1024: Kirbinator Icon (half-my face, half-terminator face) (Default)
[personal profile] kirby1024
I don't often talk about my job on LJ. In many ways, it's a pretty boring one - I scan documents and correct and mark the output so that they're readable for blind students at the university. It's one of those unfortunate jobs that is rather boring and monotonous, but requires a human. So, I get good uni pay, a workplace that happens to be my place of study, flexible hours, and no supervision, and constant access to an internet connection. So, while the work itself gets boring, the fringe benefits more than make up for it.

Currently I'm scanning in an education textbook on Classroom discipline. Now, as a scanner, I'm not a fan of this book. The inking throughout the book is various shades of green (which is fine, since the scanner scans in B/W full contrast anyway), the pages are very thin (and as such, I keep getting faded bits of text from the other side of the page which gives the OCR program hissy fits), it's big enough that I can't do double-page scanning on the scanner I'm using, and it has various paper planes flying around on the margins. It's also got a very poor margins system, in which both pages have the most margin on the left hand side of the page, and very little on the right side. Which means I really have to put my arms into it when I scan the damn thing, or else I miss the last character of each line. And until my boss manages to grab a reading list for the book, I have to scan the whole damn thing in. All 330 pages of it. Oh, and there's comic strips throughout the book. Which I have to transcribe.

The only thing that makes it bearable is that all the tables are really simple, and the diagrams are far and few between. It's mostly just nice text, which means it won't be a horror come correcting and mark-up time.

So, for those of you intending to write textbooks, please, for the love of god, use decent-thickness paper, don't even consider comics through the book, and make sure you've left at least some margin on the spine-side of the book. For our sakes, if nothing else...

(no subject)

Date: 2006-05-05 07:19 am (UTC)
From: [identity profile] carlos-v-b.livejournal.com
Or, you know, provide electronic copies if requested *grin*

(no subject)

Date: 2006-05-05 10:25 am (UTC)
ext_3749: (Default)
From: [identity profile] kirby1024.livejournal.com
Doesn't help that much - still have to put the PDFs through to extract the text and transcribe all the images and tables...

Been doing that for a particular book for about 2 months...

(no subject)

Date: 2006-05-05 10:50 am (UTC)
From: [identity profile] nifwlseirff.livejournal.com
Isn't there some automated software that can pull text out of PDFs (if they have allowed text selection)?
I'm sure I heard about it on one of the techwriting lists...
off to search.....

(no subject)

Date: 2006-05-05 11:00 am (UTC)
From: [identity profile] nifwlseirff.livejournal.com
heh.... search yields many options...
(pdf to text software)

I don't recognise the toolnames, but then I've never had to do that.. normally it's taking Word docs and getting (by hand) into FrameMaker or recently, XML, and using PDF as the publishable output.

Please tell me you use software to pull out the text and not cut-paste?

;>

(Note - I use cut-paste to get data into XML topics, cos it does a much better (and faster) job than any extraction to XML... software really doesn't understand DITA yet....

(no subject)

Date: 2006-05-08 01:23 am (UTC)
ext_3749: (Default)
From: [identity profile] kirby1024.livejournal.com
We use a program called ABBYY FineReader, which allows us to do effective "Batch scans" (so it goes through and opens the entire PDF file, converts everything into images, then reads it all as if it were very high-quality images). We take the output of these batch scans, then use the program's export utility to export them directly into MS Word, and then work on the reformatting there.

We could extract the text directly from the PDF files (and I've worked on extracted text before), but it turns out it's not nearly as fast, since it means we have to spend much longer transcribing tables and images, since we don't have anything close to an imprint of these in the file already. Also, extracting all the text from a PDF tends to throw a lot of formatting of little text boxes entirely out of whack. Believe it or not, it's traditionally faster to send a Publisher's PDF through the OCR program, then tidy it up, rather than pull all the text out and then add everything in afterwards.

Of course, for a lot of PDFs, this is all academic - a lot of the PDFs are just PDFised image scans of a small document fragments, and so we have to treat it like a set of images anyway, because they are. Thank god for ABBYY FineReader, since it can scan from both a scanner and a PDF file...

(no subject)

Date: 2006-05-05 04:21 pm (UTC)
From: [identity profile] http://users.livejournal.com/_sabik_/
Note that it's possible to set up PDFs so that they are accessible to "approved" screen readers, but it's not possible to cut'n'paste from them.

How evil.

η

(no subject)

Date: 2006-05-08 01:26 am (UTC)
ext_3749: (Default)
From: [identity profile] kirby1024.livejournal.com
Thankfully, such PDFs respond very well to OCR software. Once again, so glad we have ABBYY FineReader, considering how it natively handles PDF files...

(no subject)

Date: 2006-05-05 04:11 pm (UTC)
From: [identity profile] http://users.livejournal.com/_sabik_/
Umm, electronic copies as in whatever they had before it turned into PDF? Doesn't help with images or (necessarily) tables, but the rest of it should just copy over.

Everybody does typesetting electronically these days - there's just no excuse.


η

(no subject)

Date: 2006-05-08 01:25 am (UTC)
ext_3749: (Default)
From: [identity profile] kirby1024.livejournal.com
Also doesn't help with a lot of Library-scanned PDFs - the Library seems to just scan the document in, then take all the images and throw them into a PDF file - that needs to be sent through OCR software (which we can do).

Profile

kirby1024: Kirbinator Icon (half-my face, half-terminator face) (Default)
kirby1024

January 2011

S M T W T F S
      1
2345678
9101112131415
16171819202122
23242526272829
30 31     

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags