kirby1024 | Job Update

I don't often talk about my job on LJ. In many ways, it's a pretty boring one - I scan documents and correct and mark the output so that they're readable for blind students at the university. It's one of those unfortunate jobs that is rather boring and monotonous, but requires a human. So, I get good uni pay, a workplace that happens to be my place of study, flexible hours, and no supervision, and constant access to an internet connection. So, while the work itself gets boring, the fringe benefits more than make up for it.

Currently I'm scanning in an education textbook on Classroom discipline. Now, as a scanner, I'm not a fan of this book. The inking throughout the book is various shades of green (which is fine, since the scanner scans in B/W full contrast anyway), the pages are very thin (and as such, I keep getting faded bits of text from the other side of the page which gives the OCR program hissy fits), it's big enough that I can't do double-page scanning on the scanner I'm using, and it has various paper planes flying around on the margins. It's also got a very poor margins system, in which both pages have the most margin on the left hand side of the page, and very little on the right side. Which means I really have to put my arms into it when I scan the damn thing, or else I miss the last character of each line. And until my boss manages to grab a reading list for the book, I have to scan the whole damn thing in. All 330 pages of it. Oh, and there's comic strips throughout the book. Which I have to transcribe.

The only thing that makes it bearable is that all the tables are really simple, and the diagrams are far and few between. It's mostly just nice text, which means it won't be a horror come correcting and mark-up time.

So, for those of you intending to write textbooks, please, for the love of god, use decent-thickness paper, don't even consider comics through the book, and make sure you've left at least some margin on the spine-side of the book. For our sakes, if nothing else...

S	M	T	W	T	F	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

Most Popular Tags

angst - 3 uses
announcement - 9 uses
awesomeness - 12 uses
beautiful stuff - 4 uses
birthday - 4 uses
boyfriend - 13 uses
boys - 4 uses
christmas - 3 uses
cool stuff - 10 uses
death - 7 uses
deiludum - 4 uses
demons - 4 uses
dreamwidth - 6 uses
emotions - 3 uses
family - 9 uses
friends - 8 uses
funny stuff - 3 uses
geekery - 3 uses
help - 2 uses
house - 5 uses
injury - 3 uses
invite - 4 uses
issues - 4 uses
job - 10 uses
language - 3 uses
lexicon - 5 uses
life - 3 uses
linkies! - 8 uses
lj - 2 uses
love - 3 uses
me - 9 uses
meme - 12 uses
memes - 13 uses
mobile phone - 2 uses
moving - 7 uses
new years - 3 uses
photos - 7 uses
playground - 2 uses
positive - 6 uses
postcard - 13 uses
questions - 18 uses
random stuff - 3 uses
rape - 10 uses
roleplaying - 5 uses
sick - 2 uses
technology - 3 uses
thesis - 8 uses
twitter - 21 uses
uni - 21 uses
work - 23 uses

Flat | Top-Level Comments Only

From:

carlos-v-b.livejournal.com

Or, you know, provide electronic copies if requested *grin*

kirby1024.livejournal.com

Doesn't help that much - still have to put the PDFs through to extract the text and transcribe all the images and tables...

Been doing that for a particular book for about 2 months...

nifwlseirff.livejournal.com

Isn't there some automated software that can pull text out of PDFs (if they have allowed text selection)?
I'm sure I heard about it on one of the techwriting lists...
off to search.....

heh.... search yields many options...
(pdf to text software)

I don't recognise the toolnames, but then I've never had to do that.. normally it's taking Word docs and getting (by hand) into FrameMaker or recently, XML, and using PDF as the publishable output.

Please tell me you use software to pull out the text and not cut-paste?

;>

(Note - I use cut-paste to get data into XML topics, cos it does a much better (and faster) job than any extraction to XML... software really doesn't understand DITA yet....

We use a program called ABBYY FineReader, which allows us to do effective "Batch scans" (so it goes through and opens the entire PDF file, converts everything into images, then reads it all as if it were very high-quality images). We take the output of these batch scans, then use the program's export utility to export them directly into MS Word, and then work on the reformatting there.

We could extract the text directly from the PDF files (and I've worked on extracted text before), but it turns out it's not nearly as fast, since it means we have to spend much longer transcribing tables and images, since we don't have anything close to an imprint of these in the file already. Also, extracting all the text from a PDF tends to throw a lot of formatting of little text boxes entirely out of whack. Believe it or not, it's traditionally faster to send a Publisher's PDF through the OCR program, then tidy it up, rather than pull all the text out and then add everything in afterwards.

Of course, for a lot of PDFs, this is all academic - a lot of the PDFs are just PDFised image scans of a small document fragments, and so we have to treat it like a set of images anyway, because they are. Thank god for ABBYY FineReader, since it can scan from both a scanner and a PDF file...

http://users.livejournal.com/_sabik_/

Note that it's possible to set up PDFs so that they are accessible to "approved" screen readers, but it's not possible to cut'n'paste from them.

How evil.

η

Thankfully, such PDFs respond very well to OCR software. Once again, so glad we have ABBYY FineReader, considering how it natively handles PDF files...

Umm, electronic copies as in whatever they had before it turned into PDF? Doesn't help with images or (necessarily) tables, but the rest of it should just copy over.

Everybody does typesetting electronically these days - there's just no excuse.

η

Also doesn't help with a lot of Library-scanned PDFs - the Library seems to just scan the document in, then take all the images and throw them into a PDF file - that needs to be sent through OCR software (which we can do).

The Journal of Lee Davis-Thalbourne

Musings of a Very Odd Boy

Job Update

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

Profile

January 2011

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags