Member Sign In
Not a member?

A Wired.com user account lets you create, edit and comment on Webmonkey articles. You will also be able to contribute to the Wired How-To Wiki and comment on news stories at Wired.com.


It's fast and free.

Sign in with OpenID
Sign In
Webmonkey is a property of Wired Digital.
processing...
Join Webmonkey

Please send me occasional e-mail updates about new features and special offers from Wired/Webmonkey.
Yes No

Please send occasional e-mail offers from Wired/Webmonkey affiliated web sites and publications, and carefully selected companies.
Yes No

I understand and agree that registration on or use of this site constitutes agreement to Webmonkey's User Agreement and Privacy Policy.
Webmonkey is a property of Wired Digital.
processing...

Retrieve Sign In

Please enter your e-mail address or username below. Your username and password will be sent to the e-mail address you provided us.

or
Webmonkey is a property of Wired Digital.
processing...

Welcome to Webmonkey

A private profile page has been created for you.
As a member of Webmonkey, you can now:
  • edit articles
  • add to the code library
  • design and write a tutorial
  • comment on any Webmonkey article
Close
Webmonkey is a property of Wired Digital.

Sign In Information Sent

An e-mail has been sent to the e-mail address registered in this account.
If you cannot find it in your in-box, please check your bulk or junk folders.
Sign In
Webmonkey is a property of Wired Digital.

OCR Tech Allows Google to Index Millions of Scanned Documents

GoogleScanned PDFs are a kind of darknet on a web — at best search engines see an image inside a PDF, but can’t parse out the actual text. But now that’s changed as Google recently announced that it will begin using OCR (optical character recognition) technology to index the text inside scanned PDF documents.

Although there’s no flashy new interface or anything tangibly different in Google’s search results page, the new technology means that the full text of the some 300 million PDF files in Google’s index will soon be converted to searchable text.

That’s quite a boost for your search results, though whether or not the PDFs show up in your searches depends a lot on what you search for. Google’s examples would seem to indicate that many of the these documents are very technical, like this guide to repairing aluminum wiring (follow the link and then click “view as HTML” to see what the results look like).

Lifehacker has a fairly novel way to put the new features to work for you — upload your scanned PDFs, tell Google about them with a link and then sit back and wait for your free OCR conversion.

Certainly there are faster ways of converting scanned documents and, given that most scanners ship with free OCR programs, we’re not sure how practical the idea is, but they get points for creativity.

See Also:

Post Comment Comments Permalink Print
Reddit Digg

 
Subscribe now

Special Offer For Webmonkey Users

WIRED magazine:
The first word on how technology is changing our world.

Subscribe for just $10 a year