All posts tagged ‘PDFs’

File Under: APIs, Web Apps

Google Docs Can Now Convert Images and PDFs to Text

Google’s web-based document editor can now convert the text inside your PDFs and images into text you can edit.

When you upload a file to Google Docs, you’ll see the option to “Convert text from PDF or image files to Google Docs documents.” You can upload any PDF, PNG, JPG or GIF.

To do the conversion, Google is relying on a technology commonly known as Optical Character Recognition, or OCR. The company began using OCR for web searches in 2008, then released experimental support for OCR-based conversion as part of its Documents List Data API in 2009.

Google has been improving the technology since then, and this is its first appearance in a Google product. Of course, since it’s part of the API, you can roll it into an app of your own creation. But we can expect the conversion tool to improve and yield some pretty cool applications down the road.

It’s not perfect, and the results will vary based on the resolution or visual clarity of whatever you’re uploading.

We converted Mark Klein’s public declaration from the AT&T/NSA wiretapping case. Here’s the original PDF from the Electronic Frontier Foundation, and here’s our Googlefied MS Word .doc file.

The cleaner the layout and the text rendering, the cleaner the result.

Below is a screenshot of Wired magazine’s iPad app, followed by the Google Docs Wired_iPad_app. You’ll notice it had some problems with the pullquote and the hyphens, but it navigated the two-column layout pretty well.
Continue Reading “Google Docs Can Now Convert Images and PDFs to Text” »