Member Sign In
Not a member?

A Wired.com user account lets you create, edit and comment on Webmonkey articles. You will also be able to contribute to the Wired How-To Wiki and comment on news stories at Wired.com.


It's fast and free.

Sign in with OpenID
Sign In
Webmonkey is a property of Wired Digital.
processing...
Join Webmonkey

Please send me occasional e-mail updates about new features and special offers from Wired/Webmonkey.
Yes No

Please send occasional e-mail offers from Wired/Webmonkey affiliated web sites and publications, and carefully selected companies.
Yes No

I understand and agree that registration on or use of this site constitutes agreement to Webmonkey's User Agreement and Privacy Policy.
Webmonkey is a property of Wired Digital.
processing...

Retrieve Sign In

Please enter your e-mail address or username below. Your username and password will be sent to the e-mail address you provided us.

or
Webmonkey is a property of Wired Digital.
processing...

Welcome to Webmonkey

A private profile page has been created for you.
As a member of Webmonkey, you can now:
  • edit articles
  • add to the code library
  • design and write a tutorial
  • comment on any Webmonkey article
Close
Webmonkey is a property of Wired Digital.

Sign In Information Sent

An e-mail has been sent to the e-mail address registered in this account.
If you cannot find it in your in-box, please check your bulk or junk folders.
Sign In
Webmonkey is a property of Wired Digital.

Google Spiders to Start Crawling The ‘Deep’ Web

google.jpgGoogle recently announced it will soon begin indexing the so-called “deep” web, those pages hiding behind HTML forms and other inadvertently spider-blocking HTML elements. The move will potentially open up a whole new range of webpages that were previously invisible to the search engine.

Among the possible wins for Google users is the ability to find pages within sites based on searches of those site. As the Google Webmaster blog explains:

For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML. Having chosen the values for each input, we generate and then try to crawl URLs that correspond to a possible query a user may have made

The results of those crawls would then show up in your Google search results, potentially offering a faster, more direct way to reach the information you’re searching for.

Before any webmasters out there freak out about the possibility that Google will index pages you don’t want indexed, the Google spiders will still obey any robots.txt, nofollow, and noindex rules. However, if you have a site you don’t want crawled and you’ve been relying on a form as a means of blocking spiders, it’s time to break out the robots.txt file and specifically disallow your pages.

Another fairly humorous scenario mentioned on Hacker News serves as a reminder that using GET to modify content is very bad idea. One poor webmaster discovered the Google spider accidentally deleted his whole site by following GET-based delete URLs — don’t be that guy.

Google says that the new form-filling spiders will only be crawling certain sites, though it doesn’t offer any details about which sites it will hit.

We’ll have to wait a while to see how well this experiment works, but if it does, it could potentially open up a whole new wealth of information.

[via Slashdot]

See Also:

Post Comment Comments Permalink Print
Reddit Digg

 
Subscribe now

Special Offer For Webmonkey Users

WIRED magazine:
The first word on how technology is changing our world.

Subscribe for just $10 a year