Monday 12 October 2009

Webmaster



Google Webmaster Central Blog - Official news on crawling and indexing sites for the Google index


This Blog
Google Blogs
Web
Blog
News
This Blog




Google Blogs




Web




Blog




News




A proposal for making AJAX crawlable
Wednesday, October 07, 2009 at 10:51 AM
Webmaster level: Advanced

Today we're excited to propose a new standard for making AJAX-based websites crawlable. This will benefit webmasters and users by making content from rich and interactive AJAX-based websites universally accessible through search results on any search engine that chooses to take part. We believe that making this content available for crawling and indexing could significantly improve the web.

While AJAX-based websites are popular with users, search engines traditionally are not able to access any of the content on them. The last time we checked, almost 70% of the websites we know about use JavaScript in some form or another. Of course, most of that JavaScript is not AJAX, but the better that search engines could crawl and index AJAX, the more that developers could add richer features to their websites and still show up in search engines.

Some of the goals that we wanted to achieve with this proposal were:

* Minimal changes are required as the website grows

* Users and search engines see the same content (no cloaking)

* Search engines can send users directly to the AJAX URL (not to a static copy)

* Site owners have a way of verifying that their AJAX website is rendered correctly and thus that the crawler has access to all the content



Here's how search engines would crawl and index AJAX in our initial proposal:

* Slightly modify the URL fragments for stateful AJAX pages
Stateful AJAX pages display the same content whenever accessed directly. These are pages that could be referred to in search results. Instead of a URL like http://example.com/page?query#state we would like to propose adding a token to make it possible to recognize these URLs: http://example.com/page?query#[FRAGMENTTOKEN]state . Based on a review of current URLs on the web, we propose using "!" (an exclamation point) as the token for this. The proposed URL that could be shown in search results would then be: http://example.com/page?query#!state.

* Use a headless browser that outputs an HTML snapshot on your web server
The headless browser is used to access the AJAX page and generates HTML code based on the final state in the browser. Only specially tagged URLs are passed to the headless browser for processing. By doing this on the server side, the website owner is in control of the HTML code that is generated and can easily verify that all JavaScript is executed correctly. An example of such a browser is HtmlUnit, an open-sourced "GUI-less browser for Java programs.

* Allow search engine crawlers to access these URLs by escaping the state
As URL fragments are never sent with requests to servers, it's necessary to slightly modify the URL used to access the page. At the same time, this tells the server to use the headless browser to generate HTML code instead of returning a page with JavaScript. Other, existing URLs - such as those used by the user - would be processed normally, bypassing the headless browser. We propose escaping the state information and adding it to the query parameters with a token. Using the previous example, one such URL would be http://example.com/page?query&[QUERYTOKEN]=state . Based on our analysis of current URLs on the web, we propose using "_escaped_fragment_" as the token. The proposed URL would then become http://example.com/page?query&_escaped_fragment_=state .

* Show the original URL to users in the search results
To improve the user experience, it makes sense to refer users directly to the AJAX-based pages. This can be achieved by showing the original URL (such as http://example.com/page?query#!state from our example above) in the search results. Search engines can check that the indexable text returned to Googlebot is the same or a subset of the text that is returned to users.




(Graphic by Katharina Probst)

In summary, starting with a stateful URL such as
http://example.com/dictionary.html#AJAX , it could be available to both crawlers and users as
http://example.com/dictionary.html#!AJAX which could be crawled as
http://example.com/dictionary.html?_escaped_fragment_=AJAX which in turn would be shown to users and accessed as
http://example.com/dictionary.html#!AJAX



We're currently working on a proposal and a prototype implementation. Feedback is very welcome — please add your comments below or in our Webmaster Help Forum. Thank you for your interest in making the AJAX-based web accessible and useful through search engines!

Proposal by Katharina Probst, Bruce Johnson, Arup Mukherjee, Erik van der Poel and Li Xiao, Google
Blog post by John Mueller, Webmaster Trends Analyst, Google Zürich


Permalink 59 comments

Labels: crawling and indexing

Reunifying duplicate content on your website
Tuesday, October 06, 2009 at 3:14 PM
Handling duplicate content within your own website can be a big challenge. Websites grow; features get added, changed and removed; content comes—content goes. Over time, many websites collect systematic cruft in the form of multiple URLs that return the same contents. Having duplicate content on your website is generally not problematic, though it can make it harder for search engines to crawl and index the content. Also, PageRank and similar information found via incoming links can get diffused across pages we aren't currently recognizing as duplicates, potentially making your preferred version of the page rank lower in Google.

Steps for dealing with duplicate content within your website

1. Recognize duplicate content on your website.
The first and most important step is to recognize duplicate content on your website. A simple way to do this is to take a unique text snippet from a page and to search for it, limiting the results to pages from your own website by using a site:query in Google. Multiple results for the same content show duplication you can investigate.

2. Determine your preferred URLs.
Before fixing duplicate content issues, you'll have to determine your preferred URL structure. Which URL would you prefer to use for that piece of content?

3. Be consistent within your website.
Once you've chosen your preferred URLs, make sure to use them in all possible locations within your website (including in your Sitemap file).

4. Apply 301 permanent redirects where necessary and possible.
If you can, redirect duplicate URLs to your preferred URLs using a 301 response code. This helps users and search engines find your preferred URLs should they visit the duplicate URLs. If your site is available on several domain names, pick one and use the 301 redirect appropriately from the others, making sure to forward to the right specific page, not just the root of the domain. If you support both www and non-www host names, pick one, use the preferred domain setting in Webmaster Tools, and redirect appropriately.

5. Implement the rel="canonical" link element on your pages where you can.
Where 301 redirects are not possible, the rel="canonical" link element can give us a better understanding of your site and of your preferred URLs. The use of this link element is also supported by major search engines such as Ask.com, Bing and Yahoo!.

6. Use the URL parameter handling tool in Google Webmaster Tools where possible.
If some or all of your website's duplicate content comes from URLs with query parameters, this tool can help you to notify us of important and irrelevant parameters within your URLs. More information about this tool can be found in our announcement blog post.


What about the robots.txt file?

One item which is missing from this list is disallowing crawling of duplicate content with your robots.txt file. We now recommend not blocking access to duplicate content on your website, whether with a robots.txt file or other methods. Instead, use the rel="canonical" link element, the URL parameter handling tool, or 301 redirects. If access to duplicate content is entirely blocked, search engines effectively have to treat those URLs as separate, unique pages since they cannot know that they're actually just different URLs for the same content. A better solution is to allow them to be crawled, but clearly mark them as duplicate using one of our recommended methods. If you allow us to crawl these URLs, Googlebot will learn rules to identify duplicates just by looking at the URL and should largely avoid unnecessary recrawls in any case. In cases where duplicate content still leads to us crawling too much of your website, you can also adjust the crawl rate setting in Webmaster Tools.

We hope these methods will help you to master the duplicate content on your website! Information about duplicate content in general can also be found in our Help Center. Should you have any questions, feel free to join the discussion in our Webmaster Help Forum.

Posted by John Mueller, Webmaster Trends Analyst, Google Zürich


Permalink 19 comments

Labels: crawling and indexing

New parameter handling tool helps with duplicate content issues
Monday, October 05, 2009 at 12:33 PM
Duplicate content has been a hot topic among webmasters and our blog for over three years. One of our first posts on the subject came out in December of '06, and our most recent post was last week. Over the past three years, we've been providing tools and tips to help webmasters control which URLs we crawl and index, including a) use of 301 redirects, b) www vs. non-www preferred domain setting, c) change of address option, and d) rel="canonical".

We're happy to announce another feature to assist with managing duplicate content: parameter handling. Parameter handling allows you to view which parameters Google believes should be ignored or not ignored at crawl time, and to overwrite our suggestions if necessary.


Let's take our old example of a site selling Swedish fish. Imagine that your preferred version of the URL and its content looks like this:
http://www.example.com/product.php?item=swedish-fish

However, you may also serve the same content on different URLs depending on how the user navigates around your site, or your content management system may embed parameters such as sessionid:
http://www.example.com/product.php?item=swedish-fish&category=gummy-candy
http://www.example.com/product.php?item=swedish-fish&trackingid=1234&sessionid=5678

With the "Parameter Handling" setting, you can now provide suggestions to our crawler to ignore the parameters category, trackingid, and sessionid. If we take your suggestion into account, the net result will be a more efficient crawl of your site, and fewer duplicate URLs.

Since we launched the feature, here are some popular questions that have come up:

Are the suggestions provided a hint or a directive?
Your suggestions are considered hints. We'll do our best to take them into account; however, there may be cases when the provided suggestions may do more harm than good for a site.

When do I use parameter handling vs rel="canonical"?
rel="canonical" is a great tool to manage duplicate content issues, and has had huge adoption. The differences between the two options are:

* rel="canonical" has to be put on each page, whereas parameter handling is set at the host level
* rel="canonical" is respected by many search engines, whereas parameter handling suggestions are only provided to Google

Use which option works best for you; it's fine to use both if you want to be very thorough.

As always, your feedback on our new feature is appreciated.

Posted by Tanya Gupta and Ningning Zhu, Software Engineers


Permalink 12 comments

Google Friend Connect: No more FTP... just get started!
Friday, October 02, 2009 at 10:52 AM
Until today, you had to upload a file to your website to activate Google Friend Connect features and gadgets. Today, we're dramatically simplifying the Friend Connect setup process. To get started with Friend Connect features, all you have to do is submit your website's name and URL after logging into www.google.com/friendconnect.

To learn more about the recent updates to Google Friend Connect, check out our post on the Google Social Web Blog.

Posted by Mussie Shore, Product Manager, Google Friend Connect


Permalink 10 comments

Changes to website verification in Webmaster Tools
Thursday, October 01, 2009 at 10:02 AM
If you use Webmaster Tools, you're probably familiar with verifying ownership of your sites. Simply add a specific meta tag or file to your site, click a button, and you're a verified owner. We've recently made a few small improvements to the process that we think will make it easier and more reliable for you.

The first change is an improvement to the meta tag verification method. In the past, your verification meta tag was partially based on the email address of your Google Account. That meant that if you changed the email address in your account settings, your meta tags would also change (and you'd become unverified for any sites you had used the old tag on). We've created a new version of the verification meta tag which is unrelated to your email address. Once you verify with a new meta tag, you'll never become unverified by changing your email address.

We've also revamped the way we do verification by HTML file. Previously, if your website returned an HTTP status code other than 404 for non-existent URLs, you would be unable to use the file verification method. A properly configured web server will return 404 for non-existent URLs, but it turns out that a lot of sites have problems with this requirement. We've simplified the file verification process to eliminate the checks for non-existent URLs. Now, you just download the HTML file we provide and upload it to your site without modification. We'll check the contents of the file, and if they're correct, you're done.



We hope these changes will make verification a little bit more pleasant. If you've already verified using the old methods, don't worry! Your existing verifications will continue to work. These changes only affect new verifications.

Some websites and software have features that help you verify ownership by adding the meta tag or file for you. They may need to be updated to work with the new methods. For example, Google Sites doesn't currently handle the new meta tag verification method correctly. We're aware of that problem and are working to fix it as soon as we can. If you discover other services that have similar problems, please work with their maintainer to resolve the issue. We're sorry if this causes any inconvenience.

This is just the first of several improvements we're working on for website verification. To give you a heads up, in a future update, we'll begin showing the email addresses of all verified owners of a given site to the other verified owners of that site. We think this will make it much easier to manage sites with multiple verified owners. However, if you're using an email address you wouldn't want the other owners of your site to see, now might be a good time to change it!

Posted by Sean Harding, Software Engineer


Permalink 42 comments

Translate your website with Google: Expand your audience globally
Wednesday, September 30, 2009 at 2:54 PM
(This has been cross-posted from the Official Google Blog)

How long would it take to translate all the world's web content into 50 languages? Even if all of the translators in the world worked around the clock, with the current growth rate of content being created online and the sheer amount of data on the web, it would take hundreds of years to make even a small dent.

Today, we're happy to announce a new website translator gadget powered by Google Translate that enables you to make your site's content available in 51 languages. Now, when people visit your page, if their language (as determined by their browser settings) is different than the language of your page, they'll be prompted to automatically translate the page into their own language. If the visitor's language is the same as the language of your page, no translation banner will appear.


After clicking the Translate button, the automatic translations are shown directly on your page.


It's easy to install — all you have to do is cut and paste a short snippet into your webpage to increase the global reach of your blog or website.


Automatic translation is convenient and helps people get a quick gist of the page. However, it's not a perfect substitute for the art of professional translation. Today happens to be International Translation Day, and we'd like to take the opportunity to celebrate the contributions of translators all over the world. These translators play an essential role in enabling global communication, and with the rapid growth and ease of access to digital content, the need for them is greater than ever. We hope that professional translators, along with translation tools such as Google Translator Toolkit and this Translate gadget, will continue to help make the world's content more accessible to everyone.

Posted by Jeff Chin, Product Manager, Google Translate


Permalink 24 comments

Labels: products and services

Using named anchors to identify sections on your pages
Friday, September 25, 2009 at 12:29 PM
We just announced a couple of new features on the Official Google Blog that enable users to get to the information they want faster. Both features provide additional links in the result block, which allow users to jump directly to parts of a larger page. This is useful when a user has a specific interest in mind that is almost entirely covered in a single section of a page. Now they can navigate directly to the relevant section instead of scrolling through the page looking for their information.

We generate these deep links completely algorithmically, based on page structure, so they could be displayed for any site (and of course money isn't involved in any way, so you can't pay to get these links). There are a few things you can do to increase the chances that they might appear on your pages. First, ensure that long, multi-topic pages on your site are well-structured and broken into distinct logical sections. Second, ensure that each section has an associated anchor with a descriptive name (i.e., not just "Section 2.1"), and that your page includes a "table of contents" which links to the individual anchors. The new in-snippet links only appear for relevant queries, so you won't see it on the results all the time — only when we think that a link to a section would be highly useful for a particular query.

Posted by Raj Krishnan, Snippets Team


Permalink 32 comments

Older Posts



powered by



Site Feed
Add to Google


Gadgets powered by Google
Subscribe via email

Enter your email address:

Delivered by FeedBurner
More Blogs from Google
Visit our directory for more information about Google blogs.
Labels

* accessibility (8)
* crawling and indexing (69)
* events (38)
* feedback and communication (48)
* general tips (34)
* products and services (26)
* search results (30)
* sitemaps (27)
* webmaster guidelines (21)
* webmaster tools (66)

We love feedback!
Post a comment on the blog or visit our discussion forum for webmasters.
New to Webmaster Central?
Learn more about Google Webmaster Tools.
Useful links

* Google Webmaster Central
* Webmaster Help Center
* Google Webmaster Tools
* Webmaster Central on YouTube
* Webmaster Central China Blog
* Webmaster Central Japanese Blog
* Webmaster Central German Blog
* Webmaster Central Spanish Blog

Webmaster Central Events
Gadgets powered by Google
Recent posts from more Google blogs
Getting Gmail on your phone
Gmail Blog
Web Search in Your Country
Google AJAX Search API Blog
Go Mobile! It's Tips & Tricks Week for YouTube on Your Phone
YouTube Blog
Истории успеха российских сайтов с Google AdSense. Часть 1, сайт Good-Cook.ru (1)
AdSense Blog-Russia
Go Mobile! It's Tips & Tricks Week for YouTube on Your Phone
YouTube Blog


Google Webmaster Central Blog is powered by Blogger. Start your own weblog.

Powered By Blogger



Copyright © 2008 Google Inc. All rights reserved.
Privacy Policy | Terms of Service