Friday, April 12, 2013

SEO AJAX / HTML5 App _escaped_fragment_ best practices

Escaped Fragment URI Analysis

So due to a little bit of miscommunication between developers, and some code being released late to production we regrettably had three diverging URLs in our content body, canonical url's and sitemap.  For a period of two weeks the spiders crawled -- and we had the opportunity to go back and see what happened.

Sample size: 240914 urls crawled over  14 days 2013/03/30 - 2013/04/12 -- here are the respective dates and # of URL's crawled:

2013/03/30 - 16,375
2013/03/31 - 18,452
2013/04/01 - 17,865
2013/04/02 - 19,431
2013/04/03 - 15,443
2013/04/04 - 4,832
2013/04/05 - 21,053
2013/04/06 - 24,316
2013/04/07 - 27,477
2013/04/08 - 23,862
2013/04/09 - 22,050
2013/04/10 - 15,488
2013/04/11 - 6,396
2013/04/12 - 7,874


We will break down the # of URL's spidered, and also discuss the encoding of the _escaped_fragment_ which outlined by Google - but also appears to have been adopted by Bing, Yandex, Yahoo,  Facebook and others. 


Canonical URL Results (82%)
Example: http://www.domain.com/something#!v=1

URLS:  5858
Yandex (199.21.99.97) & Facebook (69.171.229.116) will fetch:
http://www.domain.com/something?_escaped_fragment_=v%3D1

URLS: 192774
Google & Bing does not escape #v=1 so they will request:
http://www.domain.com/something?_escaped_fragment_=v=1
(this is because they --incorrectly-- assume we will escape the canonical URL in the URI)

Sitemap URL (0.53%)
Example: http://www.domain.com/something#!sitemap
Requested URL: http://www.domain.com/something?_escaped_fragment_=sitemap

1284 links crawled 

NOTE: we found this exceptionally low, the sitemap had thousands of files and clearly is effectively not used (when compared to content & body) - of all URL's crawled only only <1% were retrieved.  Google (and other search engines) clearly prefer organic content.

Content URL: (17%)
Example: http://www.domain.com/something#!pagetype?key1=value1&key2=value2
Requested URL: http://www.domain.com/something?_escaped_fragment_=pagetype?key1=value1&key2=value2
40998 total links crawled
40913 by Google
85 from everybody Else

In Canonical and Content URL's Google encodes *SOME* special characters before requesting the _escaped_fragment_ -- but it did not work how we expected.  Specifically GoogleBot leaves characters such as ., ? and = alone - and encodes other characters including ampersand (seriously wtf!)
http://www.domain.com/something#!pagetype?key1=value 1&key2=value-2
would be requested as:
http://www.domain.com/something?_escaped_fragment_=pagetype?key1=value%20D1%26key2=value%2D2


Also of interest was the fact that GoogleBot's IP address was _clearly_ and undeniable accessing API functions which return JSON.  (more on this later).


No comments: