Posts filed under ‘Misc’

Are AdSense publishers being favored with more frequent indexing?

Today I was going to address some of the comments that Stu Drew left about managing to get a high ranking for his private-label rights articles blog entry, but I’m going to defer that to a later time. If you’re interested in that topic, let me point you to an article I’ve written about the so-called “Google Sandbox” that should address some of the questions: Redcowl Bluesingksy: Why the Google Sandbox Doesn’t Exist.

I want to talk some more about Google’s indexing of AdSense pages. In case you hadn’t heard, Googler Matt Cutts confirmed that the AdSense crawler is feeding pages into Google’s new “BigDaddy” search indexes. This confirms what others had noticed about what the AdSense crawler (usually referred to as the “mediabot”) is doing. Or does it?

As always, there are different ways to look at what’s happening. We know that pages crawled by the mediabot are now making their way into the Google search index. What we don’t know, however, is whether those pages are being pushed or pulled into the index. Let me explain.

Let’s think of the innards of the Google search engine as a bunch of black boxes. (Disclaimer: I have no special knowledge of how things actually work internally.) For our purposes, we’re only concerned with three of those boxes:

  1. The manager maintains a list of URLs and decides when each need to be indexed
  2. The crawler (this is the Googlebot) goes out and fetches pages for indexing
  3. The indexer takes crawled pages and indexes and ranks them using proprietary algorithms

At some point, the manager decides that a given URL needs to be recrawled. It decides this based on age, Google Sitemaps, PageRank, whatever. No one disputes that different sites get crawled with different frequencies, and the manager is the one making those decisions. So it tells the crawler to fetch the page. This won’t happen for a while, but when it’s done the crawler tells the manager the page has been fetched and the manager then passes the page to the indexer for processing.

Now throw the AdSense crawler into the mix and see what happens. The case that concerns the SEO community is if the mediabot pushes its pages directly to the indexer, bypassing the manager’s controls. In this scenario, changes to AdSense pages can potentially be noticed much more quickly than they would through the normal crawling process, giving them an unfair advantage. In this “push” model, the AdSense crawler effectively acts as a secondary manager.

The “pull” model, on the other hand, only affects the crawler. When the manager asks the crawler to get the contents of a given URL, the crawler first checks with the mediabot to see if the latter has crawled the page recently, where “recently” can be any reasonable length of time, say 24 hours. If it does, the crawler just returns a copy of what the mediabot saw instead of going out to fetch the page contents again. The manager is still in control in this scenario — only it decides when a page is to be crawled.

What I’ve been assuming is that Google is using the pull model, not the push model. Others are assuming the reverse (and the worst), hence the controversy. We need someone from Google to clarify this issue for us…

Originally from An AdSense Blog: Make Easy Money with Google on April 19, 2006, 11:11am

Ads by Yahoo!

Advertisements

June 21, 2006 at 6:15 am Leave a comment

How gzip encoding reduces bandwidth

Yesterday, Matt Cutts posted more details about the caching that Google’s crawlers are now doing to further clarify the whole AdSense push vs. AdSense pull issue. One of things he mentioned was how webmasters can turn on “gzip encoding” to save even further bandwidth. Since not everyone reading this is a webmaster, I thought I’d explain what he meant in further detail.

HTTP Headers

As you know, the HTTP protocol is what a web browser uses to communicate with a web server. The browser (a type of web client or user agent) always initiates the conversation with the web server by sending it a URL. In other words, if you type http://www.memwg.com/blog/adsense into your browser to read this blog, the browser sends a request (technically, a “GET” request) to the server located at http://www.memwg.com for the content located at the path /blog/adsense.

However, a bunch of other information gets sent along with the request: the type of browser being used, the user’s preferred languages, the underlying operating system type, what kind of image formats are accepted, etc. (See Masquerading Your Browser for information on how to alter or hide some of this information.) This information is attached to the request as a set of headers, basically name-value pairs of data. You can use my free HTTP header viewer tool to see what headers your browser is sending right now.

Content Encoding

Normally, any data requested by the client is sent by the web server byte-for-byte down the pipe. If you request a web page that is 10,320 bytes long, the web server sends the entire 10,320 bytes to the client. In other words, the data is sent in its “raw” or “natural” form.

One of the headers that a client can send is the Accept-Encoding header, which tells the web server that the client can receive compressed data as an alternative. If the server so chooses, it selects one of the encodings that the client supports (the client sends a list of supported encodings) and compresses the data with the selected encoding algorithm. Instead of sending a 10,320 byte document in the example above, it might end up sending a 4,567 byte long document — a significant savings. (The amount of compression depends on the algorithm being used and the data being compressed. Typically, HTML pages become much smaller.)

When the server encodes data like this, it’s the client’s job to decode it on the other end back into its raw form. The server actually sends headers back to the client as part of the response, and one of those, the Content-Encoding header, indicates which algorithm it used for the encoding. The client can then decode the data by selecting the appropriate algorithm.

GZIP Encoding

On Unix/Linux machines, the gzip application is used to compress and decompress data. But the term “gzip” or “GZIP” is also used as shorthand for the compression/decompression algorithm used by the gzip application. So when you hear someone refer to “gzip encoding”, they’re talking about data that is encoded by the same algorithm used by the gzip application.

A web browser that understands gzip encoding sends an Accept-Encoding header that looks like this:

Accept-Encoding: gzip

The web server encodes the data using the gzip algorithm and sends back the appropriate Content-Encoding header:

Content-Encoding: gzip

The browser then uses the gzip decoding algorithm to return the data to its normal, uncompressed form.

Why GZIP Encoding Helps

The idea behind gzip encoding is to reduce the amount of data being transferred over the network. In the example above, the size of the document was reduced by over half. Not only does the data transmit more quickly, you also get charged less for its transmittal — in general, the less bandwidth you’re using, the less you pay.

There are downsides to gzip encoding, though. Any data compression takes time and processing cycles, so a heavily-used web server may find itself slowed down even more if gzip encoding is enabled. And not all data types compress well — images often end up being bigger when compressed — so the server shouldn’t automatically compress everything, even if the client requests it. And some older clients have bugs in their decoding algorithms.

Note that gzip encoding is not limited to web browsers, it can be used by web crawlers as well. Browsers and crawlers look the same to a web server, they just have different headers. Matt indicated that Google has now enabled gzip encoding in all of its crawlers. So if you’re finding that your site is being crawled excessively by crawlers and using up your precious bandwidth, make sure gzip encoding is enabled in your web server — it could make a big difference.

Originally from An AdSense Blog: Make Easy Money with Google on April 24, 2006, 10:31am

Ads by Yahoo!

June 21, 2006 at 6:14 am Leave a comment


Calendar

November 2017
M T W T F S S
« Jun    
 12345
6789101112
13141516171819
20212223242526
27282930  

Posts by Month

Posts by Category