百度首页 | 百度空间
 
查看文章
 
很有用的讨论,关于noindex nofollow和robots.txt’ed是否传递PR
2007-10-10 23:22

Interview Transcript

Eric Enge: Let's talk about different kinds of link encoding that people do, such as links that go through JavaScript or some sort of redirect to link to someone, yet the link actually does represent an endorsement. Can you say anything about the scenarios in which the link is actually still recognized as a link?

Matt Cutts: A direct link is always the simplest, so if you can manage to do a direct link that's always very helpful. There was an interesting proposal recently by somebody who works on FireFox or for Mozilla I think, which was the idea of a ping attribute, where the link can still be direct, but the ping could be used for tracking purposes. So, something like that could certainly be promising, because it lets you keep the direct nature of a link while still sending a signal to someone. In general, Google does a relatively good job of following the 301s, and 302s, and even Meta Refreshes and JavaScript. Typically what we don't do would be to follow a chain of redirects that goes through a robots.txt that is itself forbidden.

Eric Enge: Right.

Matt Cutts: I think in many cases we can calculate the proper or appropriate amount of PageRank, or Link Juice, or whatever you want to call it, that should flow through such links.

Eric Enge: Right. So, you do try to track that and provide credit.

Matt Cutts: Yes.

Eric Enge: Right. Let's talk a bit about the various uses of NoIndex, NoFollow, and Robots.txt. They all have their own little differences to them. Let's review these with respect to 3 things: (1) whether it stops the passing of link juice; (2) whether or not the page it still crawled; and: (3) whether or not it keeps the affected page out of the index.

Matt Cutts: I will start with robots.txt, because that's the fundamental method of putting up an electronic no trespassing sign that people have used since 1996. Robots.txt is interesting, because you can easily tell any search engine to not crawl a particular directory, or even a page, and many search engines support variants such as wildcards, so you can say don't crawl *.gif, and we won't crawl any GIFs for our image crawl.

We even have additional standards such as Sitemap Support, so you can say here's a link to where my Sitemap is can be found. I believe the only robots.txt extension in common use that Google doesn't support is the crawl-delay. And, the reason that Google doesn't support crawl-delay is because way too many people accidentally mess it up. For example, they set crawl-delay to a hundred thousand, and, that means you get to crawl one page every other day or something like that.

We have even seen people who set a crawl-delay such that we'd only be allowed to crawl one page per month. What we have done instead is provide throttling ability within Webmaster Central, but crawl-delay is the inverse; its saying crawl me once every "n" seconds. In fact what you really want is host-load, which lets you define how many Googlebots are allowed to crawl your site at once. So, a host-load of two would mean, 2 Googlebots are allowed to be crawling the site at once.

Now, robots.txt says you are not allowed to crawl a page, and Google therefore does not crawl pages that are forbidden in robots.txt. However, they can accrue PageRank, and they can be returned in our search results.

In the early days, lots of very popular websites didn't want to be crawled at all. For example, eBay and the New York Times did not allow any search engine, or at least not Google to crawl any pages from it. The Library of Congress had various sections that said you are not allowed to crawl with a search engine. And so, when someone came to Google and they typed in eBay, and we haven't crawled eBay, and we couldn't return eBay, we looked kind of suboptimal. So, the compromise that we decided to come up with was, we wouldn't crawl you from robots.txt, but we could return that URL reference that we saw.

Eric Enge: Based on the links from other sites to those pages.

Matt Cutts: Exactly. So, we would return the un-crawled reference to eBay.

Eric Enge: The classic way that shows it you just list the URL, no description, and that would be the entry that you see in the index, right?

Matt Cutts: Exactly. The funny thing is that we could sometimes rely on the ODP description (Editor: also known as DMOZ). And so, even without crawling, we could return a reference that looked so good that people thought we crawled it, and so that caused a little bit of earlier confusion. So, robots.txt was one of the most long standing standards. Whereas for Google, NoIndex means we won't even show it in our search results.

So, with robots.txt for good reasons we've shown the reference even if we can't crawl it, whereas if we crawl a page and find a Meta tag that says NoIndex, we won't even return that page. For better or for worse that's the decision that we've made. I believe Yahoo and Microsoft might handle NoIndex slightly differently which is little unfortunate, but everybody gets to choose how they want to handle different tags.

Eric Enge: Can a NoIndex page accumulate PageRank?

Matt Cutts: A NoIndex page can accumulate PageRank, because the links are still followed outwards from a NoIndex page.

Eric Enge: So, it can accumulate and pass PageRank.

Matt Cutts: Right, and it will still accumulate PageRank, but it won't be showing in our Index. So, I wouldn't make a NoIndex page that itself is a dead end. You can make a NoIndex page that has links to lots of other pages.

For example you might want to have a master Sitemap page and for whatever reason NoIndex that, but then have links to all your sub Sitemaps.

Eric Enge: Another example is if you have pages on a site with content that from a user point of view you recognize that it's valuable to have the page, but you feel that is too duplicative of content on another page on the site

That page might still get links, but you don't want it in the Index and you want the crawler to follow the paths into the rest of the site.

Matt Cutts: That's right. Another good example is, maybe you have a login page, and everybody ends up linking to that login page. That provides very little content value, so you could NoIndex that page, but then the outgoing links would still have PageRank.

Now, if you want to you can also add a NoFollow metatag, and that will say don't show this page at all in Google's Index, and don't follow any outgoing links, and no PageRank flows from that page. We really think of these things as trying to provide as many opportunities as possible to sculpt where you want your PageRank to flow, or where you want Googlebot to spend more time and attention.

Eric Enge: Does the NoFollow metatag imply a NoIndex on a page?

Matt Cutts: No. The NoIndex and NoFollow metatags are independent. The NoIndex metatag, for Google at least, means don't show this page in Google's index. The NoFollow metatag means don't follow the outgoing links on this entire page.

Eric Enge: How about page A links to page B, and page A has a NoFollow metatag, or the link to page B has a NoFollow on the link. Will page B still be crawled?

Matt Cutts: It won't be crawled because of the links found on page A. But if some other page on the web links to page B, then we might discover page B via those other links.

Eric Enge: Right. So there are two levels of NoFollow. There is the attribute on a link, and then there is the metatag, right.

Matt Cutts: Exactly.

Eric Enge: What we've been doing is working with clients and telling them to take pages like their about us page, and their contact us page, and link to them from the Homepage with a NoFollow attribute, and then link to them using NoFollow from every other page. It's just a way of lowering the amount of link juice they get. These types of pages are usually the highest PageRank pages on the site, and they are not doing anything for you in terms of search traffic.

Matt Cutts: Absolutely. So, we really conceive of NoFollow as a pretty general mechanism. The name, NoFollow, is meant to mirror the fact that it's also a metatag. As a metatag NoFollow means don't crawl any links from this entire page.

NoFollow as an individual link attribute means don't follow this particular link, and so it really just extends that granularity down to the link level.

We did an interview with Rand Fishkin over at SEOmoz where we talked about the fact that NoFollow was a perfectly acceptable tool to use in addition to robots.txt. NoIndex and NoFollow as a metatag can change how Googlebot crawls your site. It's important to realize that typically these things are more of a second order effect. What matters the most is to have a great site and to make sure that people know about it, but, once you have a certain amount of PageRank, these tools let you choose how to develop PageRank amongst your pages.

Eric Enge: Right. Another example scenario might be if you have a site and discover that you have a massive duplicate content problem. A lot of people discover that because something bad happened. They want to act very promptly, so they might NoIndex those pages, because that will get it out of the index removing the duplicate content. Then, after it's out of the index, you can either just leave in the NoIndex, or you can go back to robots.txt to prevent the pages from being crawled. Does that make sense in terms of thinking about it?

Matt Cutts: That's at the level where I'd encourage people to try experiments and see what works best for them, because we do provide a lot of ways to remove content.

Matt Cutts: There's robots.txt.

Eric Enge: Sure. You can also use the URL removal tool too.

Matt Cutts: The URL removal tool is another way to do it. Typically, what I would probably recommend most people do, instead of going the NoIndex route, is to make sure that all their links point to the version of the page that they think is the most important. So, if they have got two copies, you can look at the back links within our Webmaster Central, or use Yahoo, or any other tools to explore it, and say what are the back links to this particular page, why would this page be showing up as a duplicate of this other page? All the back links that are on your own page are very easy to switch over to the preferred page. So, that's a very short term thing that you can do, and that only usually takes a few days to go into effect. Of course, if it's some really deep URL, they could certainly try the experiment with NoIndex. I would probably lean toward using optimum routing of links as the first line of defense, and then if that doesn't solve it, look at or consider using NoIndex.

Eric Enge: Let's talk about non-link based algorithms. What are some of the things that you can use as signals that aren't links to help with relevance and search quality? Also, can you give any indication about such signals that you are already using?

Matt Cutts: I would certainly say that the links are the primary way that we look at things now in terms of reputation. The trouble with something like other ways of measuring reputation is that the data might be sparse. Imagine for example that you decided to look at all the people that are in various yellow page directories, across the web, for the list of their address, or stuff like that. The problem is, even a relatively savvy business with multiple locations might not think to list all their business addresses.

A lot of these signals that we look at to determine quality or to help to determine reputation can be noisy. I would convey Google's basic position as that we are open to any signals that could potentially improve quality. If someone walked up to me and said, the phase of the moon correlates very well with the site being high quality, I wouldn't rule it out, I wouldn't take it off the table, I would do the analysis and look at it.

Eric Enge: And, there would be SEOs out there trying to steer the course of the moon.

Matt Cutts: It's funny, because if you remember Webmaster World used to track updates on the Google Dance, and they had a chart, because it was roughly on a 30 day schedule. When a full moon came around people started to look for the Google Dance to happen.

In any event, the trouble is any potential signal could be sparse, or could be noisy, and so you have to be very careful about considering signal quality.

Eric Enge: Right. So, an example of a noisy signal might be the number of Gadgets installed from a particular site onto people's iGoogle homepage.

Matt Cutts: I could certainly imagine someone trying to spam that signal, creating a bunch of accounts, and then installing a bunch of their own Gadgets or something like that. I am sad to say you do have to step into that adversarial analysis phase where you say okay, how would someone abuse this anytime you are thinking about some new network signal.

Eric Enge: Or bounce rate is another thing that you could look at. For example, someone does a search and went to a site, and then they are almost immediately back at the Google search results page clicking on a different link, or doing a very similar search. You could use that as a signal potentially.

Matt Cutts: In theory. I don't think we typically don't confirm or deny whether we'd use any given particular signal. It is a tough problem, because something that works really well in one language might not work as well in another language.

Eric Enge: Right. One of the problems with bounce rate is that with the web moving so much more towards just give them answer now. For example, if you have a Gadget, you want the answer in the Gadget. If you use subscribed links, you want the answer in the subscribed links. When you get someone to your site, there is something to be said for giving them the answer they are looking for immediately, and they might see it and immediately leave (and you get the branding / relationship benefit of that.

In this case, it's actually a positive quality signal rather than a negative quality signal.

Matt Cutts: Right. You could take it even further and help people get the answer directly from a snippet on the search engine results page, and so they didn't click on the link at all. There are also a lot of weird corner cases, you have to consider anytime you are thinking about a new way to try to measure quality.

Eric Enge: Right, indeed. What about the toolbar data, and Google analytics data?

Matt Cutts: Well, I have made a promise that my Webspam team wouldn't go to the Google Analytics group and get their data and use it. Search quality or other parts of Google might use it, but certainly my group does not. I have talked before about how data from the Google toolbar could be pretty noisy as well. You can see an example of how noisy this is by installing Alexa. If you do, you see a definite skew towards Webmaster sites. I know that my site does not get as much traffic as many other sites, and it might register higher on Alexa because of this bias.

Eric Enge: Right. A site owner could start prompting people to install the Google toolbar whenever they come to their site.

Matt Cutts: Right. Are you sure you don't want to install a Google toolbar, Alexa, and why not throw in Compete and Quantcast? I am sure Webmasters are a little savvier about that, then the vast majority of sites. So, it's interesting to see that there is usually a Webmaster bias or SEO bias, with many of these usage based tools.

接下篇

类别:搜索引擎优化实验 | 添加到搜藏 | 浏览() | 评论 (0)
 
最近读者:
 
网友评论:
发表评论:
姓 名:
网址或邮箱: (选填)
内 容:
验证码:
 

     

©2008 Baidu