public interface: Spidering Internal Pages

A url within an anchor tag in html may be absolute or relative. Absolute links look like <a href='http://iconfu.com'>iconfu ...</a> - they start with a protocol. Relative links look like <a href='bar.html'>bar info ...</a>. When you link to bar.html from http://example.com/pages/foo.html, the browser constructs the full reference and requests http://example.com/pages/bar.html.

So far, so good.

Relative links may also be of the form <a href='?browse=arrow'>arrow icons</a>. A browser requesting this link from http://iconfu.com/tags/list/0.html will construct this url: http://iconfu.com/tags/list/0.html?browse=arrow

This can be convenient when the code or script that handles the requested page is separate from the code or script that handles the request parameters. This doesn't happen often, but when it does, it's useful to be able to construct the url without needing to know the originating page. For example, a login handler might be implemented as a filter before the page is rendered, so the login request would simply be ?username=foo&password=bar ... this gets expanded by the browser into http://example.com/pages/foo.html?username=foo&password=bar. On the server, your login filter handles the login parameters, and your example/page script handles the rest of the url.

The bad news is that some spidering implementations handle this incorrectly (google's works fine). Instead of requesting http://iconfu.com/tags/list/0.html?browse=arrow from the earlier example, they request http://iconfu.com/tags/list?browse=arrow - they chop out "0.html". My code doesn't like this, and returns an error. Dumb MF spider implementations.

So that was that. Well, here's another bit of news: about 95% of visitors who come to iconfu through search, come from google. There are two ways to explain this: (1) google is the world's dominant search engine, who uses yahoo/live/ask.com anyway; (2) the clever people behind google analytics use some clever reporting techniques to show that google is the world's dominant search engine so why bother with the others.

We can eliminate (2) because as you know googlers Do No Evil. But today, in a flash of insight, I realised (3) perhaps those other search engines are sending me no visitors because they think my site is full of bugs and holes and 500 Internal Server Errors.

I'll fix that today and I'll let you know if I get a little more love from those unloved search engines. And then you can add "be careful with relative urls containing only a query string" to your SEO toolkit.

public interface

26 March 2009

Spidering Internal Pages

No comments:

Post a Comment

see also

tags

Archive

Creative Commons