« | Home | »

Yet Another Duplicate Content Vulnerability Hits WordPress, Movable Type Blogs (Part 1)

By Greg | August 2, 2007

This isn’t the same tired old WordPress ‘vulnerability’ you’ve already heard so much about (i.e., the fact that certain WordPress permalink choices and theme designs will cough up the same article content via multiple URLs). No, this one is actually a bug in both WordPress and Movable Type, a bug which causes these blogging platforms to deliver the same content via a potentially infinite number of different URLs. If you’re not already using the fix provided here, your blog is at risk of showing massive quantities of duplicate content to search engines — enough to dilute the relevance of your real content with a flood of identical copies.

The Structure of This Article

Please note that this article comes in two parts:

What is the New Duplicate Content Vulnerability?

Here’s how to confirm whether a blog is affected by the bug…

Visit any reasonably current WordPress or Movable Type blog that uses cruft-free permalinks (i.e., permalinks that look similar to what you see in your browser’s address bar right now, rather than ending with something like index.php?p=123). Click on a link for some individual post, and after the page has finished loading, click in your browser’s address bar and append something like /12345 to the end of the URL; with just a few exceptions, if you load that new URL, a blog affected by the bug will display exactly the same content as it did at the previous URL.

So, for example, if you visit an affected blog at a URL which looks like this:

http://example.com/blog/interesting-post/

You can also visit a URL that looks like any of these and see the very same content:

http://example.com/blog/interesting-post/12/

http://example.com/blog/interesting-post/12345/

http://example.com/blog/interesting-post/314159265359/

The exceptions? If you’re looking at a WordPress blog that displays one post across several pages, the duplicated content will always be the last page in the post. And there is at least one WordPress plugin which patches the display of post content in such a way that you’ll only see a normally formatted page with the post title, but without the post body. Finally, you won’t see the behaviour I’ve described on this blog, because all my blogs have already been fixed to guard against the vulnerability.

Using WordPress, the extra bit you append to the URL needs to be numbers only, or an error will (correctly) be returned; when it comes to Movable Type, it appears that any old combination of letters and numbers and many other characters can be appended, and you’ll keep seeing the same content. I stopped testing Movable Type after finding that all of the following characters could be included in the junk part of the URL: -+!$=&*\().^_’<>?~`§.

Why is this a problem? It’s a problem because Google and other major search engines frown pretty badly on sites which republish the same content multiple times: duplicating content tends to be spammer territory, and search engines don’t like it one bit. Show them a high ratio of duplicates to originals, and you’re just asking for trouble. If your blog is affected by this bug, the fact that it will return multiple copies of the same content via potentially infinitely many distinct URLs puts your real content at risk of being lost in a vast flood of duplicates. All that’s required is for someone to alert a search engine bot to the existence of your massively duplicated content (by linking to you, by suggesting the URL directly, etc.).

My limited testing suggests that this bug manifests itself in all recent versions of Movable Type, all recent versions of WordPress, and all versions of WordPress Multi-User. It also affects the Movable Type-powered hosted blog service at Vox.com (but not, apparently, the one at TypePad.com), and it affects the WordPress-powered hosted blog service at WordPress.com.

With around 50 million pages of content at WordPress.com alone, that’s a pretty big chunk of potentially duplicated content.

How Does This Differ From All the Other WordPress Duplicate Content Issues We’ve Heard So Much About?

This differs from the more widely discussed WordPress duplicate content issues in two ways:

Number 1 above is self-explanatory, but number 2 might be less obvious. The original WordPress duplicate content issues which have received so much attention over the last year can be attributed to the fact that a single post might be accessed via multiple categories and via multiple different archives (both category-based and calendar-based). The only people affected by these duplicate content issues were blog owners who chose to include category names in their permalink structure and then categorised single posts under multiple categories, and/or those who relied on themes that displayed entire posts rather than excerpts on category, archive, and home pages. Themes which only display an entire post at the permalink for that post were not affected, and permalink settings without category names likewise were not affected. (Actually, this is not strictly true: the way WordPress handles pingback and trackback URLs affects everybody using the platform. Fortunately, this is easily fixed via robots.txt.)

So in an important sense, none of these previous duplicate content issues had much of anything to do with WordPress: sure, WordPress made it possible for blog owners to make choices that would expose them to duplicate content issues, just like cars make it possible for car owners to drive carelessly and run into things. But it certainly didn’t require them to do so. The possibility of duplicate content in these cases is an architectural feature of the software’s design (and its support for multiple categories, custom permalinks, and a flexible theme system).

By contrast, the vulnerability I am describing here is not a feature, it is a bug: WordPress and Movable Type are returning content at URLs that should not point to anything. I.e., unless a post really has 314159265359 pages, it is simply wrong to return anything but an error when someone asks for page number 314159265359. Technically speaking, the software should be returning a 404 status code and an error page — not a 200 status code and a page full of duplicate content.

How Did the Problem Arise?

I can’t really speculate on how this problem came to appear in Movable Type, as I’m not a Movable Type user; I don’t have access either to the software’s source code or to a test installation. All I know is what I’ve been able to gather from publicly available blogs running Movable Type.

However, I can tell you how it came to be a part of WordPress.

Around one year ago, a report was filed on the public bug tracker for WordPress, a report which described a problem with the software returning a mangled article page whenever someone tried to visit a post via a URL specifying a page number higher than the number of pages in the post. The developers quickly issued a fix, which was simply to check whether the requested page number was higher than the number of available pages for that post, and return the content from the highest available page number if that check was true. But the fix raised another obvious problem: it created duplicate content. Lots of it. Although the duplicate content problem was pointed out almost immediately and the bug report reopened, the developers set the concerns aside — suggesting that 1) duplicate content vulnerabilities already existed in WordPress anyway (the implication being, I guess, why fix this one?) and 2) it was too hard to fix the problem for technical reasons. The original bug reporter protested, but the developers closed the discussion again and simply incorporated the original fix into a subsequent release of WordPress.

I’ve already described above why the comparison between this bug and previous duplicate content vulnerabilities offers a poor rationale for not fixing this bug (namely, it is a bug, not a feature). The second rationale for not fixing the bug — that it is too hard to fix the problem for technical reasons — carries more weight. Technically speaking, by the time WordPress has discovered that a non-existent page is being requested, according to the developers, it is too late to issue an error code: headers have already been sent.

What is the Status of This Bug ‘In the Wild’?

Essentially every blog I have tested which fits the profile I originally described (i.e., running recent versions of WordPress, WordPress Multi-User, or Movable Type) has proven vulnerable to this duplicate content problem. And while at least one WordPress plugin changes how the bug manifests itself — by returning a normal page without the actual text of the post — WordPress is still returning that post-free page via arbitrarily many different URLs.

It is not clear to me how one would go about discovering whether this bug has in fact resulted in large volumes of duplicate content being logged at vulnerable blogs by any of the major search engines. While it is easy enough to extract a listing from Google showing pages flagged as ’supplemental’, it is less straightforward to narrow that search down to URLs specifically fitting the pattern required here.

However, I can say that big names in the SEO (search engine optimisation) community don’t seem to have noticed the problem yet, because their own blogs remain vulnerable to the bug. I would have imagined that if the SEO experts already understood the problem, the first thing they would have done would have been to implement protection for their own blogs. Their next step would probably have been to start exploiting it; since they haven’t done the first, I’m guessing they haven’t done the second.

The Movable Type blog of Aaron Wall at seobook.com, for example, is still vulnerable as of this writing. So too are blackhat SEO blogs running WordPress such as seoblackhat.com and shoemoney.com. Interestingly, I did find one blackhat SEO blog running WordPress that returned 404 errors in response to the type of queries I’ve described here. However, that same site also returned 404 errors in response to certain other correct queries, so it’s unclear whether that particular blackhatter might just have some significant software problem in his installation that coincidentally offers protection.

As for other sites I’ve checked, I mentioned previously that the Movable Type hosted service at Vox.com is affected (nearly 2 million pages), as is the the blog for Movable Type publisher Six Apart. The flagship WordPress.com (50 million pages) and other WordPress MU sites like the Harvard Law blogs (half a million pages) are vulnerable, as are other high-profile WordPress sites like Techcrunch (1 million pages).

By the time you read this article, hopefully some of those sites will have been fixed.

Six Apart has recently been fully informed of the problem. The WordPress developers have also been fully informed about it — for around 1 year. Hopefully, both platforms will soon benefit from new (in the case of Six Apart) or renewed (in the case of WordPress) attention to the bug. Given the sheer volume of traffic which flows to and from WordPress and Movable Type powered areas of the so-called ‘blogosphere’ (does anyone else find that term just about as annoying as ‘Web 2.0′?), a bug which can compromise the integrity of search engine coverage of these sites carries the potential to tweak the flow of traffic on the internet in interesting and unpredictable ways.

UPDATE: I’ve just found that some sites which do not use cruft-free permalinks at all (e.g., digg.com) are also affected by this bug. A quick test shows that the Digg blog, running a quite old version of WordPress and without fancy permalinks, is vulnerable.

References

Fixing the WordPress and Movable Type Duplicate Content Bug

For more information on how to fix the duplicate content vulnerability in WordPress and Movable Type, at least until the respective software developers issue more elegant fixes, please see the second part of this article.

7 Comments »

Bookmark and Share:

7 Responses to “Yet Another Duplicate Content Vulnerability Hits WordPress, Movable Type Blogs (Part 1)”

  1. WordPress, Movable Type Hit by New Duplicate Content Bugs Says:
    August 2nd, 2007 at 4:27 pm

    [...] over at Where Else to Put It? is a new duplicate content vulnerability that is affecting both WordPress and Movable Type blogs. What is duplicate content, and why should [...]

  2. Donncha O Caoimh Says:
    August 9th, 2007 at 11:47 am

    Dumb question – where are the links to these pages with numbers appended to the urls? If there aren’t any links then Google won’t index them and no problem surely?

  3. Greg Says:
    August 9th, 2007 at 11:58 am

    Hi Donncha,

    Hey, many thanks for stopping by!

    (For any readers who might not be familiar with the real brains behind WordPress, Donncha is the lead developer for the multi-user version of the software.)

    Nope, it’s not a dumb question at all. The answer is that in an ideal world, there wouldn’t be any links with extraneous page numbers appended to the URL. But unfortunately, all it takes is for a person who wants to compromise the search engine visibility of a website to start publishing such links anywhere that a search engine can find them and crawl them. Anywhere will do: some junky forum somewhere, a site of their own, a direct submission to Google, a bit of blog comment spam, etc.

    The point is that the links to duplicate content don’t have to be important, or authoritative, or even ever seen by a human being: all it takes is for a search engine bot to see them, and the damage is done.

    All the best,
    Greg

  4. Wordpress Duplicate Content Vulnerability Says:
    August 11th, 2007 at 6:11 am

    [...] Mulhauser brought into my attention a Duplicate Content Vulnerability present in Wordpress and Movable [...]

  5. Wordpress Duplicate Content Vulnerability | WebZaurus Says:
    October 8th, 2008 at 2:32 pm

    [...] Mulhauser brought into my attention a Duplicate Content Vulnerability present in Wordpress and Movable [...]

  6. Ehab Says:
    January 16th, 2009 at 10:58 pm

    I do not support the idea that an attacker would go such long miles to try and hurt any website.

    Most SEO aware webmasters place noindex to archives and category pages, etc.

    Nice find though -)

  7. Matt Says:
    December 22nd, 2009 at 4:36 pm

    Hi guys, I think this is really a useful blog to hang out and to get some knowledgable content.

Comments