« Yet Another Duplicate Content Vulnerability Hits WordPress, Movable Type Blogs (Part 1) | Home | Matt Cutts Publishing Duplicate Content on His WordPress Blog »
Yet Another Duplicate Content Vulnerability Hits WordPress, Movable Type Blogs (Part 2)
By Greg | August 2, 2007
In this second article about the bug in both WordPress and Movable Type which causes these blogging platforms to deliver the same content via a potentially infinite number of different URLs, I describe a temporary fix to guard your blog against the risk of showing massive quantities of duplicate content to search engines.
The Structure of This Article
Please note that this article comes in two parts:
- Part 1: “Yet Another Duplicate Content Vulnerability Hits WordPress, Movable Type Blogs (Part 1)”, and
- Part 2: “Yet Another Duplicate Content Vulnerability Hits WordPress, Movable Type Blogs (Part 2)”
How to Fix the Problem
First let’s have a look at WordPress, then WordPress Multi-User, and then turn to Movable Type.
Fixing the WordPress Duplicate Content Vulnerability
It is for the technical reason I described in the first part of this article that the fix for this problem which I’m suggesting here is not actually part of WordPress itself — instead, it is part of the server’s underlying mod_rewrite rules which make trickery like cruft-free permalinks possible in the first place. The fix I am going to suggest is also deliberately aimed at the server level rather than at WordPress’s own internal URL rewriting specifically because there are already several plugins which mess with the WordPress rewrite architecture. I personally would recommend against diving in and patching those rules any further, in case some unforeseen conflict with someone else’s patching of those rules might arise. By placing these mod_rewrite rules directly in the .htaccess file at the root level of the blog, we can ensure our rules are executed before WordPress even gets started. Doing it this way is also significantly less resource-intensive than letting WordPress attempt the job itself: with the solution offered here, the server works around the problem before WordPress even sees the request, and before it starts delivering content to the user’s browser.
I believe the following rules, placed into the .htaccess file at the root of the WordPress installation, will do the trick for most setups (but see the important caveats below). They need to go immediately before the rules which WordPress itself places in the .htaccess file:
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteCond %{REQUEST_URI} !.*(/page/[0-9]*/?)$
RewriteCond %{REQUEST_URI} !^/200[0-9]/?$
RewriteCond %{REQUEST_URI} !^/200[0-9]/[01][0-9]/?$
RewriteCond %{REQUEST_URI} !^/200[0-9]/[01][0-9]/[0-3][0-9]/?$
RewriteRule (.*)(/[0-9]+/?)$ $1/ [R=301,L]
</IfModule>
Here’s how it works…
The first two lines just switch on the Apache rewrite engine, if it’s available (if it isn’t, this fix won’t work, but then you won’t be using fancy permalinks anyway):
<IfModule mod_rewrite.c>
RewriteEngine On
The next line is going to make sure we ignore any URL that ends with /page/ and some digits; this ensures that we don’t mess with built-in WordPress paging for archives:
RewriteCond %{REQUEST_URI} !.*(/page/[0-9]*/?)$
What is happening here is that we’re checking the URL and telling Apache to apply the rule which eventually follows only if the URL does not include some set of characters, followed by /page/, followed by some set of digits, followed possibly by a trailing slash. (The parentheses in this line are purely for clarity and do not perform a function.)
The next line is going to make sure we ignore any URL that begins with a year in the first decade of this century and then ends; this ensures that we don’t mess with year-based archives:
RewriteCond %{REQUEST_URI} !^/200[0-9]/?$
What is happening here is that we’re checking the URL again and telling Apache to apply the rule which eventually follows only if the URL is not made up solely of 200, followed by a single digit, followed possibly by a trailing slash.
The next two lines use a similar method to ensure we don’t mess with a monthly or daily archive:
RewriteCond %{REQUEST_URI} !^/200[0-9]/[01][0-9]/?$
RewriteCond %{REQUEST_URI} !^/200[0-9]/[01][0-9]/[0-3][0-9]/?$
These lines check the URL and tell Apache to appy the rewrite rule only if the URL is not made up of either a year followed by a month, or a year followed by a month followed by a day of the month.
The penultimate line does the real work via a rewrite rule, while the final line closes the <If> that we started with:
RewriteRule (.*)(/[0-9]+/?)$ $1/ [R=301,L]
</IfModule>
This rule tells Apache to take a request for any URL that ends with just some numbers and possibly a trailing slash and redirect it via a permanent redirect to the same URL but without those extra numbers. The $1 is a back reference, which refers to whatever was matched in the first set of parentheses; in effect, it inserts this first part of the URL and throws out whatever was matched in the second set of parentheses, which is the extraneous set of numbers at the end.
If you are running WordPress using name-based virtual hosting (ask your system administrator), and the www domain preference you have set for your blog (i.e., with or without an initial ‘www’) does not match the ServerName in your httpd.conf file, this rewrite rule may cause an additional 302 redirect, as Apache redirects first to a URL using whichever ‘www’ preference matches ServerName, and WordPress then redirects a second time to the other domain. You can avoid this in one of two ways:
- Update your
httpd.conffile soServerNamematches your preference, andServerAliasmatches the alternative, or - Change the rewrite rule to an external redirect, by inserting
http://yourdomain.com/orhttp://www.yourdomain.com/in front of the$1.
Important Caveats
The rewrite rules given above assume a permalink structure which does not involve anything extraneous at the start of the URL (e.g., the old tradition of inserting ‘archives’ before everything) or elsewhere within the path (e.g., ‘month01′ instead of ‘01′ for the path to a post published in January); if you do use a permalink scheme that has such extraneous crud, make sure to modify the rewrite conditions appropriately so they’re not looking for something they will never match.
In addition, if you use paged posts — either via a plugin or via WordPress’s built-in paging function (via <!--nextpage-->) — in a way that does not involve /page/ before the page number, this fix will not work for you. Instead, it will strip off your page numbers and redirect all requests for them back to the first page of your post. (I’m not sure why WordPress uses one type of URL structure for pagination in archives and a second type of URL structure for pagination in posts.)
Again, if you use any type of post paging without /page/, do not use this fix. (If you do use paging in this way, at least you will have a smaller number of post pages which are vulnerable to the bug, relatively speaking, since only the last page of paged posts will ever be duplicated as a result of this bug.)
Finally, if you ever use numbers-only slugs for your posts (or if you ever allow one to be assigned automatically by WordPress for an untitled post), then this fix may prove problematic for you. If you have a small number of such post slugs, it may be worth the effort to change them.
Oh, and this fix also assumes all your blog posts have occurred within the first decade of this century: it will stop working properly in 2010, and it won’t protect posts before 2000. This is easy enough to fix, if you have the need.
Most importantly of all: When making any change to your server’s .htaccess files, make sure you understand exactly what every single change is doing, and test extensively. If anything fails to work after making the change, revert the file immediately and re-evaluate.
Fixing the WordPress Multi-User Duplicate Content Vulnerability
If you’re running WordPress Multi-User, the rewrite scheme described above won’t work as it is. A slightly modified version will, I believe, work for the most common setups:
RewriteCond %{REQUEST_URI} !.*(/page/[0-9]*/?)$
RewriteCond %{REQUEST_URI} !^/200[0-9]/?$
RewriteCond %{REQUEST_URI} !^/200[0-9]/[01][0-9]/?$
RewriteCond %{REQUEST_URI} !^/200[0-9]/[01][0-9]/[0-3][0-9]/?$
RewriteRule (.*)(/[0-9]+/?)$ http://%{HTTP_HOST}/$1/ [R=301,L]
Again, these need to go in the .htaccess file immediately before WPMU’s own rewrite rules. What differs here is that the host name must be explicitly fetched and inserted for the redirect to work correctly. (This assumes you are using vhosts; if you are running WPMU with subdirectories, you’ll need something closer to the original example given above.)
Naturally, this only works for WordPress MU if none of your users have specified a wild and crazy permalink scheme that involves extra gubbins at the start of the URL, or extra text anywhere else within the path. If you’re fortunate enough not to have any zany permalink schemes yet, then you may wish to prevent them from appearing in future by removing the custom permalink option from options-permalink.php, offering only the more common permalink schemes which you can match with your rewrite conditions. (Nothing stops you from offering a huge number of different schemes to your users; provided you know all the possibilities in advance, you can modify your rewrite conditions. The problem comes when you don’t know in advance what they might look like.) If you run a very large site, where it’s not so easy to verify permalink choices, or if your users have already chosen zany schemes, I’m afraid you’re on your own.
Most importantly of all: When making any change to your server’s .htaccess files, make sure you understand exactly what every single change is doing, and test extensively. If anything fails to work after making the change, revert the file immediately and re-evaluate.
Fixing the Movable Type Duplicate Content Vulnerability
OK, this I can only offer in good faith, but without personal testing of any kind. As I said above, I do not use Movable Type, and I do not have access to a Movable Type installation for testing purposes. This fix is an educated guess, nothing more.
In my limited experience, it looks like most Movable Type blogs use two different URL structures, depending on whether we are looking at an individual post or a date-based archive — for example, it is common to see /post/ preceding an individual post, and /posts/ preceding a date-based archive. In addition, some use .html extensions or even .shtml extensions for individual posts or for archive pages. There is so much variation here that what I’m going to suggest is something that will accommodate quite a bit of uncertainty, but at the cost of still allowing through a certain number of duplicate URLs, and without offering full protection for some permalink schemes.
Here we go, some tentative rules intended for the .htaccess file of the root directory of the blog:
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteCond %{REQUEST_URI} !.*/200[0-9]/?$
RewriteCond %{REQUEST_URI} !.*/200[0-9]/[01][0-9]/?$
RewriteCond %{REQUEST_URI} !.*/200[0-9]/[01][0-9]/[0-3][0-9]/?$
RewriteRule (.*\.s?html)(/.+/?)$ $1/ [R=301,L]
</IfModule>
The first three conditions are intended to exclude date-based archives from consideration, much as we did in the WordPress fix; however, this time we’re not going to insist that the years come at the start of the URL (because there seems to be so much variation in what Movable Type bloggers put before the years). This enables us to accommodate uncertainty about what might appear there, but it also means that our rewrite conditions will ignore situations where the extra junk appended to a permalink turns out to match a year in the first decade of the century (i.e., numbers like ‘2000′, or ‘2001′, or ‘2002′). If you know what is going to appear before the year in archives — the obvious example being because it’s your blog — then just replace the .* at the start of the rewrite conditions with ^/whatever, where ‘whatever’ is whatever appears there.
The rewrite rule itself is where the work gets done:
RewriteRule (.*\.s?html)(/.+/?)$ $1/ [R=301,L]
What is intended here is that we look for any sequence that ends in either .html or .shtml, and we throw away anything that comes after it. (Movable Type blogs seem to go for permalink structures that pretend to be static.html or .shtml files.) You might wonder: if that’s all we were going to do anyway, why exclude all those date-based archives? That’s because of just one Movable Type-powered site I encountered with a .shtml extension other than at the very end of a date-based archive. I doubt this is very common, and to me it seems a bit screwy, but just in case… (If you know this will not be the case, then just that one rewrite rule, without the preceding rewrite conditions to exclude date-based archives, may work just fine for you.) Note that this rule throws out anything that comes after the .html or .shtml, because Movable Type tolerates all kinds of junk in the URL suffix.
Important Caveats
Of course, the rewrite rule given above works only if your posts end with either .html or .shtml! If they do not end that way but instead end looking like a directory name (like most WordPress blogs do, or like typical Movable Type date-based archives do), then you can replace the rewrite rule with the rule given for WordPress blogs instead — but in that case, you definitely must keep the three rewrite conditions intended to exclude date-based archives.
Most importantly of all: When making any change to your server’s .htaccess files, make sure you understand exactly what every single change is doing, and test extensively. If anything fails to work after making the change, revert the file immediately and re-evaluate.
More Information on the WordPress and Movable Type Duplicate Content Bug
For more information on identifying and verifying the duplicate content vulnerability in WordPress and Movable Type, plus background references both on the bug and on understanding the fix, please see the first part of this article.
16 Comments »




















August 8th, 2007 at 9:29 am
[...] Reported over at WhereElseToPutIt.com is news that Matt Cutts is publishing duplicate content on his WordPress blog — plus an explanation of the bug (which also affects Movable Type, by the way) and how to fix it. [...]
August 13th, 2007 at 11:52 pm
I use the following code, as part of a larger custom SEO plugin, to ensure that all posts are only reached from one URL. This has some limitations (you can’t have paged posts), but it removes most of the duplicate content errors works for me.
add_action('wp_head', 'seo_permalink');
function seo_permalink() {
global $post,$wp_query;
$cur_url = 'http://'.$_SERVER['HTTP_HOST'] . $_SERVER['REQUEST_URI'];
if(is_single()){
$permalink = get_permalink($post->ID);
}elseif(is_page()){
$permalink = get_page_link($post->ID);
}
if (!$permalink)
return;
if ($cur_url != $permalink) {
header('HTTP/1.1 301 Moved Permanently');
header('Status: 301 Moved Permanently');
header("Location: $permalink");
exit(0);
}
}
August 18th, 2007 at 5:36 pm
Hi Greg,
Your post doesnt talk about how to get the fix for url structure that follows /%category%/%postname%/
structure in WordPress
The htaccess rules seem to be for WordPress blogs that are using /year/month/day/post/ structure.
I applied those rules to my htaccess and it doesnt seem to work cos my blog is using /category/postname/
August 20th, 2007 at 5:30 pm
Aaron –> Yep, that looks like it does the trick for you as well, with your server immediately returning a 301 in response to queries with extra page-style junk at the end. Unfortunately, you still lose paging capability. If only there were a fix that preserved paging!
Darrin –> Hmm, you have me stumped! I use this method just fine on one of my other blogs that has /%category%/%postname%/ permalinks. Have you been able to locate the source of whatever problem it was that you experienced?
In theory, the mod_rewrite rules I described here ought to work for pretty much any permalink structure. The discussion of /year/month/day etc. is not about permalinks, but rather about excluding date-based archives from the mod_rewrite rules. In other words, the ending part of the mod_rewrite rules work to strip off extra bumpf at the end of the URL, while the preceding conditions just ensure we don’t do that when we’re looking at a date-based archive. Make sense?
All the best,
Greg
October 13th, 2007 at 1:33 am
hi, i would very much like to use this in my htaccess, but i am unsure if i can use it on my site with my url structure?
i put the code into my htacess with no problems but then took it out after reading your cavetes.
could you please let me know if i can use this code
heres my site url structure
http://www.example.com/post-name/
October 13th, 2007 at 9:15 am
Hi Martin,
At first glance, I would have thought the rules should work fine for you, but don’t take my word for it: test extensively to make sure your site works as it should!
All the best,
Greg
October 13th, 2007 at 2:39 pm
thanks Greg, how should i insert the code? should it be at the very start of my htaccess?
my current htaccess looks likes this
RewriteEngine On
RewriteCond %{HTTP_HOST}!^www.example.com$ [NC]
RewriteRule ^(.*)$ http://www.example.com/$1 [R,L]
RewriteBase /
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]
Options -Indexes
DirectoryIndex index.php index.html
Order Deny,Allow
Deny from all
October 13th, 2007 at 8:21 pm
Hi Martin,
Yes, you want the new mod_rewrite rules to fire before WP’s own rules.
One other thought: it might be worth double-checking your “Deny from all” line, as usually this would be wrapped inside something indicating what it is you want to deny access to (e.g., specific files, like .htaccess).
Best of luck!
I’m out of the office right now having some time off, so I’m afraid I’ll be scarce for the next week or so.
All the best,
Greg
October 14th, 2007 at 1:26 pm
Hi Greg, just wanted to report that i added your code to my htaccess and everything seems to be working fine,no problems.
the previous code i had inserted caused me a cgi error when saving a post, but yours works perfectly.
thanks very much for your help!
p.s. the deny all line was something i saw suggested on another site, so im not sure how to change it. But i will look into it if i can find the site again.
cheers
Martin
December 15th, 2007 at 1:37 am
[...] It’s true, check it out for yourself. Thankfully, there’s an excellent article at Where Else To Put It on editing your htaccess file in order to eliminate this potential nightmare. Just imagine if a [...]
January 30th, 2008 at 4:02 am
You are a mod_rewrite wiz-kid. Thanks for the tip!! I didn’t use your rules to avoid the infinite URLs problem but I was looking for a way to get WPMU to ignore a certain directory. It worked.. thanks!!
January 30th, 2008 at 10:08 am
You’re very welcome!
All the best,
Greg
August 7th, 2008 at 8:14 am
Hello! I hope you can help me on this…
My permalink structure is /%category%/%postname%/.
I assign one category to every post. Recently, I have changed a category name. The links works ok, but if you access the old link (www.domain.com/old-category-name/post-name) it still works.
Webmaster tools tells me that I have duplicate content… Technically, the page with the old category name shouldn’t exist…
What can I do?
Thank you,
Nicholas
August 7th, 2008 at 11:13 am
Hi Nicholas,
Hmm, it’s hard to tell from the little bit that you’ve described…
Note that it is possible (but not normally desirable) to wind up with two different categories or tags which have the same name but different slugs; that’s one possibility which could account for what you’re seeing.
If nothing else, you could always brute force it to redirect with 1) some mod_rewrite rules or 2) complete deletion and recreation of the post in question.
All the best,
Greg
October 13th, 2008 at 12:25 pm
Hello Greg,
Thanks for your reply. I’ve realized that the best permalink structure you can choose is /%postname%/. If you later change the category names, no problem should occur.
Regards,
Nicholas
December 24th, 2008 at 9:48 pm
[...] prefer you read this post to know more. The post also gives info about fixing this duplicate content issue for Movable Type [...]