
During SEO migration of my Drupal 8 website, I came across the problem of duplicate / out-of-date URLs doing regular queries on Google's Search engine. As someone who expects nothing less than 100% consistency across all advertised links, there should be a way to fix this issue.
My first thought was to change my Drupal install to a multi-site, serving the same content on each, the only difference being whether it was http or https. But that idea wouldn't work, as both use the same robots.txt file from the common codebase to crawl for web content.
There had to be a simpler way ... and there is.
.htaccess
The solution is to modify Drupal's default .htaccess file, so that it also takes into account any queries directed to 'robots.txt'. By default, Drupal's .htaccess file is protocol agnostic when a query is made for robots.txt . This needs to be changed, so that website migrations across protocols (full or in part) can be more streamlined at the site level.
The amendments I made to .htaccess are as follows :
<IfModule mod_rewrite.c>
RewriteEngine on
# If http query to robots.txt, send output from robots_http.txt
RewriteCond %{REQUEST_URI} ^/robots.txt
RewriteCond %{HTTPS} off
RewriteRule "^robots\.txt$" "robots_http.txt" [PT,L]
# If https query to robots.txt, send output from robots_https.txt
RewriteCond %{REQUEST_URI} ^/robots.txt
RewriteCond %{HTTPS} on
RewriteRule "^robots\.txt$" "robots_https.txt" [PT,L]
# Force redirect HTTPS if not query to robots.txt .
RewriteCond %{HTTPS} off
RewriteCond %{HTTP:X-Forwarded-Proto} http
RewriteCond %{REQUEST_URI} !^/robots.txt
RewriteRule ^(.*)$ https://%{HTTP_HOST}%{REQUEST_URI} [L,R=301]
.
.
... etc
</IfModule>
From there, make a copy of 'robots.txt' to 'robots_http.txt' & 'robots_https.txt', and make changes as desired. You can rename or delete 'robots.txt' afterwards, as any queries to 'robots.txt' will be handled by the aforementioned files.
Testing
To test that you are indeed querying the two different robots files, use the 'curl' command (instead of via the browser) to directly query the robots.txt file on your website. The reason is that your website can cache results to your web browser, thus yielding false readings. If you are indeed receiving the expected robots.txt file from http and https, submit both the http version and https version to Google Search for indexing.
I hope this helps someone who's experiencing SEO issues with their website.
About the author |
|
![]() |
Tom Thorp is an IT Consultant living in Miami on Queensland's Gold Coast. With more than 30 years working in the IT industry, he has extensive experience. The IT services provided to clients include:
Website development and hosting,
Database Administration, Server Administration (Windows, Linux, Apple), PBX Hosting and Administration, Helpdesk Support (end-user & technical). |
If you like any of my content, consider a donation via Crypto by clicking on one of the payment methods:. |
Leave a Comment