Separate robots.txt files for http and https content using single codebase

Submitted by Tom Thorp on Wednesday, August 15, 2018 - 22:55
Modified on Friday, July 5, 2019 - 21:29
Search Engine Optimization
During SEO migration of my Drupal 8 website, I came across the problem of duplicate / out-of-date url's doing regular queries on Google's Search engine. As one where I'd expect nothing less than 100% consistency across all advertised links, there should be a way to fix this issue. 
My first thought was to change my Drupal install to a multi-site, serving the same content on each, the only difference been whether it was http or https. But that idea wouldn't work, as both use a common codebase, hence use robots.txt as its' source for robots to crawl for web content. 
There had to be a simpler way ...  and there is. 


The solution is to modify Drupal's default .htaccess file, so that it also takes into account any queries directed to 'robots.txt'. By default, Drupal's .htaccess file is protocol agnostic when a query is made for robots.txt . This needs to be changed, so that website migrations across protocols (full or in-part) can be more streamlined at the site level. 
The amendments I made to .htaccess are as follows : 
<IfModule mod_rewrite.c>
  RewriteEngine on

# If http query to robots.txt, send output from robots_http.txt
  RewriteCond %{REQUEST_URI} ^/robots.txt
  RewriteCond %{HTTPS} off
  RewriteRule "^robots\.txt$" "robots_http.txt" [PT,L]

# If https query to robots.txt, send output from robots_https.txt
  RewriteCond %{REQUEST_URI} ^/robots.txt
  RewriteCond %{HTTPS} on
  RewriteRule "^robots\.txt$" "robots_https.txt" [PT,L]

  # Force redirect HTTPS if not query to robots.txt .
  RewriteCond %{HTTPS} off
  RewriteCond %{HTTP:X-Forwarded-Proto} http
  RewriteCond %{REQUEST_URI} !^/robots.txt
  RewriteRule ^(.*)$ https://%{HTTP_HOST}%{REQUEST_URI} [L,R=301]

  ... etc
From there, make a copy of 'robots.txt' to 'robots_http.txt' & 'robots_https.txt', and make changes as desired. You can rename or delete 'robots.txt' afterwards, as any queries to 'robots.txt' will be handled by the aforementioned files.


To test that you are indeed querying the two different robots files, use the 'curl' command (instead of via the browser) to directly query the robots.txt file on your website. The reason is because your website can cache results to your web-browser, thus yielding false readings. If you are indeed receiving the expected robots.txt file from http and https, submit both the http version and https version to Google Search for indexing. 
I hope this helps someone who's experiencing SEO issues with their website. 

About the author

Tom Thorp
Tom Thorp is an IT Consultant living in Miami on Queensland's Gold Coast. With over 30+ years working in the IT industry, Tom's experience is a broad canvas. The IT services Tom provides to his clients, includes :
Website development and hosting
Database Administration
Server Administration (Windows, Linux, Apple)
PABX Hosting and Administration
Helpdesk Support (end-user & technical).