For a project I’ve been working on, I wanted to to have my Sidekiq worker (which is part of an RSS crawler) discover the favicon for a web site and cache it for later display. It was fun figuring out a way to do this, so I just had to share.
A Brief History of Favicons
Favicons, or “shortcut icons,” can be defined in multiple ways. Like all too many things in web design, browsers handle them in slightly different and mildly incompatible ways, meaning there’s plenty of redundancy. Favicons came to be when Microsoft added them to Internet Explorer 5 in 1999, implementing a feature where the browser would check the server for a file named favicon.ico
and display it in certain parts of the UI. The following year, the W3C published a standard method for defining a favicon. Rather than simply having the browser look for a file in the root directory, an HTML document should specify a file in the header with a <link>
tag, just like with stylesheets.
Fast forward to the present, and you have a bit of screwiness.
- All major web browsers check for the link tag first, and fall back to
favicon.ico
if it’s not found. - You can define multiple icons in the HTML header. You can have ICO/PNG/GIF formats, as well as different sizes.
- Some browsers support larger 32×32 favicons, while others will only use the 16×16 ones. Chrome for Mac prefers the 32×32 ones, and scales them down to 16×16 on Macs without Retina displays.
- Big Bad Internet Explorer only supports ICO files for favicons, not PNGs.
The most compatible way to set up your favicon is to define both 32×32 and 16×16 icons in your header, using the PNG format, and make a 16×16 ICO formatted one to name “favicon.ico” and drop into your web root. Browsers that play nicely will use the PNG ones in whatever dimensions they prefer, and IE will fall back to the ICO file.
Writing the Class
Now that the history lesson is out of way, you can see why there’s a little bit of a challenge here. Depending on how badly you want to find and display that icon, you may have to write logic for the different methods. For this tutorial, I will focus on two. The simplest, which is looking to see if there’s a favicon.ico
, and a basic implementation of checking for a link tag defining a shortcut icon.
Before we do anything else, we need to install a few dependencies. Either add them to your Gemfile and do a bundle install
, or use the gem install
command to install them manually.
Now require the necessary libraries at the top of a new Ruby file and we can get going.
require "httparty" require "nokogiri" require "base64"
We can define a class to make a nice, clean interface for this to keep it modular and easier to reuse. As you can see below, I’ve made a Favicon
class and added some accessors for instance variables, as well as an initialize
method that assigns the parameter it receives to the @host
instance variable before calling the method we will be defining next.
require "httparty" require "nokogiri" require "base64" class Favicon attr_reader :host attr_reader :uri attr_reader :base64 def initialize(host) @host = host check_for_ico_file end end
We’ll be implementing the simplest part first. The check_for_ico_file
method will send an HTTP GET request to /favicon.ico
on the server specified in @host and check to see if a file exists. (The server will send a 200 OK
response if it does, and a 404 Not Found
error otherwise.) If it does, the URL will be saved to an instance variable and the icon file’s contents will be base64 encoded before being saved to an instance variable as well.
The HTTParty gem is great for this, since it drastically simplifies simple HTTP requests like this.
# Check /favicon.ico def check_for_ico_file uri = URI::HTTP.build({:host => @host, :path => '/favicon.ico'}).to_s res = HTTParty.get(uri) if res.code == 200 @base64 = Base64.encode64(res.body) @uri = uri end end
If you want, you could go ahead and instantiate the class to try out what we have so far. If you pass it the domain name of a site that uses the /favicon.ico
convention, the object should find it without issue.
favicon = Favicon.new("arstechnica.com") puts favicon.uri #Outputs http://arstechnica.com/favicon.ico puts favicon.base64 #Outputs a bunch of base64-encoded gibberish. More on this later puts puts favicon.host #Outputs arstechnica.com
Now let’s handle link tags! The process for that is a little bit more in-depth. First we need to request a web page from the server, such as the index page, and parse it for tags that resemble <link rel="shortcut icon" href="..." />
. Then we have to evaluate the contents of href
to make sure it’s an absolute URL, and prepend the domain name if it is not. After that, we can finally make a request to get the icon itself and save it.
Still with me? Excellent, now here’s the code to do that. I’ll comment it a little more thoroughly, since it looks messier at a glance.
# Check "shortcut icon" tag def check_for_html_tag # Load the index page with HTTParty and pass the contents to Nokogiri for parsing uri = URI::HTTP.build({:host => @host, :path => '/'}).to_s res = HTTParty.get(uri) doc = Nokogiri::HTML(res) # Use an xpath expression to tell Nokogiri what to look for. doc.xpath('//link[@rel="shortcut icon"]').each do |tag| # This is the contents of the "href" attribute, which we pass to Ruby's URI module for analysis taguri = URI(tag['href']) unless taguri.host.to_s.length < 1 # There is a domain name in taguri, so we're good iconuri = taguri.to_s else # There is no domain name in taguri. It's a relative URI! # So we have to join it with the index URL we built at the beginning of the method iconuri = URI.join(uri, taguri).to_s end # Grab the icon and set the instance variables res = HTTParty.get(iconuri) if res.code == 200 @base64 = Base64.encode64(res.body) @uri = iconuri end end end
Now there’s one more thing to do before we’re done. The initialize method needs to be tweaked so it calls our newest method:
def initialize(host) @host = host check_for_ico_file check_for_html_tag end
Now the class will check for the favicon.ico
file first, then the HTML tag. If the HTML tag is present, it will take precedence.
Available as a Gist! For your convenience, the results of this tutorial are available as a GitHub Gist.
Using the Class
Now all you have to do is include the class with a require statement, and grab favicons.
require "favicon" favicon = Favicon.new("arstechnica.com") puts favicon.uri #Outputs http://static.arstechnica.net/favicon.ico puts favicon.base64 #Outputs a bunch of base64-encoded gibberish. More on this later puts puts favicon.host #Outputs arstechnica.com
Now…what of that “base64-encoded gibberish?” It’s the perfect format for a little trick called Data URIs, which you can read all about over at CSS-Tricks. If you cache that base64 string somewhere, probably in a database, you can output it like so:
<img width="16" height="16" alt="favicon" src="" />
It will display like any other image, but won’t use an additional HTTP request, because the image data is already embedded on the page. This makes it perfect for a list of web sites with icons beside them. Instead of kicking off several HTTP requests for individual tiny images, you just embed them right in the page.
If you’re unfortunate enough that you must support antique versions of Internet Explorer (version seven or prior) then you can’t use Data URIs, as they were not supported. However, all is not lost. You could conceivably adapt the class and have it write the image data to files on the server instead of base64-encoding them.