Finding a Website’s Favicon with Ruby

For a project I’ve been working on, I wanted to to have my Sidekiq worker (which is part of an RSS crawler) discover the favicon for a web site and cache it for later display. It was fun figuring out a way to do this, so I just had to share.

A Brief History of Favicons

Favicons, or “shortcut icons,” can be defined in multiple ways. Like all too many things in web design, browsers handle them in slightly different and mildly incompatible ways, meaning there’s plenty of redundancy. Favicons came to be when Microsoft added them to Internet Explorer 5 in 1999, implementing a feature where the browser would check the server for a file named favicon.ico and display it in certain parts of the UI. The following year, the W3C published a standard method for defining a favicon. Rather than simply having the browser look for a file in the root directory, an HTML document should specify a file in the header with a <link> tag, just like with stylesheets.

Fast forward to the present, and you have a bit of screwiness.

  • All major web browsers check for the link tag first, and fall back to favicon.ico if it’s not found.
  • You can define multiple icons in the HTML header. You can have ICO/PNG/GIF formats, as well as different sizes.
  • Some browsers support larger 32×32 favicons, while others will only use the 16×16 ones. Chrome for Mac prefers the 32×32 ones, and scales them down to 16×16 on Macs without Retina displays.
  • Big Bad Internet Explorer only supports ICO files for favicons, not PNGs.

The most compatible way to set up your favicon is to define both 32×32 and 16×16 icons in your header, using the PNG format, and make a 16×16 ICO formatted one to name “favicon.ico” and drop into your web root. Browsers that play nicely will use the PNG ones in whatever dimensions they prefer, and IE will fall back to the ICO file.

Writing the Class

Now that the history lesson is out of way, you can see why there’s a little bit of a challenge here. Depending on how badly you want to find and display that icon, you may have to write logic for the different methods. For this tutorial, I will focus on two. The simplest, which is looking to see if there’s a favicon.ico, and a basic implementation of checking for a link tag defining a shortcut icon.

Before we do anything else, we need to install a few dependencies. Either add them to your Gemfile and do a bundle install, or use the gem install command to install them manually.

Now require the necessary libraries at the top of a new Ruby file and we can get going.

require "httparty"
require "nokogiri"
require "base64"

We can define a class to make a nice, clean interface for this to keep it modular and easier to reuse. As you can see below, I’ve made a Favicon class and added some accessors for instance variables, as well as an initialize method that assigns the parameter it receives to the @host instance variable before calling the method we will be defining next.

require "httparty"
require "nokogiri"
require "base64"


class Favicon


  attr_reader :host
  attr_reader :uri
  attr_reader :base64


  def initialize(host)
    @host = host
    check_for_ico_file
  end


end

We’ll be implementing the simplest part first. The check_for_ico_file method will send an HTTP GET request to /favicon.ico on the server specified in @host and check to see if a file exists. (The server will send a 200 OK response if it does, and a 404 Not Found error otherwise.) If it does, the URL will be saved to an instance variable and the icon file’s contents will be base64 encoded before being saved to an instance variable as well.

The HTTParty gem is great for this, since it drastically simplifies simple HTTP requests like this.

# Check /favicon.ico
def check_for_ico_file
  uri = URI::HTTP.build({:host => @host, :path => '/favicon.ico'}).to_s
  res = HTTParty.get(uri)
  if res.code == 200
    @base64 = Base64.encode64(res.body)
    @uri = uri
  end
end

If you want, you could go ahead and instantiate the class to try out what we have so far. If you pass it the domain name of a site that uses the /favicon.ico convention, the object should find it without issue.

favicon = Favicon.new("arstechnica.com")

puts favicon.uri
#Outputs http://arstechnica.com/favicon.ico

puts favicon.base64
#Outputs a bunch of base64-encoded gibberish. More on this later

puts puts favicon.host
#Outputs arstechnica.com

Now let’s handle link tags! The process for that is a little bit more in-depth. First we need to request a web page from the server, such as the index page, and parse it for tags that resemble <link rel="shortcut icon" href="..." />. Then we have to evaluate the contents of href to make sure it’s an absolute URL, and prepend the domain name if it is not. After that, we can finally make a request to get the icon itself and save it.

Still with me? Excellent, now here’s the code to do that. I’ll comment it a little more thoroughly, since it looks messier at a glance.

# Check "shortcut icon" tag
def check_for_html_tag

  # Load the index page with HTTParty and pass the contents to Nokogiri for parsing
  uri = URI::HTTP.build({:host => @host, :path => '/'}).to_s
  res = HTTParty.get(uri)
  doc = Nokogiri::HTML(res)

  # Use an xpath expression to tell Nokogiri what to look for.
  doc.xpath('//link[@rel="shortcut icon"]').each do |tag|

    # This is the contents of the "href" attribute, which we pass to Ruby's URI module for analysis
    taguri = URI(tag['href'])

    unless taguri.host.to_s.length < 1
      # There is a domain name in taguri, so we're good
      iconuri = taguri.to_s
    else
      # There is no domain name in taguri. It's a relative URI!
      # So we have to join it with the index URL we built at the beginning of the method
      iconuri = URI.join(uri, taguri).to_s
    end

    # Grab the icon and set the instance variables
    res = HTTParty.get(iconuri)
    if res.code == 200
      @base64 = Base64.encode64(res.body)
      @uri = iconuri
    end
    
  end

end

Now there’s one more thing to do before we’re done. The initialize method needs to be tweaked so it calls our newest method:

def initialize(host)
  @host = host
  check_for_ico_file
  check_for_html_tag
end

Now the class will check for the favicon.ico file first, then the HTML tag. If the HTML tag is present, it will take precedence.

Available as a Gist! For your convenience, the results of this tutorial are available as a GitHub Gist.

Using the Class

Now all you have to do is include the class with a require statement, and grab favicons.

require "favicon"

favicon = Favicon.new("arstechnica.com")

puts favicon.uri
#Outputs http://static.arstechnica.net/favicon.ico

puts favicon.base64
#Outputs a bunch of base64-encoded gibberish. More on this later

puts puts favicon.host
#Outputs arstechnica.com

Now…what of that “base64-encoded gibberish?” It’s the perfect format for a little trick called Data URIs, which you can read all about over at CSS-Tricks. If you cache that base64 string somewhere, probably in a database, you can output it like so:

<img width="16" height="16" alt="favicon" src="" />

It will display like any other image, but won’t use an additional HTTP request, because the image data is already embedded on the page. This makes it perfect for a list of web sites with icons beside them. Instead of kicking off several HTTP requests for individual tiny images, you just embed them right in the page.

If you’re unfortunate enough that you must support antique versions of Internet Explorer (version seven or prior) then you can’t use Data URIs, as they were not supported. However, all is not lost. You could conceivably adapt the class and have it write the image data to files on the server instead of base64-encoding them.