Character encoding 101

As a software engineer, it is to all extents and purposes a requirement that one has at least a basic understanding of (and appreciation for) character encoding. On that basis, before we go any further I must implore you to read Joel Spolsky's article on Unicode.

I have recently been working on two projects that have encountered various issues associated with character encoding. I was naively oblivious to exactly what I needed to do in each case (as regard character encoding) so went about learning more (and writing about it here).

Use Case

Recently I have been developing a social food photography application for web and mobile.

It was extremely important that the website and application were able to display characters from every different alphabet.

The intention is that the application will be utilized across the globe, and photographs will be submitted of local foods. As such, one can safely assume that the captions, tags, and location names associated with various photos may well utilise different character sets.

The application interacts with our backend through our API.

Our requirements are thus:

  • Characters from any alphabet are displayed correctly on the web
  • Characters from any language can be transmitted across the wire

As outlined in Joel's article, the requirements for displaying a particular character set on the web is that you set the appropriate meta tag:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

For our API responses we simply specify the content type as application/json instead.

Within our iOS application we utilise the Alamofire networking library. As can be seen in this pull request, the library is built to respect the content encoding returned by the server. That is to say that specifying our charset as UTF-8 in our response allows Alamofire to appropriately handle the response and return it in the correct format to your callback.

Interestingly, as per the JSON specification, "JSON text SHALL be encoded in Unicode. The default encoding is UTF-8."

Domain Bites

Another project on which I have been working is Domain Bites - an application for keeping up to date with news pertaining to the domain name industry.

Whilst the majority of articles displayed through the app are written in English (using English character sets), it is not safe to assume that that will always be the case.

In addition to this, the content being displayed is being pulled from the RSS feeds of various industry blogs. Given this, we cannot guarantee how article data will be transmitted through the feed.

One issue that I encountered with this project was that a number of feeds were returning data containing HTML entity codes. For example &#8217; would be returned instead of an apostrophe. This obviously affects readability.

As outlined here, these codes are hexadecimal representations of unicode code points (see Joel's article). In some cases entity names were also being returned such as &nbsp; (These names are often used as they have the benefit of being somewhat more memorable).

** If you are interested, these codes are often utilised to make things explicit. For example utilisng the &nbsp; entity makes it patently obvious that a space should be displayed. Some characters (for example < and > need to be outputted in their entity format (&lt; AND &gt;) because < and > have special meanings within HTML. You can view a full list of character entity references in the HTML5 specification here.

As these character references reference a particular unicode code point, they are easily convertable to their unicode counterpart. PHP for example provides the html_entity_decode method which does this for you.

Given that I store the pulled articles in my database I thought it interesting to also note that one can store raw data within most modern databases. With this particular project, I decode the received data utilising html_entity_decode prior to saving it in my (mySQL) database. The table in question utilises the UTF-8 character encoding and as such there is no reason not to store UTF-8 characters in their raw format. You'll save a negligible amount of space too !

As answered here, you can verify the encoding of your (mySQL) database using the following:

SHOW TABLE STATUS where name like 'table_123';

Interesting other stuff..

Whilst writing this it crossed my mind that 'UTF-8' seems to be ideal for.. everything. Why would anyone utilise a different character set?

Well.. it definitely wasn't a stupid thought - it seems there is even a manifesto outlining exactly why UTF-8 should be utilised everywhere, and why it should be the default encoding type.

There are however a few disadvantages outlined on Wikipedia - I am however not convinced.

So..

As you can see, character encoding issues are not exactly tough to resolve. You do however need to be aware of them, especially if you rely on data from external sources - data, the format of which you can not guarantee.

Had I not considered character encoding I would have limited the potential market for the product to a paltry 850 million people source, and would have risked slightly annoying people trying to keep up to date with domain name news.

Given that it is so simple, you should definitiely give consideration to it. Its kind of interesting too :)


Thomas Clowes

Thomas Clowes

I am a 28 year old software engineer from the United Kingdom. During the day I build multi platform applications. In my spare time I eat food and run marathons. Sometimes I write angry tweets.