An introduction to font fallback. Primarily aimed at people working on Google Fonts. The reader is assumed to have basic familiarity with Font 101.
Imagine we offer email, chat, or any other service where user-entered text can be displayed. Users will probably expect more than basic latin to work. We need to strive to be able to display any valid sequence of unicode codepoints correctly.
We typically get a clue to the language from at least one of app/web developer, browser, or operating system.
So, given (language, codepoint sequence)
we need to be able to produce something sufficient to render the text.
We take that to mean breaking the input text into one or more runs of text that are to be drawn with a specific font.
The goal of this document is to explain enough of the rudiments of this problem to build a toy solution.
Raph’s talk on Android Typography (youtube) is well worth a watch for context.
It is implausible for a single font, limited to 65k chars, to support all the worlds languages. We’re going to need a bunch of fonts. If we have multiple fonts we’ll also need rules for how to choose which one should be used to render a given unit of text. Let’s call (fonts, rules)
a font configuration.
We can now specify our problem a bit more concretely:
Input: (font configuration, language, codepoint sequence)
Output: sequence of (font, codepoint sequence)
Building an entirely new set of fonts for most or all of Unicode is a big job. Thankfully Android is open source and has both a configuration (fonts.xml) and a set of open source fonts.
We could just use Androids entire text stack but that wouldn’t leave us anything to play with!
It is tempting to think of codepoint as meaning a user perceived character. Unfortunately this isn’t at all true:
Ìṣọ̀lá
is 5 user perceived characters, 6 codepoints
ọ̀
is two codepoints: latin small o with dot below, combining grave accentOur results will be much better if we try to ensure entire user perceived characters to come from the same font. “user perceived character” is clumsy, let’s use “grapheme” (“The smallest meaningful contrastive unit in a writing system.”, Oxford) or “grapheme cluster.”
That means we want to iterate over the grapheme clusters and pick the best font for each cluster. It’s easy to loop over the codepoints in a string. Looping over graphemes is harder. Thankfully Unicode has a detailed desription of how to approach this in Annex #29 “Unicode Text Segmentation” (tr29). Even better, International Components for Unicode (ICU, http://icu-project.org/) provides an implementation.
We now have enough we can start to think about implementing a fallback system. Pick a programming language and write a toy implementation! A few tips and reminders:
fonts.xml
carefully, the prioritization of fonts (first match by lang, then by order) is critical.BreakIterator
, making it easy to loop over grapheme clusters in our input text
# PyICU can be grumpy about installation; this worked for me on Mac using Homebrew in a py3 venv
brew install icu4c
export PATH="/usr/local/opt/icu4c/bin:$PATH"
pip install pyicu
That should be enough to implement what we wanted at the beginning:
Input: (font configuration, language, codepoint sequence)
Output: sequence of (font, codepoint sequence)
Note that this is over-simplified but perhaps enough to give some feel for the problem.