le0c

i18n: in theory

Internationalisation is process of preparing your application so it can be delivered in another language or territory. In the context of software development, this refers to the process of translating the phrases you use to communicate in your application or service.

From this we will deliberately exclude regulatory or compliance requirements - I'll assume if you are reading this, that you are permitted to deliver your service in another jurisdiction.

Background

We are working in a Typescript codebase, with React on the front end - standard stuff. We opted to make a single codebase deliver in multiple languages rather than developing an application per language. Our application compiles to a single website bundle, which works in the browser.

Technical approach

The general approach will be to implement a function translate() which receives an identifier for a piece of text, and returns that piece of text in the correct language.

We store text in a dictionary file, dictionary.json:

{
	"en": {
		"hello": "World",
		"login": {
			"greeting": "Welcome to our website",
			"button": "Sign in"
		}
	}
}

For implementing the translate()function, we follow this pattern in our source code:

<button>{translate('login.button')}</button>

In addition to this, in the next chapter, we will also demonstrate:

Our dictionary file will be bundled with the application when it is delivered in the browser.

Where to begin

To follow this approach, you will need:

  1. A list of all phrases in your application
  2. A name for each phrase in your list
  3. A mechanism for identifying what language a user would prefer to use
  4. A function which takes the phrase-name and returns text in the correct language

A list of all phrases

Compiling this list can be aided by scripts, but ultimately will depend on the complexity of your application. Your goal should be to identify all places in which user facing text is stored, and create a list of these locations. To do this, you might consider:

  1. Static analysis: we can configure es-lint to surface all string literals
  2. Regex: use search terms like (?<!=)".+" to find string literals
  3. AST analysis: using ts-morph to parse string literals programmatically
  4. Flushing: Once most strings have been found, remaining strings can be found by replacing all known strings with some known testing value - any locations that don't show this test value, have not been wired into the system.

When producing early iterations of this list, the name you give to the token does not matter as much as it will later in the process. Therefore, producing a list with some duplication or error is acceptable, so long as you do this programmatically. It would be useful to have a tool or script which can:

  1. Check each file in your repository for a best guess at "user facing text"
  2. Uniquely identify each of these strings
  3. Print a list of these identifier:phrase pairs to a file

One approach could be to parse your application as an Abstract Syntax Tree and ask your parser to identify each occurrence of a string literal. The identifier can be the location of the string in the application tree:

App.HomeComponent.LoginForm.LoginButton: "Login"
App.HomeComponent.LoginForm.TextBox: "Click the button to login"

This name does not have to be especially human readable in the first iteration, but should at least be understood by developers working on the codebase.

One issue in the above snippet is that, we could envisage adding another TextBoxto this LoginForm - how do we differentiate between these two instances of TextBox?

Additionally, If you have two components in different parts of your syntax tree which share the same string literal, you will encode this string literal twice in your list. This may be acceptable, as the difference in context might result in a different translation.

However, you will need to do additional work to deduplicate the dictionary if this is not acceptable. One approach might be to hoist all shared translation to a common root in the dictionary file. However, this has drawbacks: now some of your tokens do not expressly follow the naming convention used by the others.

When supplying this list to your translator, it might be useful to have more context than TextBox - especially when there are multiple instances of a child component.

A name for each phrase in your list

In order to translate this list, you will need to supply it to a translator. In order to help them translate joyfully, it might be convenient for your translator to understand the context in which this phrased is used.

A useful way to think of this name is as a post code. you want to confer some information about where and what this phrase is, just by looking at the token name:

App.HomeComponent.LoginForm.LoginButton: "Login"
App.HomeComponent.LoginForm.TextBox: "Click the button to login"

As mentioned in the previous section, this naming schema ("follow the mark up structure") can run into issues when you have duplicate child components. In addition, if you have a large codebase or deeply nested mark up, you may end up with token names that are extremely long.

Some approaches that may be useful in dealing with this are:

A mechanism for identifying what language a user would prefer to use

The browser can offer some insight into which language a user prefers. Additionally you can expose this as a user level setting, and store this against the users preferences. Returning this on user login, or in another API response early in your applications lifecycle means you can reliably use this to display some language.

For determining which language a user would like to see, you can fall back from most reliable to least reliable according to your specifications. In our case, it made sense to store this against the users organisation. In this case, we would like to prefer this as a source of truth, and if it doesn't exist we can ask the browser, using the Navigator.language object:

function getLanguage (userID) {
	// look at this users organisation:
	const organisation = getUserOrganisation(userID)
	return organisation.features.usableLanguageCodes ?? navigator.languages
}

A function which takes the phrase-name and returns text in the correct language

In its most basic form, this function does the following:

  1. Receives a token name.
  2. Use this token name to search the dictionary for a phrase.
  3. Return the phrase associated with this token.

According to your needs, you may have additional requirements:

In addition to this, you can supply some syntactic sugar to make the usage of this function more manageable by using a hook:

const { t } = useTranslate('prefix.goes.here')

This is useful because you can take the shared portions of a token name for a set of tokens in a file, and supply that shared root to the translate function.

Then, each additional usage of a specific token has the prefix applied to it. This saves on some overhead, allowing you to do the following:

// following on from the above example

<Home>{t('tokenName')}</Home>

instead of:

<Home>{t('prefix.goes.here.tokenName')}</Home>

What next?

When you have the requisite parts in place, you can begin the following refactor:

function App () {
	return <Home>Welcome!</Home>
}

becomes:

const { t } = useTranslate('app.home')

function App () {
	return <Home>{t('greeting')}</Home>
}

Everywhere you have raw text in a file, now needs to call your translation function.

This refactor is mechanically intensive, as you will have to edit every file in which you have user facing text. We discuss doing this in the next chapter.