More people than you might think have to deal with data in multiple languages. Work in Scotland? English and Gaelic are on your agenda. You’ve built a database to support English — and then it struggles when someone has an accent in their name. Transliterations from other scripts are often not consistent.
Here is one group’s workshopped list of the key issues people need to consider.
Technical considerations
Encoding standards
Unicode / UTF8 — you can say every character is a number — but some characters are more than a letter: accents, ligatures. It’s even more pronounced in Arabic and Chinese. It really matters if you want to make them consistently searchable.
These are complicated, cursed and political. There’s an American evangelical organisation that gets to decide on language standards…
Text direction
Right to left, left to right, top to bottom, and more…
Orthography
Same language, different alphabet
Cursive text
It can be hard for humans to read easily — let alone machines.
Human factors
Language structure
- What needs to be captured
- Names
- Constructed languages (from Esperanto to Elvish to Klingon…)
Some languages are so different from one another, it’s challenging to make direct translations. It’s difficult to find matching elements of a sentence. The verb might not be what gives you a sense of when something is happening in time. Mandarin has no tenses or singular or plural.
Audio tones
The way people say things can have an impact on meaning
Code-switching
Changing the way you speak, your vocabulary or so on, to fit in with a group you perceive as more dominant in any social situation.
Language shifts
Language changes over time, the semantic meaning of words drifts.
Closed language practices
Some communities use a language to make sure out-group people don’t understand them, and will change elements that are discovered.
Variant distinctions
Do you fancy eggplant or aubergine for dinner? It depends on where you are…
Slang
Informal usage, which again tends to shift over time.
Sociolects
Variants by social groups, rather than Dialects, which are variants by location.
Non-verbal/non-written language
Lingua Franca Mix
Mixing up words from different languages to create distinct language variants
What is a language?
For some people, it’s an ISO 639 number, for others it’s something people speak. It depends on where and how you need to draw a line.