Managing multilingual datasets: what you need to consider

More people than you might think have to deal with data in multiple languages. Work in Scotland? English and Gaelic are on your agenda. You’ve built a database to support English — and then it struggles when someone has an accent in their name. Transliterations from other scripts are often not consistent.

Here is one group’s workshopped list of the key issues people need to consider.

Technical considerations

Encoding standards

Unicode / UTF8 — you can say every character is a number — but some characters are more than a letter: accents, ligatures. It’s even more pronounced in Arabic and Chinese. It really matters if you want to make them consistently searchable.

These are complicated, cursed and political. There’s an American evangelical organisation that gets to decide on language standards…

Text direction

Right to left, left to right, top to bottom, and more…

Orthography

Same language, different alphabet

Cursive text

It can be hard for humans to read easily — let alone machines.

Human factors

Language structure

  • What needs to be captured
  • Names
  • Constructed languages (from Esperanto to Elvish to Klingon…)

Some languages are so different from one another, it’s challenging to make direct translations. It’s difficult to find matching elements of a sentence. The verb might not be what gives you a sense of when something is happening in time. Mandarin has no tenses or singular or plural.

Audio tones

The way people say things can have an impact on meaning

Code-switching

Changing the way you speak, your vocabulary or so on, to fit in with a group you perceive as more dominant in any social situation.

Language shifts

Language changes over time, the semantic meaning of words drifts.

Closed language practices

Some communities use a language to make sure out-group people don’t understand them, and will change elements that are discovered.

Variant distinctions

Do you fancy eggplant or aubergine for dinner? It depends on where you are…

Slang

Informal usage, which again tends to shift over time.

Sociolects

Variants by social groups, rather than Dialects, which are variants by location.

Non-verbal/non-written language

Lingua Franca Mix

Mixing up words from different languages to create distinct language variants

What is a language?

For some people, it’s an ISO 639 number, for others it’s something people speak. It depends on where and how you need to draw a line.


One thought on “Managing multilingual datasets: what you need to consider

Leave a Reply

Your email address will not be published. Required fields are marked *