Tag Archives: standards

Managing multilingual datasets: what you need to consider

More people than you might think have to deal with data in multiple languages. Work in Scotland? English and Gaelic are on your agenda. You’ve built a database to support English — and then it struggles when someone has an accent in their name. Transliterations from other scripts are often not consistent.

Here is one group’s workshopped list of the key issues people need to consider.

Technical considerations

Encoding standards

Unicode / UTF8 — you can say every character is a number — but some characters are more than a letter: accents, ligatures. It’s even more pronounced in Arabic and Chinese. It really matters if you want to make them consistently searchable.

These are complicated, cursed and political. There’s an American evangelical organisation that gets to decide on language standards…

Text direction

Right to left, left to right, top to bottom, and more…

Orthography

Same language, different alphabet

Cursive text

It can be hard for humans to read easily — let alone machines.

Human factors

Language structure

  • What needs to be captured
  • Names
  • Constructed languages (from Esperanto to Elvish to Klingon…)

Some languages are so different from one another, it’s challenging to make direct translations. It’s difficult to find matching elements of a sentence. The verb might not be what gives you a sense of when something is happening in time. Mandarin has no tenses or singular or plural.

Audio tones

The way people say things can have an impact on meaning

Code-switching

Changing the way you speak, your vocabulary or so on, to fit in with a group you perceive as more dominant in any social situation.

Language shifts

Language changes over time, the semantic meaning of words drifts.

Closed language practices

Some communities use a language to make sure out-group people don’t understand them, and will change elements that are discovered.

Variant distinctions

Do you fancy eggplant or aubergine for dinner? It depends on where you are…

Slang

Informal usage, which again tends to shift over time.

Sociolects

Variants by social groups, rather than Dialects, which are variants by location.

Non-verbal/non-written language

Lingua Franca Mix

Mixing up words from different languages to create distinct language variants

What is a language?

For some people, it’s an ISO 639 number, for others it’s something people speak. It depends on where and how you need to draw a line.


What open data standards do we need?

Terence Eden from the Government Data Service had one of the most reacted-to pitches at Open Data Camp 4. Surely, he suggested to the more than 100 attendees packed into Cardiff’s Pierhead, data should always be released as pdf?

Open Data Standards

Of course, this was a joke. And at the session on ‘what open data standards do we need’ he said he had insisted that government departments released data in open document format.

This wasn’t openness for openness sake, he added. It was because he didn’t think it was reasonable for open data users to be expected to buy licenses for expensive, proprietary database and software projects where good, open and free alternatives existed.

Continue reading What open data standards do we need?

Better Highways with Open Data

Highways open data issues capture

Highways may look like the perfect area for open data initiatives. There is lots of data about highways assets; there is public demand for new services, such as websites or apps through which they can report potholes; and councils have incentives to get involved.

As Teresa Jolly, the leader of a session on highways pointed out, councils need to start making better use of their data, because people are saying:

We have all these new demands on us, and we have no money. How can we start talking to our communities about meeting their real needs without breaking the bank.

Continue reading Better Highways with Open Data