Multi-language support on embedded plarforms

P

pozz 8 years ago

[This message is posted to comp.arch.embedded and comp.lang.c]

Just for reference, an embedded platform based on a MCU with integrated Flash, for example a Cortex-Mx device. Here I consider only western languages (left-to-right and european chars, english, french, german, spanish and so on).

The main problem is the translation of strings, maybe 10-100 strings.

I know something about gettext package that can't be used in those embeded platforms. However I like the approach of gettext.

print_to_display(x, y, "Hello world!");

is simply changed in:

#include ... print_to_display(x, y, _("Hello world!"));

In this way, the code stays highly readble as before introducing the multi-language support. If a member of a structure needs a string, it is a char * as usual.

The solution I found in embedded platforms is to use an array of array of strings: one index for the string and one index for the language.

enum lang_t { ENGLISH, ITALIAN, LANG_N }; enum string_t { STR_HELLO_WORLD, STR_HOW_ARE_YOU };

const char *strings[STRING_N][LANG_N] = { { // STR_HELLO_WORLD { "Hello world!", "Ciao mondo" } }, { // STR_HOW_ARE_YOU { "How are you?", "Come stai?" } }, };

static enum lang_t lang = ENGLISH;

const char *_(int string_idx) { return strings[string_idx][lang]; }

void set_language(enum lang_t new_language) { lang = new_language; }

I don't like too much this approach for two reasons. The first, the line:

print_to_display(x, y, _(STR_HELLO_WORLD));

is much less readable than

print_to_display(x, y, _("Hello world!"));

The second, I need to change the type of some members/variables from char * to int:

struct mystruct { int title; // Instead of char *title ... };

Another approach I'm thinking is to embed all the translations in the string, using a separator character that can't be used in normal strings.

print_to_display(x, y, _("Hello world!|Ciao Mondo!"));

The _() function will search the translated string based on the current language. If he can't find, it could return the first translation (english).

This approach has some disadvantages. It's difficult to exclude one language from the build. If the languages are more than a couple, the strings will be very long. The order of the translations (first english, than italian, ...) is important and you have to remember it for every string.

What approach do you use?

Vote

D

David Brown 8 years ago

I think if you are looking for a pure C approach, and you want to keep it efficient, then using the enumerated type as an index is the best choice.

But rather than writing all the strings directly in C, I would keep track of them in a spreadsheet saved in tab delimited format, and use a little script to turn it into a C header file declaring the enum, and a C source file initialising the array. It just makes it easier to keep track of everything, and saves a great deal of effort when you need to get someone else to make the translation strings.

Vote

P

Paul 8 years ago

Whatever method you choose there is one thing thta has to be done procedurally and will get overlooked if a time constrained bug fix occurs. That is that part of the fix is to change a string to correct an error, change the feature, whoever is updataing it,'forgets' or is time pressured for release and fails to to do ALL the other translations.

There is no easy solution for that as that involves people.

The problem with having situations where the SAME string is in two places, I have seen fail on desktop applications.

If the string in the translation tables is NOT identical to string in the code section, it fails.

e.g. Table contains "hello world"

Code contains "hello world\n"

This also is more likely to happen where the same string has been copy/pasted as two different parts of code to actually print the same string. Then a correction is required, so someone diligently corrects the translation table and ONE place in the code where they see the problem, but does not realise there are OTHER instances of the same string.

Whilst having index keys to strings (as Constants) may be less readable and could always have inline comments, it does save on storage space and iterative long string compares, if excecution speed is also a problem. Also cuts down on typos and other accidental differences between strings.

Whatever you do need procedures to ensure all strings are translated to all languages, for every change of any string. The human part is the weak link.

....

Paul Carpenter | paul@pcserviceselectronics.co.uk PC Services Logic Gate Education Timing Diagram Font For those web sites you hate

Vote

R

Robert Wessel 8 years ago

We use a preprocessor approach as well, although not from a spreadsheet. An advantage of a preprocessor, is that it makes it easy to slot default messages (IOW English) in for items which have missing translations, and to build subsets of the supported languages and messages to keep space requirements down.

Paul mentioned the difficulty of keeping the different translations in sync; the preprocessor can help there too, if you put a version code (a timestamp, in our case) on each version of each message. Then the preprocessor can warn if the English message was updated without the timestamp being updated (hopefully after being reviewed!) on the Italian message.

Vote

I

Ian Collins 8 years ago

If you store your spreadsheet in plain text (SCV), your version control system can keep track of changes for you!

Ian

Vote

P

Phil Hobbs 8 years ago

Stick to plain ASCII and expect the users to adjust. ;)

Cheers

Phil Hobbs

(Who doesn't build a lot of things that sell in the millions.)

Dr Philip C D Hobbs Principal Consultant ElectroOptical Innovations LLC Optics, Electro-optics, Photonics, Analog Electronics 160 North State Road #203 Briarcliff Manor NY 10510 hobbs at electrooptical dot net http://electrooptical.net

Vote

R

Robert Wessel 8 years ago

Well yes, (and ours are text based), but I've yet to see a SCM that can tell you that someone updated the English version of message#14, but hasn't validated or updated the Italian one yet. You can certainly get a diff and manually see where changes have been made, but that still leaves you with a manual comparison to the translations, and no way of tracking that the validations have been done.

Vote

K

Keith Thompson 8 years ago

The "git blame" command tells you, for each line in a file, when that line was most recently modified. Other SCMs have similar tools. I imagine you could build some tools on top of that that could warn you, for example, that the English version of message #14 was updated yesterday but the Italian version hasn't been changed in the last year.

Keith Thompson (The_Other_Keith) kst-u@mib.org Working, but not speaking, for JetHead Development, Inc. "We must do something. This is something. Therefore, we must do this." -- Antony Jay and Jonathan Lynn, "Yes Minister"

Vote

R

Robert Wessel 8 years ago

A problem is when a change needs to be made to only some translations of a message. Let's say the English one was awkwardly worded, and thus modified, but (some of) the other translations don't need to be changed (although they should probably be reviewed). You need a way to track when a particular translation was last validated against the intended meaning of the message (OK, let's be blunt, the base English message), and against which version it was validated. So we have:

{ msg=FILENOTFOUND,v=3 EN="File not found",m=05-09-2017,v=3 IT="File non trovato",m=01-01-2015,v=3 }

Vote

T

Theo Markettos 8 years ago

I wonder if you could do something using the strings themselves as identifiers.

In other words _("Hello world") is a function called _() passed a const char *

The first thing the _() function does is look up that char * in a hash table to see if it's something we've seen before. If so, it returns a pointer to the translated string.

If not, it matches the string against a list of translations and inserts the pointer to the translation into the hash table.

The tradeoff is that it's more work at runtime. But essentially we only have to walk the string once per run, and then all we have to do is hash the pointer each time we use it. That's not zero overhead, but probably much less work than printf() is already doing (if you're using that). It's a bit more problematic if first-time walking the string might be too costly on some code paths.

gettext or another compiler technique could be used to scrape out the strings to build the translation table. You might be able to instrument that to raise an error at compile time when the extracted translations don't match the translations in the database.

Theo

Vote

A

Allan Herriman 8 years ago

Some SCM tools allow the use of "hook scripts" - bits of code or programs that you can hook into various SCM actions or states. You would need to write the scripts yourself, but it ought to be possible to disallow a checkin on the English file if the other language files haven't been updated.

I did a similar thing with Tortoise SVN once: for some reason we had document source (e.g. Word) and PDF output both in the SCM system, and a hook script would check if one was being checked in without the other, and abort the checkin with a suitable message if one was missing.

Regards, Allan

Vote

K

Keith Thompson 8 years ago

[snip]

GNU gettext is free software, licensed under GPLv3. I wonder if you could grab a copy of it, remove any functionality you don't need, and end up with something small enough to work on your embedded system.

In a very quick look at the gettext sources, I see that the gettext-runtime/src subdirectory contains about 1300 lines of C code. If that's all that needs to run on the target system, you might even be able to use it without modification.

(Any licensing issues are left as an exercise.)

Keith Thompson (The_Other_Keith) kst-u@mib.org Working, but not speaking, for JetHead Development, Inc. "We must do something. This is something. Therefore, we must do this." -- Antony Jay and Jonathan Lynn, "Yes Minister"

Vote

S

Scott Lurndal 8 years ago

GPLv3 would likely preclude use of gettext in proprietary embedded code.

There is equivalent functionality in the *BSD variants that are licensed much more liberally.

Vote

P

pozz 8 years ago

This is a good suggestion. I'm not an expert of gettext, however I remember it loads/search for right strings (based on current language) at runtime, looking at the content of a binary file (mo extension).

In my embedded platform I don't have a real filesystem so I can't access "files" at runtime.

Maybe I could add the mo files in the output binary file (the image of the Flash memory of the MCU) at exact locations and change gettext code to look at those fixed addresses instead of accessing files.

Anyway thanks for the suggestions.

And this is another good point to study.

Vote

B

bartc 8 years ago

gettext looks like a very heavy-duty approach. (A lot of these third party solutions are. My experience was that a third party library that took care of 5% of the functionality of my application, would be several times bigger than my entire app.)

The method is basically this, assuming you have tables of messages for all languages in memory:

Take an English message M
Look it up in the English table, to get index N (with 100 messages, a linear search will do)
If N is in range, return the string from table[N] for language L
If M wasn't found, just return M, the English version.

So, probably a 10 or 20 line function.

This would require two copies of each English message, one in the source, and one in a searchable table. And that needs maintenance.

You might be able to get around that by embedding a serial number in each English message:

puts(_("Please enter filename: !078"));

Here the '!078" is the number and does not appear (or you can use {78} etc, any scheme will do).

Now you just have to search the table for language L for a message with the same number. (You don't need to convert to an integer, just compare the last few characters.)

Of course, you need to return a string without the !078 etc in it. For that purpose, it might be better to put this number at the start. Then you return a string pointing to the just past the number.

(See example below using such a scheme. This might give some ideas.)

You can use the number as an actual index, but the maintenance becomes harder.

There is still the problem of producing a list of English messages for translators to work from. But the format and ordering of that is not critical.

Another reason to forget using anyone else's library.

--------------------------------------------------------------

#include #include

char *italian[] = { "1!uno", "2!due", "3!tre", "4!quattro", "5!fine", }; char *spanish[] = { "2!dos", "3!tres", "1!uno", "4!cuatro", "5!fin" }; char *german[] = { "4!vier", // ordering doesn't matter "5!Ende", "1!eins", "2!zwei", "3!drei" };

//char **currlang = italian; //char **currlang = german; char **currlang = spanish; //char **currlang = NULL;

int nmessages=sizeof(italian)/sizeof(italian[0]);

char* skipprefix(char* M){ char *s=M; while (*s!='!' && *s!=0) ++s; if (*s==0) return M; return s+1; }

char* lookup(char* M){ char *s; int i,len;

s=skipprefix(M); len=s-M; if (currlang==NULL || len==0) return s; // English

for (i=0; i

Vote

D

David Brown 8 years ago

gettext /is/ a heavy-duty approach. It is designed to separate the program code and that translation texts, so that they can be written by different people, compiled separately, distributed separately, and (if desired) updated separately - because the binary and the translation files are all separate files. It is a very useful approach for many kinds of program - but too big and complex for what the OP wants, I believe.

That can be okay for a starting point, but it has a /big/ problem - you only get one entry for each original English language message. When you are translating messages, it is not uncommon to encounter different messages with the same text in the original language but different texts in the translations. In gettext, this is done by including __FILE__ and __LINE__ in the lookup.

Maintenance of the string numbers here is a hassle.

No, it is another reason to look at the licensing before using other libraries. People write libraries with the intention of letting other people use them - you just need to make sure the licensing is suitable.

Vote

B

bartc 8 years ago

The OP said there are 50-100 messages. Then any clashes (of the same English text with different meanings) can be handled manually.

But my scheme with references numbers can fix that. That can be extended to annotate messages give a general method of disambiguating messages with multiple meanings.

No, the numbers can be anything, including any text, or can be annotations. But in this scheme, every message must have an annotation, and that will can the appearance of the message within the source.

How does that library deal with the issues of extracting the messages in a format that can be submitted to a translator (who might be in a different country), and what format are they sent back in, or submitted to the program?

What about when the program is revised, and messages are deleted, added or modified?

How does it deal with multiple instances of the same message that differ only in leading or trailing punctuation or capitalisation? Do multiple messages have to be provided?

What about the problem raised above of the same English words having a different meaning depending on context?

I looked at docs for gettext and it's a 275 pages in PDF format; 378 pages in Word. How many messages did the OP want to deal with again?

(The scheme I outlined in my first post in the thread dealt with all this. And it totalled a few hundred lines of code. Actually I don't think I needed the translations at all; they could be loaded locally from a file, with an existing version of the application, so no intervention was needed.)

bartc

Vote

R

Richard Damon 8 years ago

If you really want to look by string, then your _ function just needs to search the translation table for that value, and then return the string desired translation instead of passing in the index. The lookup will take a bit of time, but not that long given your number of strings, and if you sort the strings by the base translation, you could binary search to find it.

Vote

D

David Brown 8 years ago

Indeed it will handle it. But it means you have to have numbers in the code, and match it up with numbers in the translation files. Once you start having that sort of thing, you lose the benefits of having a simple direct text in the code. So you might as well cut out that text in the code and put it in the messages file. And then you might as well use an enumerated type - then instead of arbitrary numbers with no connection to indexing and manual checking for collisions, you have a header file with the enumerated type defined, symbols with useful names (like "str_hello_world"), checking by the compiler for errors, automatic completion from within your IDE, and fast and simple lookup in the actual table.

Yes, it is a hassle - you have to be sure there are no conflicts, and you have to match them up in your translation file. That's easy for a small program, but scales poorly and cannot be checked by the compiler.

Do you mean gettext in particular, or some arbitrary library in general? The licensing issue was a general point.

gettext comes with tools to aid translating and maintaining the translation files. I'll let you look up the details - there is little point in having me copy-and-paste stuff off the web.

It is all handled by gettext. Whether you like or dislike the way it is handled, is up to you. Again, look up the details if you want.

Since gettext is quite big, and has a license unsuitable for most embedded software, it is unlikely to be the answer for the OP. It is a /heavyweight/ solution. It might help inspire ideas for the OP, but it is not a practical choice for him. It is, however, and excellent choice for many other projects and programs. And there are several other gettext-like libraries around, which might be a workable choice depending on what the OP likes.

The scheme you outlined is a possibility, and I'm sure the OP will consider it. It is not the way /I/ would do it (I mentioned that in my first reply in the thread), but it could work.

Vote

M

Mel Wilson 8 years ago

This may have been said already (who can tell, here), but the gettext convention _("Some string that has to be translated")

can make it easy for some preprocessors of your own to pick out these strings, manage a database of translations, and substitute the translations into individual builds for each language. The '_' can just be a syntactic marker -- doesn't need to be a callable function at all.

Vote

Multi-language support on embedded plarforms

Join the Discussion

Didn't find your answer?