Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
[split] language translations concatenation issues
#1
Hi, I'm not sure if GetSimple CMS translation is still active, just wanted to chip on on something I noticed when translating my own backend.

And I want to note that I like how GetSimple is turning out. It really is simple and I very much like this software. That is the only reason I'm posting here. Please forgive me if my post feel like a rant; I do not intent it to be.


I used the Translate plugin and I'd say that it is quite a nice way to make it easy to translate. I myself didn't want to spend time digging around and learning github. (If there's a guide I'd try.) I do see the value in managing the translations using github, though.

The Translate plugin provides a better workflow than editing a PHP file. The worst thing about translating PHP files is that we spend more time moving the cursor than actually translating. With the Translate plugin, I translate, Tab, then translate the next string. For translators who often translate as fast as we type, this is extremely helpful. It's probably comparable to the benefits of auto-indent plus syntax highlighting combined, when a programmer writes code.

Now regarding the strings in GetSimple CMS. I noticed that it concatenate strings.

This is what I consider the bane of software localization. I work in English-Chinese software localization and often run into hard-coded strings like this. In GetSimple, for example, page management has a TOTAL_PAGES string that is appended onto the actual total to form "10 total pages". Because it is hard coded, the Chinese translation has to do away with the meaning of "total" and just literally translate it as "pages" instead. (The natural translation for it should be "共 10 頁".) The problem here is that the code forces word order upon the strings, and that interferes with translation. (This is also why almost all open source translated software in Chinese have weird or unnatural translations.)

In the ideal world, all variables should be represented by placeholders so order can be freely changed, i.e. exactly how the date format strings are handled. This is the standard treatment for commercial handling through software like Trados. (Although I must admit even big names like Valve, Microsoft, Adobe, etc make this mistake all the time. Especially Valve was very bad at this, it was a pain to translate anything of theirs a few years back.)

But this presents another pitfall, and many companies fall into this second trap of writing strings like "posted by %s, %s" to form "posted by zfz, 5 minutes ago". This is more or less the same as concatenating ("posted by", $author, ", ", $time, " ago"), it's not really a placeholder if we can't switch their order around.

Removing this hard-coding also means that translations won't get reverted to funny stuff just because a new translator comes on board and start to edit stuff out of context into the "correct translations". I've had client reviewers do that, then push their "correct" version into production, then get laughed at by their customers.

Another example is the tooltip also in page management: "View Page: page-title". The colon is hard coded and that forces the translation to use "English punctuation". We use full-width colons.

Lastly, style matters. In Chinese, we simply don't use italics. Yes you see it all the time, but that is very bad practice that resulted from lack of alternatives. The concept of italics in Chinese requires a change of the script (font) itself. Most fonts just look bad, and especially the standard UI/web serif font becomes very hard to read at the standard sizes if you italicize it.

In fact, the GetSimple admin back-end is very difficult to read in traditional Chinese, many italic strings are virtually illegible. This is not entirely the fault of italics, though; Chinese characters simply are not suited to using the commonly seen small size that modern websites love to use.

A properly localized piece of software will need to have an accompanying style that fits the text. But of course this is easily taken care of by providing a CJK-friendly CSS as a plug-in. This has the additional benefit that we could adjust the layouts to cater for slightly longer strings in certain places like table headings.

It would be great if GetSimple could at least remove all hard-coding from its strings. There's nothing worse than having to translate half a string; it goes beyond having to guess the meaning: it makes for bad translations.
Reply
#2
i18n advice is well appreciated, I don't think translations were a core design, they were added in to support our international community which turns out to be larger and larger and it was implemented easily as possible.

For example the use of sprintf syntax and no plural support.
although sprintf is very versatile, it is nothing like concatenation, you are mistaken on how it works or are unaware on how to use variable substitution. See argument swapping http://us2.php.net/sprintf

I am not familiar with the translate plugin, but threw the github together so people could diff and edit in the browser.

Post specifics of anything you find and it will be fixed.
NEW: SA Admin Toolbar Plugin | View All My Plugins
- Shawn A aka Tablatronix
Reply
#3
Yep, I know printf but it is not exposed to the translators.

The actual string in the php file which translators look at, is "total pages", not "%d total pages", and that means it doesn't matter what goes on behind the scenes, to the translator, we have no control of "where the %d goes".

As for argument swapping, can it be used two swap arguments of the same type? Because that is really what would cause problems.

For specifics, I'll underline the key parts (i.e. the "missing parts") in each of the following strings:
  • PHPVER_ERROR: the string ends abruptly, there should be a variable for us to place, i.e. "greater is required, you have %s"
  • EMAIL_USERNAME: "Your usernmae is %s"
  • EMAIL_PASSWORD: "Your new password is %s"
  • NOTE_USERNAME and NOTE_PASSWORD: shouldn't they be in one full sentense? Use <br/> in the string if a forced return is required
  • EDITPAGE_TITLE: "Edit Page: %s"
  • VIEWPAGE_TITLE: "View Page: %s"
  • DELETEPAGE_TITLE: "Delete Page: %s"
  • TOTAL_PAGES: "%d total pages"
  • TOTAL_FILES: "%d total files &amp; folders"
  • EDITING_FILE: "Editing File: %s", i.e. the colon, space and actual filename needs to be part of the string
  • In site health check page, all the "Installed -OK" strings need to be one full string, not concatenated from "Installed" and "OK". Because "OK" is translated differently depending on whether it is used on a button or used as a status flag. If you need to accommodate for other status like "Installed - error", then that itself should be one full string, too.
  • POWERED_BY: "Powered by %s"
  • TOTAL_BACKUPS: "%d total backups".
  • WELCOME: "Welcome %s"
  • PLUGIN_INSTALLED: "%d plugins installed"
And then there were two empty strings that showed up in the Translate plugin: AUTOSAVE_FAILED and TRANSITERATION. I'm not sure what to do with these.

One specific string that deserves separate note is the PAGE_BACKUPS string. This string is used as the title to that page, as well as the note for how many backups there are. As explained previously, the Chinese version of these two usages will be different, hence the English version also must have two different strings for this. PAGE_BACKUPS="Page Backups", and TOTAL_PAGE_BACKUPS="%d page backups" for the total count.

The above is just things I picked up from skimming through the en_US.php file. There will be others I missed because I haven't seen all of them in action to even realize the string provided wasn't complete. The best way would probably be for me to go through every line of code and check out exactly how each piece of output is created. That'll take some time.
Reply
#4
Looks like a great start, of course these are non backward compatible changes so we'll have to figure out how to change them without breaking every lang file.

It's easier to remove the translations and turn off fallback, and they will all have {} around the tokens in the ui.
Also I know of some I see when I am coding now and then they just need to be annotated and identified.

The code is hard to manage right now as everything is some combination of inline php in HTML, so finding output string is a pain there are echoes all over.
NEW: SA Admin Toolbar Plugin | View All My Plugins
- Shawn A aka Tablatronix
Reply
#5
Oh and we also have to identify any locale stuff like counts and dates that will need to have localles applied that will be added also in 3.4
NEW: SA Admin Toolbar Plugin | View All My Plugins
- Shawn A aka Tablatronix
Reply
#6
Just a few comments regarding the Translate plugin and its advantages:
  • it scans the plugin's or GetSimple core code to retrieve all necessary translations - great, if you are adding text placeholders but only want to assign texts later.
  • if a text placeholder is found in the language file, but not in the source code, it's marked (grayed out), but can still be translated.
  • it shows all (installed) languages and their completion percentages.
  • you normally translate from one language to another language - you don't need to understand what e.g. PAGE_NOTEXIST means, but can translate "The requested page does not exist" to your language. You can even translate, if you don't know English (well enough), as long as there exists a language you know.
  • you can filter the list to only show missing translations
  • you can save and immediately view the result by navigating to the respective admin page.

Currently I have a version that can download/upload from/to Transifex.net (as long as there is a project getsimple_<pluginname> and you are authorized for it). I'm not sure if this is (easily) possible for Github, too.
I18N, I18N Search, I18N Gallery, I18N Special Pages - essential plugins for multi-language sites.
Reply
#7
interesting, I will check it out.
NEW: SA Admin Toolbar Plugin | View All My Plugins
- Shawn A aka Tablatronix
Reply
#8
Hmmm. I just realized there were a lot of translations in the Extend sections. In that case indeed it'll break a lot of things.

If you mainly want to cater to the non-CJK European/American type of audience, perhaps it's not worth the effort to do such a change, as often those languages have very similar word order.

I would appreciate if you could drop a note here or maybe PM me if you decide to go along that road, though. As I'll have to find an alternative solution instead.

Perhaps it's just best to keep it in mind for your next big overhaul in future, when it's alright to break backwards compatibility.
Reply
#9
perhaps we could have some kind of versioning or wrapper for these changes that would for a time do some kind of backward conversions. It does seem like a lot of effort to have to change them all to support 2 different strings of outputs.

Ideally we would just hold this off until a major release ( which is allowed to break stuff )
NEW: SA Admin Toolbar Plugin | View All My Plugins
- Shawn A aka Tablatronix
Reply
#10
https://github.com/GetSimpleCMS/GetSimpleCMS/issues/762
NEW: SA Admin Toolbar Plugin | View All My Plugins
- Shawn A aka Tablatronix
Reply
#11
I split this out into a new thread
NEW: SA Admin Toolbar Plugin | View All My Plugins
- Shawn A aka Tablatronix
Reply
#12
Yeah, I think that'll be best. No point breaking what's already more or less working for existing users.

Having two sets of string would also confuse people just getting on-board. I know I would. Those strings are bound to float around the Internet for a long while and I don't think it's a good thing for the long term.

Perhaps its possible to start with just drawing up some short guidelines for i18n/l10n, for not only the core, but also for themes and plugins. "Do this and your stuff will be easy to translate" kind of thing.

But I'm getting ahead of myself. I'm just glad that you're considering re-organizing the strings for a future major release. Smile
Reply
#13
(2014-02-12, 09:03:40)zfz Wrote: And then there were two empty strings that showed up in the Translate plugin: AUTOSAVE_FAILED and TRANSITERATION. I'm not sure what to do with these.

AUTOSAVE_FAILED seems to be missing in language files (including en_US).
TRANSLITERATION is optional, it's for slug transliteration: instead of a string, it should be an array of equivalences between non-English characters and their latinised version, e.g. 'ñ' => 'n'. Some language files may have it, and there are also several plugins for specific languages (Romance languages, Cyrillic, Czech/Slovak, ...)
Reply
#14
So here's some quick guidelines to keep in mind when making a piece of software.

1. Every string must be meaningful on its own

This necessitates using placeholders. For example, having separate strings for "Page" and "of", is very bad. On the other hand, having one string of "Page {$pagenum} of {$totalpagenum}" is very good.

2. Never force word order

This means the placeholders must be named and can be freely switched around. For example, this is very bad: having "%s by %s, %s" where the three string variables correspond to $action ("Updated" or "Created" or "Posted"), $author, $timeago ("just now" or "yesterday" or "last week").

What should be done is to have 3 strings like this (translation provided):
  • "Updated by {$author}, {$timeago}" -> "{%timeago}由{$author}更新"
  • "Created by {$author}, {$timeago}" -> "{%timeago}由{$author}建立"
  • "Posted by {$author}, {$timeago}" -> "{%timeago}由{$author}發表"
Using just "%s, %s" would cause a problem here, resulting in awkward translations that conform to English syntax, i.e. forcing English syntax onto Chinese.

3. Never reuse anything

This is more or less a natural implication of the previous two points. Another example is trying to reuse the string "by".
  • {$posttitle} + "by" + {$author} -> {$posttitle} + "作者:" + {$author}
  • "by" + {$sortcriteria} -> "依" + {$sortcriteria}
As can be seen, the string "by" by itself has no meaning, and most languages do not have 1-to-1 relationships between words. Reusing parts of a string like that will make translators hate you, sometimes also making your software look stupid.

4. String ID should not be just the string content

The classic example is: STRING_NEXT = "Next".

Next what? Next page? Next step? In Chinese the translation for "Next" will include the measure word, and Chinese measure words are very different to English.

It would be better to use STRING_NEXTPAGE = "Next".

And yes, this again ties in with the previous points because now it would be weird to use STRING_NEXTPAGE for presenting a next step. So you would have two strings that seem redundant in English:
  • STRING_NEXTPAGE = "Next" -> STRING_NEXTPAGE = "下一頁"
  • STRING_NEXTSTEP = "Next" -> STRING_NEXTSTEP = "下一步"
You see, they mean different things, so they should be different strings, even though in English you would use the same resulting string to represent both.

A good rule of thumb is to use the string ID to describe what that string does. Instead of STR_UPDATEDBY="Updated by {$author}", use STR_BACKUPNOTE_UPDATE="Updated by {$author}".

This way the translator can easily know that this is not just a string that says "Updated by $author", but it will be used as the note in the list of backup data, as opposed to a small note under each post. Again, in this case the suitable translations will differ, it's possible to write a more fitting/natural translation for it.

==

Having said all this, I also have to say that they're just guidelines, not rules set in stone. It's impossible to expect anyone to build a system that can cater for absolutely every language in the world. In fact, my current job that came from a well known global brand has strings that makes every single mistake I have described.

I've been in software localization for more than ten years, working as a freelancer, and have handled stuff for Microsoft, Adobe, Sun, Samsung, Sony Ericsson, Asus, Blackberry etc via my agent. They all make these mistakes, all of them, all the time. In general programmers don't care about localization even though they know they want to launch the product globally. They just don't care. Most translators would be used to it anyway.

So don't worry too much about it. It won't be the end of the world if you ignore these guidelines. Following the guidelines allows your software to have that "polish", if the translator is good. In the end you may decide to, for example, not use placeholders but just the built-in variable substitution of sprintf. You loose some leeway on the translation but makes the system more efficient or more reliable. That's perfectly reasonable.

I hope I haven't been too much of a pain for you all. Thanks for reading. Smile
Reply




Users browsing this thread: 2 Guest(s)