dB

What's new in Virtaal 0.6.1
I missed out a post on Virtaal 0.6.0 so I'll wrap both the major and 0.6.1 bugfix release together. For those not in the know, Virtaal is our Computer Aided Translation Tool (CAT) that we've been developing as part of the ANLoc project.
Our aim in Virtaal continues to be to have a simple clean interface, yet to present powerful features to translators. We seem to be doing the right thing when you read the following comments from a recent review of Virtaal, "It’s clean interface and ease of use are the best virtues of this application. ... there [are] NO extra buttons, and the layout looks like a side-by-side sheet presentation. Beautiful. It also allows access to machine translation services such as Google, Moses and Opentran. Other features include highlighted diffs between the translation memory suggestions, a don’t-touch-your-mouse approach, and much more."
<!--break-->
So what did we add to 0.6.* version of Virtaal? Let's have a look.
The most notable change in Virtaal 0.6.0 is the new welcome area. In early versions of Virtaal new users where faced with the "What now" thought as they opened the tool and faced a blank screen. Since Virtaal has a very clean interface there aren't any hints about what the application does. There are no unused panes for TM or glossary entries. We realised that we could actually make use of this space to enhance usability and help newbies. In true Virtaal fashion we avoided adding a splash screen or tip of the day dialogue. What we developed is the following welcome screen and we hope that you like it.
The welcome screen is not meant to be just a pretty face, we wan it to be really useful for the translator. As you can see it gives easy access to previous translations, guides and other help so both the seasoned translator and the newbie are easily and quickly helped.
We hope to add other features to the welcome screen in future versions, hopefully emerging as a dashboard of sorts where we can show the state of work and activities currently in progress.
New and improved Machine Translation pluginsWe added support for Microsoft Translator (or Bing Translator as they sometimes call it). You may recall that we did a special release of this plugin on Windows to allow translators to translate into Haitian Creole at the time of the Haitian earthquake. A recent study comparing Google, Yahoo's Babelfish and Microsoft Bing MT solutions seems to indicate that for short texts that Microsoft and Yahoo may offer better results.
We've supported Apertium, the FOSS rule-based Machine Translation engine, for a long time now. Apertium recently created a new service API that mostly mimics Google's MT API. We've adapted the Virtaal plugin to use this new API. While most other MT engines are statistical based, Apertium uses a rule based approach. For the languages that Apertium supports it might present better MT suggestions then statistical MT services.
Improved format supportVirtaal uses the Translate Toolkit to provide support for various localisation formats. With this release we now integrate support for OmegaT glossary files, you can now edit these directly in Virtaal instead of in a spreadsheet or wordprocessor. We hope this leads to more reuse of terminology.
An XLIFF file can provide alt-trans entries and Virtaal will now display these in the suggestion dropdown. In the screenshot below you can see the suggested translation as the first entry provided by user 'admin'.
When you are working in Pootle with XLIFF files you will now be able to review suggestions off-line. XLIFF files supplied to you might also contain alt-trans entries with MT and TM suggestions, these can now also be seen when you translate.
In case you've forgotten Virtaal can edit Qt Linguists .ts files, thus you can translate pretty much any FOSS applications in Virtaal. With this release we fixed some bugs relating to plural support in newer TS files so we should be able to manage any file currently in the wild.
New languages, improved language features and language related bugsA translation tool that isn't itself translated! We're proud to see a growing number of people contributing translations to Virtaal. We've added: Bulgarian, Icelandic and Thai and of course many other translations have been updated. Virtaal is now translated into 40 languages.
Virtaal running with Translate Toolkit, versions > 1.7.0, is able to detect your target language based on the 'Language-Team' header entry in your PO files. So your language pair selection is almost always going to be just correct.
We now have better interaction with the Voikko backend of Enchant and improved autocorrect data for Polish (yes we do autocorrect using OpenOffice.org data files). We've also added a workaround for GNOME bug 569581 (Windows US intl layout, Afrikaans 'n).
AccessibilityWe worked hard in this release to make sure that Virtaal works well in high contrast modes to assist people with visual disabilities.
The following before and after pictures show the changes in a High Contrast Inverse theme. While the changes are small it's worth realising that the tool was unusable for someone needing inverse colour schemes in order to use a computer. You will notice that the text input area is now properly rendering as light on dark. You can't see it here but we also made sure that the placeable colours, placeable highlighting and terminology colours all now work in inverse.
- Virtaal has a very good system to handle placeables. We've now made it possible to select placeables from the plural in the source as well as to cycle through the placeables back to selecting the whole source text after you've moved through all placeables.
- Support for proxy servers - Virtaal just didn't work in university labs, hopefully this provides enough support for most cases.
- Reduced flickering in the editing area - stepping through large units in Virtaal produced too much flicker, now we will be gentle on the eye.
- Use the most frequent word as autocomplete suggestion - we just weren't giving you the best autocomplete suggestion all the time, now we do.
- Better handling of errors in the Open-Tran service - Open-Tran.eu has been down quite a lot recently and we get a few XMLRPC errors, these are now all caught.
You can read the release notes for other minor bugs that were fixed in 0.6.0 and 0.6.1.
Localisation: How we guess the target translation language in Virtaal
In Virtaal, our desktop Computer Aided Translation (CAT) tool, we've have a number of usability goals. One of those is trying to limit the configuration required to use the tool. Most of us think nothing about setting the target translation language in our CAT tool when requested. But we've always asked the question, can't the CAT tool work this out itself?
In this post I'll talk about how we've been able to correctly determine the target language for about 87% of the localisation files on a typical Linux system.
<!--break-->
Most translator, who work in one language and one direction, are probably wondering why this is an issue. For anyone who translates in both directions, translates a number of language or who manages a number of translation teams will understand just how important this feature is. When they open the files their language settings will be changed and should be correct.
The feature allows the CAT tool to configure itself without any intervention from the translator, apart from the simple act of opening a file for translation. But even a single language translator will benefit from this feature as a translator will examine other translations to see how someone translated the source text. In this case Virtaal's settings will change for this quick lookup and will change back when the real translation begins, all without the translator doing anything.
I personally review a number of translated languages. I like using Virtaal as it simply reconfigures itself to the target language when I open a file. Mostly I don't need to even check that the selected target language is correct. My Machine Translation, Translation Memory, terminology and spell checking are automatically enabled for the correct target language.
A little history and some background informationWe've been building this language guessing for Virtaal for some time now, our aim is to do the right thing with minimal user input. When first run Virtaal's approach is to first try to determine the target language by examining the environment. This mostly involves looking at your locale. This was our first effort to get the language right.
The Translate Toolkit, on which Virtaal is built, allows us to determine the source and target languages of a number of file formats (TMX, XLIFF, Qt). Thus once we load a file we're able to look at the file metadata to determine the language pair. But this doesn't work on PO files since there isn't any target language information in the header.
The missing target language information in Gettext was why I proposed that we add a language header to Gettext PO files. Fortunately this idea was accepted upstream and it has been implemented in Gettext. However, we're still waiting for this new version of Gettext to be released and once released we'll still need to wait quite some time for it to gain wide adoption.
So while we waited for Gettext 0.18 to be released we implemented ngram matching techniques as another approach to guess the target language. This works quite well but we need language models for each language that we need to guess. Ngrams are still useful for us in Virtaal as we add the ngram guessed language to our language pair chooser, thus if the target language is incorrectly indicated we'll still include the ngram guessed pair the language chooser list.
Realising that we can't wait around for Gettext 0.18 to be released and for it to filter down into distribution over 1-2 years we decided to look at other ways in which we could more reliably determine the target language based on information in the file header.
Language-Team header analysisWe've looked at analysing the Gettext 'Language-Team' header entry to help determine the target language. To do this analysis our script msgunfmt'ed the 15,000+ MO files I have on my Fedora 12 installation. This created long lists of the potential Language-Team headers that we then ran through our guesser. We added information and improved the guesser as we identified patterns in the extracted headers.
In the analysis we found the following:
- A Language-Team of English is almost always a false positive. E.g "Kannada <en@li.org>", an English email address for the Kannada team, unlikely.
- Small languages almost always get this header wrong. E.g an Hawaiian translation has this header "English <en@translate.freefriends.org>".
- Some meta language translation projects don't distinguish between the languages that they are translating. This mostly affects Indic languages e.g. "<info.gist@cdac.in>" is used for a number of Indic translations.
- Some projects use generic contact information. Examples include: wxWidget, Novell, Compiz and OpenSUSE. Technically there is nothing wrong with this and we can work around it if the actual target language is mentioned, but often the target language isn't mentioned.
- Even with these issues we can safely guess 87% of the target languages from the headers with minimal false positives.
In the cases where we can't guess the language we're almost always dealing with: missing or default header information, English headers, or personal email addresses that we've excluded.
Here are some of the details of our analysis:
- Analysed 15244 MO files.
- Could not classify 7,5% (1133).
- Incorrect language classification for 5,5% (848) of the files. Many of these cover issues were translators have indicated regional variants, e.g. de vs de_DE, af vs af_ZA, bn vs bn_IN or different encodings e.g. sr vs sr@latin.
- Only 1,8% (287) are true misclassifications. Most of these are due to incorrect language information in the headers. This probably indicates that the data is quite reliable more then it highlights any issue
So combining this data we can safely and correctly guess 87% of the language teams based simply on the team header. We expect 5,5% to be incorrect or to not capture the regional and encoding information. We can't guess 7,5% of the headers.
Even though we'll miss guess some target languages, the translator will still be able to set their target language within Virtaal. This will allow them to correct any bad classifications and also ensure that when saved the file will use the correct Gettext 'Language' header. We won't need to guess the language again.
How does our guesser use the Language-Team header to guess the target language?Our analysis of existing headers was to help build our actual Language-Team guesser. We guess the target language as follows:
- Firstly before we even try analyse Language-Team we first look for the Language header, then headers used by Poedit. These headers are likely to be correct and are set by the users to actually indicate their target language. If we don't find those header then we move onto the Language-Team analysis.
- Our first step in the Language-Team header is to check with a number of regular expression for common language team email addresses. Thus "<fr@li.org>" is easily identified as French. By using a regex we also future proof the guesses and can detect teams that emerge later.
- Then we use snippets of contact information which are almost always email addresses and sometimes URLs. These are essentially team contacts that can't be detected with our regular expressions.
- Lastly, we use snippets of language names both in English and the target language, e.g. Dutch and Nederlands.
- If all that fails we give up guessing.
You can see this in action if you run a recent version of Virtaal with Translate Toolkit 1.7.0 (which was released on 2010-05-12). Windows users will need to wait for a new release of Virtaal (>v0.6.0).
How can you help?We think we've got most of the data sorted out, if you can help us reduce the 5,5% misclassification and 7,5% unclassifiable entries then that would be great.
If you are a translator then please have a look at our team.py file and check that your team's email address (see LANG_TEAM_CONTACT_SNIPPETS) and that your language name, variants or other defining information (see LANG_TEAM_LANGUAGE_SNIPPETS) are listed.
But probably the easiest and best way that you can help is to use a good localisation tool, such as Virtaal, Pootle or Poedit, that captures the target language information in the header. The next best thing is to make sure that you make use of very standard contact information for your team so that its easy to guess your language.
Continuous integration, can it work for software localisation?
At Translate.org.za we want to keep delivering the best FOSS localisation tools. To do that we've started using Continuous Integration (CI) in the development of Pootle, Virtaal and the Translate Toolkit. We're using a tool called Hudson to manage our CI process.
Since the tools that we develop are all focused on localisation we thought, "Wouldn't it be great if we could use CI to continuously check our translations?". I hope that you will start to use some of our scripts, or your own, to ensure that localisation is part of your CI build process.
<!--break-->
The problemSince we build localisation tools we pride ourselves on doing localisation well. But even we've made a few mistakes along the way, mistakes like:
- Shipping broken translation files. There is nothing quite as frustrating as sending out an application that breaks because of a typo in the translation of a variable. The cost of fixing the issue and releasing a bug fix build is just too much for a small development team. We want to focus on cool new features, we'd rather not fix a bug that we could have caught with CI.
- All text not present in the translation files. We work on string freezes and try hard not to change things while in freeze. So nothing hurts as much as discovering that a feature you added many months ago is not actually present in the new translation files. You now realise that you are about to release a feature that will only be in English. So now we must break string freeze and get the new files to translators with a lot of communication overhead. For translators it means updating their just completed translations to the new set of translations, they might not have the time. Many of these are simple steps but they require lots of overhead and because so many people are involved there is a real potential for other errors to occur. So we want to make sure that when we enter string freeze everything that we want to be translated is ready for translation. We'd rather not break string freeze simply because we forgot to add a file to POTFILES.in.
- Broken XML file building. As we're using intltool we build some files (mimetype XML and .desktop) from our translations. We don't need to run this step very often, so infrequently in fact that we might only run it as we prepare to release. We'd like to catch any errors in the building of these files when they the error occurs, not just before the release.
We want to apply CI to our localisations because we're not machines, we simply want to be able to forget about localisation issues while we work towards a release. We want to know that our code is always ready for localisation and that our localisations are always 100% technically correct. We don't want any surprises and we want to fix errors that occur when they occur.
We've manage to achieve this.
As you can see above we have a Hudson job called validate-translations that runs a number of localisation related build steps.
The solution to catching technical localisation errorsWe run intltool as part of the build process to catch files that aren't being extracted for localisation and for mimetype and .desktop file building from the translation. That part was easy, the harder part of making sure that translations that are committed are correct, for that we built a more elaborate script around the Gettext tools.
Hudson can monitor errors reported in the JUnit XML format. Our solution was to build a simple bash script that exercises Gettext's msgfmt command and outputs the results in a JUnit XML file. The script is simple. For each PO file that it finds it runs msgfmt -cv. Any errors are captured so that we can more easily fix them when we review the results.
Feel free to use the JUnit XML script for PO files within your own Hudson jobs.
Since starting this CI process we've seen good results.
As you can see above we solved 20 msgfmt errors over just three builds. More importantly we can now can safely modify our code and know that our CI will catch any localisation issues.
So what is next for CI and localisation?At the moment we simply catch msgfmt errors, we will be looking to add the following:
- PO file snippet - it would make it easier to find and fix the errors that we find if we have the snippet of PO that caused the error. Currently we only have the line number and have to first find that line in the PO file before we can even check what is causing the error. With the snippet we can make the full diagnosis while reviewing the Hudson test failure report.
- pofilter checks - the Translate Toolkit has a number of checks (47 in fact) that catch technical localisation errors. We'd like to XML test result files that show those errors. The Translate Toolkit is very useful for human review but we'll need to create a method to mark false positives that we wish to ignore in the future test runs.
- pocount - We want to count the translation status of a group of PO files. You might wonder why we'd want to do that. The reason is that many projects ship with translations that meet some level of completeness. For Virtaal, our Computer Aided Translation tool, we set that threshold at 75% complete for shipping translations. With pocount we should be able to automate this so that we return a test failure if a translation falls below this threshold. If you are able to compare the files that meet the threshold with the files listed in a LINGUAS file (a file that lists all shipped translations) then it's possible to raise an error when a new file needs to be added to the LINGUAS file to ensure that it's shipped. Similarly it would be possible to raise an error if an existing translation falls below the threshold, in which case it needs to be removed from the list of shipped localisations. Now there will be no risk of shipping incomplete translations or of forgetting to ship a new translation.
I'll try to post new blog entries when we add some of these new features or scripts to our own build process.