Some questions from new developers
I am a developer and have just added a new project. I have read the FAQ and have some questions I would like to ask.
- Can we use uppercase, punctuations, or even non-English as message id?
- Due to lack of experience, I found that some sentences need to change their message ids. I wonder if there is a good way to change the ID directly without changing the translation? (FAQ#Is it possible to rename message keys?)
- How to remove wrong messages? Or as mentioned in FAQ#Do I have to do anything special when deleting messages?, just remove it from the project repository?
- What happens if a translation in the project repository conflicts with a translation on translatewiki, when they are both modified and different?
- If I create a new message directly on translatewiki, will it also be exported to the project repository?
- Is there a way to observe all changes to a project?
Thank you for your patience in reading.
I can already say that you must avoid characters not allowed in resource names in URLs (notably ?#
) or that could break the parsing of URLs or wiki pagenames (notably leading/trailing or duplicated /
or whitespaces, or segments like /./
or /../
or []{}
).
But "anchor-encoded" IDs are generally safe (note that they may use puntuations in .-
. Beware about underscores, treated like spaces for pagenames in MediaWiki. Also please avoid equal signs =
in message IDs, otherwise it is difficult to reference these messages (for example with {{Msg-mw|id}}
in doc pages or in talks) and requires sometime to use additional escaping (which may not work as expected in external tools or editors, or in import/export tools).
Ideally, these identifiers should be usable directly as anchors and as valid titles for Mediawiki pagenames (but be careful about namespaces, so also avoid the :
if it follows some letters, words or numbers without any other punctuation), and should then be safely usable as well as Javascript or CSS identifiers without needing additional reencoding: this means using a subset of printable ASCII (yes ASCII punctuations can cause problems in various processing tools or UI elements) or otherwise any other non-Latin characters that can be valid inline plain-text and correctly encoded in Unicode with UTF-8 (preferably in NFC form).
You should also avoid using overlong IDs. Usually, translation units limit these length to a reasonnable size (about 70 characters), then process whitespaces using hyphens, then drop duplicate punctuation, and if neede can append a computed hash (encoded in hexadecimal or Base-64) from the initial unfiltered string, to avoid conflicting IDs. A basic shell script can do that computing automatically for you:
- normalisation (including normalized newlines: CR vs. LF vs CR+LF) and basic filtering (trailing whitespaces at end of lines or trailing newlines at end of the text should not be present in translation units, but leading whitespaces may eventually) to get the basic text to be translated.
- computing the 1st part of the ID from the basic text (filtering undesired sequences with some regexp, reduction of whitespaces, and finally truncating to the maximum length)
- computing the 2nd part of the ID also from the basic text (computing a CRC or SHA hash, and representing it in hexadecimal or Base64 in a substring of about 8 characters); this hash must be appended after a separator (as if it was an additional word) to the ID only if the 1st part generated was different from the basic text.
Your translation module can then create translation units and should preserve in the pack the generated IDs. (e.g. as special comment lines in a .po source file suitable for GetText, or as XML attributes or as index keys in JSON tables).
You can shoose the ID scheme to use which is the most convenient for your project, and that allows simple mapping between different requirements (e.g. you can treat dashes, spaces and underscores as equivalent in IDs, to get the flexibility of use and integration of these IDs for various environments that have different requirements for the distinct identifiers they support).
If you look at existing supported projects in this wiki, there are different naming schemes; they are not all written to be processed by Mediawiki or deployed into wikipages: you'll find Python, Java, C/C++, PHP, Ruby, Lua... And even MediaWiki is not the only markup language that the target projects will use (you can also find MarkDown for example).
It's up to you, inside your project, to develop the simple scripts that will convert the formats and generate the traceable IDs that you and this TWN site will use to cooperate. These script will extract the needed text from your source, will compute IDs, will update your sources (if needed) to insert these IDs as a call to your own internal I18n support function, will manage the history of the generated source bundles, will convert them to a format supported by TWN and will import them, will regurly update to export from TWN the translated pack, will import them back to your project repository. And it's up to you to define who will run these scripts and will be allowed to merge data back into your repository to build your updated app.
Thanks for the comment. I will write my own widget to handle the translation of the messages.
Anyway, I'm still curious if I can use something like Chinese as the message id, after all the non-English wiki also uses non-English as the page title.
Also, is there a way to observe all the changes in a project?
Chinese IDs are possible, notably if your project uses Chinese as its primary development language. I've not seen much projects using it with Mediawiki but with other open projects.
Technically nothing would prohibit using Arabic as well, or Hebrew (except that Bidi reordering creates additional difficulties if these identifiers are not "isolated" or surrounded by enclosing pairs of punctuations like parentheses or brackets, and even with the MediaWiki syntax, the use of brackets/braces for enclosing links to page names or to transclude templates or to call parser functions, causes problem when the link includes a pipe and a display label which may use another language/script that would need to be isolated themselves: we frequently see problems where contributors have difficulties to input them, caused by the default Bidi reordering and the absence by default of any isolation for proper input order).
With Chinese identifiers, there's no Bidi problem. The problem is that translators often have to be able to easily refer to messages, and can't easily type them unless they have a Chinese IME, and know how to use them. On the other side almost everyone can type ASCII on their keyboards, but then the translation interface must always provide a way to display and copy these identifiers for references (in MediaWiki these identifiers should be suitable as wiki pagenames)
I suggest you look at the Unicode specifications and Javascript spcifications for "internationalized identifiers" and which characters are suitable. Unicode (and CLDR) define character properties.
Things are simpler if identifiers are limited to characters that everyone can input on their keyboard and these identifier are starting and ending with characters with strong directionality (i.e. not any punctuation or symbols with weak direction, such as quotation marks, or hyphens/dashes/underscores, which may eventually be present in identifiers: C/C++/java identifiers allow the use of underscores anywhere, ignoring this constraint and causing problems for users of RTL scripts, other languages also allow leading or trailing significant punctuations/symbols like $, % and these leading/trailing symbols cause problems for embedding them in other contexts, forcing the use of some escaping system which are dependant of each environment. of use... If possible, identifiers should work without needing these escaping mechanisms and the namespace in which these identifiers will be defined and used should also provide some useful equivalences, e.g. underscores and hyphens used in the middle of identifiers, and some alternate punctuation for namespace separation: these identifiers or composite identifiers should ideally respect such constraints and should avoid using characters that require a complex IME, otherwise the project using them will stay largely confined with a smaller set of contributors or translators that can work with them; unfortunately this limits most projects to use simpler alphabets possibly augmented with more basic diacritics and to apply some discipline for the use of separators in the middle of identifiers).
You can have other views, but given you are requesting this here, I think you should consider the needs for real internationationisation with users of any languages: do you want to maximize the translatability of your project or to have your project translated to languages for which there is a good enough interaction and sufficient cooperation with Chinese users?
Maybe your project could also support alternate identifiers (defined aliases) written in alternate scripts, by using an internal registry of these equivalences (but for now Translatewiki.net does not have a way to be aware of these equivalences and recognize aliases (even if there's a limited support by using redirects for wiki page names where these identifier aliases may be mapped to).
Thank you for the explanation.
Anyway, for search purposes, it is best to use English. However, sometimes I may need to change the message ID if there is an error. however, there seems to be a function in translatewiki to automatically search for similar phrases, so maybe changing the message ID won't cause too many problems.
Yes there's such function, however contact the site admins for details, I've not used it but I've seen that there's some known issues and ways to do that correctly. Noite that for now, when a message gets renamed or when the message group is splitted into two separate ones, the moved messages may need to be reviewed again, causing additonal work for translators. I've seen a few phabricator tasks about this.