Remove 'sh' (Serbo-Croatian) language ?

This is a follow-up on an IRC discussion I've had with Nike about a year ago; I'm posting it here, since I got no feedback on IRC despite several pings.

TranslateWiki.net maintains a MantisBT translation file strings_sh.txt (was added back in 2019). There are several issues with that:

The file name does not follow our naming convention (should be strings_serbo_croatian.txt)
Language code sh is deprecated according to https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes
The language is not defined in MantisBT configuration ($g_language_choices_arr, $g_language_auto_map)

Reference: original MantisBT issue including IRC chat transcript

-- Damien‎

bump...

-- Damien‎

sh works. sc is alternative. Sardinian cannot be sc.

Obsuser (talk)‎

You are wrong, "sc" is standard for Sardinian in ISO 639-1, BCP47 and Wikimedia wikis and MediaWiki translations (do not assume it being usable for Serbo-Croatian).

"sh" has been removed from ISO 639-1 only (but not "hbs" from ISO 639-3 !), and not from BCP47 (where "hbs" was mapped to be the same as "sh") which more or less still considers it as an alias of "sh-latn" (with the implied Latin script by default). but it has been kept in ISO 639-3 as a macrolanguage (containing "hr", "cnr" aliased by default to "cnr-Latn", and "bs" aliased by default to "bs-Latn"). It is special compared to other macrolanguages because of its implied default script, but "sh-cyrl" is also valid (and comprises "bs-Cyrl", "cnr-Cyrl", and "sr" aliased to "sr-Cyrl").

The old decision taken in ISO 639-1 only is very unfortunate, given than "hbs" has been kept in ISO 639-2 and ISO 639-3 (including in its later revision where it was assigned the "scope" of a "macrolanguage"). That code may be retired in ISO 639-1 not for techincal or translation purpose, but probavbly motivated only for bibliographic use (but many public libraries in the world have not remvoed that classification for their book archives, and notably not for books published in the former Yugoslavia!). But that decision was motivated before ISO 639-3 was released to define the concept of "macrolanguages", and also before the revision of BCP47 (at that time ISO 639-1 and ISO 639-2 were a mess, they were unstable, and most applications chose to ignore ISO 639 and have developed their standards to reference BCP47, and its related IANA databases for language tags, rather than ISO 639; ISO 639-3 has attempted to make a more comprehensive codification, had to fix some codes by definining their scope; BCP 47 was revized to include "grandfathered" tags, and ensure stability and backward compatiblity; today nobody wants to make any normative reference to ISO 63, except for bibliographic purposes with simplified classifications, but not usable at all for translations and technical applications; this is the case of all IETF, W3C, ITU standards, as well as other ISO/CIE/ECMA standards, and many national standard bodies, even if sometimes they took the decision to preserve their own legacy codes and made specific requests to ISO and the IETF to maintain the stability).

ISO 639-1 still has very bad codes like "bh" (which is not even a macrolanguage but a family, not mapped to ISO 639-2 or -3 but to ISO 639-5 as "bih"; note that ISO 639-5 is still very incomplete for classificying language families). As well Wikimedia still has its own legacy codes that violate BCP 47 (but they are slowly being retired and replaced). Wikimedia privately uses "bh" in its wiki domaine names by assuming it refers only to "bho" (Bhojpuri), one of the languages of that family, but it uses other conforming ISO 639-3 codes for the other languages of that family, so that is not blocking any project.

"sh" (Serbo-Croatian) is still valid in BCP47 and many linguists (as well as many native speakers) also consider it as being a single macrolanguage comprising "hr" (Cratian), "bs" (Bosnian), "cnr" (Montenegrin), and "sr" (Serbian), independantly of the "Latn" or "Cyrl" script which they may use, even if languages were separated (even though they are basicalty dialects/variants of each other with excellent mutual understanding, and just some prefered forms in each of them, and minor orthographic differences between locations; but the orthographies in the two scripts are mutually interchangeable, that's why Wikimedia wikis provide an automated translitator for reading/writing them in either script, just as a matter of user preferences). Very few words are in fact localized specifically between these 4 languages and none between the two scripts.

(Note that Wikimedia still uses some incorrect "sr-ec" and "sr-el" legacy codes instead of the standard "sr-Cyrl", "sr-Latn" codes. This is only for its domain names and interwikis, not for HTML language tagging which uses standard BCP47 codes; there also remains some properties in Wikidata still using these legacy codes, but they are deprecated and should be also replaced by BCP47 codes; but this does not apply to sitelinks whose usage in domain names (for wikis) and in interwiki prefixes does NOT violate the HTML standard; Wikimedia still has to cleanup its local use of "nrm" instead of "nrf" for Norman, which severely conflicts with ISO 639-3 and BCP47, as it blocks any attempt to translate to "Narom", a completely unrelated South-East Asian language.)

The best working standard for encoding languages used in translations is BCP47 (i.e. RFC 4646 for its last release and its related IANA database). Let's forget ISO 639-1/2 completely (it will remain in the limbos of some public libraries with their old classification system, but many have converted their catalogs to use BCP47 instead for language identification, plus eventually ISO 639-5 for a very weak classification of language families in book collections; if they need more precide classifications today, they can use BCP47 "private-use" codes, they can also use ISO 15924, including script variants for written documents and artworks, even if these variants are unified in Unicode) !

Finaly note that for translations, we don't care at all about ISO 639 (and its many past defiencies), only about BCP 47 (where ISO 639 is only a partial and unstable source); this is not just for this wiki, or Mediawiki or Wikimedia, this is a standard used everywhere on the web (part of HTML for example, as well as almost all i18n libraries and programming languages using them). Many things have disappeared or changed unilaterally in ISO 639, or have been rejected for use in BCP 47, which is a much more usable standard, more precise, and where stability for language identification was part of the design and kept for ever as much as possible (even if some BCP 47 tags or subtags may become insufficient and may need to be requalified in newer documents, but any translation made with a valid BCP 47 tag will remain valid in any later update; except if thre was a severe error and the tags are exceptionnally marked as "discouraged" in the IANA database, where it may or may not suggest some prefered replacement, if one is most likely for most common usage cases). ISO 639 has only been defined for broad use by libarians for the classification and searches in their catalogs, or for managing copyrights in large categories according to their current practices for data exchanges ; later ISO 639 was partly updated to add "technical use" (motivated by trying to get a compatiblity with BCP 47 (but only on old version, and it was never updated later). ISO 639 broke that technical compatiblity later multiple times. BCP 47 data sources for adding entries in the IANA database have ben publicly documented in multiple RFCs with rationales (that's not the case for ISO 639 whose decisions are closed and limited by copyright issues, with no details about they were vetted, so ISO 639 is not a "best practice", only something endorsed by a few national ISO vetters for their use in their national catalogs who actually dont care at all about using precise and distinctive terminology). Don't refer to ISO 639, it's not a normative reference, just an informal informative reference! Note also that various countries have stopped supporting ISO 639 (which they never approved themselves in ISO TC) in their public catalogs and media libraries, they have adopted BCP 47 instead, the same is true for many publishers, media creators and vendors.

Verdy p (talk)‎

Nike (or anyone else), could you please either

delete the sh language, or if it can't be removed, then
rename the MantisBT language file to strings_serbo_croatian.txt

So I can (finally) close this issue ? It's been pending for nearly 2 years since we initially spoke...

Thanks in advance !

-- Damien‎

sh has been removed from supported languages of MantisBT.

Nike (talk)‎

And the "sh" language should not be deleted (it is still valid in BCP47 even if it has been retired ONLY from ISO 639-1) but ideally moved to "hbs" (still valid in ISO 639-2 and ISO 639-3, also valid in BCP47). Wikimedia wiksi however don't need to be renamed (in BCP 47, both "hr" and "hbs" are equivalent).

Other non Wikimedia projects may opt to not support that language, but it has to remain for Wikimedia, until Wikimedia choose to deprecate it as a selectable user language (in the ULS), even if the "sh.*" wikis are kept (and the "sh:" interwiki code also remains valid to point to these wikis, just like possibly the "hbs:" interwiki prefix which should be the new standard/canonical form).

For translations of labels in Wikidata item, "sh" should probably be deprecated in favor of "hbs", though both codes (in both scripts) may only remain used as a common fallback, when individual codes for Serbian (both Latin and Cyrillic scripts), Bosnian (both scripts), Croatian, Montegrin are not used.

As well their legacy variants subtags (with "-ec" and "-el" instead of "-cyrl" and "-latn") should remain only for Wikimedia wikis doman names and interwikis. All other projects (as well as support for lang="*" attributes in HTML, or xml:lang="*" pseudo-attributes in XML or SVG, or "lang(*)" selectors in CSS) should always use the standard subtags from ISO 15924. And in my opinion, even on this wiki, all translations should be done here using the standard subtags. Wikimedia wiki domain names and interwikis however should support the aliases using standard tags.

Those legacy codes cause problems and cause unnecessary complex maintenance: everything should work everywhere using only standard BCP47 codes.

Aliased codes are not exceptional and not a problem: Wikimedia has aliases for example between the legacy "zh-clasical" code and "lzh", or betwen "zh-yue" and "yue". (it's just a long time to define which alias will be the standard/canonical form): these aliases can persist for long and the migration to standard BCP47 can be planned over long periods and implemented step by step, making sure that nothing is broken long after the transition (it may be up to a decennial, there's no emergency to drop these.

The same cannot be said about serious conformance issues like "nrm" that should be fixed in priority, properly announced, involving the community to allow them to make the work needed in many spreaded pages): the first step is then to create all needed aliases from "nrm" to "nrf", and migrate the internal database names. Then the community can work on converting wiki pages, Wikidata items. Here we can then rename all "nrm" resources to use "nrf" instead. Then we need a transition period not lasting more than 1 year (time sufficient to perform full scans an existing data and plan all corrections needed, in pages, templates, Lua modules, external Wikimedia tools and bots) before we drop the aliases on "nrm", keeping only "nrf": the use of "nrm" should be monitored with tracking logs as much as possible for that period, until these logs become almost empty (logs should try to identify the internal or external source of these legacy usages; trackign external sources can be only managed by Wikimedia admins if this requires parsing server access logs, but Wikimedia admins can run a parser that will analayse and detected when can be publicly reported: the community can then work on these external sources, if they are accessible or can be contacted). The one year delay should allow most external seach engines to fix their indexes, notably for the domain names used in URLs.

After that period, "nrm" is closed, deleted, a new wiki is created for Narom which can be requested. Once such Narom wiki is created, the initial homepage should contain a notice saying that it ois no longer about Norman that can be found elsewhere. Some similar banners may also added in Categories or some new "conflicting" pages created in Narom (such creations of pages in Narom will likely be very slow, and conflicts of page names will not occur in many pages as the Norman and Narom languages are very different, but will occur for example if both languages use the same proper names or borrow the same English terms. There will be no surprise for rare users that may have followed links from old pages or other old sites or documents. Those banners may then be removed after about 3 years (but there will be some community information pages about the wiki were we can trace its history and indicate what was migrated and where. Such public history is important to keep for extremely long periods, just like ISO itself also maintains a public history).

Verdy p (talk)‎

sh has been removed from supported languages of MantisBT.

Thanks for this !

I noticed today that the strings_sh.txt file is still present in the repository, meaning whatever was done is apparently not cascading as part of the translation updates.

Can I simply and safely remove the file in the repository ?

-- Damien‎

Can I simply and safely remove the file in the repository ?

Ping...

-- Damien‎

Sorry, I missed the notification for this. Yes you can safely remove it.

Nike (talk)‎

Thanks!

-- Damien‎

FYI: sh is NOT deprecated in BCP 47. sh is only deprecated in ISO 639-1 but BCP 47 only use sh instead of ISO 639-1 hbs, and we use BCP 47 instead of ISO 639-1 on translatewiki.

Winston Sung (talk)‎