wiki-archive/twiki/data/UBIF/ExtendLanguageWithNeutralAn...

331 lines
15 KiB
Plaintext

head 1.11;
access;
symbols;
locks; strict;
comment @# @;
1.11
date 2009.11.20.02.02.03; author LeeBelbin; state Exp;
branches;
next 1.10;
1.10
date 2007.03.06.17.30.00; author TWikiGuest; state Exp;
branches;
next 1.9;
1.9
date 2005.04.14.15.13.21; author JavierTorre; state Exp;
branches;
next 1.8;
1.8
date 2005.04.14.11.18.14; author GregorHagedorn; state Exp;
branches;
next 1.7;
1.7
date 2004.11.09.22.42.47; author GregorHagedorn; state Exp;
branches;
next 1.6;
1.6
date 2004.10.27.14.48.05; author GregorHagedorn; state Exp;
branches;
next 1.5;
1.5
date 2004.10.26.14.01.33; author GregorHagedorn; state Exp;
branches;
next 1.4;
1.4
date 2004.07.15.18.02.00; author GregorHagedorn; state Exp;
branches;
next 1.3;
1.3
date 2004.06.11.10.15.50; author GregorHagedorn; state Exp;
branches;
next 1.2;
1.2
date 2004.06.10.06.39.51; author GregorHagedorn; state Exp;
branches;
next 1.1;
1.1
date 2004.05.24.11.25.24; author GregorHagedorn; state Exp;
branches;
next ;
desc
@none
@
1.11
log
@none
@
text
@%META:TOPICINFO{author="LeeBelbin" date="1258682523" format="1.1" version="1.11"}%
%META:TOPICPARENT{name="UBIF.SchemaDiscussion"}%
---+!! %TOPIC%
Previous versions of this topic discussed whether it would be appropriate to add special values to the xml language type to deal with situations where language is not appropriate at all (content is entirely language-neutral) or unknown. This was now resolved by using the extension mechanism defined in the xml language type itself!
http://www.ietf.org/rfc/rfc1766.txt specifies that "x" is a reserved value for private use and that secondary codes of "x" will not be registered. Indeed, both xmlspy 2004 and 2005 accept a value like "x-unknown" as a valid value for the type xs:language. I have therefore removed the previous discussion and changed the UBIF schema to always use xs:language.
I propose that, by a UBIF convention, applications may desire to understand the following language codes:
* language=*"x-unknown"*: the language could not be defined, but the language attribute is required
* language=*"x-neutral"*: the content is language-neutral
* language=*"x-mixed"*: the content is a mixture of multiple languages<br/>(but in most cases a "dominant" language should be indicated, e.g. when other languages appear only in cited text)
-- Gregor Hagedorn - 9 Nov. 2004
The above is probably be obsolete. According to http://en.wikipedia.org/wiki/ISO_639, three special codes already exist in ISO 639-2:
* mul = Multiple languages / several languages used (defined in the normative text)
* und = Undetermined, language or languages cannot be identified (defined in the normative text)
* mis = Miscellaneous languages (defined in the list of codes).
There seem to be no equivalent to language-neutral text, such as scientific organism names. Perhaps these should be registered as their own code?
-- Main.GregorHagedorn - 14 Apr 2005
---
Javier de la Torre points out
In a best practices document (http://www.ietf.org/rfc/rfc3066.txt) the following statements can be found:<br/>
5 "You SHOULD NOT use the UND (Undetermined) code unless the protocol in use forces you to give a value for the language tag, even if the language is unknown. Omitting the tag is preferred."<br/>
6. "You SHOULD NOT use the MUL (Multiple) tag if the protocol allows you to use multiple languages, as is the case for the Content-Language: header.",<br/> where "SHOULD NOT" is defined as: "This phrase, or the phrase "NOT RECOMMENDED" mean that there may exist valid reasons in particular circumstances when the particular behavior is acceptable or even useful, but the full implications should be understood and the case carefully weighed before implementing any behavior described with this label."
I vote for making a language attribute required in the TDWG/GBIF data schemata in certain places, even if some providers have to choose the undefined value. I see our situation as "valid reasons in particular circumstances"...
In GBIF we have a decidedly international agenda. This is currently building and therefore has not yet adequate enforcement mechanisms for practices. My personal estimate of realistic current practices is that for a while only required information will be given, and if a language attribute is not required it will be ignored -- even though most data providers could inform about language. The lack of language information may prevent the possibility to provide internationalized services, and the lack of services hinders non-schema social enforcement of providing language information (a hen or egg problem).
In my experience the vast majority of html documents does not give any language information, even though it is obvious that almost all could and most should. The only chance there is to hope for the chinese to NOT use language tags as well, so people in English speaking countries will want to have those documents they can read tagged with a language code. Please add your vote as well.
-- Main.GregorHagedorn - 14 Apr 2005
---
I also vote for making it mandatory. I fully agree with Gregor on the idea that things that are not requireed are just simple ignored.
With the actual configuration tool for the !BioCASe wrapper it is very easy to provide this information and I think it will not bother people to supply it because they know it.
And if this help in the internationalization process asking the people to do it is worth.
For some time I did not like the idea of forcing people to include metadata like this in their mappings, but I think it was just because of the lack of easy tools to provide it. Before it was rather complicate to provide this, you had to include it in your database and then map it. Now when it is so easy I fully agree on improving the quality of the documents the people create.
Now I am a fan of mandatory metadata elements, like languages, measurement scale declaration, recordbasis, etc... We have to make the people clear that what is obvious for their single database in networks with so many sources we need to be much more explicit.
-- Main.JavierTorre - 14 Apr 2005
%META:TOPICMOVED{by="GregorHagedorn" date="1089916489" from="SDD.ExtendLanguageWithNeutralAndUnknown" to="UBIF.ExtendLanguageWithNeutralAndUnknown"}%
@
1.10
log
@Added topic name via script
@
text
@d1 2
a4 2
%META:TOPICINFO{author="JavierTorre" date="1113491601" format="1.0" version="1.9"}%
%META:TOPICPARENT{name="UBIF.SchemaDiscussion"}%
d10 3
a12 3
* language=*"x-unknown"*: the language could not be defined, but the language attribute is required
* language=*"x-neutral"*: the content is language-neutral
* language=*"x-mixed"*: the content is a mixture of multiple languages<br/>(but in most cases a "dominant" language should be indicated, e.g. when other languages appear only in cited text)
d17 3
a19 3
* mul = Multiple languages / several languages used (defined in the normative text)
* und = Undetermined, language or languages cannot be identified (defined in the normative text)
* mis = Miscellaneous languages (defined in the list of codes).
@
1.9
log
@none
@
text
@d1 2
@
1.8
log
@none
@
text
@d1 1
a1 1
%META:TOPICINFO{author="GregorHagedorn" date="1113477494" format="1.0" version="1.8"}%
d39 12
@
1.7
log
@none
@
text
@d1 1
a1 1
%META:TOPICINFO{author="GregorHagedorn" date="1100040167" format="1.0" version="1.7"}%
d3 36
a38 10
Previous versions of this topic discussed whether it would be appropriate to add special values to the xml language type to deal with situations where language is not appropriate at all (content is entirely language-neutral) or unknown. This was now resolved by using the extension mechanism defined in the xml language type itself!
http://www.ietf.org/rfc/rfc1766.txt specifies that "x" is a reserved value for private use and that secondary codes of "x" will not be registered. Indeed, both xmlspy 2004 and 2005 accept a value like "x-unknown" as a valid value for the type xs:language. I have therefore removed the previous discussion and changed the UBIF schema to always use xs:language.
I propose that, by convention, applications may desire to understand the following language codes:
* language=*"x-unknown"*: the language could not be defined, but the language attribute is required
* language=*"x-neutral"*: the content is language-neutral
* language=*"x-mixed"*: the content is a mixture of multiple languages<br/>(but in most cases a "dominant" language should be indicated, e.g. when other languages appear only in cited text)
-- Gregor Hagedorn - 9 Nov. 2004
@
1.6
log
@none
@
text
@d1 1
a1 1
%META:TOPICINFO{author="GregorHagedorn" date="1098888485" format="1.0" version="1.6"}%
d3 1
a3 1
I am currently having quite a bit of a headache with languages in UBIF...
d5 1
a5 1
SDD attempts to support multiple representations within a language, especially for different expertise levels (school children to scientists) and defines audiences for this purpose. Most likely this solution is limited to SDD and can not be expected to become a common foundation of all TDWG/GBIF interchange standards. It is a relatively unique problem of descriptive data that highly depend on adequate language representations for a given audience.
d7 4
a10 1
Limiting the expressiveness of UBIF to language is not a problem for SDD. However, language we need (for geographical place names, <nop>MediaResource title label, labels(titles of publications, agent names, class/taxon names, etc.). The latter may be language independent (e.g. scientific names provided latin rank labels are used) but may also be highly language dependent common names (like "mushrooms"). Similarly, publications may be considered relatively language neutral in roman-languages - in languages using other character sets, presumably the entire reference would have to be tranliterated to cyrillic, chinese, japanese etc. Finally, while I expect someone to look at a DELTA dataset when converting it to SDD to determine the language of the entire thing, this may not be practical for all proxy data. The language may be unknown/undetermined. So the options for the <nop>SDD.ProxyBaseType.Label are:
d12 1
a12 30
a) make the Language attribute optional, allowing it to be missing. In some respects this is the simplest solution. However, it prevents identity constraints (the label of objects should be unique within each language and across the dataset) and I believe it makes XPath expressions (Label/Representation["en"]/Text) significantly more complex and dependent on case logic (-- is this true?)
b) Create an extension to xs:language that also supports codes for language-neutral and undetermined language. Proposal (simple type, union of xs:language and a string enumeration):
<verbatim>
<xs:simpleType name="Language">
<xs:annotation>
<xs:documentation>Union of xs:language with '-' for language-neutral (e.g. scientific names) and '?' for unknown</xs:documentation>
</xs:annotation>
<xs:union memberTypes="xs:language">
<xs:simpleType>
<xs:restriction base="xs:string">
<xs:enumeration value="-"/>
<xs:enumeration value="?"/>
</xs:restriction>
</xs:simpleType>
</xs:union>
</xs:simpleType>
</verbatim>
Having to extend the commonly used xml:language type makes me feel slightly uneasy. Perhaps the inapplicable is in fact not necessary - the only use case I have are scientific taxon names, and we could agree to consider them as latin. But the Unknown as an explicit code in my feeling would greatly simplify the handling of data. I also see it as a social contract: Please at least try to see whether the entire dataset is monolingual, so even where the dataset does not contain explicit language information, communicate the implicit information to the consumers, rather than ignoring the problem altogether.
Can anybody give advice?
-- Gregor Hagedorn - 24 May/26 Oct. 2004
---
http://www.ietf.org/rfc/rfc1766.txt says that x is a reserved value for private use. Perhaps x is more appropriate to overload with the "unknown" semantics?
-- Gregor Hagedorn - 27 Oct. 2004
@
1.5
log
@none
@
text
@d1 1
a1 1
%META:TOPICINFO{author="GregorHagedorn" date="1098799293" format="1.0" version="1.5"}%
d35 4
@
1.4
log
@none
@
text
@d1 1
a1 1
%META:TOPICINFO{author="GregorHagedorn" date="1089914520" format="1.0" version="1.4"}%
d3 1
a3 1
I am currently having quite a bit a headache with languages...
d5 1
a5 1
1. SDD attempts to support multiple representations within a language, especially for different expertise levels (school children to scientists) and defines audiences for this purpose. Most likely this solution is limited to SDD and can not be expected to become a common foundation of all TDWG/GBIF interchange standards. It is a relatively unique problem of descriptive data that highly depend on adequate language representations for a given audience.
d7 1
a7 1
2. Limiting the expressiveness in the common infrastructure to language is no problem. However, in the proxy data we need multiple languages for geographical place names, <nop>MediaResource title label, and we may need them for labels of publications, agent names, class/taxon names. The latter may be language independent (e.g. scientific names with latin rank labels) but may also be highly language dependent common names (like "mushrooms"). Similarly, publications may be relatively language neutral (perhaps only for roman-languages - in languages using other character sets, presumably the entire reference would have to be tranliterated to cyrillic, chinese, japanese etc.). Finally, while I expect someone to look at a DELTA dataset when converting it to SDD to determine the language of the entire thing, this may not be practical for all proxy data. The language may be unknown/undetermined.
d9 1
a9 1
3. So the options for the <nop>SDD.ProxyBaseType.Label are:
d11 1
a11 3
* make the Language attribute optional, allowing it to be missing. In some respects this is the simplest solution. However, it prevents identity constraints (no language used twice) and I believe it makes XPath expressions (Label/Representation["en"]/Text) significantly more complex and dependent on case logic -- is this true?
* Create an extension to xs:language that also supports codes for language-neutral and undetermined language. Proposal (simple type, union of xs:language and a string enumeration):
d28 2
d32 1
a32 1
-- Gregor Hagedorn - 24 May 2004
@
1.3
log
@none
@
text
@d1 2
a2 2
%META:TOPICINFO{author="GregorHagedorn" date="1086948950" format="1.0" version="1.3"}%
%META:TOPICPARENT{name="UnifiedBioInfoFramework"}%
d35 1
@
1.2
log
@none
@
text
@d1 2
a2 2
%META:TOPICINFO{author="GregorHagedorn" date="1086849591" format="1.0" version="1.2"}%
%META:TOPICPARENT{name="OverarchingPatternsForTdwgSchemata"}%
@
1.1
log
@none
@
text
@d1 1
a1 1
%META:TOPICINFO{author="GregorHagedorn" date="1085397924" format="1.0" version="1.1"}%
d13 1
a13 3
* Create and extension to xs:language that also supports codes for language-neutral and undetermined language. See below for a proposal (simple type, union of xs:language and a string enumeration).
@