wiki-archive/twiki/data/SDD/SupportForMultipleAudiences...

%META:TOPICINFO{author="GarryJolleyRogers" date="1259118878" format="1.1" version="1.15"}%
%META:TOPICPARENT{name="ClosedTopicSchemaDiscussionSDD09"}%
---+!! %TOPIC%

Trevor asked about the - perhaps political reasons - to support multiple languages and audiences. I fully understand the urge to avoid a significant complication of the information model that is not required in ones own English agenda. Any complication may reduce the chances of successfully finishing a software project.

However, to many of us multiply audiences are quite important, and I will try to explain why this is so. The international language for science is English, and most relevant publications are today published and read in English. English-speakers today have a privilege to work with the same language in basic science as in applied science and public communications.

In contrast, non-english speakers must continuously bridge a language gap to communicate with their technical personal, with applied sciences (which usually use the local language) and with the public. This used to be a special problem of bilingual countries like Canada, but today we have, for example, projects in Germany that require bilingual German/English data and web interfaces as a funding requirement. You can get a away with doing part in both languages, and leave part in either English or German.

Also, I believe it is incorrect to assume this is not currently done. I know of two <nop>DeltaAccess projects that are multilingual, it was possible in the old DELTA where English/Chinese data sets exist (though in two character definition files which manually had to be kept in synchronicity), and Bob Morris is doing it for Spanish/English and expert/schoolchildren. {from KRT: Lucid also has a multilingual interface with language files in english, german, portuguese, spanish, bahasa indonesia, mandarin and others. We all seem to be finding that multilinguality is important.}

For these reasons, SDD does provide mechanisms to develop data sets for multiple audiences. These can be expressed in a single SDD document. There is no requirement in SDD that translations must be complete, or that more than one translation for a concept has to exist. A multilingual dataset may grow over time, and certainly needs time to mature and to learn about appropriateness of translated terminology. It may often be quite acceptable to leave a few detailed comments in English for a long time, while only translating the "backbone" concepts.

I would encourage any project that plans to develop a new application for descriptive data to consider, whether it can afford support for multiple languages. I believe even projects in English-speaking countries will often encounter multiple audiences when they collaborate with other countries in biodiversity studies. Kevin correctly remarks on the email list that multiple audiences are also relevant to "broaden the application of taxonomy beyond taxonomists (surely a requirement if the taxonomic crisis is to be resolved)". In fact, I believe at least in mycology the differences between schools are just as big as between languages. I see no special problem supporting languages if you want to support different research schools in one language.

Also note that the dominant language of the World is Chinese, not English! Wouldn't it be helpful to be able to read the descriptions of the thousands of new species described each year in Chinese in English?

-- Gregor Hagedorn - 18 Mar 2004

---

I certainly agree with all of these points, and alternative/dual language representations will clearly be useful or necessary for particular user groups. In particular I have no problem with multiple representations for non-expert user groups (who might be members of the public, or workers in other fields, e. g. ecology, who just want an indication of what is recorded, but do not require highly accurate translations).

However, from our experiences, to assume that such dual representations would ever be accurate enough for detailed scientific work is risky. I would guess that the original language of a description would still be considered more accurate (even if completely incomrehensible to non-native speakers) - so would there be any way to flag which representation is 'original', and which representations are translations. Perhaps the quality of a translation could be scored too - - a technical translation being worth more than for example a user friendly translation to help a non-native speaker. Perhaps such a scoring overlaps with the concept of expertise level. For example a technical description might be  translated (dumbed down...) for a less expert audience.

I suspect current working practice for taxonomists is to ignore existing descriptions in a foreign language, and redescribe a specimen for purposes of a new work. This is what generally happens even where a same language description exists. If a taxonomist doesn't trust/understand a description written in his own language, he is even less likely to trust a translated description.

-- Main.TrevorPaterson - 18 Mar 2004

---

You raise a very important point in terms of trust. However, it is a misunderstanding that any part of the description is ever being translated (with the exception of Note inside State or <nop>StatisticalMeasure). Our model is to provide representations for a single concept, and the concept is being translated in terminology, not a descriptions.

As such the coded descriptions are not considered prose in a language, but data that are potentially language independent. I fully agree that this will never work perfectly and requires significant debugging until it works satisfactorily. However, in my experience a scientist will and can never trust 100% any description. Even if the terminology is well defined, taxonomists will only look up the definition if they think they do not know the meaning. If I believe I understand the concept and width of a descriptive term, I will use it (select it from pick-list etc.) without stopping to carefully review the definition first.

Similarly, definitions have a strong tendency to evolve over time. Even though what I say above may sound unscientific, I believe it is in fact a result of doing science rather than mathematics. We do discover more about the organisms, we continue to learn new details and concepts, and these concepts have a strong influence on the older concepts, which as a consequence evolve. Any attempt to create an agreed terminology for all regions and all times I believe to be futile. This is why SDD has not a world-wide terminology model, but a project paradigm that may be constrained by geography, language, time, and ones Ph.D. committee.

(Note: What you cannot see anywhere yet is that these project concepts are expected to be associated or mapped to each other. This is part of our plan, but we have not yet made any progress on it, because we believe it can wait until a version 2 (should we ever finish version 1).)

Back to the trust. I agree that it may indeed be valuable to detect in which language, and with which language representation (i.e. not only the audience code, but the full text) a descriptive statement was scored (or perhaps later changed). So far we never discussed this.

I could think of an additional text field attempting to maintain a score-time representation. In the case of scoring a character, this may contain character + state label in the currently selected audience representation. It may even go further, in that an application presenting a concept hierarchy rather than a flat character list could store the labels in the tree path plus the state (plus modifiers, etc.).

Is anybody in favor of such an addition?

-- Gregor Hagedorn - 18 Mar 2004

---

I think Trevor's point about flagging the "native" representation of the character or state label or NLD is important - it's like a book that, when translated, always says it's been translated from such-and-such a language rather than making out that it was written afresh in the new language. For historical and accuracy purposes this is important.

Gregor I don't understand your proposed solution though and it seems unnecessarily clunky. Surely all we need is a flag on the audiences definition to say "this one is native". Of course, that will only work when a document is first authored with only one audience, then subsequently reauthored with another - that will often be the case with language, but not ncessarily with <nop>ExpertiseLevel. So perhaps Language and <nop>ExpertiseLevel need to be separated instead of being fused into an audience definition. Or perhaps a new attribute <nop>NativeLanguage in <nop>UBIF.ContentMetadata.

-- Main.KevinThiele - 22 Mar 2004
---

I disagree with you Kevin. Your scenario of only one authoring language and multiple translations is likely in small projects that are prepared and published like a book, but I believe not in multinational projects like <nop>FishBase, the lichen project LIAS, etc. Multiple people will work on descriptions. One solution could be that we could make it a "rule", that a single description may always only be edited in a single language. If you want to add/change, you would have to clone it to another one, with your authorship and that may then bear your preferred language. The problem with that is that the relation between descriptions so far are unresolved. Adding information is fine, but contradicting information has been discussed at probably each SDD meeting - without any progress.

I agree on "clunkyness" of storing original text, but I see another problem: imaging you code "234723478" in a description, and the terminology resolves this state code to "moderately to strongly hairy". Later you decide that this was a bad idea, and you either create two new categories: "moderately hairy" and "strongly hairy", or you even drop the degree of hairiness and delegate this to modifiers, you have a very similar break of trust than in the language case.

That is, the break of trust is not a result of multiple languages, but a result of coding symbols in the descriptions for which the natural language expressions may differ in the terminology - either depending on translation activities, or depending on schema evolution.

I believe most of us agree that terminology is never correct up-front. It should be carefully designed esp. in large projects, but it is bound to evolve. I don't know any project where it did not. We do try to go through all descriptions verifying changes, but sometimes that is difficult. Also, it is a smart idea to solve the terminology evolution in cases where it does matter by renaming original state to "(use is deprecated:) moderately to strongly hairy", add the other states, and then go through all descriptions and update them. However, I believe that such timeconsuming practices are not enforceable.

My proposal adds a "trust layer" (what may we call it?) to this latter problem as well - but it is clunky. I wish I could see a better one. Another option is to export all data very frequently to CVS systems, that was proposed in Paris. I doubt that it will work, it will be analytically extremely complicated to find certain types of changes among all the changes present in large databases (i.e. running into several 1000 descriptions and changed daily, not just the "mistrusted" change you are after.

If the problem of evolving terminology would not matter so much, I think storing the audience and allowing only a single audience per description would be fine.

-- Gregor Hagedorn - 25 Mar 2004
---

Kevin Thiele's suggestion to single out an audience for this, or any other, purpose, probably doesn't survive data integration applications. How would conflicts be resolved? Possibly this is a problem also even if the special designation is at the level of any object that can have an audience, though in that case the integrating application will probably need to make some resolution choices in any case, and these seem likely to make the issue moot. For example, if two data sources have character values for the same character but with different states, the choice of which audience label is singled out as distinguished is implied by that choice.

Sooner or later, somebody needs to be motivated to do translations of labels for any labeled things TDWG may choose to make official definitions, e.g. for some standard statistical measures.

-- Main.BobMorris - 27 Mar 2004
---

This is just a note that the original language/trust issue is still unresolved, as well as any contradiction mechanism in SDD 0.91 as it stands after the Berlin meeting!

-- Gregor Hagedorn - 26 May 2004