wiki-archive/twiki/data/tmp/SDD/ConvertingLIF2SDD.txt

---+!! %TOPIC%

%META:TOPICINFO{author="GregorHagedorn" date="1172749018" format="1.1" version="1.8"}%
%META:TOPICPARENT{name="UBioXidDevelopment"}%
This topic originated from the discussions in UBioXidDevelopment about how to convert Lucid LIF data to SDD. The main points from the discussion are summarized here. <strong style="color:green">Please feel free to add/correct the text below directly rathen than, as usual, add your comments at the end. Simply add yourself to the list of contributors!</strong>

---

<h3>Description matrix</h3>

The matrix information of the Lucid-style .LIF file should go into "<nop>CodedDescriptions" section of SDD. Example of the matrix section of our .LIF to atlantic tunas:

<verbatim>
[..Main Data (txs)..]
6100000100001100100010000011111
6010000010001010010010000101110
6001000100001001010010000131110
6001000100010100001001000101100
6001000001001010100000101000000
6000100100001010000100100101000
6000010100001010001000310101100
6000001000101010000100100101000
</verbatim>

This matrix scores taxa by rows and states by columns. Lucid scores taxon under the following scheme:%BR%
0 - absent%BR%
1 - present%BR%
2 - unknown%BR%
3 - rare%BR%
4 - commonly misinterpreted%BR%
5 - rarely misinterpreted%BR%
6 - metric data


Note: score 4 and 5 were originally discussed here as: "4 - present but may be misinterpreted as absent; 5 - absent but may be misinterpreted as present". We could not get any official definition for LIF, but the one that seems most likely is given above.


<strong>Quantitative/numerical data</strong>

In LIF, values are to be used only for metric data (Lucid-type 6). In Lucid, metric data is not entered as a single value, but instead there are parameters for extreme upper, normal upper, normal lower, and extreme lower values. So, for one metric state/taxon pair, there are four metric values to be entered.

SDD support a much larger number of statistical measures, including variance or sample size for true sampling data. A recommended mapping of Lucid parameters to SDD statistical measure codes follows:
   * "extreme upper" = SDD: Max
   * "normal upper" = SDD: <nop>UMethUpper
   * "normal lower" = SDD: <nop>UMethLower
   * "extreme lower" = SDD: Min

The "unknown method" in the name refers to the fact that for Lucid data the kind of interval is not known. It may be a statistical range (Interquartile, mean plus/minus standard deviation, 95% confidence interval for mean, etc.) or it may be a human estimate of what is considered a "normal" range. For the latter, SDD provides the codes "ObserverEstUpper/Lower".

To use a quantitative character in SDD, an element with name "QuantitativeCharacter" must be added to Terminology/Characters (this defines a character id value, a label; but not the measures). To add character data to the description, add an "Quantitative" element to <nop>CodedDescription/SummaryData, containing a ref to the Character "id" attribute. This may look like:
<verbatim>
      <QuantitativeCharacter id="2">
        <Representation><Label>Leaf length</Label></Representation>
        <MeasurementUnit><Label role="abbrev">m</Label></MeasurementUnit>
        <Default>
          <MeasurementUnitPrefix>milli</MeasurementUnitPrefix>
        </Default>
      </QuantitativeCharacter>
          ...
          <Quantitative ref="2">
            <Measure type="Min" value="2.3"/>
            <Measure type="UMethLower" value="3.2"/>
            <Measure type="UMethUpper" value="7.1"/>
            <Measure type="Max" value="7.9"/>
            <MeasurementUnit><InternationalAbbreviation>cm</InternationalAbbreviation></MeasurementUnit>
          </Quantitative>
</verbatim>

Note that the measurement unit can either be expressed at the character definition, or at the character data. A unit in data overrides the unit in definition, but the unit in definition applies to all data not defin<69>ng a deviating unit.


<strong>Categorical data</strong>

SDD uses a character based scheme and only positive statements about values or categories are made. A character can be viewed as an analysis variable in statistical data analysis, which, however, may contain a list or set of values. For most purposes of identification processes this is conveniently expressed in a taxon x state matrix like in Lucid. However, many things can not or only poorly be expressed in that way (for example an image relating to a entire character, an entire character being inapplicable, or a text note on a character that has been studied).

In SDD, a character therefore either has a status value (unknown, inapplicable etc.) or, if categorical, it has a list of states scored as present. The absence of a state is implicit, provided that the character has been scored at all. If the character is not listed at all, this is roughly equivalent to the status "unknown". This design is resilient against defining new characters while data are already scored. Lucid LIF statements about state being absent are redundant in SDD and 0 needs no representation.

To use a categorical character in SDD, an element with name "CategoricalCharacter" must be added to Terminology/Characters (this defines a character id value, a label, and the states). To add character data to the description, add an "Categorical" element to <nop>CodedDescription/SummaryData, containing a ref to the Character "id" attribute. This may look like:
<verbatim>
      <CategoricalCharacter id="1">
        <Representation><Label> Leaf complexity</Label></Representation>
        <States>
          <StateDefinition id="1">
            <Representation><Label>Simple</Label></Representation>
          </StateDefinition>
          <StateDefinition id="2">
            <Representation><Label>Compound</Label></Representation>
          </StateDefinition>
        </States>
      </CategoricalCharacter>
       ...
       Then used in descriptions like:
      <Categorical ref="1">
        <State ref="1"/>
      </Categorical>
</verbatim>


<strong>Categorical data with modifiers</strong>

Several Lucid scores (2-5) are expressed in SDD in the form of modifiers. We need:

   * A frequency modifier "rare" to be defined in Terminology/Modifiers which would be added for states scored as 3 (see SDD example file Modifier id="22"). Example for the application of this in a description:
<verbatim>
         <CodedDescription id="1"><Header><ClassName ref="1"/></Header>
         <SummaryData>
           <Categorical ref="2"><State ref="3"><Modifier ref="22"/></State><State ref="4"/></Categorical>
           <Categorical ref="3"><State ref="1"/></Categorical>
         ...
</verbatim>

   * An expression that a state value is unknown. In SDD this requires a distinction:
      * In Lucid LIF all states of a character are unknown and thus the character value itself is unknown. The SDD feature <nop>CodingStatus (status of the character) defines whether a property of an object is unknown (see Terminoloy/General/<nop>CodingStatusValues, e. g. in the SDD example files: <nop>CodingStatus id="2" debugkey="Unknown"). SDD does distinguish between various levels of unknown, especially whether this is not yet coded, or whether it is logically impossible to code (= not applicable), or whether information has been researched, but not readily available. These distinctions are irrelevant when using data for identification, but important when building large datasets, especially collaboratively.
      * In Lucid some states are coded and others marked as unknown. This may either occur if the supplier of information has studied a feature, but knows that some conditions are rare and realizes that her or his sampling was insufficient, or may be the result of a poorly defined character (in which the states are not truly related, and a state frequency distribution would have no meaning). The way to express the mixed situation in SDD is to say it is "perhaps this state". Example: the Lucid statement "flower elliptic, unknown whether ovatate" is in SDD interpreted as  "flower elliptic, perhaps ovatate". Thus we first need to define in Terminology/Modifiers a Certainty modifier for perhaps: (see Modifier id="41" in the SDD_tech.xml example file). Example for the application of this in a description:
<verbatim>
         <CodedDescription id="1"><Header><ClassName ref="1"/></Header>
         <SummaryData>
           <Categorical ref="2"><State ref="1"><Modifier ref="41"/></State><State ref="3"/><State ref="4"/></Categorical>
           <Categorical ref="2"><State ref="5"/></Categorical>
         ...
</verbatim>
      * Note that the above looks more complicated than the Frequency example, since Certainty modifiers apply to all states in a Categorical data element. The element itself may, however, be repeated. This may seem an unnecessary complication. It was chosen because Certainty modifiers apply in principle to all character types, including quantitative, measured color ranges, or future character types like molecular patterns.

   * A certainty modifier with the special flag Specification/IsTrueByMisinterpretation set to true (see Modifier id="42"). This is to be used for Lucid value 5. Example:
<verbatim>
   <CodedDescription id="101"><Header><ClassName ref="1"/></Header>
     <SummaryData>
       <Character ref="1">
         <State ref="1"><Modifier ref="42"/></State>
       </Character>
</verbatim>
   * There is no equivalent to Lucid value 4 in SDD. The fact that something is present, but can be erroneously overlooked is considered so general that is is not part of data coding, but instead should be part of the reasoning of the query or identification method. Thus treat 4 simply as Lucid-score 1. -- <em>(If anybody disagrees on this, and can provide examples or scenarios, it would be an urgent issue to fix!)</em>
   * Appropriate statistical measures for the metric data. (See the discussion above!)

---

<h3>Images</h3>

Note that descriptions may be associated with images or formatted documents (or even video/audio). You can add a <nop>MediaResource ref in the description (we consider illustrations of the entire taxon part of description, not part of the taxon name). <nop>MediaResources/MediaResource can occur in the <nop>CodedDescription itself (after the Header) or in a specific character (if the image only applies to this). After defining a <nop>MediaResource in <nop>MediaResources:
<verbatim>
TO BE UPDATED
</verbatim>
you can refer to such resources for the entire description like:
<verbatim>
<CodedDescription id="101">
  <Scope><TaxonName ref="1"/></Scope>
  ###DIRECTLY NO LONGER POSSIBLE IN SDD. MAY BE IN REPRESENTATION, OR
  <MediaResource ref="123"/><MediaResource ref="124"/>
  ...
</verbatim>

-- Main.GregorHagedorn, Main.PatrickLeary - 7 Jun 2004 to 18. August 2004

---

I have just recently created a working version of a LIF (Lucid Interchange Format) to SDD converter. It is a work in progress that shall be updated as the SDD standards change. It can be found at http://uio.mbl.edu/SDD/converter.php . The user can upload a LIF from their computer to convert, they can enter a URL (web address) of a LIF file somewhere on the web, or they can choose one of the sample LIF files that we provide to test the conversion process. Be aware that we are not taxonomists, so the keys that we provide are not very detailed and may not use all the proper taxonomic vocabulary.

We would be very interested to test a complete and thorough key made by a specialist, as that is the audience we are appealing to. If you have a key you would like to let us use to test our software, or even use as a demonstration key, you can e-mail me at pleary@mbl.edu.

I would love to hear your comments and suggestions for changes to this converter, and I hope you can find it useful.

-- Main.PatrickLeary - 10 Jun 2004