<TDwidth="70"><Ahref="http://www.tdwg.org"><IMGsrc="../../../DarwinCore_files/TDWGlogo_Twiki.gif"width="150"height="70"alt="Biodiversity Information Standards (TDWG) logo"></A></TD>
<TD>Tim Robertson (GBIF)<trobertson@gbif.org>, John Wieczorek (MVZ)<tuco@berkeley.edu>, Markus Döring (GBIF)<mdoering@gbif.org>, Renato De Giovanni (CRIA)<renato@cria.org.br>, Dave Vieglais (KUNHM)<vieglais@ku.edu></TD>
<TD>This document is governed by the standard legal, copyright, licensing provisions and disclaimers issued by the Taxonomic Databases Working Group.</TD>
This document provides guidelines for the description of Darwin Core data residing in <em>fielded text</em> files (e.g. comma separated values,
tab delimited files etc.) by means of providing an XML metafile.<br/>
<imgsrc="images/usage.png"></img><br/>
</P>
<h3>1.1 XML versus <EM>Fielded Text</EM></h3>
<p>
Many resources exist on the web describing the advantages of XML (<ahref="http://en.wikipedia.org/wiki/XML">http://en.wikipedia.org/wiki/XML</a>) over less structured content such as <em>fielded text</em>.
These guidelines <b>do not</b> promote the use of <EM>Fielded Text</EM> over XML for data files, but rather provide recommendations for how to handle such data files when necessary.
<br/>
2 such scenarios might be
<ul>
<li>The transfer of large numbers of Darwin Core <i>simple</i> records from one database to another.
Typically databases are very efficient at producing and consuming (e.g.) <em>Tab file</em> output.</li>
<li>The description of legacy data existing in a <em>fielded text</em> format, such that it might be automatically understood and loaded into another system.
It could be that this system would then re-serve the data in another format such as XML.</li>
</ul>
</p>
<h3>1.2 Existing Solution</h3>
<p>
Proposed standards exist for similar XML metafiles to describe <EM>fielded text</EM> files, such as the <ahref="http://www.fieldedtext.org/">FieldedText</a> standard. The FieldedText standard aims to offer description of any
<EM>fielded text</EM> file including all possible permutations of content. While beneficial to the publisher, this flexibility provides significant challenges to the consumer due to the diverse options that may exist.
</p>
<h3>1.3 Example Metafile Content</h3>
A simple comma seperated values data file of the following form:
The metafile schema is available at <ahref="../../../text/tdwg_dwc_text.xsd">tdwg_dwc_text.xsd</a>.
</p>
<h3>4.1 The <archive> element</h3>
<p>
<tableclass="border">
<thead>
<caption>Attributes</caption>
<th>Attribute</th>
<th>Description</th>
<th>Required</th>
<th>Default</th>
</thead>
<tbody>
<tr>
<tdclass=""><em>fileRoot</em></td>
<td>Contains a qualified Uniform Resource Locator (URL) defining the root location of the data files being described, and must be publically accessible.
Valid examples of the format include <i>http://data.gbif.org/collections/</i>, <i>ftp://ftp.gbif.org/public/</i> and <i>http://data.gbif.org/webservices/export?id=</i>. This value will be concatinated
with the location of the <ahref="#fileTag-location"><file></a> and therefore should contain any necessary trailing characters such as / ? etc.</td>
<td>An <archive> will contain one or more <ahref="#fileTag"><file></a> elements, each representing an individual file being described.</td>
</tr>
</tbody>
</table>
</p>
<h3><aname="fileTag">4.2 The <file> element</a></h3>
<td>Specifies the location of the file relative to the fileRoot - e.g. dwc-data.txt</td>
<td>✓</td>
<td/>
</tr>
<tr>
<tdclass=""><em>fieldsTerminatedBy</em></td>
<td>Specifies the delimiter between fields. Typical values might be "," or "\t" for CSV or Tab files respectively.</td>
<td/>
<td>\t</td>
</tr>
<tr>
<tdclass=""><em>linesTerminatedBy</em></td>
<td>Specifies the row separator character.</td>
<td/>
<td>\n</td>
</tr>
<tr>
<tdclass=""><em>compression</em></td>
<td>Specifies the compression used for the file. May be omitted or specified as one of:
<dl>
<dt>GZIP</dt>
<dd>Data file is compressed as GZIP</dd>
<dt>ZIP</dt>
<dd>Data file is compressed as ZIP (E.g. using PKZIP, WinZip, StuffIt etc)</dd>
</dl>
<td/>
<td/>
</tr>
<tr>
<tdclass=""><em>encoding</em></td>
<td>Specifies the encoding for the data file. One of:
<dl>
<dt>UTF-8</dt>
<dd>8-bit Unicode Transformation Format</dd>
<dt>UTF-16</dt>
<dd>16-bit Unicode Transformation Format</dd>
<dt>ISO-8859-1</dt>
<dd>Commonly known as Latin-1 and a common default of Microsoft Windows based operating systems</dd>
<dt>windows-1252</dt>
<dd>Commonly known as WinLatin and a common default of legacy versions of Microsoft Windows based operating systems</dd>
</dl>
</td>
<td/>
<td>ISO-8859-1</td>
</tr>
<tr>
<tdclass=""><em>ignoreHeaderLines</em></td>
<td>Specifies the number lines to ignore from the beginning of the file. This can be used to ignore files with column headings or preamble comments for example.</td>
<td/>
<td>0</td>
</tr>
<tr>
<tdclass=""><em>rowType</em></td>
<td>
A Unified Resource Identifier (URI) for the term identifying the class of data represented by each row.
See <ahref="../../index.htm">Darwin Core Terms</a> definitions. Additional classes may be referenced by URI and defined outside the Darwin Core specification.
For convienience the classes defined by Darwin Core are listed below:
<td>When verbatum dates are used, this field can be used to indicate the format represented. It is recommended to use the date, dateTime and time for field formats wherever possible, but where verbatum dates are required, a format may be specified here.
This should be considered a 'hint' for consumers. It is recommended that consumers support the minimum combinations of DD MM and YYYY with the separators / and -. Examples are given:
<td>A <file> will contain one or more <ahref="#fieldTag"><field></a> elements, each representing a 'column' in the row</td>
</tr>
</tbody>
</table>
</p>
<h3><aname="fieldTag">4.2 The <field> element</a></h3>
<p>
<tableclass="border">
<thead>
<caption>Attributes</caption>
<th>Attribute</th>
<th>Description</th>
<th>Required</th>
<th>Default</th>
</thead>
<tbody>
<tr>
<tdclass=""><em>index</em></td>
<td>Specifies the column index from the row. The first column is column 0, the second column 1 etc.
If no column index is specified, then the term and the default may be used to define a constant value for all rows</td>
<td/>
<td/>
</tr>
<tr>
<tdclass=""><em>term</em></td>
<td>A Unified Resource Identifier (URI) for the term identifying the property of data represented by this field.
For example, a scientific name would be http://rs.tdwg.org/dwc/terms/ScientificName.
Terms outside of the Darwin Core specification may be used, such as those from the Dublin Core Metadata Initative.
</td>
<td>✓</td>
<td/>
</tr>
<tr>
<tdclass=""><em>type</em></td>
<td>Specifies the type of the content represented in the column. The following values are supported.
<dl>
<dt>string</dt>
<dd>Represents a sequence of characters, and should be used where no other type is appropriate</dd>
<dt>integer</dt>
<dd>Represents a whole numeric value (e.g. 123)</dd>
<dt>decimal</dt>
<dd>Represents a decimal value (e.g. 10.34). Decimal point must be represented by the character . otherwise the field must be declared as a string type</dd>
<dt>dateTime</dt>
<dd>Represents the combination of a date and time, in the format [-]CCYY-MM-DDThh:mm:ss[Z|(+|-)hh:mm]. Valid values include 2001-10-26T21:32:52, 2001-10-26T21:32:52+02:00, 2001-10-26T19:32:52Z, 2001-10-26T19:32:52+00:00, -2001-10-26T21:32:52, and 2001-10-26T21:32:52.12679. Where this format cannot be used, the string type must be declared</dd>
<dt>date</dt>
<dd>Represents a date in the format [-]CCYY-MM-DD[Z|(+|-)hh:mm]. Valid values include 2001-10-26, 2001-10-26+02:00, 2001-10-26Z, 2001-10-26+00:00, -2001-10-26, and -20000-04-01. Where this format cannot be used, the string type must be declared</dd>
<dt>time</dt>
<dd>Represents a time in the format hh:mm:ss[Z|(+|-)hh:mm]. Valid values include 21:32:52, 21:32:52+02:00, 19:32:52Z, 19:32:52+00:00, and 21:32:52.12679. Where this format cannot be used, the string type must be declared</dd>
</dl>
TODO: See guidelines for type specification</td>
<td/>
<td>string</td>
</tr>
<tr>
<tdclass=""><em>format</em></td>
<td>TODO - finish decision on format</td>
<td/>
<td/>
</tr>
<tr>
<tdclass=""><em>default</em></td>
<td>Used to optionally specify a default value should there not be one supplied in any given row. If no index is supplied, this can be used to define a constant applicable to all rows.</td>
<!-- field definitions omitted for example -->
</file>
</archive>
</pre>
<h4>5.1.3 Multiple related data files</h4>
When the content of one data file relates to another data file, a relationship can be expressed in the metafile using the <relationships> element.
In database terminology, this is equivalent to defining a foreign key constraint from one table to another.
However, where a database has the ability to enforce this relationship, <em>fielded text</em> files do not have this capability. The following guidelines are recommended:<br/>
<ul>
<li>The fields on either end of a relationship must be of the same type (e.g. xs:integer)</li>
<li>To indicate a single row is not related, no value must be provided. The use of 0, -1, \N, NULL are not to be used to indicate this</li>
<li>The data provider must ensure that data has integrity - that the target of a relationship does indeed exist</li>
</ul>
Therefore care must be taken by the data provider that the relationship expressed is indeed valid, and that the data integrity is not broken.
It goes beyond the scope of these guidelines to specify how a consumer must deal with related data. However, the following procedure is recommended for a database import:
<ul>
<li>Create tables for each described file with no constraints</li>
<li>Import file content into temporary tables</li>
<li>Check data integrity by testing the expressed join</li>
<li>Copy data into tables enforcing the relationship, or add constraint to newly created tables</li>