Biodiversity Information Standards (TDWG) logo

Darwin Core Text Guidelines

Title: Darwin Core Text Guidelines
Date Issued: 2009-02-12
Abstract: Guidelines for the implementation of Darwin Core in XML.
Contributors: John Wieczorek (MVZ), Markus Döring (GBIF), Renato De Giovanni (CRIA), Tim Robertson (GBIF), Dave Vieglais (KUNHM), Stan Blum (CAS)
Legal: This document is governed by the standard legal, copyright, licensing provisions and disclaimers issued by the Taxonomic Databases Working Group.
Part of TDWG Standard: ***URL to DwC Standard*** goes here
Creator: Darwin Core Task Group
Identifier: http://rs.tdwg.org/dwc/terms/xsd/guide/2009-02-12/
Latest Version: http://rs.tdwg.org/dwc/terms/xsd/guide/
Replaces: Not applicable
Replaced By: Not applicable
Translations: http://rs.tdwg.org/dwc/translations/
Document Status: This is a TDWG Request for Comment.

Table of Contents

1. Introduction
2. References
3. Terminology
4. General implementation recommendations

1. Introduction

This document provides guidelines for the description of Darwin Core data residing in fielded text files (e.g. comma separated values, tab delimited files etc.) by means of providing an XML metafile.

1.1 XML versus Fielded Text

Many resources exist on the web describing the advantages of XML (http://en.wikipedia.org/wiki/XML) over less structured content such as fielded text. These guidelines do not promote the use of Fielded Text over XML for data files, but rather provide recommendations for how to handle such data files when necessary.
2 such scenarios might be

1.2 Existing Solution

Proposed standards exist for similar XML metafiles to describe fielded text files, such as the FieldedText standard. The FieldedText standard aims to offer description of any fielded text file including all possible permutations of content. While beneficial to the publisher, this flexibility provides significant challenges to the consumer due to the diverse options that may exist.

1.3 Example Metafile Content

A simple comma seperated values data file of the following form:
ID,ScientificName,IndividualCount
123,"Cryptantha gypsophila Reveal & C.R. Broome",12
124,"Buxbaumia piperi",2 
can be described with the following illustrative Darwin Core metafile (Namespaces omitted for example):
<archive fileRoot="http://data.gbif.org/download/">
  <file 
    rowType="http://rs.tdwg.org/dwc/text/DarwinRecord"
    location="specimens.csv"
    ignoreHeaderLines="1">
      <field index="0" term="http://rs.tdwg.org/dwc/terms/CatalogNumber" type="xs:integer"/>
      <field index="1" term="http://rs.tdwg.org/dwc/terms/ScientificName" type="xs:string"/>
      <field index="2" term="http://rs.tdwg.org/dwc/terms/IndividualCount" type="xs:integer"/>
      <!-- A constant value has no index, but applies to all rows -->
      <field term="http://rs.tdwg.org/dwc/terms/DatasetID" type="xs:string" default="urn:lsid:tim.lsid.tdwg.org:collections:1"/>
  </file>
</archive>

2. References

[DCTERMS] http://dublincore.org/documents/dcmi-terms/ Dublin Core Metadata terms.
[FIELDEDTEXT] http://www.fieldedtext.org/ Fielded Text proposed standard.
[HISTORY] http://rs.tdwg.org/dwc/terms/history/ Complete historical reference to Darwin Core terms.
[NAMESPACEPOLICY] http://rs.tdwg.org/dwc/terms/namespace/ Policy governing Darwin Core terms.
[TERMS] http://rs.tdwg.org/dwc/terms/ Quick reference to recommended Darwin Core terms.
[TEXTSCHEMA] http://rs.tdwg.org/dwc/terms/xsd/tdwg_dwc_text.xsd Simple Darwin Core Text schema.
[VERSIONS] http://rs.tdwg.org/dwc/terms/history/versions/ Reference for mapping historical Darwin Core terms to the current recommended terms.
[XML] http://www.w3.org/XML/ Reference site for the Extensible Markup Language (XML).

3. Terminology

Fielded Text
Fielded Text refers to a format of structuring a flat text file into rows and columns; examples include comma separated values(CSV) and Tab delimited files (Tab file)

4. Metafile content description

The metafile schema is available at tdwg_dwc_text.xsd.

4.1 The <archive> element

Attributes
Attribute Description Required Default
fileRoot Contains a qualified Uniform Resource Locator (URL) defining the root location of the data files being described, and must be publically accessible. Valid examples of the format include http://data.gbif.org/collections/, ftp://ftp.gbif.org/public/ and http://data.gbif.org/webservices/export?id=. This value will be concatinated with the location of the <file> and therefore should contain any necessary trailing characters such as / ? etc.
Elements
Element Description
<file> An <archive> will contain one or more <file> elements, each representing an individual file being described.

4.2 The <file> element

Attributes
Attribute Description Required Default
location Specifies the location of the file relative to the fileRoot - e.g. dwc-data.txt
fieldsTerminatedBy Specifies the delimiter between fields. Typical values might be "," or "\t" for CSV or Tab files respectively. \t
linesTerminatedBy Specifies the row separator character. \n
compression Specifies the compression used for the file. May be omitted or specified as one of:
GZIP
Data file is compressed as GZIP
ZIP
Data file is compressed as ZIP (E.g. using PKZIP, WinZip, StuffIt etc)
encoding Specifies the encoding for the data file. One of:
UTF-8
8-bit Unicode Transformation Format
UTF-16
16-bit Unicode Transformation Format
ISO-8859-1
Commonly known as Latin-1 and a common default of Microsoft Windows based operating systems
windows-1252
Commonly known as WinLatin and a common default of legacy versions of Microsoft Windows based operating systems
ISO-8859-1
ignoreHeaderLines Specifies the number lines to ignore from the beginning of the file. This can be used to ignore files with column headings or preamble comments for example. 0
rowType A Unified Resource Identifier (URI) for the term identifying the class of data represented by each row. See Darwin Core Terms definitions. Additional classes may be referenced by URI and defined outside the Darwin Core specification. For convienience the classes defined by Darwin Core are listed below:
Simple Darwin Core
http://rs.tdwg.org/dwc/terms/text/DarwinRecord
Dataset
http://rs.tdwg.org/dwc/terms/Dataset
Sample
http://rs.tdwg.org/dwc/terms/Sample
SamplingEvent
http://rs.tdwg.org/dwc/terms/SamplingEvent
SamplingLocation
http://rs.tdwg.org/dwc/terms/SamplingLocation
Identification
http://rs.tdwg.org/dwc/terms/Identification
Taxon
http://rs.tdwg.org/dwc/terms/Taxon
RelatedResource
http://rs.tdwg.org/dwc/terms/RelatedResource
SampleAttribute
http://rs.tdwg.org/dwc/terms/SampleAttribute
EventAttribute
http://rs.tdwg.org/dwc/terms/EventAttribute
dateFormat When verbatum dates are used, this field can be used to indicate the format represented. It is recommended to use the date, dateTime and time for field formats wherever possible, but where verbatum dates are required, a format may be specified here. This should be considered a 'hint' for consumers. It is recommended that consumers support the minimum combinations of DD MM and YYYY with the separators / and -. Examples are given:
DDMMYYYY
E.g. for dates in format 21121978
DD-MM-YYYY
E.g. for dates in format 21-12-1978
MMDDYYYY
E.g. for dates in format 12211978
MM-DD-YYYY
E.g. for dates in format 12-21-1978
YYYYMMDD
E.g. for dates in format 19781221
Elements
Attribute Description
<field> A <file> will contain one or more <field> elements, each representing a 'column' in the row

4.2 The <field> element

Attributes
Attribute Description Required Default
index Specifies the column index from the row. The first column is column 0, the second column 1 etc. If no column index is specified, then the term and the default may be used to define a constant value for all rows
term A Unified Resource Identifier (URI) for the term identifying the property of data represented by this field. For example, a scientific name would be http://rs.tdwg.org/dwc/terms/ScientificName. Terms outside of the Darwin Core specification may be used, such as those from the Dublin Core Metadata Initative.
type Specifies the type of the content represented in the column. The following values are supported.
string
Represents a sequence of characters, and should be used where no other type is appropriate
integer
Represents a whole numeric value (e.g. 123)
decimal
Represents a decimal value (e.g. 10.34). Decimal point must be represented by the character . otherwise the field must be declared as a string type
dateTime
Represents the combination of a date and time, in the format [-]CCYY-MM-DDThh:mm:ss[Z|(+|-)hh:mm]. Valid values include 2001-10-26T21:32:52, 2001-10-26T21:32:52+02:00, 2001-10-26T19:32:52Z, 2001-10-26T19:32:52+00:00, -2001-10-26T21:32:52, and 2001-10-26T21:32:52.12679. Where this format cannot be used, the string type must be declared
date
Represents a date in the format [-]CCYY-MM-DD[Z|(+|-)hh:mm]. Valid values include 2001-10-26, 2001-10-26+02:00, 2001-10-26Z, 2001-10-26+00:00, -2001-10-26, and -20000-04-01. Where this format cannot be used, the string type must be declared
time
Represents a time in the format hh:mm:ss[Z|(+|-)hh:mm]. Valid values include 21:32:52, 21:32:52+02:00, 19:32:52Z, 19:32:52+00:00, and 21:32:52.12679. Where this format cannot be used, the string type must be declared
TODO: See guidelines for type specification
string
format TODO - finish decision on format
default Used to optionally specify a default value should there not be one supplied in any given row. If no index is supplied, this can be used to define a constant applicable to all rows.

5. General implementation guidelines

5.1 Single and multiple data files

5.1.1 Single data file

In its simplest usage, a single data file can be described. Specifically the file location, the row type and the field mapping are provided.

<!-- Namespaces omitted for example -->
<archive fileRoot="http://mydata.org/">
  <file rowType="http://rs.tdwg.org/dwc/terms/text/DarwinRecord" 
    location="specimens.txt">
    <field index="0" 
      term="http://rs.tdwg.org/dwc/terms/CatalogNumber" 
      type="xs:integer"/>
    <field index="1" 
      term="http://rs.tdwg.org/dwc/terms/ScientificName" 
      type="xs:string"/>
  </file>
</archive>

5.1.2 Multiple unrelated data files

Multiple files containing no inter-file relationships may be described with a single metafile. The files must reside at the same 'root' location. A typical example for this usage might be multiple dataset files each with a common format.

<!-- Namespaces omitted for example -->
<archive fileRoot="http://mydata.org/">
  <file rowType="http://rs.tdwg.org/dwc/text/DarwinRecord" 
    location="aves.txt">
    <!-- field definitions omitted for example -->
  </file>
  <file rowType="http://rs.tdwg.org/dwc/text/DarwinRecord" 
    location="lepidoptera.txt">
    <!-- field definitions omitted for example -->
  </file>
</archive>

5.1.3 Multiple related data files

When the content of one data file relates to another data file, a relationship can be expressed in the metafile using the <relationships> element. In database terminology, this is equivalent to defining a foreign key constraint from one table to another. However, where a database has the ability to enforce this relationship, fielded text files do not have this capability. The following guidelines are recommended:
Therefore care must be taken by the data provider that the relationship expressed is indeed valid, and that the data integrity is not broken.

<!-- Namespaces omitted for example -->
<archive fileRoot="http://mydata.org/">
  <file rowType="http://rs.tdwg.org/dwc/terms/Sample" 
    location="specimens.txt">
    <field index="0" term="http://rs.tdwg.org/dwc/terms/CatalogNumber"/>
    <field index="1" term="http://rs.tdwg.org/dwc/terms/IndividualCount"/>
  </file>
	
  <file rowType="http://rs.tdwg.org/dwc/terms/Identification" 
    location="identifications.txt">
    <field index="0" term="http://rs.tdwg.org/dwc/terms/IdentificationID"/>
    <field index="1" term="http://rs.tdwg.org/dwc/terms/IdentifiedBy"/>
    <field index="2" term="http://rs.tdwg.org/dwc/terms/CatalogNumber"/>
    <field index="3" term="http://rs.tdwg.org/dwc/terms/ScientificName"/>
  </file>
	
  <relationships>
    <relationship>
      <file location="specimens.txt" fieldIndex="0"/>
      <file location="identifications.txt" fieldIndex="2"/>
    </relationship>
  </relationships>	
</archive>

Note:
Although feasible, it is not recommended to express a relationship from one file to itself. This recommendation is made since no description of the relationship type may be expressed.

5.2 Field Type Guidelines

Most terms should be typed as "string" with the exception of the following terms, which are listed with proposed types:

Non string term mappings
Term Recommended Types Comments
http://rs.tdwg.org/dwc/terms/DateIdentifieddateTime, date, string
http://rs.tdwg.org/dwc/terms/EarliestDateCollecteddateTime, date, string
http://rs.tdwg.org/dwc/terms/EventAttributeDeterminedDatedateTime, date, string
http://rs.tdwg.org/dwc/terms/LatestDateCollecteddateTime, date, string
http://rs.tdwg.org/dwc/terms/SampleAttributeDeterminedDatedateTime, date, string
http://rs.tdwg.org/dwc/terms/VerbatimCollectingDatedateTime, date, string
http://rs.tdwg.org/dwc/terms/CoordinatePrecisiondecimal, int, string
http://rs.tdwg.org/dwc/terms/CoordinateUncertaintyInMetersdecimal, int, string
http://rs.tdwg.org/dwc/terms/DistanceAboveSurfaceInMetersMaximumdecimal, int, string
http://rs.tdwg.org/dwc/terms/DistanceAboveSurfaceInMetersMinimumdecimal, int, string
http://rs.tdwg.org/dwc/terms/EventAttributeAccuracydecimal, int, string
http://rs.tdwg.org/dwc/terms/EventAttributeValuedecimal, int, string
http://rs.tdwg.org/dwc/terms/MaximumDepthInMetersdecimal, int, string
http://rs.tdwg.org/dwc/terms/MaximumElevationInMetersdecimal, int, string
http://rs.tdwg.org/dwc/terms/MinimumDepthInMetersdecimal, int, string
http://rs.tdwg.org/dwc/terms/MinimumElevationInMetersdecimal, int, string
http://rs.tdwg.org/dwc/terms/SampleAttributeAccuracydecimal, int, string
http://rs.tdwg.org/dwc/terms/SampleAttributeValuedecimal, int, string
http://rs.tdwg.org/dwc/terms/VerbatimDepthdecimal, int, string
http://rs.tdwg.org/dwc/terms/DecimalLatitudedecimal, string
http://rs.tdwg.org/dwc/terms/DecimalLongitudedecimal, string
http://rs.tdwg.org/dwc/terms/CatalogNumberNumericint
http://rs.tdwg.org/dwc/terms/DayOfMonthint, stringusing 1 as 1st of the month
http://rs.tdwg.org/dwc/terms/EndDayOfYearint, string
http://rs.tdwg.org/dwc/terms/IndividualCountint, string
http://rs.tdwg.org/dwc/terms/MonthOfYearint, stringusing 1 as January
http://rs.tdwg.org/dwc/terms/PointRadiusSpatialFitint, string
http://rs.tdwg.org/dwc/terms/StartDayOfYearint, stringusing 1 as January 1st
http://rs.tdwg.org/dwc/terms/YearSampledint, stringin the format CCYY e.g. 2001
http://rs.tdwg.org/dwc/terms/EndTimeOfDaytime, string
http://rs.tdwg.org/dwc/terms/StartTimeOfDaytime, string

6. Database exporting examples

6.1 Mysql

Using the select into outfile command it is very easy to produce fielded text from mysql.
The encoding of the resulting file will depend on the server variables and collations used, and might need modified before the operation. It is worth noting that mysql will represent NULL values as \N by default and therefore the isNull() function must be used.
SELECT 
  IFNULL(id, ''), IFNULL(scientific_name, ''), IFNULL(count,'') 
    INTO outfile '/tmp/dwc.txt' 
      FIELDS TERMINATED BY ',' 
      OPTIONALLY ENCLOSED BY '"' 
      LINES TERMINATED BY '\n' 
FROM 
  dwc;

7. Guidelines for consumers

It goes beyond the scope of these guidelines to specify how a consumer must deal with related data. However, the following procedure is recommended for a database import:

Creative Commons License Copyright 2009 - Biodiversity Information Standards - TDWG - Contact Us

Except where otherwise noted, content on this site is licensed under a Creative Commons Attribution 3.0 United States License.