---++ TRAVEL REPORT

This is my travel report from the GUID-1 meeting.


1. Submitted by: Dag T. F. Endresen

2. Country visited: USA

3. Dates of travel: 31/01/2006 to 06/02/2006

4. Related projects: Germplasm Clearing House Mechanism, Germplasm database interoperability in the context of GBIF, EURISCO and SINGER. (Generation Challenge Programme - SP4 - Task 2005-23 Web services and Task 2005-26 Central Registry)

5. Co-travellers:  none

6. Persons met or visited:
Some of the key persons attending the meeting: Donald Hobern (GBIF, Denmark), Ricardo Pereira (TDWG, Brazil), Roger Hyam (TDWG, UK), Benjamin Szekely (LSID, IBM, USA), George Garrity (DOI, Michigan State University, USA), Stan Blum (California Academy of Sciences, DADI, TDWG, USA), Gerald Guala (Plants, USDA), Jessie Kennedy (Napier University, Scotland), Sally Hinchcliffe (IPNI, Kew Botanical Garden, UK), Steven Perry (University of Kansas, DiGIR, USA), Roderic Page (University of Glasgow, Scotland)
[http://wiki.gbif.org/guidwiki/wikka.php?wakka=GUID1Participants]

7. Purpose of travel:
The Taxonomic Databases Working Group (TDWG) and the Global Biodiversity Information Facility (GBIF) first Workshop on Globally Unique Identifiers for Biodiversity Informatics (GUID-1) at the National Center for Evolutionary Synthesis (NESCent), Durham, NC, USA on Feb 1 to 3, 2006. The global nomenclature taxonomy authorities as well as a number of institutes with different types of collections of biological specimens were represented at the workshop.

A GUID framework is fundamental in facilitating distributed information systems and database interoperability. GUIDs are particularly useful in a cross community network with heterogeneous data models like the biodiversity domain. The most important purpose of the workshop was to choose a primary GUID (Global Unique Identifier) technology to be the recommended TDWG (Taxonomic Databases Working Group) GUID standard. The GUID technology is an important element of data interoperability particularly when joining distributed datasets to form a global, thematic or regional data index. Management of the dynamic GBIF index of all biodiversity data resources will benefit greatly from implementation of GUID technology. The GUID technology will also be an important element of both the next phase of EURISCO development, further SINGER development and a cornerstone in management of the Germplasm/Accession Clearing House Mechanism (under development at IPGRI).

8. Summary of travel:
Workshop outcomes:
(1) LSID (Life Science Identifiers) was chosen as the primary recommended TDWG technology for implementation of GUID technology during the workshop. LSID, DOI and Handle were discussed as the main GUID alternatives. (2) The primary metadata format was agreed to be RDF. Most metadata in the TDWG community today is shared as described as XML data defined from standard XML schemas (ABCD, DarwinCore, GCP_Passport, SPICE, SDD, etc.). (3) A distributed model of GUID authorities to issue and resolve the GUID was preferred to a centralized model. In a centralized model a central authority like for example GBIF could issue and resolve all the GUIDs for biodiversity databases in the life science domain. In a decentralized model, the individual data providers or community networks like the germplasm community (represented by e.g. IPGRI) would issue and resolve GUIDs. GBIF will only "link" to the resolving services from the distributed GUID authorities.

   The LSID format is
   URN:LSID:authority:namespace:object-id:revision
      
   Example
   URN:LSID:ncbi.nlm.nih.gov:GenBank.accession:NT_001063:2
      
   The revision is optional and could indicate different versions of the LSID object/entity.

LSID technology will be used to uniquely identify and resolve "passport" information (metadata) and objects/entities (data) in biodiversity databases. The LSID is a standardized naming schema for biological entities in the life science domain. The LSID technology include the assigning of new globally unique identifiers following the naming syntax, and a resolving service where a user can retrieve the digital entity itself, or information about the entity (some objects are physical real world objects and only the information about the entity can be retrieved in digital form).

The most important requirement is to never assign the same LSID to different objects/entities. It is recommended to limit the number of LSIDs assigned to the same object/entity. Multiple resolutions of LSIDs are supported. The same LSID can be resolved from several different LSID resolving authorities, but only one authority is the "home" authority. Other authorities with metadata on a LSID should register with the "home" authority for the particular LSID they wish to contribute metadata. Further details on the LSID implementation will be discussed during the spring and will be the topic for the planned follow-up GUID-2 workshop.

Benjamin Szekely (IBM, USA) presented the LSID technology to the workshop. LSID is short for Life Sciences Identifiers and is a specification from the Object Management Group (OMG). IBM is the main contributor to the development of the LSID specification. [http://lsid.sourceforge.net/]

George Garrity (Michigan State University, USA) presented the DOI technology. DOIs are widely used to identify and resolve published articles with more. Individual articles in the major scientific journals are identified by DOIs and can be resolved to the abstract and depending on copyrights and access rights of the client to the full text of the article. DOIs are developed in particular for identification of intellectual property and implements routines to manage copyright protection and access control. [http://www.doi.org/]

The Handle System was not formally presented, but in particular Roderic Page introduced the technology to the workshop. Handle is the super class of the DOIs. The DOIs are a sub-authority of the Handle System. Other implementations of Handle besides DOI exist (DVL, DSpace, cIDF, NDLP). [http://www.handle.net/]

Also the ARK (Archival Resource Key) technology was shortly considered, but not included in the discussed alternatives. [http://www.cdlib.org/inside/diglib/ark/]

The main criteria leading to the choice of LSID were the cost model of DOIs, the more dynamic model of LSID authorities, the open nature of the LSID protocol and better support for implementation on different platforms. The DOI system is built on micro payments. For each DOI issued the issuing authority will pay a few cents to the DOI foundation. Further the issuing authority also pay an annual fee to the DOI foundation in the magnitude of $US 35,000. These costs were considered unrealistic for most biodiversity data providers. GBIF could represent the complete biodiversity domain, but the workshop preferred a more distributed GUID authority model.

Some concerns were raised on the requirement of DNS in the resolution of LSIDs. Other resolution models are supported by the LSID specification, but not implemented yet. LSIDs have less opacity (DNS domain name used as part of the LSID) than DOI and Handle were also discussed, but not weighted in disfavor of LSID. A central body to supervise the assignment and resolving of the GUID is a better guarantee of uniqueness for never assigning the same GUID twice as well as a guarantee of the GUID being resolved. Each individual DOI is registered in the central DOI index. But several use cases of GUID for biodiversity objects/entities demanded a more flexible model. LSID authorities are more flexible to establish and depend only on modification of the DNS SRV record to define that a LSID service is provided from this Internet domain. 

Both the Handle System and DOI are proprietary technologies. The patent holders could choose to become more restrictive on the use of the technology in the future. The LSID technology is more open and responsibility for the further development of the LSID framework could be taken over by TDWG or other bodies if the present LSID developers reduce activity or move in a direction unsuitable to the biodiversity domain.


9. Recommendations/Actions to be taken:
Set up a LSID authority at IPGRI for the Germplasm/Accession Clearing House Mechanism.

A central LSID authority to issue and resolve LSID for the germplasm domain could be established and hosted from IPGRI. The same cautious distributed model as recommended from TDWG should preferably be maintained. It is not expected that the individual genebank institutes will jump on the opportunity to implement LSID authorities just because recommended to do so from GBIF/TDWG (and IPGRI?). The utility of the GUID is less obvious for a local information system. The utilities of LSIDs are more important for aggregators of dynamic distributed data like the EURISCO, SINGER, Germplasm/Accession CHM, and GBIF data portal. For a local information system a local unique identifier is sufficient and the extra work (and costs) to build, maintain and resolve a global unique identifier is hard to justify from the local funding. It is however possible to establish and curate a central authority to issue and resolve LSID to local germplasm data units. The central authority will need to curate the bounding between the LSID and the local germplasm data unit unless a commitment to maintain the bounding is made form the local/source data provider. For germplasm accessions there is already a commitment from the European genebanks providing data to the EURISCO catalogue to maintain unique resolution of the accession based on a composite key of accession number/catalogue number, holding institute code (WIEWS institute codes) and genus name. Other similar commitments of stable, unique and persistent bounding of metadata and germplasm objects could be made/identified and used in the resolution of the LSID to the germplasm units. The local genebank information systems should be encouraged to include and to provide the LSID as part of the unit level/passport data. The better option is if with time the local genebank systems will establish their own local LSID authorities.

The central germplasm LSID authority should be designed to allow for local genebank information systems to establish local LSID authority services. Two overall strategies can be followed. The LSIDs issued can be divided to virtual authorities that later can be moved to the local data providing institute. The LSID issued from the central authority (IPGRI) can remain at the central authority and in the case of a local LSID authority resolving the same, or probably a completely new LSID, the central LSID authority will simply resolve (redirect service of) the LSID to the local authority. With a robust model to refresh dynamic metadata from the source of the type planned for the Germplasm/Accession Clearing House Mechanism index the central LSID authority would perhaps remain sufficient for most genebank institutes.


10. Distribution outside IPGRI: No limitations from my side.