wiki-archive/twiki/data/Phylogenetics/LinkingTreesReport1.txt,v

685 lines
42 KiB
Plaintext

head 1.9;
access;
symbols;
locks; strict;
comment @# @;
expand @o@;
1.9
date 2010.11.10.15.10.33; author ArlinStoltzfus; state Exp;
branches;
next 1.8;
1.8
date 2010.11.09.19.43.21; author ArlinStoltzfus; state Exp;
branches;
next 1.7;
1.7
date 2010.11.09.17.50.45; author ArlinStoltzfus; state Exp;
branches;
next 1.6;
1.6
date 2010.11.09.15.06.33; author ArlinStoltzfus; state Exp;
branches;
next 1.5;
1.5
date 2010.11.08.20.51.02; author ArlinStoltzfus; state Exp;
branches;
next 1.4;
1.4
date 2010.11.03.02.12.53; author DanRosauer; state Exp;
branches;
next 1.3;
1.3
date 2010.11.02.20.18.36; author DanRosauer; state Exp;
branches;
next 1.2;
1.2
date 2010.11.02.16.22.40; author DanRosauer; state Exp;
branches;
next 1.1;
1.1
date 2010.10.28.13.55.37; author ArlinStoltzfus; state Exp;
branches;
next ;
desc
@none
@
1.9
log
@none
@
text
@%META:TOPICINFO{author="ArlinStoltzfus" date="1289401833" format="1.1" reprev="1.9" version="1.9"}%
%META:TOPICPARENT{name="LinkingTrees2010"}%
to delete
@
1.8
log
@none
@
text
@d1 1
a1 1
%META:TOPICINFO{author="ArlinStoltzfus" date="1289331801" format="1.1" reprev="1.8" version="1.8"}%
d3 1
a3 205
---+ Current Best Practices for Publishing Trees Electronically: Draft Report and Request for Comments
---++ Executive Summary
*(not done)*
Recent announcements from major funding agencies and from journals in evolution and systematics (Appendix 1) will require online publication of trees in reusable formats. Increasingly, phylogenies are seen not just as an endpoint – presenting the inferred relationships between a set of organisms or taxa – but as the starting point for a range of further research. This research will almost always involve linking with other data.
Phylogenetic trees are represented in a variety of electronic formats, but when published, appear most frequently as a graphical image. While making the tree accessible online in a standard format would be a major step forward, re-usability of trees depends on several other conditions that, for the foreseeable future, will be difficult for many researchers to obtain.
Most phylogeny information artefacts (e.g., files) out there don't have any of these. Integrating phylogenetic information into the global web of data will progress rapidly when it is:
* easy for users to put this information into their trees via appropriate software; and
* considered standard good practice, and beneficial to the creators of the trees, to include linkable information.
Thus, the goal of this report is to identify data formats, software and work procedures to deliver reusable, linkable trees. We hope to provoke discussion leading to workable and widely accepted solutions to this problem.
---+ Request for Comments
*(done)*
To ensure that the descriptions and recommendations here are accurate and relevant to the community of users, we are seeking feedback in several ways
* we are targeting scientists (via scientific email lists) with a survey to assess current practices and needs ([[https://spreadsheets.google.com/viewform?formkey=dHhZa0xMQTJuR0ZCZWxoV2JSTG13b2c6MQ][draft survey]]).
* we provide a form for feedback on this page ([[#AddComments][below]])
* we accept long comments emailed to an.address [at] geebung.id.au [Add real address here]
---+ Draft Report: Current Best Practices for Publishing Trees Electronically
---++ Introduction
---+++ Rationale for archiving re-useable scientific data and metadata
(*done, but should be shortened*)
Re-use of accessible data is crucial to the progressive and self-policing nature of scientific inquiry. Thus, professional associations, publishers, and funding agencies recognize that ''availability of the data underlying published scientific findings is essential to a healthy scientific process'' (see Appendix 1). In the past, publication of a conventional scientific article was sufficient to satisfy most of the demand for accessible and re-usable data. This is no longer the case, due to technological advances that make it easy to produce massive amounts of data, and to aggregate and synthesize diverse types of data from all over the globe. These conditions drive scientific communities to develop data archives and information standards, along with the cyber-infrastructure to support their use.
This infrastructure has both post-publication and pre-publication benefits to researchers. Post-publication archiving of results increases the accessibility and exposure of a researcher's work. However, the same technologies and standards that facilitate effective archiving also enable researchers in the pre-publication stage to link results to related data, to avoid duplication of effort, to collaborate more effectively, and to pursue large-scale, integrative projects. Examples of large-scale projects that rely on meta-analysis or integration are assembling a tree of life representing all known species, or identifying vulnerable species by combining occurrence data, climate data, and phylogeny in a geographic framework.
The premise of archiving phylogenies is that the scientific community will benefit from having access to the trees generated every year. The number of such trees is quite large. The analysis of citations by Kumar & Dudley suggested that the number of phylogeny publications in 2006 was 7000, and rapidly increasing. Experts in phylogenetic analysis typically generate hundreds or thousands of trees for every tree that is published. Thus, it is likely that, each year, many millions of trees are generated in association with published research.
---+++ Features expected to make archived trees re-useable
*(not done: see formal language support below)*
What would make these trees re-useable, allowing that there may be many different categories of re-use (replication, meta-analysis, aggregation, integration)? Some guidance is provided by the considerations outlined in the 2008 roadmap of the TDWG Technical Architecture group. The TAG Roadmap emphasis 3 things: globally unique identifiers (GUIDs), validatable formats (specifically, XML), and formal language support (ontologies).
---++++ 1. Standard, validatable formats
*(done)*
Currently, most trees that appear in the published literature are accessible only in the form of an embedded graphical image, i.e., the published item is literally a picture of the tree, rather than the tree as an informational entity. For an electronic file, one typically must write to the authors. For trees to be re-usable, they must be accessible in a standard format that makes the structure of the tree explicit. There are a variety of data formats that do this (see Appendices 1 and 3). Some of these formats can be validated, i.e., they are defined by a schema to which every valid instance conforms.
---++++ 2. Rich annotations ("metadata")
*(done)*
The importance of metadata ("annotations") can be illustrated with the following example of a tree in the most commonly encountered format, a "Newick" string with nested parentheses representing clades:
<pre>
((otu1:0.34, otu2:0.19):0.11, otu3:0.44);
</pre>
This tree cannot be re-used for any purpose. Even if our goal is to explore models of speciation, and we wish only to measure whether the topology of the tree is ladder-like vs. bushy, we can't use this particular tree because we can't tell whether it's a species tree (relevant to speciation), or some other kind of tree (irrelevant). We might be able to determine this by reading the original publication to find out what the labels ("otu1", etc) mean, but no citation information is included with the tree. To interpret this tree, or to integrate it with other information, we would need to link it with other information, but we can't, because it does not refer to any identifiable thing.
This example suggests some obvious ways to make a tree re-useable, namely to provide citation information, and when appropriate, to provide taxonomic links or other identifiers for the "OTUs" (Operational Taxonomic Units) at the tips of the tree. Another way to look at the problem of re-useability is to imagine that we have a database full of all published trees, richly annotated with the right kinds of metadata, and our challenge is to use this database to explore a basic research question, discover new relationships among types of data, or carry out a meta-analysis addressing a methodological issue. Types of data or metadata useful for such studies would include:
* authorship and citation data
* taxonomic links and species identifiers
* identifiers for a specimen or accession to which OTUs are linked
* links to data from which the tree was inferred
* geographic coordinates
* a description of the method by which the tree was inferred
---++++ 3. Formal language support.
*(not done)*
We need to describe the concept of formal language support to assign classes and relationships. GUIDs and ontologies.
---+++ Rationale for this assessment
*(done, but could be shorter)*
The ultimate goal of our effort here is to make trees more interoperable. We believe that if the forest of trees produced by researchers each year were computationally accessible, the scientific community would have a much greater capacity to validate and extend phylogeny-based research. The benefits of linked data have been discussed elsewhere (http://www.taxonconcept.org/taxonconcept-blog/2010/8/5/why-linked-open-data-makes-sense-for-biodiversity-informatic.html).
As a step toward this goal, we aim to assess current approaches to publishing trees electronically, in order to educate phylogenetic users, and to identify strengths and weaknesses. This effort is timely for several reasons:
* While in the past, many scientists felt no incentive to share data, recent research has shown that making data available in public archives increases citations (ref: Piwowar, research remix), widely understood as an indicator of professional success;
* In early 2010, eight journals in evolution and systematics announced plans to implement a data-archiving policy (see Appendix 1);
* The only major electronic repository of trees, TreeBase, has recently completed a major upgrade of features, including its submission process; in 2009, a data archive called Dryad was launched and will accept various kind of electronic files, including those with trees;
* NSF has recently increased its [[http://www.nsf.gov/pubs/policydocs/pappguide/nsf11001/gpg_2.jsp#IIC2j requirements for data-sharing plans]] in grant proposals (see Appendix 1). Thus, scientists will be motivated by funding agencies to share data electronically;
* In recent years, phyloinformatics researchers have been developing supporting technologies to enable interoperability, including XML file formats (NeXML, PhyloXML), an ontology (CDAO) and a web-services standard (PhyloWS).
Thus, at the same time that funding agencies, publishers, and the scientific culture are shifting in ways that create incentives for sharing data, including phylogenetic trees, new technologies are emerging to make it easier.
---++ Relevant standards
---+++ Policies
*(not done: need to finish the TDWG section below)*
*Some evolution-related journals.* In early 2010, the editorial boards of 8 journals (Evolution, Molecular Biology and Evolution, American Naturalist, Molecular Ecology, Journal of Evolutionary Biology, Heredity, and Evolutionary Applications) announced plans for a joint data archiving policy. This is a minority of the journals that regularly publish phylogenetic trees (other examples would be Systematic Biology, Molecular Phylogenetics, and so on). The policy (to be developed at a later date) would require "that data supporting the results in the paper should be archived in an appropriate public archive" to ensure that the data are "preserved and usable for decades in the future". The policy does not make clear whether phylogenetic trees would be considered "data supporting the results in the paper" (which is oddly phrased-- shouldn't it refer to data supporting the _conclusions_ of the paper?). See Appendix 1 for details.
*NSF.* In the US, NSF is the major funder of evolutionary science. As described in Appendix 1, NSF guidelines call for proposals to include a &#8220;Data Management Plan&#8221; to describe how the proposal will conform to NSF policy on the dissemination and sharing of research results, including what types of data will be produced, "the standards to be used for data and metadata format and content", and plans "for preservation of access" to the data. The policy does not specify any particular standards, but merely calls on researchers to address this issue.
*TDWG.* TDWG is an organization. TDWG has standards, but they do not have carrots or sticks like NSF and publishers. Darwin core and LSIDs are TDWG-approved standards.
---+++ Formats
*(not done)* This should focus on what the formats can represent in terms of useful metadata. Mostly its just a summary of what is in appendix 2.
* Newick- only allows labels, no metadata
* NEXUS (https://www.nescent.org/wg_phyloinformatics/Supporting_NEXUS_Documentation)
* NHX [[http://www.phylosoft.org/NHX/nhx.pdf PDF docs]])
* [[http://www.phyloxml.org phyloXML]]- find out how much of attributes below can be represented
* [[http://www.nexml.org NeXML]] - write a description of how to represent LSID, GenBank accn, geo coordinates
---+++ Ontologies and data standards
*(not done)* Based on Appendix 1, this will summarize relevant bits of
* Dublin core and prism, for representing citation data
* Darwin core and its relevance to key types of metadata
* MIAPA (non-existent minimal reporting standard)
* CDAO (comparative data analysis ontology).
---++ Current practices
*(not done)* This is going to be a short section, because current practices are rudimentary.
* Archiving at journal web sites -- most trees archived this way?
* [[http://www.treebase.org TreeBase]], has a submission process (see Appendix 3)
* [[http://datadryad.org Dryad]], has been up for a year (does it have submissions with trees?)
---++ Gaps and recommendations
*(not done)* Because this is an initial draft of a report, this does not need to be very polished.
1. lack of community standard for accessions and species links
1. lack of resolvable lsids; lack of a validator to see if species refs are resolved
1. lack of cyberinfrastructure for format validation, format translation and business logic
1. uncertain relevance of "data" policies to trees
1. lack of MIAPA or analogous reporting standard
1. lack of formal language support (e.g., tree inferred_from_data matrix)
1. lack of educational resources, instructional guides
#AddComments
---++ Please add your comments
%COMMENT{type="above"}%
---+ Appendices
---++ Appendix 1. Relevant standards
---+++ Data sharing and archiving policies
*(not done)*
(see the [http://en.wikipedia.org/wiki/Data_sharing Data Sharing] article on wikipedia for references to data sharing policies in the US). Authors of scientific studies often are required (as a condition of funding or of publication) to make such results available to the research community without restriction.
---++++ Evolution Journals
*(done)*
The [[http://www.datadryad.org Dryad]] web site describes the Joint Data Archiving Policy as follows:
<blockquote>
&lt; &lt; Journal &gt; &gt; requires, as a condition for publication, that data supporting the results in the paper should be archived in an appropriate public archive, such as &lt; &lt; list of approved archives here &gt; &gt;. Data are important products of the scientific enterprise, and they should be preserved and usable for decades in the future. Authors may elect to have the data publicly available at time of publication, or, if the technology of the archive allows, may opt to embargo access to the data for a period up to a year after publication. Exceptions may be granted at the discretion of the editor, especially for sensitive information such as human subject data or the location of endangered species.
</blockquote>
And lists the following partner journals (for links, go to the [[http://www.datadryad.org Dryad]] web site):
* Whitlock, M. C., M. A. McPeek, M. D. Rausher, L. Rieseberg, and A. J. Moore. 2010. Data Archiving. American Naturalist. 175(2):145-146, doi:10.1086/650340
* Rieseberg, L., T. Vines, and N. Kane. 2010. Editorial and retrospective 2010. Molecular Ecology. 19(1):1-22, doi:10.1111/j.1365-294X.2009.04450.x
* Rausher, M. D., M. A. McPeek, A. J. Moore, L. Rieseberg, and M. C. Whitlock. 2010. Data Archiving. Evolution. doi:10.1111/j.1558-5646.2009.00940.x
* Moore, A. J., M. A. McPeek, M. D. Rausher, L. Rieseberg, and M. C. Whitlock. 2010. The need for archiving data in evolutionary biology. Journal of Evolutionary Biology 2010. doi:10.1111/j.1420-9101.2010.01937.x
* Uyenoyama, M. K. 2010. MBE editor's report. Molecular Biology and Evolution. 27(3):742-743. doi:10.1093/molbev/msp229
* Butlin, R. 2010. Data archiving. Heredity advance online publication. 28 April doi:10.1038/hdy.2010.43
* Tseng, M. and L. Bernatchez. 2010. Editorial: 2009 in review. Evolutionary Applications. 3(2):93-95, doi:10.1111/j.1752-4571.2010.00122.x
---++++ NSF
<blockquote>
Beginning January 18, 2011, proposals submitted to NSF must include a supplementary document of no more than two pages labeled &#8220;Data Management Plan&#8221;. This supplementary document should describe how the proposal will conform to NSF policy on the dissemination and sharing of research results. See Grant Proposal Guide (GPG) Chapter II.C.2.j for full policy implementation.
</blockquote>
The policy may be found in the Award and Administration Guide, [[http://www.nsf.gov/pubs/policydocs/pappguide/nsf11001/aag_6.jsp#VID4 section VI.D.4.b]]:
<blockquote>
b. Investigators are expected to share with other researchers, at no more than incremental cost and within a reasonable time, the primary data, samples, physical collections and other supporting materials created or gathered in the course of work under NSF grants. Grantees are expected to encourage and facilitate such sharing.
</blockquote>
The Grant Proposal Guide, [[http://www.nsf.gov/pubs/policydocs/pappguide/nsf11001/gpg_2.jsp#IIC2j Section II.C.2.j]], reads partially as follows:
<blockquote>
Plans for data management and sharing of the products of research. Proposals must include a supplementary document of no more than two pages labeled &#8220;Data Management Plan&#8221;. This supplement should describe how the proposal will conform to NSF policy on the dissemination and sharing of research results (see AAG Chapter VI.D.4), and may include:
<ul>
<li>the types of data, samples, physical collections, software, curriculum materials, and other materials to be produced in the course of the project;
<li>the standards to be used for data and metadata format and content (where existing standards are absent or deemed inadequate, this should be documented along with any proposed solutions or remedies);
<li>policies for access and sharing including provisions for appropriate protection of privacy, confidentiality, security, intellectual property, or other rights or requirements;
<li>policies and provisions for re-use, re-distribution, and the production of derivatives; and plans for archiving data, samples, and other research products, and for preservation of access to them.
</ul>
Data management requirements and plans specific to the Directorate, Office, Division, Program, or other NSF unit, relevant to a proposal are available at: http://www.nsf.gov/bfa/dias/policy/dmp.jsp. If guidance specific to the program is not available, then the requirements established in this section apply.
</blockquote>
---+++ Dublin core
*(not done)*
there isn't a standard for encoding dublin core publication data in XML. in particular, there isn't an enclosing element. In NeXML it would be "meta". DC isn't very well suited to journal articles, anyway. too bad. the best attempt I've seen (http://reprog.wordpress.com/2010/09/03/bibliographic-data-part-2-dublin-cores-dirty-little-secret/) goes like this:
<verbatim>
<mikesMadeUpNamespace:article xmlns:dc=&#8221;http://purl.org/dc/elements/1.1/&#8221; xmlns:dcterms=&#8221;http://purl.org/dc/terms/ xmlns:mikesMadeUpNamespace=&#8221;whatever&#8221;>
<dc:creator>Michael P. Taylor</dc:creator>
<dc:creator>Darren Naish</dc:creator>
<dcterms:issued>2007</dcterms:issued>
<dc:title>An unusual new neosauropod dinosaur from the Lower Cretaceous Hastings Beds Group of East Sussex, England.</dc:title>
<dcterms:isPartOf>urn:ISSN:0081-0239</dcterms:isPartOf>
<dc:publisher>Blackwell</dc:publisher>
<dc:type xxx=&#8221;http://purl.org/dc/terms/DCMIType&#8221;>Text</dc:type>
<dcterms:bibliographicCitation>Palaeontology 50(6), 1547-1564. (2007)</dcterms:bibliographicCitation>
<dc:identifier>info:doi:10.1111/j.1475-4983.2007.00728.x</dc:identifier>
</mikesMadeUpNamespace:article>
</verbatim>
---+++ MIAPA
*(done)*
Scientists with an interest in the archiving and re-use of phylogenetic data have called for (but not yet developed) a minimal reporting standard designated "Minimal Information for a Phylogenetic Analysis", or MIAPA ([[http://www.ncbi.nlm.nih.gov/pubmed/16901231 Leebens-Mack, et al. 2006]]). The vision of these scientists is that the research community would develop, and adhere to, a standard that imposes a minimal reporting burden yet ensures that the reported data can be interpreted and re-used. Such a standard might be adopted by journals, repositories, databases, workflow systems, granting organizations, and organizations that develop taxonomic nomenclature based on phylogenies. Leebens-Mack, et al. suggest that a study should report objectives, sequences, taxa, alignment method, alignment, phylogeny inference method, and phylogeny (this implies that MIAPA is intended only for molecular, as opposed to non-molecular, phylogenetics).
As of 2010, no standard or draft has been developed (the [[http://mibbi.sourceforge.net/projects/MIAPA/ MIBBI repository for the MIAPA project]] is empty). A [[https://www.nescent.org/wg_evoinfo/MIAPA_WhitePaper NESCent whitepaper on MIAPA]] outlines how the project could be moved forward. As a proof-of-concept exercise (described with some screenshots [[https://www.nescent.org/wg_evoinfo/Supporting_MIAPA#Proof-of-concept_.28annotation_software.29 here]]), participants in NESCent's Evolutionary Informatics working group configured an existing annotation application to use a controlled vocabulary to describe a phylogenetic analysis as a series of steps.
---+++ TDWG and Darwin Core
From the [[http://rs.tdwg.org/dwc/terms/guides/xml/index.htm Darwin Core XML Guide]] (specify namespace with xmlns:dwc="http://rs.tdwg.org/dwc/terms/"):
<verbatim>
<dwc:Taxon>
<dwc:scientificName>Anthus hellmayri</dwc:scientificName>
<dwc:class>Aves</dwc:class>
<dwc:genus>Anthus</dwc:genus>
<dwc:specificEpithet>hellmayri</dwc:specificEpithet>
<dwc:occurrenceID>urn:catalog:AUDCLO:EBIRD:OBS64515331</dwc:occurrenceID>
</dwc:Taxon>
</verbatim>
---++ Appendix 2: toy data examples
---++ Appendix 3: archives
---++ Appendix 4: tools and tips
---++ Appendix 5: Survey and user feedback.
-- Main.ArlinStoltzfus - 28 Oct 2010
@
1.7
log
@none
@
text
@d1 1
a1 1
%META:TOPICINFO{author="ArlinStoltzfus" date="1289325045" format="1.1" reprev="1.7" version="1.7"}%
d6 1
a6 1
*NOT DONE*
d8 1
a8 1
Recent announcements from major funding agencies and from journals in evolution and systematics (see Appendix 1) will require online publication of trees in reusable formats. Increasingly, phylogenies are seen not just as an endpoint &#8211; presenting the inferred relationships between a set of organisms or taxa &#8211; but as the starting point for a range of further research. This research will almost always involve linking with other data.
d19 5
a23 7
To ensure that the approach(es) proposed here are relevant to the community of users, we are seeking your views.
Please fill in the [[https://spreadsheets.google.com/viewform?formkey=dHhZa0xMQTJuR0ZCZWxoV2JSTG13b2c6MQ][LinkingTrees initial user survey (SOON, STILL DRAFTING NOW...)]]
We would also welcome any further comments, which can be:
* added [[#AddComments][below]] or
* emailed to an.address [at] geebung.id.au [Add real address here]
d28 1
d36 1
a36 1
d40 2
a41 2
It appears that, at present, the trees that appear in the published literature are accessible only in the form of an embedded graphical image, i.e., the published item is literally a picture of the tree, rather than the tree as an informational entity. For trees to be re-usable, they must be accessible in a standard format that makes the structure of the tree explicit. There are a variety of data formats that do this (see Appendix). Some of these formats can be validated, i.e., they are defined by a schema to which every valid instance conforms.
d44 1
a44 1
d49 1
a49 1
This tree cannot be re-used for any purpose. Even if our goal is to explore models of speciation, and we don't care which species are implicated, but wish only to measure whether the topology of the tree is ladder-like vs. bushy, we can't use this particular tree because we can't tell whether it's a species tree (relevant to speciation), or some other kind of tree (irrelevant). We might be able to determine this by reading the original publication to find out what the labels ("otu1", etc) refer to, but no citation information is included with the tree. To interpret this tree, or to integrate it with other information, we would need to link it with other information, but we can't, because it does not refer to any identifiable thing.
d52 2
d55 1
d60 2
a61 2
GUIDs and ontologies.
d64 1
d78 2
d84 1
a84 1
*TDWG.* TDWG is an organization. TDWG has standards, but they do not have carrots or sticks like NSF and publishers. Darwin core and LSIDs are TDWG-approved standards.
d87 1
a87 1
This should focus on what the formats can represent in terms of useful metadata. Mostly its just a summary of what is in the appendix.
d94 5
a98 1
This will summarize relevant bits of Dublin core, Darwin core, MIAPA (non-existent minimal reporting standard) and CDAO (comparative data analysis ontology).
d101 1
a101 1
This is going to be a short section, because current practices are rudimentary.
d104 1
a104 1
* [[http://www.treebase.org TreeBase]], has a submission process
d108 1
d126 2
a127 1
(see the [http://en.wikipedia.org/wiki/Data_sharing Data Sharing] article on wikipedia for references to data sharing policies in the US). Authors of scientific studies often are required (as a condition of funding or of publication) to make such results available to the research community without restriction.
d130 1
a130 1
d165 1
d182 1
d185 1
a185 1
As of 2010, no standard or draft has been developed (the [[http://mibbi.sourceforge.net/projects/MIAPA/ MIBBI repository for the MIAPA project]] is empty). A [[https://www.nescent.org/wg_evoinfo/MIAPA_WhitePaper NESCent whitepaper on MIAPA]] outlines how the project could be moved forward. As a proof-of-concept exercise (described with some screenshots [[[[https://www.nescent.org/wg_evoinfo/Supporting_MIAPA#Proof-of-concept_.28annotation_software.29 here]]), participants in NESCent's Evolutionary Informatics working group configured an existing annotation application to use a controlled vocabulary to describe a phylogenetic analysis as a series of steps.
d200 5
d206 1
@
1.6
log
@none
@
text
@d1 1
a1 1
%META:TOPICINFO{author="ArlinStoltzfus" date="1289315193" format="1.1" version="1.6"}%
d6 1
a6 4
This report seeks to identify best practices for publishing phylogenetic trees electronically, It aims to make trees as useful and reusable as possible, by making it easier to automatically link trees to other data.
Currently phylogenetic trees are published in a wide variety of formats, but most frequently the only publication is as an image in a journal paper. Trees are increasingly published in machine readable formats, either as a supplementary data file with a journal paper, or in online repositories such as TreeBase. Even where trees are published in one of the standard data formats however, the taxa or specimens are often not identified in a way which supports easy machine recognition and linking to other information.
d10 1
a10 16
Typical examples where linking would be valuable include:
* linking a node on a tree to geographic information about the taxon
* linking a node on a tree to ecological or morphological about the taxon information
* linking a node on a tree to a museum specimen or GenBank accession
* searching for trees which include a given taxon or specimen
The integrating variables with the most potential in the short term are:
1. taxon name
* well formatted binomial or trinomial taxon name
* LSID which resolves to a taxon concept
1. specimen identifier
* collection accession number
* LSID which resolves to a specimen record in a collection
1. sequence identifier
* GenBank accession
1. geographic coordinates
a26 1
d29 4
a32 4
---+++ Rationale for scientific data archiving
---+++ Relevance to phylogenetic trees
---+++ Features expected to make archived trees re-useable
The TDWG TAG stragegy. GUIDs, XML and ontologies.
d34 1
a34 38
* the form of the data: Standard formats. Validatable.
* the types of attached metadata
* publication
* species
* specimen, accession
* geographic coordinates
* methods
* how the metadata are rendered
* GUIDs
* ontologies
---+++ the right kinds of metadata (tax, pub, geo,
---++ Relevant standards
---+++ Policies
---+++ GUIDs
---+++ Formats
* Newick- only allows labels, no metadata
* NEXUS (https://www.nescent.org/wg_phyloinformatics/Supporting_NEXUS_Documentation)
* NHX [[http://www.phylosoft.org/NHX/nhx.pdf PDF docs]])
* [[http://www.phyloxml.org phyloXML]]- find out how much of attributes below can be represented
* [[http://www.nexml.org NeXML]] - write a description of how to represent LSID, GenBank accn, geo coordinates
---+++ Ontologies and data standards
---++ Current practices
---+++ Archiving:
* Archiving at journal web sites
* [[http://www.treebase.org TreeBase]], has a submission process
* [[http://datadryad.org Dryad]] still need to investigate
---+++ Representing metadata
---++ Gaps and recommendations
1. lack of community standard for accessions and species links
1. lack of resolvable lsids; lack of a validator to see if species refs are resolved
1. lack validation service for most formats
1. lack of cyberinfrastructure for format translation and business logic
1. uncertain relevance of "data" policies to trees
1. lack of reporting standard
1. lack of language support (e.g., tree inferred_from_data matrix)
1. educational resources, instructional guides
d36 1
a36 4
---+ scrap from previoius version
---++ Rationale and objectives
---+++ The rationale for archiving data for future re-use
Science is both ''progressive'' and ''self-policing''. The progressive nature of science means that new studies build on previous results, and scientists avoid duplication. The self-policing aspect of science depends on the potential that new studies built on a faulty foundation will fail, and more acutely, that attempts to repeat (replicate) a faulty study will fail, casting doubt on it. Thus, re-use of accessible data is crucial both to the progressive aspect of science and to its self-policing aspect.
d38 1
a38 1
In the past, these key conditions of science did not drive scientific communities to erect special data archives and to develop information standards. Conventional scientific publications were sufficient to satisfy demands for accessible and re-usable data. But this is no longer the case. Technological advances have made it easy to produce massive amounts of data, and to aggregate and synthesize massive amounts of data, creating a demand for large-scale data re-use.
d40 1
a40 1
Increasingly, professional associations, publishers, and funding agencies have recognized that ''availability of the data underlying published scientific findings is essential to a healthy scientific process'' (see the [http://en.wikipedia.org/wiki/Data_sharing Data Sharing] article on wikipedia for references to data sharing policies in the US). Authors of scientific studies often are required (as a condition of funding or of publication) to make such results available to the research community without restriction.
d42 1
a42 2
---+++ The rationale for assessing practices for e-publishing trees
Hundreds of thousands-- perhaps millions-- of trees are generated each year in association with published research. Of these, a tiny fraction are published each year in association with journal articles.
d44 1
a44 1
The vast majority of these "published" trees appear as graphical images &#8212; and may be accessible as an electronic image file, while a tiny fraction are archived in a computable electronic form, nearly always as a string with nested parentheses representing clades (the "Newick" format).
d46 1
a46 1
To a computer, the images and image files are informational dead-ends. The "Newick" tree strings expose the topology and branch lengths of the tree, but such trees typically are not adequate for data integration, re-use, and re-purposing. To understand why, consider the following example:
d48 1
a48 1
((my_arbitrary_name1:0.34, idiosyncratic_name:0.19):0.11, my_other_name:0.44)
d50 8
a57 1
What this tree means depends entirely on what the labels refer to, but the labels are arbitrary. To interpret this tree, to validate it, or to integrate it with other information, we would need to link it with other information, but we can't, because it does not refer to any identifiable entity. Ideally it would refer to something with a globally unique ID (a "GUID"). In general, if the nodes in the tree are not associated with identifiable information, the structure of the tree has no recoverable biological meaning.
d59 3
d64 1
a64 1
As a step toward this goal, we aim to assess current approaches to publishing trees electronically, in order to educate phylogenetic users, and to identify strengths and weaknesses. This effort is timely for several reasons:
d66 1
a66 1
* In early 2010, some key journals in evolution and systematics announced plans to implement a data-archiving policy: to publish in these journals, researchers will need to start archiving their trees;
d68 1
a68 1
* NSF has recently increased its [[http://www.nsf.gov/pubs/policydocs/pappguide/nsf11001/gpg_2.jsp#IIC2j requirements for data-sharing plans]] in grant proposals. Thus, scientists will be motivated by funding agencies to share data electronically;
d71 7
a77 1
Thus, at the same time that funding agencies, publishers, and the scientific culture are shifting in ways that create incentives for sharing data, including phylogenetic trees, new technologies are emerging to make it easier.
d79 1
a79 6
Finally, although there are many indirect, community-wide benefits of sharing data, there are more direct and immediate benefits to the producer including:
* increased citation - if researchers can readily apply a tree to additional research questions, it is likely to generate additional citations for the tree's authors.
* easier to link ones own data - the same enhancements which prepare trees for reuse will also help researchers to link their own trees to related data for analysis.
and more general benefits:
* enables web applications, such as for phylogeographic visualisation, which link trees to spatial and other data.
* facilitates use of trees in big-data questions which become tractable when data linking can be automated.
d81 25
d112 75
d188 1
a188 1
-- Main.ArlinStoltzfus - 28 Oct 2010@
1.5
log
@none
@
text
@d1 1
a1 1
%META:TOPICINFO{author="ArlinStoltzfus" date="1289249462" format="1.1" version="1.5"}%
d3 1
a3 1
---+ Best Practices for Publishing Trees Electronically: Initial Report and Request for Comments
d46 6
a51 2
---+ Best Practices for Publishing Trees Electronically
---++ Outline
d53 5
a57 7
1. Rationale and objectives of archiving for re-use
1. Key information to make the tree re-useable
* Accessions - may lead to species identifier (potential for conflicts)
* Taxonomic Links
1. a well formatted taxon name or taxon concept
1. binomial or separate field for each taxonomic level
1. LSID which resolves on a taxon names service
d59 34
a92 18
* reference to journal or other publication (important to note this explicitly, though its covered by Dryad and TreeBASE)
1. Archives to upload to:
* [[http://www.treebase.org TreeBase]], has a submission process
* [[http://datadryad.org Dryad]] still need to investigate
* other places to expose a [[http://www.nexml.org NeXML]], [[http://www.phyloxml.org phyloXML]], or CDAO file (generic LOD store)
1. Formats
* Newick- only allows labels, no metadata
* NEXUS (https://www.nescent.org/wg_phyloinformatics/Supporting_NEXUS_Documentation)
* NHX [[http://www.phylosoft.org/NHX/nhx.pdf PDF docs]])
* [[http://www.phyloxml.org phyloXML]]- find out how much of attributes below can be represented
* [[http://www.nexml.org NeXML]] - write a description of how to represent LSID, GenBank accn, geo coordinates
1. Gaps and recommendations
1. Semantic link between tree and characters (typically inferred_from) is not explicit
* [[http://www.treebase.org TreeBase]] supports this via method link from matrix to tree
1. lack of community standard for accessions and species links
1. lack of resolvable lsids; lack of a validator to see if species refs are resolved
1. lack validation service for most formats
a129 22
---++ Key information to make the tree re-useable
The TDWG TAG stragegy. GUIDs, XML and ontologies.
* the form of the data: Standard formats. Validatable.
* the types of attached metadata
* publication
* species
* specimen, accession
* geographic coordinates
* how the metadata are rendered
* GUIDs
* ontologies
---++ Archives to upload to
---++ Available technology
---++ Formats
---++ Gaps and recommendations
@
1.4
log
@none
@
text
@d1 1
a1 1
%META:TOPICINFO{author="DanRosauer" date="1288750373" format="1.1" version="1.4"}%
d75 12
a86 2
---++ Rationale and objectives of archiving for re-use
Hundreds of thousands-- perhaps millions-- of trees are generated each year in association with published research. Of these, a tiny fraction are published each year in association with journal articles. The vast majority of these "published" trees appear as graphical images &#8212; and may be accessible as an electronic image file, while a tiny fraction are archived in a computable electronic form, nearly always as a string with nested parentheses representing clades (the "Newick" format).
d92 1
a92 1
What this tree means depends entirely on what the labels refer to, but the labels are arbitrary. To interpret this tree, to validate it, or to integrate it with other information, we would need to link it with other information, but we can't, because it does not refer to any identifiable entity. Ideally it would refer to something with a GUID. In general, if the nodes in the tree are not associated with identifiable information, the structure of the tree has no recoverable biological meaning &#8212; and indeed, most trees that are archived lack clearly identifiable information allowing them to be linked with other data, except under the guidance of an expert communicating with the authors of the paper.
a111 1
d114 1
a114 1
Standard formats. Validatable.
d116 9
a124 3
Type of metadata
* publication information (dublin core)
*
@
1.3
log
@none
@
text
@d1 1
a1 1
%META:TOPICINFO{author="DanRosauer" date="1288729116" format="1.1" reprev="1.3" version="1.3"}%
d75 1
a75 1
---+++ Rationale and objectives of archiving for re-use
d126 1
a126 1
-- Main.ArlinStoltzfus - 28 Oct 2010
@
1.2
log
@none
@
text
@d1 1
a1 1
%META:TOPICINFO{author="DanRosauer" date="1288714960" format="1.1" reprev="1.2" version="1.2"}%
d43 2
a44 1
* emailed to an.address@@geebung.id.au Add real address here
d56 3
a58 3
* geo coordinates
* publication info (important to note this explicitly, though its covered by Dryad and TreeBASE)
1. Archives to upload to:
d68 1
a68 1
1. gaps and recommendations
d75 2
a76 2
---+++ Rationale
Hundreds of thousands-- perhaps millions-- of trees are generated each year in association with published research. Of these, a tiny fraction are published each year in association with journal articles. The vast majority of these "published" trees appear as graphical image&#8212 and may be accessible as an electronic image file, while a tiny fraction is archived in a computable electronic form, nearly always as a string with nested parentheses representing clades (the "Newick" format).
d82 1
a82 1
What this tree means depends entirely on what the labels refer to, but the labels are arbitrary. To interpret this tree, to validate it, or to integrate it with other information, we would need to link it with other information, but we can't, because it does not refer to any identifiable entity. Ideally it would refer to something with a GUID. In general, if the nodes in the tree are not associated with identifiable information, the structure of the tree has no recoverable biological meaning&#8212; and indeed, most trees that are archived lack clearly identifiable information allowing them to be linked with other data, except under the guidance of an expert communicating with the authors of the paper.
d103 1
a103 1
---++ Key factors promoting re-use [drop this section? dr]
d109 3
a111 1
*
d115 1
a115 5
---+++ Archives
---+++ Formats
---+++ Tools
@
1.1
log
@none
@
text
@d1 1
a1 1
%META:TOPICINFO{author="ArlinStoltzfus" date="1288274137" format="1.1" version="1.1"}%
d16 1
a16 1
* linking a node on a tree to a museum specimen or genbank accession
a18 2
We want to facilitate the creation and storage of trees built for data integration so that they can be more easily and automatically linked to other data.
d30 3
a32 1
Most phylogeny information artefacts (e.g., files) out there don't have any of these. Integrating phylogenetic information into the global web of data will progress rapidly when it is a) easy for users to put this information into their trees via appropriate software and b) considered standard good practice, and beneficial to the creators of the trees, to include linkable information.
d37 3
d41 3
a43 2
* survey link
* other modes of feedback
d121 5
d127 1
a127 1
-- Main.ArlinStoltzfus - 28 Oct 2010@