%META:TOPICINFO{author="HilmarLapp" date="1358873661" format="1.1" version="1.80"}%
%META:TOPICPARENT{name="WorkingMeeting2010"}%
A paper based on this technical report has now been published: Stoltzfus, Arlin, Brian O’Meara, Jamie Whitacre, Ross Mounce, Emily L Gillespie, Sudhir Kumar, Dan F Rosauer, and Rutger A Vos. 2012. “Sharing and Re-use of Phylogenetic Trees (and Associated Data) to Facilitate Synthesis.” BMC Research Notes 5 (1) (October 22): 574. doi:[[http://dx.doi.org/10.1186/1756-0500-5-574 10.1186/1756-0500-5-574]].
---+ DRAFT: Current Best Practices for Publishing Trees Electronically
*Authors*
* Arlin Stoltzfus, Biochemical Science Division, NIST, 100 Bureau Drive, Gaithersburg, MD, 20899
* Jamie Whitacre, Smithsonian Institution
* Dan Rosauer, Yale University
* Torsten Eriksson, Royal Swedish Academy of Sciences
%TOC%
---+ Summary
An assessment of best practices for publishing phylogenetic trees is timely given recent decisions by many journals to require the archiving of trees. However, even without that justification, several longer-term trends favor an increased emphasis on richly annotated re-usable trees that can be linked to other data: the opportunities for phylogeny re-use are greater; new opportunities for aggregation and integration exist; and both specific and general technologies that make sharing and re-using trees easier have emerged. This report summarizes an as-yet-incomplete project to perform an assessment of best practices and, in turn, suggest solutions for meeting recommendations provided by the scientific community (? jsw) and filling gaps in the current landscape.
The motivation for the report is that it will encourage the use, and further the development, of data management practices that will benefit scientists individually and collectively. Archiving of results post-publication seems to benefit the scientific community, and to benefit the individual archiving scientist in the form of increased recognition. A less speculative motivation is that developing capacity to manage richly annotated yet interoperable data _benefits scientists_ (individually and collectively) and _science_ by making it easier to carry out integrative, automated, or large-scale projects.
This report represents the first step in a larger analysis. Following release of this initial report, we will disseminate a survey broadly, carry out further analysis, and write a more comprehensive report. For the present purposes, we have conducted an initial general assessment of
* data archiving policies and reporting standards adopted by journals and funding agencies
* two electronic archives (TreeBASE and Dryad) suitable for storing phylogenies
* file formats commonly used for representing phylogenies (Newick, NEXUS, NHX, phyloXML and NeXML)
* available support for Life Science Identifiers (LSIDs) and other globally unique identifiers (GUIDs)
In other areas, we offer comments and call for more extensive analysis:
* language support for representing data and metadata
* current practices in the research community
* software tools to support archiving and re-use
In addition, we have studied the submission process of TreeBASE, and have evaluated the capacity of various file formats to represent specific kinds of metadata (annotations) deemed likely to increase the capacity for research results to be discovered, interpreted, linked (to other data) and re-used, including:
* publication data (authorship, citation)
* species names and other taxonomic identifiers
* methods used to infer a tree
* geographic coordinates
Our tentative analysis suggests the following:
* The infrastructure to support archiving of 1000's of new phylogenetic trees is available
* The needs of archiving are not the same as those of publishing linkable, re-usable data
* No formalized reporting standard for a phylogenetic analysis currently exists
* The extent to which _data_ archiving policies require archiving of phylogenetic trees is unclear
* The potential for archiving richly annotated trees is limited by technology and standards
* The gap between needs and capacities is much greater for publishing re-usable trees than for simple archiving
Archiving phylogenetic trees is technically feasible given current formats, and using currently available archives (TreeBASE and Dryad). However, the archival value of many trees will be limited without a shift in emphasis toward re-useability, along with technology and standards to support such a shift. While making trees archival is an important step forward for the phylogenetic community, re-usability of trees depends on several other conditions that, for the foreseeable future, will be difficult for most researchers to obtain. Before interoperability of richly annotated trees can be obtained, the research community must commit to the use of globally unique identifiers (GUIDs) for informational and material entities, and develop the syntax and semantics to represent the metadata upon which the value of the data depend. The community may be ready to respond to renewed calls for a Minimal Information for a Phylogenetic Analysis (MIAPA) standard.
---+ Request for Comments
To ensure that the descriptions and recommendations here are accurate and relevant to the community of users, we are seeking feedback in several ways. As described in Appendix 4, we intend to target scientists with a survey to assess current practices and needs.
We also solicit feedback on this preliminary report (see [[#AddComments][below]]). We invite interested scientists to make comments and to join the effort required to complete this report.
---+ Draft Report: Current Best Practices for Publishing Trees Electronically
---++ Scope and rationale of this report
A major National Research Council (NRC) report on "A New Biology for the 21st Century" (2009; http://www.nap.edu/catalog.php?record_id=12764) suggests enormous potential for biological discovery based on aggregating and integrating data from diverse sources and from multiple disciplines. More specifically, recent commentaries (e.g., Sidlauskas, et al, 2010; Patterson, et al., 2010), suggest the possibility that archiving and re-use of phylogenetic trees and biodiversity data will soon take off at a pace not seen before. At the same time that funding agencies, publishers, and scientific culture are shifting in ways that incentivize sharing of data-- including phylogenetic trees-- new technologies and standards are emerging with the potential to make phylogenetic methods and results more interoperable (Sidlauskas, et al, 2010; Prosdocimi, et al., 2009). This interoperability infrastructure benefits individual researchers by enabling them to link results to related data, to avoid duplication of effort, to collaborate more effectively, and to pursue large-scale, integrative projects.
What standards and technologies _will_ allow scientists to take advantage of phylogenies for "A New Biology"? The answer to this question is not certain. It is not clear whether promising interoperability technologies will fulfill their promise. It is not clear whether, in the phylogenetics community, these technologies will be developed in an orchestrated manner, through stakeholder organizations (analogous to TDWG for biodiversity studies), or in a more anarchic or competitive way.
However, regardless of strategies for responding to current challenges, the first step is to understand those challenges. For this reason, at the TDWG 2010 meeting in Woods Hole, the TDWG phylogenetic standards interest group set in motion a project to to assess the current state of the field in regard to archiving and re-use of phylogenetic trees, with the ultimate goal of encouraging the use and further development of practices that make trees interoperable. We aim to identify strengths and weaknesses in the current infrastructure, practices and, policies; to educate tree producers about the needs of tree users; and to educate users about the needs of tree producers. The scope of this project extends, in principle, to all areas (systematics, phylogenetics, paleobiology, diversity studies, etc) where the archiving and re-use of trees is of interest to scientists.
As a step toward this goal, we have undertaken a preliminary assessment of current best practices for publishing phylogenetic trees. Specifically, in regard to the electronic archiving and re-use of trees, we have done a preliminary review of
* relevant institutional policies and reporting standards
* current practices
* data formats
* software tools
* ontologies and other forms of language support
This document reports the results of our preliminary assessment. To get broader feedback, recruit interested scientists, and gather information to finish the report, we have partnered with participants of the MIAPA-discuss (miapa-discuss@googlegroups.com) email list. This group has developed a survey that will be sent to thousands of scientists. After analyzing this feedback, we intend to expand this preliminary report into a manuscript for publication. We invite those willing to make a commitment of work to join in this project.
---++ Background: data archiving and re-use, how and why?
Re-use of published results is crucial to the progressive and self-correcting nature of scientific inquiry. In the distant past, publication of a conventional scientific article was deemed sufficient to satisfy the demand for accessible and re-usable data. By publishing, authors were obliged to share data (and materials), but publishers had little power to enforce such obligations. In practice, authors determined which data were released, to whom, and in what form, often assuming that their own interests were served best by hoarding data, rather than sharing data, an attitude that remains common in some fields (ref: Piwowar).
The circumstances of publishing and data reuse have changed radically in the past few decades. As the result of new technologies that generate massive amounts of data, many scientific reports depend on data too voluminous to publish in printed form. For instance, the record of a 3-dimensional protein structure contains roughly 10^4 coordinates (each a floating-point number) per domain. To facilitate archiving and re-use of such data, crystallographers collaborated in 1971 to launch the Protein Database ([[http://www.pdb.org][PDB]]), which is still the world's premier archive for 3D protein structures. In 1982, just 5 years after the discovery of DNA sequencing methods, GenBank was launched to archive the DNA sequences that were crowding the pages of journals. In both cases, editorial boards of relevant scientific journals quickly decided to require simultaneous archiving (of 3D structures in PDB; of DNA sequences in GenBank), so that data would be accessible to all scientists upon publication.
Meanwhile, principled reasons to promote data sharing have exerted an increasing influence over institutions and institutional policies. Professional associations, publishers, and funding agencies recognize that availability of the data underlying published scientific findings is essential to a healthy scientific process (see Appendix 1). Funding agencies such as NSF and NIH increasingly recognize that work done on behalf of the public, especially if it is funded by taxpayers, should be accessible to the public without restriction.
Viewed as a large-scale dynamic, data sharing is a movement of information from producers to consumers, facilitated by informatics tools, and guided in various ways by institutional policies. Some of the policies noted above represent incentives or pressures on individual researchers to "push" data out into the world. However, there is simultaneously an increasing "pull" from the promise of large-scale studies that aggregate and re-purpose data. The availability of data from PDB and GenBank, for example, has resulted in innumerable publications by scientists analyzing data generated by other scientists and stored in these archives. For this reason, no scientist would doubt the utility of these archives for scientific research.
How do these and other factors apply to the publishing of phylogenetic trees? Phylogenetic trees play two central roles in modern biology: organizing knowledge by lines of descent, and extending knowledge through comparative analysis. In either role, trees become useful only to the extent that the tree and its parts are attached to, or can be linked to, data and metadata.
Until recently, for most authors, publishing a phylogenetic tree meant publishing a picture of a tree in a journal article (figure)-- an informational dead-end. Thus, in the economy of phylogenetic data sharing, there are thousands of phylogeny producers (Kumar and Dudley, 2006), but there have been few phylogeny consumers, in spite of an archive called "TreeBASE" (Piel, et al., 2002) that has enabled phylogeny re-use since the late 1990's.
However, conditions are changing in ways that favor archiving and re-use of trees:
* In a coordinated effort announced early in 2010, various journals in evolution and systematics have implemented data-archiving policies;
* In 2010, TreeBase (Piel, et al., 2002) completed a substantial upgrade of features, including its submission process;
* A new data archive, Dryad, began accepting data from ecological and evolutionary studies, including phylogenetic trees in 2009;
* NSF recently increased its [[http://www.nsf.gov/pubs/policydocs/pappguide/nsf11001/gpg_2.jsp#IIC2j requirements for data-sharing plans]] in grant proposals (see Appendix 1);
* In recent years, the National Evolutionary Synthesis Center (NESCent) has invested in "phyloinformatics" efforts to enable interoperability, resulting in projects to develop an XML file format (NeXML), an ontology (CDAO), and a web-services standard (PhyloWS).
What about phylogeny consumers? The most significant challenge in enabling an information-sharing dynamic in phylogenetics may be to recognize and understand how and why phylogeny consumers would re-use a phylogeny product. Phylogenetic trees play a vital role in research. Anyone with experience in phylogenetic analysis quickly learns that our biologist colleagues want trees for various research purposes, and frequently ask for help in getting them. Thus, there is an enormous demand for phylogenetic knowledge. Yet, the product of an individual phylogeny producer, it seems, is unlikely to be re-used or re-purposed. The reason for this seems to be that the typical need is for a very narrowly defined phylogenetic product, with a specific set of OTUs and characters, including up-to-date information.
Limited cases in which phylogenetic results are aggregated or re-used are playing a more prominent role in scientific research. Examples of large-scale projects that rely on meta-analysis or integration are assembling a tree of life representing all known species, or identifying vulnerable species by combining occurrence data, climate data, and phylogeny in a geographic framework (this section needs specific references and examples along the lines of Burleigh, et al., 2010, and Sidlauskas, et al., 2010).
This project to assess strengths and weaknesses in current practices has the ultimate goal of enabling effective data-sharing that links phylogeny producers and phylogeny consumers. Logically, then, we must begin with some notion of what makes a tree re-usable. In considering what makes it likely for a tree to be re-used in study replication, meta-analysis, aggregation, or integration, we draw guidance from the 2008 roadmap of the TDWG Technical Architecture Group (TAG), and two recent commentaries, by Sidlauskas, et al (2010) on synthetic approaches to evolutionary analysis, and by Patterson, et al (2010) on the importance of names. Together, these resources suggest the importance of
* having unique names for biological things of interest (globally unique identifiers or GUIDs)
* exchanging information using validatable formats (e.g., XML)
* using a controlled set of terms and predicates, ideally defined in an ontology
* providing context with rich annotations ("metadata")
for the re-useability of phylogenetic results. Below, we briefly explain these features.
---++++ 1. Standard, Validatable Formats
Currently, most trees that appear in the published literature are accessible only in the form of an embedded graphical image, i.e., the published item is literally a flat picture of a phylogenetic tree (figure, above), rather than a machine-readable symbolic encoding of relationships. For trees to be re-usable, they must be accessible in a standard, computer-readable format that makes the structure of the tree explicit. The tree image above corresponds to the following Newick string:
((otu1:0.34, otu2:0.19):0.11, otu3:0.44);Newick is the simplest of several data formats that are used to represent trees (see Appendices 1 and 3). For instance, PhyloXML and NeXML are formats defined by schemata. Available tools allow any instance of a PhyloXML file to be checked against the schema, to determine whether the file is properly formed. A file with mistakes in syntax (miss-spelled terms, missing punctuation, etc), will be found invalid. But if a file is valid, then any software that fully supports the standard should be able to read it. Automated validation removes uncertainty, especially in regard to the causes of errors. ---++++ 2. Globally Unique Identifiers (GUIDs) There are many uses for identifiers. To integrate data from diverse sources, we need to have some kind of integrating variable such as a species name, a specimen number, etc. For instance, to aggregate data on species occurrence, we need to know if a report of a bird in location X and a second report of a bird in location Y refer to the same species of bird. If we wish to integrate data from all over the world using names for things, then the names should be unique over all the world. By contrast, the tree example above invokes entities "otu1", "otu2" and "otu3". If we integrated data from all over the world using arbitrary local names like "otu1" and "otu2", we would make mistakes by aggregating information that does not belong together. Using globally unique identifiers or GUIDs ensures that when we refer to a thing, regardless of context, we know what it is. For instance, perhaps otu1 refers to the coyote. In that case, we can provide a Life Science Identifier or LSID (urn:lsid:ubio.org:namebank:2478093), which is a kind of GUID, and this LSID will make it possible for the researcher to associate the entity with information on ''_Canis latrans_'' (coyote) available in resources such as the Encyclopedia of Life. If otu1 is a gene sequence, then an http URI for its NCBI accession can serve as a GUID, and this will make it possible for any subsequent researcher to associate "otu1" with the underlying sequence data. The system of Document Object Identifiers allows publishers to assign GUIDs to publications. Thus, not just species or samples, but information artefacts, should have unique identifiers. For instance, in TreeBASE, each tree, data matrix, and study receives an ID that is unique and stable, allowing TreeBASE to offer persistence GUIDs via http URIs. ---++++ 3. Rich Annotations ("metadata") While the Newick tree above is in a standard format and could be archived in the Newick format, it remains an information dead-end, because we do not know what it refers to or how it was derived. Even if our goal is to explore models of speciation, and we wish only to measure whether the topology of the tree is ladder-like vs. bushy, we can't use this particular tree because we can't tell whether it is a species tree (relevant to speciation), or some other kind of tree (irrelevant). So then, what kind of annotations increase the re-usability of a tree? What are the integrating variables a tree consumer would use to integrate or aggregate trees? Imagine that we have a data-mining tool with access to all published trees, richly annotated. Our challenge is to use this database to reveal prior work on a topic, to test a hypothesis, to discover new relationships, or to carry out a meta-analysis addressing a methodological issue. In this context, useful types of data or metadata would include: * data (or GUIDs for data) from which the tree was inferred (e.g., if we wish to combine data into a supermatrix) * authorship and citation data (e.g., if we wish to find all the studies by a particular author) * taxonomic links and species identifiers for OTUs (e.g., to find all studies relevant to a taxonomic group) * identifiers for a specimen or accession to which OTUs are linked (e.g., to find any studies with a particular gene) * geographic coordinates (e.g., to integrate phylogeny data with other geographically-linked data) * a description of the method by which the tree was inferred (e.g., to enforce quality controls) ---++++ 4. Formal Language Support Computable knowledge representation is largely a matter of relationships between entities that can be expressed as subject-predicate-object triples, i.e., "Bob has_friend Susan" or "Susan is_a female_person". By joining these two statements via the identity of Susan (which we could establish via a GUID for Susan), we can answer the question of whether Bob has any female friends, even though neither statement alone tells us this. Without formal language support, its unclear what such terms and predicates mean. For instance, if we are interested in the phylogenetic origin of primates and search the web for a _tree that has a monkey and a squirrel_, we will not find a phylogeny that includes a monkey and a squirrel as OTUs, but we will get other fascinating information, including information about 1) *squirrel monkeys*, which spend time in *trees*, 2) sounds made by *squirrels*, *monkeys* and other animals that live in *trees*; and 3) news of a scientific study showing that macaques (a kind of *monkey*) in *trees* get upset when flying *squirrels* sail over them. Typically, a domain expert (in knowledge representation, a field of application such as phylogenetics is called a *domain*) avoids language problems by using a limited set of known data sources with a limited set of terms and predicates whose domain-specific meanings are understood by the expert. Thus, one way to support clear use of language is to have domain-specific vocabularies. A more robust form of support is to specify concepts in an ontology. To represent the kinds of annotations that make phylogenies suitable for re-use-- citations, taxonomic links, provenance information, georeferences, methods descriptions-- requires language support for the relevant concepts. For instance, [[http://dublincore.org/documents/dces/][Dublin core]] provides a metadata standard for documents, providing terms for assigning authorship, title, and so on. The [[http://open-biomed.sourceforge.net/opmv/ns.html][Open Provenance Model Vocabulary Specification ]] provides a term "wasDerivedFrom", such that, having derived "tree1" from "alignment1", we could annotate this relationship with the statement tree1 [[http://purl.org/net/opmv/ns#wasDerivedFrom][http://purl.org/net/opmv/ns#wasDerivedFrom]] alignment1 An important aim for this project is to investigate what kinds of annotations can be supported by available vocabularies and, where possible, to make recommendations about which vocabularies are best for which annotations. Some types of annotations involve ''domain-specific concepts'', e.g., if we wish to distinguish "unrooted tree" from "rooted tree" in a robust way, this must make reference to some externally defined concept. ---++ Analysis ---+++ Policies *Evolution-related journals and their data policies.* In early 2010, the editorial boards of eight journals: _Evolution_, _Molecular Biology_ and _Evolution_, _American_ _Naturalist_, _Molecular_ _Ecology_, _Journal of Evolutionary Biology_, _Heredity_, _Evolutionary Applications_) announced plans for a joint data archiving policy. This is a minority of the journals that regularly publish phylogenetic trees (other examples include _Systematic Biology_, _Molecular Phylogenetics_, and so on). The policies adopted by most of these journals as of January 2011 require data archiving in an "appropriate public archive" to ensure that the data are "preserved and usable for decades in the future". However, some policies are more stringent than others. For example, _Evolution_ requires that "authors submit DNA sequence data to GenBank and phylogenetic data to TreeBase" and _American Naturalist_ stipulates that "authors. . . deposit the data associated with accepted papers in a public archive. For gene sequence data and phylogenetic trees, deposition in GenBank or TreeBASE, respectively, is required." Other journals have a looser policy. _Molecular Ecology_ "expects that data supporting the results in the paper should be archived in an appropriate public archive such as GenBank, Gene Expression Omnibus, TreeBASE, Dryad, the Knowledge Network for Biocomplexity, your own institutional or funder repository, or as Supporting Information on the _Molecular Ecology_ web site." Furthermore, _Evolutionary Applications_ states that "only data underlying the main results in the paper need to be made available, In addition, sufficient information must be provided such that data can be readily suitable for re-analyses, meta-analyses, etc. . . . The preferred way to archive data is using public repositories. For types of data for which there is no public repository, authors can upload the relevant data as Supplementary Materials on the journal's website. Data submission to any of these repositories and the acceptance of the data by these repositories must occur *before* the manuscript goes to production. Appendix 1 provides detailed guidelines for submitting to TreeBASE and to Dryad. *National Science Foundation (NSF).* In the US, NSF is the major funder of evolutionary science. As described in Appendix 1, NSF guidelines call for proposals to include a “Data Management Plan” to describe how the proposal will conform to NSF policy on the dissemination and sharing of research results, including what types of data will be produced, "the standards to be used for data and metadata format and content", and plans "for preservation of access" to the data. The policy does not specify any particular standards, but merely calls on researchers to address this issue. ---+++ Standards *Life Science Identifiers (LSIDs)* represent a standard developed and approved by Biodiversity Information Standards (TDWG), an organization that promotes the wider and more effective dissemination of information about the World's heritage of biological organisms. * MIAPA* Scientists with an interest in the archiving and re-use of phylogenetic data have called for (but not yet developed) a minimal reporting standard designated "Minimal Information for a Phylogenetic Analysis", or MIAPA ([[http://www.ncbi.nlm.nih.gov/pubmed/16901231 Leebens-Mack, et al. 2006]]). The vision of these scientists is that the research community would develop, and adhere to, a standard that imposes a minimal reporting requirement yet ensures that the reported data can be interpreted and re-used. Such a standard might be adopted by journals, repositories, databases, workflow systems, granting organizations, and organizations that develop taxonomic nomenclature based on phylogenies. Leebens-Mack, et al. suggest that a study should report objectives, sequences, taxa, alignment method, alignment, phylogeny inference method, and phylogeny (this implies that MIAPA is intended only for molecular, as opposed to non-molecular, phylogenetics). As of 2010, no standard or draft has been developed (the [[http://mibbi.sourceforge.net/projects/MIAPA/ MIBBI repository for the MIAPA project]] is empty). A [[http://evoinfo.nescent.org/MIAPA_WhitePaper NESCent whitepaper on MIAPA]] outlines how the project could be moved forward. As a proof-of-concept exercise (described with some screenshots [[https://www.nescent.org/wg_evoinfo/Supporting_MIAPA#Proof-of-concept_.28annotation_software.29 here]]), participants in NESCent's Evolutionary Informatics working group configured an existing annotation application to use a controlled vocabulary to describe a phylogenetic analysis as a series of steps. Open questions * are there other standards that are applicable here? ---+++ Formats Various formats are used for phylogenies. Here we review information on the following 5 formats: * Newick- design for trees only with labels, no other data or metadata * NEXUS (http://informatics.nescent.org/wiki/Supporting_NEXUS_Documentation) - a full featured but dated format * NHX [[http://www.phylosoft.org/NHX/nhx.pdf PDF docs]]) - an extension of Newick with limited uses * [[http://www.phyloxml.org phyloXML]]- an economical and easy to use XML format tuned to molecular phylogenies * [[http://www.nexml.org NeXML]] - a full-featured XML format that allows arbitrary annotations of diverse types of data The Newick ("New Hampshire") format was developed informally in 1986 by a group of phylogenetic software developers (http://en.wikipedia.org/wiki/Newick_format). It was intended to represent trees only, not associated data or metadata. NEXUS is a highly expressive data format that has been in use for nearly as long as Newick. It is the preferred format for many phylogenetic inference programs such as PAUP* and MrBayes. The basic structure of a NEXUS file is a series of blocks, each containing commands. The most commonly used blocks are TAXA (a declared list of OTUs), CHARACTERS (a matrix of comparative data) and TREES (one or more phylogenetic trees for the OTUs). OTUs and characters can be referenced (from other blocks) by index numbers. Due to the lack of an ongoing development model, and ambiguities in the syntax, different interpretations of NEXUS have arisen within the phylogenetics community. NHX (New Hampshire eXtended) format was developed by Christian Zmasek as an extension of Newick, to represent common annotations of nodes (e.g., duplication events), and to insert molecular sequences. However, the highly constrained syntax of NHX limits its usefulness. In the past few years, four different XML formats have become available, though none is in widespread use. The main developer of NHX format, Christian Zmasek, went on to develop phyloXML (Han and Zmasek, 2009), a validatable format to represent a greater range of attributes than NHX. PhyloXML has an economical schema tuned to the needs of molecular phylogeneticists. The BEAST package (Drummond, et al., 2007) has used an XML input format for several years, but it is not considered further here because it is not used to export trees (BEAST outputs trees in NEXUS format). Likewise, while it is possible to encode comparative data in terms of CDAO (Comparative Data Analysis Ontology: Prosdocimi, et al., 2009), and serialize this as RDF-XML, this is not the recommended use of CDAO. NeXML (http:www.nexml.org) is an XML format with a precisely defined schema, modeled after the structure of NEXUS. While the design of phyloXML takes a very direct approach to satisfying user needs, NeXML opts for greater generality at the expense of a much more complex schema. It has an approach to metadata that allows for arbitrary annotation of data objects using external vocabularies. Features of Newick, NHX, NEXUS, phyloXML and NeXML are compared in the table below (a filled square indicates presence of the feature; an open circle indicates that there are significant limitations on this feature).
< < Journal > > requires, as a condition for publication, that data supporting the results in the paper should be archived in an appropriate public archive, such as < < list of approved archives here > >. Data are important products of the scientific enterprise, and they should be preserved and usable for decades in the future. Authors may elect to have the data publicly available at time of publication, or, if the technology of the archive allows, may opt to embargo access to the data for a period up to a year after publication. Exceptions may be granted at the discretion of the editor, especially for sensitive information such as human subject data or the location of endangered species.And lists the following partner journals (for links, go to the [[http://www.datadryad.org Dryad]] web site): * Whitlock, M. C., M. A. McPeek, M. D. Rausher, L. Rieseberg, and A. J. Moore. 2010. Data Archiving. American Naturalist. 175(2):145-146, doi:10.1086/650340 * Rieseberg, L., T. Vines, and N. Kane. 2010. Editorial and retrospective 2010. Molecular Ecology. 19(1):1-22, doi:10.1111/j.1365-294X.2009.04450.x * Rausher, M. D., M. A. McPeek, A. J. Moore, L. Rieseberg, and M. C. Whitlock. 2010. Data Archiving. Evolution. doi:10.1111/j.1558-5646.2009.00940.x * Moore, A. J., M. A. McPeek, M. D. Rausher, L. Rieseberg, and M. C. Whitlock. 2010. The need for archiving data in evolutionary biology. Journal of Evolutionary Biology 2010. doi:10.1111/j.1420-9101.2010.01937.x * Uyenoyama, M. K. 2010. MBE editor's report. Molecular Biology and Evolution. 27(3):742-743. doi:10.1093/molbev/msp229 * Butlin, R. 2010. Data archiving. Heredity advance online publication. 28 April doi:10.1038/hdy.2010.43 * Tseng, M. and L. Bernatchez. 2010. Editorial: 2009 in review. Evolutionary Applications. 3(2):93-95, doi:10.1111/j.1752-4571.2010.00122.x ---++++ National Science Foundation (NSF)
Beginning January 18, 2011, proposals submitted to NSF must include a supplementary document of no more than two pages labeled “Data Management Plan”. This supplementary document should describe how the proposal will conform to NSF policy on the dissemination and sharing of research results. See Grant Proposal Guide (GPG) Chapter II.C.2.j for full policy implementation.The policy may be found in the Award and Administration Guide (AAG), [[http://www.nsf.gov/pubs/policydocs/pappguide/nsf11001/aag_6.jsp#VID4 section VI.D.4.b]]:
b. Investigators are expected to share with other researchers, at no more than incremental cost and within a reasonable time, the primary data, samples, physical collections and other supporting materials created or gathered in the course of work under NSF grants. Grantees are expected to encourage and facilitate such sharing.The Grant Proposal Guide, [[http://www.nsf.gov/pubs/policydocs/pappguide/nsf11001/gpg_2.jsp#IIC2j Section II.C.2.j]], reads partially as follows:
Plans for data management and sharing of the products of research. Proposals must include a supplementary document of no more than two pages labeled “Data Management Plan”. This supplement should describe how the proposal will conform to NSF policy on the dissemination and sharing of research results (see AAG Chapter VI.D.4), and may include:---+++ Dublin Core *(not done)* there isn't a standard for encoding Dublin Core (Dc) publication data in XML. In particular, there isn't an enclosing element. In NeXML it would be "meta". DC isn't very well suited to journal articles, anyway. The best attempt I've seen (http://reprog.wordpress.com/2010/09/03/bibliographic-data-part-2-dublin-cores-dirty-little-secret/) goes like this:Data management requirements and plans specific to the Directorate, Office, Division, Program, or other NSF unit, relevant to a proposal are available at: http://www.nsf.gov/bfa/dias/policy/dmp.jsp. If guidance specific to the program is not available, then the requirements established in this section apply.
- the types of data, samples, physical collections, software, curriculum materials, and other materials to be produced in the course of the project;
- the standards to be used for data and metadata format and content (where existing standards are absent or deemed inadequate, this should be documented along with any proposed solutions or remedies);
- policies for access and sharing including provisions for appropriate protection of privacy, confidentiality, security, intellectual property, or other rights or requirements;
- policies and provisions for re-use, re-distribution, and the production of derivatives; and plans for archiving data, samples, and other research products, and for preservation of access to them.
Dryad is a general-purpose repository. It doesn't impose constraints on how data are represented within the files that users submit. The best practices need to come from elsewhere, such as journal policies, MIAPA, and community practice imposed by awareness of how the data will be reused by more specialized phylogenetic tools. Dryad just introduced a "handshaking" feature for TreeBASE. Users can elect to have a NEXUS file that is deposited to Dryad "pushed through" to TreeBASE to initiate the submission process there. So for the special case of phylogenetic data in Dryad, we would encourage having that Newick tree within a NEXUS file, together with the OTU metadata that can fit within that file format. I dream of a future in which lots of different software tools will support the editing and output of metadata-rich phylogenies in NeXML, and that TreeBASE can ingest those NeXML files. But we aren't there yet. If a user doesn't intend to use TreeBASE for whatever reason, then a Newick tree in one file and OTU metadata in a separate CSV file would be a reasonable low-tech solution, as long as the OTU identifiers were consistent between the files. A ReadMe file could also be used to provide study-level metadata.---+++ TreeBASE TreeBASE is a repository for trees that has been in operation for many (how many? jsw) years. In the past few years, the schema was redesigned, and there have been numerous upgrades to the user interface, including a sophisticated submission process and a web services API to retrieve results via a URL. ---++++ Uploading a tree to [[http://www.treebase.org TreeBase]] The TreeBASE website provides detailed instructions for submitting data. We obtained further information in a teleconference (9/29/10) with Bill Piel, who described the process as follows: 1. Use Mesquite to prepare document before uploading to [[http://www.treebase.org TreeBase]] * Why? Because 1) TreeBASE and Mesquite use the same Java API for parsing NEXUS; and 2) this API is a relatively complete and robust implementation of the standard 1. In Mesquite, best to combine matrix and tree in the same file to ensure matching names 1. Ensure taxon names are written out in full as binomial or trinomial * If there are infraspecies, just write the triplet without ‘var’, ‘subsp’ etc * What if there are multiple specimens for the same taxon? Each name must be unique, so make sure the specimen ID etc, is a suffix formatted with a leading capital or a number so [[http://www.treebase.org TreeBase]] won’t treat it as a new taxon name 1. After upload, click on yellow taxon button. Then click <validate taxon labels>. Tree base tries to match up labels with existing taxon names. If not, checks uBio. If name may be a homonym – will be asked to choose which taxon map link to. NCBI handles the homonyms. [[http://www.treebase.org TreeBase]] will link to taxon names to a GenBank taxid if possible. 1. Create an analysis record to link the matrices to the trees. 1. Linking to specimen IDs (e.g., genbank accession) is done by setting attributes of rows in the matrix: 1. After uploading matrix click <download rowsegment template>. There is a list of row labels to populate. You can enter Darwin Core information about the specimen. 1. There is a bug here: if some rows are populated for a given column, all rows must be populated for that column. There is an error if left blank. To work around this bug, just put something there, such as a dash ("-"). 1. You could apply this metadata to just a part of the alignment See below for notes on how much of this metadata is included in [[http://www.nexml.org NeXML]] output. Currently there is no way to attach metadata to the tree nodes individually. ---++++ Results of using the [[http://www.treebase.org TreeBase]] submission process To assess the TreeBase submission process, we uploaded files with OTU labels that contain species names. These were recognized >90% of the time by TB once the "validate taxon labels" button is pressed (which prompts the question of why TB doesn't suggest these automatically and simply ask the user to confirm. One of us used the "row segment table" interface to annotate a submission with GenBank accessions. One of us (AS) worked with Dr. Martin Wu to submit data from a largely, recently published analysis of prokaryotic phylogeny (Wu, et al., 2010). The data consist of a 720-taxon tree, a 6309-column alignment, and metadata (citation data, analysis methods) added interactively during the submission process. Prior to submission, AS spent several hours to generate matching labels so that the separate alignment and tree files (initially with non-matching names) could be combined in Mesquite or Bio::NEXUS. This is a common stumbling block in phylogenetics workflows. Dr. Wu spent an hour on the submission process itself, though this stretched out over several weeks while a syntax issue due to differing interpretations of NEXUS was resolved via email, with help from Dr. Piel (initially, we encoded names as 'Genus_species_strain', based on the equivalence of spaces and underscores in NEXUS names; however, protecting the underscores within a single-quoted phrase prevented them from being treated as spaces by the TreeBASE NEXUS parser). When this minor syntax issue was resolved, TreeBASE automatically matched all 720 OTU names to qualified species names. The report was submitted and now appears as TreeBASE study [[http://www.treebase.org/treebase-web/search/study/summary.html?id=10965 10965]]. Before submitting to TreeBASE, Dr. Wu had been contacted with requests for the data 3 times in the 11 months since the paper was published. Dr. Wu reports that making the submission to TreeBASE was "definitely worth it". The following [[http://www.treebase.org TreeBase]] screenshot (cropped) shows how a user may assign a UBio Id to an OTU (and it also shows that _TreeBase correctly guesses the actual species_):