head 1.80; access; symbols; locks; strict; comment @# @; 1.80 date 2013.01.22.16.54.21; author HilmarLapp; state Exp; branches; next 1.79; 1.79 date 2011.07.31.22.42.51; author HilmarLapp; state Exp; branches; next 1.78; 1.78 date 2011.02.16.14.55.13; author ArlinStoltzfus; state Exp; branches; next 1.77; 1.77 date 2011.02.14.15.57.02; author ArlinStoltzfus; state Exp; branches; next 1.76; 1.76 date 2011.02.10.17.29.59; author ArlinStoltzfus; state Exp; branches; next 1.75; 1.75 date 2011.02.09.19.42.05; author ArlinStoltzfus; state Exp; branches; next 1.74; 1.74 date 2011.02.08.18.04.35; author ArlinStoltzfus; state Exp; branches; next 1.73; 1.73 date 2011.02.08.16.41.14; author ArlinStoltzfus; state Exp; branches; next 1.72; 1.72 date 2011.02.07.21.31.22; author ArlinStoltzfus; state Exp; branches; next 1.71; 1.71 date 2011.02.07.17.10.13; author ArlinStoltzfus; state Exp; branches; next 1.70; 1.70 date 2011.01.31.21.15.57; author ArlinStoltzfus; state Exp; branches; next 1.69; 1.69 date 2011.01.27.00.36.00; author JamieW; state Exp; branches; next 1.68; 1.68 date 2011.01.26.19.47.47; author JamieW; state Exp; branches; next 1.67; 1.67 date 2011.01.26.14.13.46; author JamieW; state Exp; branches; next 1.66; 1.66 date 2011.01.25.16.10.39; author ArlinStoltzfus; state Exp; branches; next 1.65; 1.65 date 2011.01.25.13.24.38; author ArlinStoltzfus; state Exp; branches; next 1.64; 1.64 date 2011.01.25.02.56.33; author JamieW; state Exp; branches; next 1.63; 1.63 date 2011.01.24.21.57.44; author ArlinStoltzfus; state Exp; branches; next 1.62; 1.62 date 2011.01.24.19.37.51; author JamieW; state Exp; branches; next 1.61; 1.61 date 2011.01.21.22.30.30; author ArlinStoltzfus; state Exp; branches; next 1.60; 1.60 date 2011.01.21.20.43.12; author ArlinStoltzfus; state Exp; branches; next 1.59; 1.59 date 2011.01.18.16.26.41; author ArlinStoltzfus; state Exp; branches; next 1.58; 1.58 date 2011.01.18.05.10.26; author JamieW; state Exp; branches; next 1.57; 1.57 date 2011.01.18.04.56.35; author JamieW; state Exp; branches; next 1.56; 1.56 date 2011.01.18.01.51.51; author JamieW; state Exp; branches; next 1.55; 1.55 date 2011.01.14.22.01.21; author ArlinStoltzfus; state Exp; branches; next 1.54; 1.54 date 2011.01.13.21.18.06; author JamieW; state Exp; branches; next 1.53; 1.53 date 2011.01.13.14.46.19; author ArlinStoltzfus; state Exp; branches; next 1.52; 1.52 date 2011.01.12.18.14.54; author ArlinStoltzfus; state Exp; branches; next 1.51; 1.51 date 2011.01.11.17.23.39; author ArlinStoltzfus; state Exp; branches; next 1.50; 1.50 date 2011.01.10.20.29.16; author ArlinStoltzfus; state Exp; branches; next 1.49; 1.49 date 2010.11.24.18.37.47; author ArlinStoltzfus; state Exp; branches; next 1.48; 1.48 date 2010.11.24.15.47.24; author ArlinStoltzfus; state Exp; branches; next 1.47; 1.47 date 2010.11.19.18.02.02; author ArlinStoltzfus; state Exp; branches; next 1.46; 1.46 date 2010.11.18.20.36.34; author ArlinStoltzfus; state Exp; branches; next 1.45; 1.45 date 2010.11.18.16.21.47; author ArlinStoltzfus; state Exp; branches; next 1.44; 1.44 date 2010.11.16.18.52.36; author ArlinStoltzfus; state Exp; branches; next 1.43; 1.43 date 2010.11.15.14.30.45; author DanRosauer; state Exp; branches; next 1.42; 1.42 date 2010.11.15.13.58.19; author ArlinStoltzfus; state Exp; branches; next 1.41; 1.41 date 2010.11.12.14.33.55; author ArlinStoltzfus; state Exp; branches; next 1.40; 1.40 date 2010.11.11.21.37.09; author ArlinStoltzfus; state Exp; branches; next 1.39; 1.39 date 2010.11.11.17.10.57; author ArlinStoltzfus; state Exp; branches; next 1.38; 1.38 date 2010.11.11.14.56.31; author ArlinStoltzfus; state Exp; branches; next 1.37; 1.37 date 2010.11.10.21.03.51; author ArlinStoltzfus; state Exp; branches; next 1.36; 1.36 date 2010.11.10.18.37.53; author ArlinStoltzfus; state Exp; branches; next 1.35; 1.35 date 2010.11.10.16.00.22; author ArlinStoltzfus; state Exp; branches; next 1.34; 1.34 date 2010.11.09.17.27.04; author ArlinStoltzfus; state Exp; branches; next 1.33; 1.33 date 2010.11.09.15.22.55; author ArlinStoltzfus; state Exp; branches; next 1.32; 1.32 date 2010.10.30.14.51.52; author ArlinStoltzfus; state Exp; branches; next 1.31; 1.31 date 2010.10.28.13.51.59; author ArlinStoltzfus; state Exp; branches; next 1.30; 1.30 date 2010.10.27.21.23.47; author ArlinStoltzfus; state Exp; branches; next 1.29; 1.29 date 2010.10.25.20.28.52; author DanRosauer; state Exp; branches; next 1.28; 1.28 date 2010.10.25.19.18.47; author DanRosauer; state Exp; branches; next 1.27; 1.27 date 2010.10.25.15.46.22; author DanRosauer; state Exp; branches; next 1.26; 1.26 date 2010.10.25.13.59.52; author ArlinStoltzfus; state Exp; branches; next 1.25; 1.25 date 2010.10.22.20.43.54; author DanRosauer; state Exp; branches; next 1.24; 1.24 date 2010.10.21.14.10.15; author DanRosauer; state Exp; branches; next 1.23; 1.23 date 2010.10.19.16.07.47; author ArlinStoltzfus; state Exp; branches; next 1.22; 1.22 date 2010.10.18.20.49.11; author ArlinStoltzfus; state Exp; branches; next 1.21; 1.21 date 2010.10.18.16.17.16; author ArlinStoltzfus; state Exp; branches; next 1.20; 1.20 date 2010.10.18.14.42.53; author ArlinStoltzfus; state Exp; branches; next 1.19; 1.19 date 2010.10.14.19.39.10; author ArlinStoltzfus; state Exp; branches; next 1.18; 1.18 date 2010.10.13.13.44.39; author ArlinStoltzfus; state Exp; branches; next 1.17; 1.17 date 2010.10.07.18.55.04; author ArlinStoltzfus; state Exp; branches; next 1.16; 1.16 date 2010.10.07.13.40.13; author ArlinStoltzfus; state Exp; branches; next 1.15; 1.15 date 2010.10.06.21.05.39; author ArlinStoltzfus; state Exp; branches; next 1.14; 1.14 date 2010.10.06.17.51.47; author ArlinStoltzfus; state Exp; branches; next 1.13; 1.13 date 2010.10.05.21.04.31; author ArlinStoltzfus; state Exp; branches; next 1.12; 1.12 date 2010.10.05.16.53.59; author ArlinStoltzfus; state Exp; branches; next 1.11; 1.11 date 2010.10.04.20.35.54; author ArlinStoltzfus; state Exp; branches; next 1.10; 1.10 date 2010.10.01.16.35.04; author ArlinStoltzfus; state Exp; branches; next 1.9; 1.9 date 2010.10.01.12.28.29; author ArlinStoltzfus; state Exp; branches; next 1.8; 1.8 date 2010.09.30.22.06.02; author ArlinStoltzfus; state Exp; branches; next 1.7; 1.7 date 2010.09.30.18.11.19; author ArlinStoltzfus; state Exp; branches; next 1.6; 1.6 date 2010.09.30.13.47.29; author ArlinStoltzfus; state Exp; branches; next 1.5; 1.5 date 2010.09.29.15.27.16; author DanRosauer; state Exp; branches; next 1.4; 1.4 date 2010.09.29.14.36.08; author ArlinStoltzfus; state Exp; branches; next 1.3; 1.3 date 2010.09.28.04.40.38; author DanRosauer; state Exp; branches; next 1.2; 1.2 date 2010.09.27.22.04.54; author HilmarLapp; state Exp; branches; next 1.1; 1.1 date 2010.08.31.14.15.29; author ArlinStoltzfus; state Exp; branches; next ; desc @none @ 1.80 log @none @ text @%META:TOPICINFO{author="HilmarLapp" date="1358873661" format="1.1" version="1.80"}% %META:TOPICPARENT{name="WorkingMeeting2010"}% A paper based on this technical report has now been published: Stoltzfus, Arlin, Brian O’Meara, Jamie Whitacre, Ross Mounce, Emily L Gillespie, Sudhir Kumar, Dan F Rosauer, and Rutger A Vos. 2012. “Sharing and Re-use of Phylogenetic Trees (and Associated Data) to Facilitate Synthesis.” BMC Research Notes 5 (1) (October 22): 574. doi:[[http://dx.doi.org/10.1186/1756-0500-5-574 10.1186/1756-0500-5-574]]. ---+ DRAFT: Current Best Practices for Publishing Trees Electronically *Authors* * Arlin Stoltzfus, Biochemical Science Division, NIST, 100 Bureau Drive, Gaithersburg, MD, 20899 * Jamie Whitacre, Smithsonian Institution * Dan Rosauer, Yale University * Torsten Eriksson, Royal Swedish Academy of Sciences %TOC% ---+ Summary An assessment of best practices for publishing phylogenetic trees is timely given recent decisions by many journals to require the archiving of trees. However, even without that justification, several longer-term trends favor an increased emphasis on richly annotated re-usable trees that can be linked to other data: the opportunities for phylogeny re-use are greater; new opportunities for aggregation and integration exist; and both specific and general technologies that make sharing and re-using trees easier have emerged. This report summarizes an as-yet-incomplete project to perform an assessment of best practices and, in turn, suggest solutions for meeting recommendations provided by the scientific community (? jsw) and filling gaps in the current landscape. The motivation for the report is that it will encourage the use, and further the development, of data management practices that will benefit scientists individually and collectively. Archiving of results post-publication seems to benefit the scientific community, and to benefit the individual archiving scientist in the form of increased recognition. A less speculative motivation is that developing capacity to manage richly annotated yet interoperable data _benefits scientists_ (individually and collectively) and _science_ by making it easier to carry out integrative, automated, or large-scale projects. This report represents the first step in a larger analysis. Following release of this initial report, we will disseminate a survey broadly, carry out further analysis, and write a more comprehensive report. For the present purposes, we have conducted an initial general assessment of * data archiving policies and reporting standards adopted by journals and funding agencies * two electronic archives (TreeBASE and Dryad) suitable for storing phylogenies * file formats commonly used for representing phylogenies (Newick, NEXUS, NHX, phyloXML and NeXML) * available support for Life Science Identifiers (LSIDs) and other globally unique identifiers (GUIDs) In other areas, we offer comments and call for more extensive analysis: * language support for representing data and metadata * current practices in the research community * software tools to support archiving and re-use In addition, we have studied the submission process of TreeBASE, and have evaluated the capacity of various file formats to represent specific kinds of metadata (annotations) deemed likely to increase the capacity for research results to be discovered, interpreted, linked (to other data) and re-used, including: * publication data (authorship, citation) * species names and other taxonomic identifiers * methods used to infer a tree * geographic coordinates Our tentative analysis suggests the following: * The infrastructure to support archiving of 1000's of new phylogenetic trees is available * The needs of archiving are not the same as those of publishing linkable, re-usable data * No formalized reporting standard for a phylogenetic analysis currently exists * The extent to which _data_ archiving policies require archiving of phylogenetic trees is unclear * The potential for archiving richly annotated trees is limited by technology and standards * The gap between needs and capacities is much greater for publishing re-usable trees than for simple archiving Archiving phylogenetic trees is technically feasible given current formats, and using currently available archives (TreeBASE and Dryad). However, the archival value of many trees will be limited without a shift in emphasis toward re-useability, along with technology and standards to support such a shift. While making trees archival is an important step forward for the phylogenetic community, re-usability of trees depends on several other conditions that, for the foreseeable future, will be difficult for most researchers to obtain. Before interoperability of richly annotated trees can be obtained, the research community must commit to the use of globally unique identifiers (GUIDs) for informational and material entities, and develop the syntax and semantics to represent the metadata upon which the value of the data depend. The community may be ready to respond to renewed calls for a Minimal Information for a Phylogenetic Analysis (MIAPA) standard. ---+ Request for Comments To ensure that the descriptions and recommendations here are accurate and relevant to the community of users, we are seeking feedback in several ways. As described in Appendix 4, we intend to target scientists with a survey to assess current practices and needs. We also solicit feedback on this preliminary report (see [[#AddComments][below]]). We invite interested scientists to make comments and to join the effort required to complete this report. ---+ Draft Report: Current Best Practices for Publishing Trees Electronically ---++ Scope and rationale of this report A major National Research Council (NRC) report on "A New Biology for the 21st Century" (2009; http://www.nap.edu/catalog.php?record_id=12764) suggests enormous potential for biological discovery based on aggregating and integrating data from diverse sources and from multiple disciplines. More specifically, recent commentaries (e.g., Sidlauskas, et al, 2010; Patterson, et al., 2010), suggest the possibility that archiving and re-use of phylogenetic trees and biodiversity data will soon take off at a pace not seen before. At the same time that funding agencies, publishers, and scientific culture are shifting in ways that incentivize sharing of data-- including phylogenetic trees-- new technologies and standards are emerging with the potential to make phylogenetic methods and results more interoperable (Sidlauskas, et al, 2010; Prosdocimi, et al., 2009). This interoperability infrastructure benefits individual researchers by enabling them to link results to related data, to avoid duplication of effort, to collaborate more effectively, and to pursue large-scale, integrative projects. What standards and technologies _will_ allow scientists to take advantage of phylogenies for "A New Biology"? The answer to this question is not certain. It is not clear whether promising interoperability technologies will fulfill their promise. It is not clear whether, in the phylogenetics community, these technologies will be developed in an orchestrated manner, through stakeholder organizations (analogous to TDWG for biodiversity studies), or in a more anarchic or competitive way. However, regardless of strategies for responding to current challenges, the first step is to understand those challenges. For this reason, at the TDWG 2010 meeting in Woods Hole, the TDWG phylogenetic standards interest group set in motion a project to to assess the current state of the field in regard to archiving and re-use of phylogenetic trees, with the ultimate goal of encouraging the use and further development of practices that make trees interoperable. We aim to identify strengths and weaknesses in the current infrastructure, practices and, policies; to educate tree producers about the needs of tree users; and to educate users about the needs of tree producers. The scope of this project extends, in principle, to all areas (systematics, phylogenetics, paleobiology, diversity studies, etc) where the archiving and re-use of trees is of interest to scientists. As a step toward this goal, we have undertaken a preliminary assessment of current best practices for publishing phylogenetic trees. Specifically, in regard to the electronic archiving and re-use of trees, we have done a preliminary review of * relevant institutional policies and reporting standards * current practices * data formats * software tools * ontologies and other forms of language support This document reports the results of our preliminary assessment. To get broader feedback, recruit interested scientists, and gather information to finish the report, we have partnered with participants of the MIAPA-discuss (miapa-discuss@@googlegroups.com) email list. This group has developed a survey that will be sent to thousands of scientists. After analyzing this feedback, we intend to expand this preliminary report into a manuscript for publication. We invite those willing to make a commitment of work to join in this project. ---++ Background: data archiving and re-use, how and why? Re-use of published results is crucial to the progressive and self-correcting nature of scientific inquiry. In the distant past, publication of a conventional scientific article was deemed sufficient to satisfy the demand for accessible and re-usable data. By publishing, authors were obliged to share data (and materials), but publishers had little power to enforce such obligations. In practice, authors determined which data were released, to whom, and in what form, often assuming that their own interests were served best by hoarding data, rather than sharing data, an attitude that remains common in some fields (ref: Piwowar). The circumstances of publishing and data reuse have changed radically in the past few decades. As the result of new technologies that generate massive amounts of data, many scientific reports depend on data too voluminous to publish in printed form. For instance, the record of a 3-dimensional protein structure contains roughly 10^4 coordinates (each a floating-point number) per domain. To facilitate archiving and re-use of such data, crystallographers collaborated in 1971 to launch the Protein Database ([[http://www.pdb.org][PDB]]), which is still the world's premier archive for 3D protein structures. In 1982, just 5 years after the discovery of DNA sequencing methods, GenBank was launched to archive the DNA sequences that were crowding the pages of journals. In both cases, editorial boards of relevant scientific journals quickly decided to require simultaneous archiving (of 3D structures in PDB; of DNA sequences in GenBank), so that data would be accessible to all scientists upon publication. Meanwhile, principled reasons to promote data sharing have exerted an increasing influence over institutions and institutional policies. Professional associations, publishers, and funding agencies recognize that availability of the data underlying published scientific findings is essential to a healthy scientific process (see Appendix 1). Funding agencies such as NSF and NIH increasingly recognize that work done on behalf of the public, especially if it is funded by taxpayers, should be accessible to the public without restriction. Viewed as a large-scale dynamic, data sharing is a movement of information from producers to consumers, facilitated by informatics tools, and guided in various ways by institutional policies. Some of the policies noted above represent incentives or pressures on individual researchers to "push" data out into the world. However, there is simultaneously an increasing "pull" from the promise of large-scale studies that aggregate and re-purpose data. The availability of data from PDB and GenBank, for example, has resulted in innumerable publications by scientists analyzing data generated by other scientists and stored in these archives. For this reason, no scientist would doubt the utility of these archives for scientific research. How do these and other factors apply to the publishing of phylogenetic trees? Phylogenetic trees play two central roles in modern biology: organizing knowledge by lines of descent, and extending knowledge through comparative analysis. In either role, trees become useful only to the extent that the tree and its parts are attached to, or can be linked to, data and metadata. Until recently, for most authors, publishing a phylogenetic tree meant publishing a picture of a tree in a journal article (figure)-- an informational dead-end. Thus, in the economy of phylogenetic data sharing, there are thousands of phylogeny producers (Kumar and Dudley, 2006), but there have been few phylogeny consumers, in spite of an archive called "TreeBASE" (Piel, et al., 2002) that has enabled phylogeny re-use since the late 1990's. However, conditions are changing in ways that favor archiving and re-use of trees: * In a coordinated effort announced early in 2010, various journals in evolution and systematics have implemented data-archiving policies; * In 2010, TreeBase (Piel, et al., 2002) completed a substantial upgrade of features, including its submission process; * A new data archive, Dryad, began accepting data from ecological and evolutionary studies, including phylogenetic trees in 2009; * NSF recently increased its [[http://www.nsf.gov/pubs/policydocs/pappguide/nsf11001/gpg_2.jsp#IIC2j requirements for data-sharing plans]] in grant proposals (see Appendix 1); * In recent years, the National Evolutionary Synthesis Center (NESCent) has invested in "phyloinformatics" efforts to enable interoperability, resulting in projects to develop an XML file format (NeXML), an ontology (CDAO), and a web-services standard (PhyloWS). What about phylogeny consumers? The most significant challenge in enabling an information-sharing dynamic in phylogenetics may be to recognize and understand how and why phylogeny consumers would re-use a phylogeny product. Phylogenetic trees play a vital role in research. Anyone with experience in phylogenetic analysis quickly learns that our biologist colleagues want trees for various research purposes, and frequently ask for help in getting them. Thus, there is an enormous demand for phylogenetic knowledge. Yet, the product of an individual phylogeny producer, it seems, is unlikely to be re-used or re-purposed. The reason for this seems to be that the typical need is for a very narrowly defined phylogenetic product, with a specific set of OTUs and characters, including up-to-date information. Limited cases in which phylogenetic results are aggregated or re-used are playing a more prominent role in scientific research. Examples of large-scale projects that rely on meta-analysis or integration are assembling a tree of life representing all known species, or identifying vulnerable species by combining occurrence data, climate data, and phylogeny in a geographic framework (this section needs specific references and examples along the lines of Burleigh, et al., 2010, and Sidlauskas, et al., 2010). This project to assess strengths and weaknesses in current practices has the ultimate goal of enabling effective data-sharing that links phylogeny producers and phylogeny consumers. Logically, then, we must begin with some notion of what makes a tree re-usable. In considering what makes it likely for a tree to be re-used in study replication, meta-analysis, aggregation, or integration, we draw guidance from the 2008 roadmap of the TDWG Technical Architecture Group (TAG), and two recent commentaries, by Sidlauskas, et al (2010) on synthetic approaches to evolutionary analysis, and by Patterson, et al (2010) on the importance of names. Together, these resources suggest the importance of * having unique names for biological things of interest (globally unique identifiers or GUIDs) * exchanging information using validatable formats (e.g., XML) * using a controlled set of terms and predicates, ideally defined in an ontology * providing context with rich annotations ("metadata") for the re-useability of phylogenetic results. Below, we briefly explain these features. ---++++ 1. Standard, Validatable Formats Currently, most trees that appear in the published literature are accessible only in the form of an embedded graphical image, i.e., the published item is literally a flat picture of a phylogenetic tree (figure, above), rather than a machine-readable symbolic encoding of relationships. For trees to be re-usable, they must be accessible in a standard, computer-readable format that makes the structure of the tree explicit. The tree image above corresponds to the following Newick string:
((otu1:0.34, otu2:0.19):0.11, otu3:0.44);
Newick is the simplest of several data formats that are used to represent trees (see Appendices 1 and 3). For instance, PhyloXML and NeXML are formats defined by schemata. Available tools allow any instance of a PhyloXML file to be checked against the schema, to determine whether the file is properly formed. A file with mistakes in syntax (miss-spelled terms, missing punctuation, etc), will be found invalid. But if a file is valid, then any software that fully supports the standard should be able to read it. Automated validation removes uncertainty, especially in regard to the causes of errors. ---++++ 2. Globally Unique Identifiers (GUIDs) There are many uses for identifiers. To integrate data from diverse sources, we need to have some kind of integrating variable such as a species name, a specimen number, etc. For instance, to aggregate data on species occurrence, we need to know if a report of a bird in location X and a second report of a bird in location Y refer to the same species of bird. If we wish to integrate data from all over the world using names for things, then the names should be unique over all the world. By contrast, the tree example above invokes entities "otu1", "otu2" and "otu3". If we integrated data from all over the world using arbitrary local names like "otu1" and "otu2", we would make mistakes by aggregating information that does not belong together. Using globally unique identifiers or GUIDs ensures that when we refer to a thing, regardless of context, we know what it is. For instance, perhaps otu1 refers to the coyote. In that case, we can provide a Life Science Identifier or LSID (urn:lsid:ubio.org:namebank:2478093), which is a kind of GUID, and this LSID will make it possible for the researcher to associate the entity with information on ''_Canis latrans_'' (coyote) available in resources such as the Encyclopedia of Life. If otu1 is a gene sequence, then an http URI for its NCBI accession can serve as a GUID, and this will make it possible for any subsequent researcher to associate "otu1" with the underlying sequence data. The system of Document Object Identifiers allows publishers to assign GUIDs to publications. Thus, not just species or samples, but information artefacts, should have unique identifiers. For instance, in TreeBASE, each tree, data matrix, and study receives an ID that is unique and stable, allowing TreeBASE to offer persistence GUIDs via http URIs. ---++++ 3. Rich Annotations ("metadata") While the Newick tree above is in a standard format and could be archived in the Newick format, it remains an information dead-end, because we do not know what it refers to or how it was derived. Even if our goal is to explore models of speciation, and we wish only to measure whether the topology of the tree is ladder-like vs. bushy, we can't use this particular tree because we can't tell whether it is a species tree (relevant to speciation), or some other kind of tree (irrelevant). So then, what kind of annotations increase the re-usability of a tree? What are the integrating variables a tree consumer would use to integrate or aggregate trees? Imagine that we have a data-mining tool with access to all published trees, richly annotated. Our challenge is to use this database to reveal prior work on a topic, to test a hypothesis, to discover new relationships, or to carry out a meta-analysis addressing a methodological issue. In this context, useful types of data or metadata would include: * data (or GUIDs for data) from which the tree was inferred (e.g., if we wish to combine data into a supermatrix) * authorship and citation data (e.g., if we wish to find all the studies by a particular author) * taxonomic links and species identifiers for OTUs (e.g., to find all studies relevant to a taxonomic group) * identifiers for a specimen or accession to which OTUs are linked (e.g., to find any studies with a particular gene) * geographic coordinates (e.g., to integrate phylogeny data with other geographically-linked data) * a description of the method by which the tree was inferred (e.g., to enforce quality controls) ---++++ 4. Formal Language Support Computable knowledge representation is largely a matter of relationships between entities that can be expressed as subject-predicate-object triples, i.e., "Bob has_friend Susan" or "Susan is_a female_person". By joining these two statements via the identity of Susan (which we could establish via a GUID for Susan), we can answer the question of whether Bob has any female friends, even though neither statement alone tells us this. Without formal language support, its unclear what such terms and predicates mean. For instance, if we are interested in the phylogenetic origin of primates and search the web for a _tree that has a monkey and a squirrel_, we will not find a phylogeny that includes a monkey and a squirrel as OTUs, but we will get other fascinating information, including information about 1) *squirrel monkeys*, which spend time in *trees*, 2) sounds made by *squirrels*, *monkeys* and other animals that live in *trees*; and 3) news of a scientific study showing that macaques (a kind of *monkey*) in *trees* get upset when flying *squirrels* sail over them. Typically, a domain expert (in knowledge representation, a field of application such as phylogenetics is called a *domain*) avoids language problems by using a limited set of known data sources with a limited set of terms and predicates whose domain-specific meanings are understood by the expert. Thus, one way to support clear use of language is to have domain-specific vocabularies. A more robust form of support is to specify concepts in an ontology. To represent the kinds of annotations that make phylogenies suitable for re-use-- citations, taxonomic links, provenance information, georeferences, methods descriptions-- requires language support for the relevant concepts. For instance, [[http://dublincore.org/documents/dces/][Dublin core]] provides a metadata standard for documents, providing terms for assigning authorship, title, and so on. The [[http://open-biomed.sourceforge.net/opmv/ns.html][Open Provenance Model Vocabulary Specification ]] provides a term "wasDerivedFrom", such that, having derived "tree1" from "alignment1", we could annotate this relationship with the statement tree1 [[http://purl.org/net/opmv/ns#wasDerivedFrom][http://purl.org/net/opmv/ns#wasDerivedFrom]] alignment1 An important aim for this project is to investigate what kinds of annotations can be supported by available vocabularies and, where possible, to make recommendations about which vocabularies are best for which annotations. Some types of annotations involve ''domain-specific concepts'', e.g., if we wish to distinguish "unrooted tree" from "rooted tree" in a robust way, this must make reference to some externally defined concept. ---++ Analysis ---+++ Policies *Evolution-related journals and their data policies.* In early 2010, the editorial boards of eight journals: _Evolution_, _Molecular Biology_ and _Evolution_, _American_ _Naturalist_, _Molecular_ _Ecology_, _Journal of Evolutionary Biology_, _Heredity_, _Evolutionary Applications_) announced plans for a joint data archiving policy. This is a minority of the journals that regularly publish phylogenetic trees (other examples include _Systematic Biology_, _Molecular Phylogenetics_, and so on). The policies adopted by most of these journals as of January 2011 require data archiving in an "appropriate public archive" to ensure that the data are "preserved and usable for decades in the future". However, some policies are more stringent than others. For example, _Evolution_ requires that "authors submit DNA sequence data to GenBank and phylogenetic data to TreeBase" and _American Naturalist_ stipulates that "authors. . . deposit the data associated with accepted papers in a public archive. For gene sequence data and phylogenetic trees, deposition in GenBank or TreeBASE, respectively, is required." Other journals have a looser policy. _Molecular Ecology_ "expects that data supporting the results in the paper should be archived in an appropriate public archive such as GenBank, Gene Expression Omnibus, TreeBASE, Dryad, the Knowledge Network for Biocomplexity, your own institutional or funder repository, or as Supporting Information on the _Molecular Ecology_ web site." Furthermore, _Evolutionary Applications_ states that "only data underlying the main results in the paper need to be made available, In addition, sufficient information must be provided such that data can be readily suitable for re-analyses, meta-analyses, etc. . . . The preferred way to archive data is using public repositories. For types of data for which there is no public repository, authors can upload the relevant data as Supplementary Materials on the journal's website. Data submission to any of these repositories and the acceptance of the data by these repositories must occur *before* the manuscript goes to production. Appendix 1 provides detailed guidelines for submitting to TreeBASE and to Dryad. *National Science Foundation (NSF).* In the US, NSF is the major funder of evolutionary science. As described in Appendix 1, NSF guidelines call for proposals to include a “Data Management Plan” to describe how the proposal will conform to NSF policy on the dissemination and sharing of research results, including what types of data will be produced, "the standards to be used for data and metadata format and content", and plans "for preservation of access" to the data. The policy does not specify any particular standards, but merely calls on researchers to address this issue. ---+++ Standards *Life Science Identifiers (LSIDs)* represent a standard developed and approved by Biodiversity Information Standards (TDWG), an organization that promotes the wider and more effective dissemination of information about the World's heritage of biological organisms. * MIAPA* Scientists with an interest in the archiving and re-use of phylogenetic data have called for (but not yet developed) a minimal reporting standard designated "Minimal Information for a Phylogenetic Analysis", or MIAPA ([[http://www.ncbi.nlm.nih.gov/pubmed/16901231 Leebens-Mack, et al. 2006]]). The vision of these scientists is that the research community would develop, and adhere to, a standard that imposes a minimal reporting requirement yet ensures that the reported data can be interpreted and re-used. Such a standard might be adopted by journals, repositories, databases, workflow systems, granting organizations, and organizations that develop taxonomic nomenclature based on phylogenies. Leebens-Mack, et al. suggest that a study should report objectives, sequences, taxa, alignment method, alignment, phylogeny inference method, and phylogeny (this implies that MIAPA is intended only for molecular, as opposed to non-molecular, phylogenetics). As of 2010, no standard or draft has been developed (the [[http://mibbi.sourceforge.net/projects/MIAPA/ MIBBI repository for the MIAPA project]] is empty). A [[http://evoinfo.nescent.org/MIAPA_WhitePaper NESCent whitepaper on MIAPA]] outlines how the project could be moved forward. As a proof-of-concept exercise (described with some screenshots [[https://www.nescent.org/wg_evoinfo/Supporting_MIAPA#Proof-of-concept_.28annotation_software.29 here]]), participants in NESCent's Evolutionary Informatics working group configured an existing annotation application to use a controlled vocabulary to describe a phylogenetic analysis as a series of steps. Open questions * are there other standards that are applicable here? ---+++ Formats Various formats are used for phylogenies. Here we review information on the following 5 formats: * Newick- design for trees only with labels, no other data or metadata * NEXUS (http://informatics.nescent.org/wiki/Supporting_NEXUS_Documentation) - a full featured but dated format * NHX [[http://www.phylosoft.org/NHX/nhx.pdf PDF docs]]) - an extension of Newick with limited uses * [[http://www.phyloxml.org phyloXML]]- an economical and easy to use XML format tuned to molecular phylogenies * [[http://www.nexml.org NeXML]] - a full-featured XML format that allows arbitrary annotations of diverse types of data The Newick ("New Hampshire") format was developed informally in 1986 by a group of phylogenetic software developers (http://en.wikipedia.org/wiki/Newick_format). It was intended to represent trees only, not associated data or metadata. NEXUS is a highly expressive data format that has been in use for nearly as long as Newick. It is the preferred format for many phylogenetic inference programs such as PAUP* and MrBayes. The basic structure of a NEXUS file is a series of blocks, each containing commands. The most commonly used blocks are TAXA (a declared list of OTUs), CHARACTERS (a matrix of comparative data) and TREES (one or more phylogenetic trees for the OTUs). OTUs and characters can be referenced (from other blocks) by index numbers. Due to the lack of an ongoing development model, and ambiguities in the syntax, different interpretations of NEXUS have arisen within the phylogenetics community. NHX (New Hampshire eXtended) format was developed by Christian Zmasek as an extension of Newick, to represent common annotations of nodes (e.g., duplication events), and to insert molecular sequences. However, the highly constrained syntax of NHX limits its usefulness. In the past few years, four different XML formats have become available, though none is in widespread use. The main developer of NHX format, Christian Zmasek, went on to develop phyloXML (Han and Zmasek, 2009), a validatable format to represent a greater range of attributes than NHX. PhyloXML has an economical schema tuned to the needs of molecular phylogeneticists. The BEAST package (Drummond, et al., 2007) has used an XML input format for several years, but it is not considered further here because it is not used to export trees (BEAST outputs trees in NEXUS format). Likewise, while it is possible to encode comparative data in terms of CDAO (Comparative Data Analysis Ontology: Prosdocimi, et al., 2009), and serialize this as RDF-XML, this is not the recommended use of CDAO. NeXML (http:www.nexml.org) is an XML format with a precisely defined schema, modeled after the structure of NEXUS. While the design of phyloXML takes a very direct approach to satisfying user needs, NeXML opts for greater generality at the expense of a much more complex schema. It has an approach to metadata that allows for arbitrary annotation of data objects using external vocabularies. Features of Newick, NHX, NEXUS, phyloXML and NeXML are compared in the table below (a filled square indicates presence of the feature; an open circle indicates that there are significant limitations on this feature). ---+++ Archives Researchers wishing to make a phylogenetic tree available in a public archive currently have two options, TreeBASE and Dryad. The TreeBASE project is a specialized repository that focuses on supporting phylogenetic studies (Piel, et al., 2002). TreeBASE 2.0 (released in March, 2010) has a relational database back-end with a complex schema that allows it to accommodate not just phylogenies and character matrices, but metadata associated with a study, including authors, publications, and descriptions of methods. The submission process (see Appendix 3) is well documented, and allows users to associate OTUs with species names (NCBI or UBio names) and to add other types of metadata. Data are uploaded in NEXUS format. Other formats are not currently supported and support for studies with large numbers of trees is limited. Archived data are made available to users via a convenient web interface; however, the web interface does not provide full access to the schema. Last year, according to the web site (http://www.treebase.org/treebase-web/about.html), TreeBASE contained 6,500 trees in 2,500 publications (60,000 distinct taxa). This is a small fraction of trees published since TreeBASE began in the mid-1990s. Due to variable journal policies, TreeBASE is well known and extensively used in some sub-disciplines such as fungal systematics, but is relatively unknown and unused in others (see Appendix 1 for a list of 19 journals that require or recommend submission of trees to TreeBase as a condition of publication). The Dryad project was launched in 2009, to support archiving of data from ecological and evolutionary studies, including data that do not fit any specialized database. The data may include images, text files, spreadsheets, and some other types of files. According to Todd Vision of the Dryad project (see Appendix 3), "since Dryad is a general-purpose repository, it doesn't impose any constraints on how the data are represented within the files that users submit. The best practices need to come from elsewhere, such as journal policies, MIAPA . . . and community practice." Because of the diversity of data, there is no back-end schema for knowledge organization. Instead, any text in uploaded files is indexed so that relevant files can be identified and retrieved by users. The submission process, currently in beta testing, is carefully explained on the website. By submitting data, users make the data available for re-use via a Creative Commons license. The launch of Dryad was coordinated with an initiative urging journals and professional societies in the disciplines of ecology and evolution to adopt policies about data archiving and data sharing (see Appendix 1). According to the web site (http://www.datadryad.org), in January of 2011, Dryad contained 407 data packages and 1000 data files, published in 52 journals. Each package, apparently, corresponds to a publication. Open questions * Does morphobank also archive data? What about TOLKIN? Any others? ---+++ Language Support (Ontologies and Other Vocabularies) In recent years there has been an explosion of work on ontologies to provide the language support for sharing of knowledge in life sciences. Ontologies are one extreme on a continuum of vocabulary artefacts ranging from lists of informally defined tags, to hierarchies of classes (taxonomies), to ontologies that specify class hierarchies as well as relations and attributes. In some areas of research, vocabulary artefacts play a key role, e.g., the Gene Ontology (GO), primarily a set of 3 taxonomies (molecular function, pathway, location), plays a key role in genome annotation. The field of comparative biology benefits from a long tradition of using organismal taxonomy, which is a hierarchy of approved names and synonyms. In regard to representing trees and associated data, a clear example of the use of a standard vocabulary is the use of IUPAC codes for nucleotides and amino acids (e.g., as explicitly defined in the NEXUS format specification of Maddison, et al. 1997). As a general practical matter, NeXML and CDAO together provide an approach to metadata that is open-ended. NeXML is designed in a way to take advantage of external vocabularies. So, if there is a way to say something by invoking the terms of an external vocabulary, it can be said within NeXML (and this is the advantage of the NeXML metadata approach). The Comparative Data Analysis Ontology (CDAO) provides language support for many aspects of comparative analysis, though it has not been tested extensively and remains experimental. The concept of a phylogenetic tree could be designated by reference to CDAO:Tree, which in turn defines a tree as a sub-class of "Network", and in relation to other concepts like "Branch" and "Node". However, in this area, the role of ontologies and other vocabularies is mainly a matter of future possibilities, rather than current practice. A variety of relevent artefacts exist, some of them recently developed. For this preliminary report, we have identified some possible resources, but we are not sure how useful they will be. At this point, we mainly have questions. *Representing authorship and citations* Citing a journal publication is a key issue. However, there is also an issue of the authorship of an electronic document (e.g., tree file), which may be distinct from a publication. Dublin core (http://dublincore.org/documents/dces/), a metadata standard for documents, provides language for authorship, creation dates, copyright, licenses and so on. A major shortcoming is that it lumps journal, volume, issue and page into one concept (dc:citation), making it unsuitable for scientific literature. PRISM (http://www.prismstandard.org), which builds on Dublin core, may provide a better alternative for referencing the scientific literature. *Linking to taxonomic concepts* LSIDs are a standard approved by TDWG. Sources such as [[http://www.ubio.org/][UBio]] provide LSIDs for taxa (taxon concepts). This solves one part of the problem of providing annotations that refer to LSIDs for species. The other part, which isn't as clear, is the issue of what predicates one should use to link entities in a tree file with a species source, or with another taxonomic concept. For instance, what is the proper relation between an internal node of a tree and a concept for a higher taxonomic category that corresponds to it? *Provenance* A key issue in annotating a phylogeny with character data is to indicate the source of data or specimens. For molecular sequences, a GenBank accession is appropriate. PhyloXML has a simple tag for that. In other cases, its not clear which predicate to use, especially since an aligned molecular sequence may be derived by truncation from a GenBank source. In this case, the Open Provenance Model (mentioned earlier) has some general predicates such as wasDerivedFrom. A tree can be annotated as having been derived from an alignment, and this alignment can be annotated, in turn, as being derived from individual sequences with GenBank sources. Another case is that in which we wish to associate data with a specimen that has a museum accession. Does DarwinCore provide an appropriate source of predicates? *Georeferences* TDWG TaxonOccurrence (http://rs.tdwg.org/ontology/voc/TaxonOccurrence) seems to address this. This could be incorporated directly into NeXML. In phyloXML, there are pre-assigned tags. *Methods* As indicated in the MIAPA paper (Leebens-Mack, et al) and in the TreeBASE submission protocol, researchers dealing with molecular data consider methods to be an important component of metadata. The Open provenance model, mentioned above, provides some generic concepts that would be useful. However, in spite of the potential of CDAO (Prosdocimi, et al., 2009), there seems to be a major gap between what is available, and what is needed to annotate the complex multi-step user-assisted workflows used by a scientific researcher to generate a phylogeny product. A recent LIMS plug-in for the Geneious software allows for tracking workflows and alignment annotations (see http://software.mooreabiocode.org). Open questions: * Are there other standards for citations more appropriate than DublinCore and PRISM? * Where do we get predicates for linking to LSIDs for taxonomic concepts? Does DarwinCore provide an appropriate source of predicates? ---+++ Tools For the purposes of this report, we did not carry out an extensive analysis of available tools. Our initial impressions are that there is a deficit of tools supporting archiving and reuse of phylogenetic data. Even in an atmosphere that incentivizes data sharing, scientists motivated to archive or re-use data may find it difficult to do so if appropriate tools are lacking. Some of the obvious kinds of tools to support archiving of richly annotated data sets are: * format validators * translators that support conversions of data and metadata between formats * annotation tools that allow users to add metadata using controlled vocabularies To support re-use, re-purposing, aggregation and integration of data requires the same kinds of tools, as well as tools for * visualizing diverse types of data together with their metadata * manipulating data and metadata (e.g., extracting subsets or subtrees) while maintaining integrity * databasing a collection of studies for further analysis of their data and metadata * comparing, measuring and manipulating sets of trees (and associated data and metadata) Many phylogenetics users implement a customized interactive workflow that relies on diverse software tools. Because these tools may use different formats, the ability to convert among formats is important. However, convenient generalized tools are lacking (the NeXML manual provides a useful list of online servers and scripting approaches). Likewise, there are many tools for viewing trees, but it seems that only a few allow for viewing trees together with a matrix of data (e.g., Archaeopteryx, Mesquite, Nexplore). Support for adding or viewing metadata is very limited. The Phenoscape project has a tool used for project-specific purposes of adding ontology-based phenotype annotations to comparative data (in NeXML format). The TreeBASE submission server is an example of a tool that allows users to annotate data sets by associating OTUs with species names, and by associating data rows with accessions. However, this tool only works in the context of a TreeBASE submission. ---+++ Current Practices The analysis of citations by Kumar & Dudley (2007) suggests that the number of phylogeny publications in 2006 was 7000, and the rate of phylogeny publications is rapidly increasing. Experts in phylogenetic analysis typically generate hundreds or thousands of trees for every tree that is published. Thus, it is likely that, each year, many millions of trees are generated in association with published research. Yet only a subset is actually made available through publications and electronic media. A systematic analysis of current practices for archiving and re-use of trees has not been performed for this draft report. In early 2011 we intend to release a survey that will provide information on some aspects of current practices. The main questions to address are: 1. To what extent are trees being archived today? What data and metadata are being archived? * Archiving at journal web sites may be the most common method * [[http://www.treebase.org TreeBase]] is widely used in some sub-disciplines * [[http://datadryad.org Dryad]], Morphobank, others? 1. What are the real or perceived barriers to archiving? 1. To what extent are trees being re-used? 1. What are the real or perceived barriers to re-use? * what makes a study suitable for re-use by a particular user? (which data and metadata must match user's criteria) * if the re-usable study exists, would it be found easily? * if the re-usable study can be found, could it be accessed and interpreted easily? ---++ Conclusions: gaps and recommendations This report is tentative and incomplete. However, we think it will be useful to suggest some conclusions and recommendations, subject to revision as we continue this project and expand our understanding. ---+++ Archiving *Archives with the capacity to store thousands of new phylogenetic trees are available*. TreeBASE and Dryad may serve as repositories to ensure that phylogenetic tree information associated with a publication is recoverable many years into the future. These repositories represent quite different approaches to archiving. Individual users may find one archive more suitable than the other. *There is no common reporting standard governing the archiving of trees* It is not clear what a phylogeny producer should include when archiving a tree. In the absence of a developed MIAPA, there is no community standard of the minimal information for a phylogenetic analysis report. Institutional policies are inconsistent, and lack specifics. TreeBASE requires that a tree be accompanied by a data matrix, publication information, and an explicit methodological link from tree to matrix. *The extent to which current policies require archiving of trees is unclear*. The draft policy suggested recently by journals (Appendix 1) refers to "data supporting the results" of a publication. The significance of this is unclear due to an ambiguity in the term "data". Natural scientists traditionally use the term "data" in the _empirical_ sense, as a synonym for "facts", the empirical observations or measurements on which further analysis rests. Phylogenies are not observational data. Computer and information scientists use "data" (and in some sub-fields, even "facts") to refer to any kind of recorded information, regardless of its nature or derivation. The NIH data sharing policy (see note 7 of [http://grants.nih.gov/grants/policy/nihgps/fnpart_ii.htm]) makes clear that NIH uses the _informational_ sense, not the _empirical_ one. We recommend that institutions with data archiving policies be explicit about what they mean by "data". Open questions: * to what extent will authors avoid archives, e.g., by choosing a publisher that does not require archiving? * how significant is the technical barrier posed by format translation (implied by the TreeBASE instructions) ---+++ Publishing re-usable or link-able data Currently the gap between needs and capacities is much greater for publishing re-usable trees than for the problem of archiving trees. *The needs of archiving are not the same as those of publishing linkable, re-usable data*. For instance, BEAST (Drummond, et al., 2007) users can support study replication by archiving, with their NEXUS output file, their BEAST XML input file, which includes the input data along with precise instructions for processing. This is a perfectly adequate solution for archiving, and it provides strong support for study replication. However, this approach to archiving does not go very far toward facilitating re-use, because the information in the BEAST XML file provides instructions that only BEAST can understand, and is not anchored by semantics defined in external vocabularies. *A reporting standard for a phylogenetic analysis can only extend the re-usability of archived trees*. As noted above, there is no standard governing archiving. A minimal reporting standard for phylogenetic analysis has been suggested (see MIAPA, Appendix 1) but not drafted or approved. Establishing such a standard is a critical step in promoting re-useable trees. To develop a standard requires community organization as well as technological support (some guidance is provided by a [[http://evoinfo.nescent.org/MIAPA_WhitePaper MIAPA whitepaper]] from the NESCent EvoInfo working group). *Other issues on which a conclusion might be possible in the final report* 1. Lack of resolvable LSIDs; lack of a validator to see if species refs are resolved 1. Lack of formal language support (e.g., tree inferred_from_data matrix) 1. Lack of community standards for some types of metadata 1. Lack of education, awareness, of metadata standards 1. Lack of tools (software support) for annotation #AddComments ---++ Please add your comments %COMMENT{type="above"}% ---++ Author contribution and other acknowledgments * We thank TDWG for its support for the 1-day workshop that launched this project in Woods Hole, MA. * DR and AS pitched the idea for the project to the TDWG phylogenetic standards interest group * Elena Herzog, TE, DR, AS, and JW participated in the TDWG workshop project October 2010 in Woods Hole, MA * DR, AS and JW wrote the report * Bill Piel and Todd Vision provided information on archive projects * Christian Zmasek and Rutger Vos provided information on file formats * Other people in the discussion thread? (jsw) ---++ References Burleigh, J. G., M. S. Bansal, et al. (2010). "Genome-Scale Phylogenetics: Inferring the Plant Tree of Life from 18,896 Gene Trees." Syst Biol. Han, M. V. and C. M. Zmasek (2009). "phyloXML: XML for evolutionary biology and comparative genomics." BMC Bioinformatics 10: 356. Kumar, S., and J. Dudley. 2007. Bioinformatics software for biologists in the genomics era. Bioinformatics 23:1713-1717. Lapp, H., S. Bala, J. P. Balhoff, A. Bouck, N. Goto, M. Holder, R. Hollan, A. Holloway, T. Katayama, P. O. Lewis, A. Mackey, B. I. Osborne, W. H. Piel, S. L. Kosakovsky Pond, A. Poon, W. G. Qiu, J. E. Stajich, A. Stoltzfus, T. Thierer, A. J. Vilella, R. Vos, C. M. Zmasek, D. Zwickl, and T. J. Vision. 2007. The 2006 NESCent Phyloinformatics Hackathon: A field report. Evolutionary Bioinformatics 3:357-366. Patterson, D. J., J. Cooper, P. M. Kirk, R. L. Pyle, and D. P. Remsen. 2010. Names are key to the big new biology. Trends Ecol Evol 25:686-691. Piel, W. H., M. J. Donoghue, and M. J. Sanderson. 2002. "TreeBASE: a database of phylogenetic knowledge." Pp. 41-47. In: Shimura, J., K. L. Wilson, and D. Gordon, eds. To the interoperable "Catalog of Life" with partners Species 2000 Asia Oceanea. Research Report from the National Institute for Environmental Studies No. 171, Tsukuba, Japan. Prosdocimi, F., B. Chisham, E. Pontelli, J. D. Thompson, and A. Stoltzfus. 2009. Initial Implementation of a Comparative Data Analysis Ontology. Evolutionary Bioinformatics 5:47-66. Sidlauskas, B., G. Ganapathy, E. Hazkani-Covo, K. P. Jenkins, H. Lapp, L. W. McCall, S. Price, R. Scherle, P. A. Spaeth, and D. M. Kidd. 2010. Linking big: the continuing promise of evolutionary synthesis. Evolution 64:871-880. ---+ Appendices ---++ Appendix 1. Relevant Standards ---+++ Data sharing and archiving policies *(not done)* For general information on data sharing policies in the US, see the wikipedia [[http://en.wikipedia.org/wiki/Data_sharing data sharing]] article, or NIH's [[http://grants.nih.gov/grants/policy/data_sharing/ data sharing web site]]. Authors of scientific studies often are required (as a condition of funding or of publication) to make data available to the research community without restriction. Add something on Scientific Data Management for Government Agencies working group. (jsw) ---++++ Evolution and Systematics Journals *(done)* [[http://www.treebase.org TreeBASE]] as an [[http://www.treebase.org/treebase-web/journal.html online list]] of 19 journals that require or recommend submission of trees to TreeBASE as a condition of publication (_Evolution_, _Evolutionary Applications_, _Fungal Biology_, _Invertebrate Systematics_, _Mycologia_, _Mycologial Progress_, _Mycologial Research_, _Mycoscience_, _Mycosphere_, _Organisms, Diversity, and Evolution_, _Persoonia_, _Phytopahology_, _Plant Disease_, _Rhodora_, _Muelleria_, _Studies in Mycology_, _Systematic Biology_, _Systematic Botany_, _Tropical Bryology_). The [[http://www.datadryad.org Dryad]] web site describes the Joint Data Archiving Policy as follows:
< < Journal > > requires, as a condition for publication, that data supporting the results in the paper should be archived in an appropriate public archive, such as < < list of approved archives here > >. Data are important products of the scientific enterprise, and they should be preserved and usable for decades in the future. Authors may elect to have the data publicly available at time of publication, or, if the technology of the archive allows, may opt to embargo access to the data for a period up to a year after publication. Exceptions may be granted at the discretion of the editor, especially for sensitive information such as human subject data or the location of endangered species.
And lists the following partner journals (for links, go to the [[http://www.datadryad.org Dryad]] web site): * Whitlock, M. C., M. A. McPeek, M. D. Rausher, L. Rieseberg, and A. J. Moore. 2010. Data Archiving. American Naturalist. 175(2):145-146, doi:10.1086/650340 * Rieseberg, L., T. Vines, and N. Kane. 2010. Editorial and retrospective 2010. Molecular Ecology. 19(1):1-22, doi:10.1111/j.1365-294X.2009.04450.x * Rausher, M. D., M. A. McPeek, A. J. Moore, L. Rieseberg, and M. C. Whitlock. 2010. Data Archiving. Evolution. doi:10.1111/j.1558-5646.2009.00940.x * Moore, A. J., M. A. McPeek, M. D. Rausher, L. Rieseberg, and M. C. Whitlock. 2010. The need for archiving data in evolutionary biology. Journal of Evolutionary Biology 2010. doi:10.1111/j.1420-9101.2010.01937.x * Uyenoyama, M. K. 2010. MBE editor's report. Molecular Biology and Evolution. 27(3):742-743. doi:10.1093/molbev/msp229 * Butlin, R. 2010. Data archiving. Heredity advance online publication. 28 April doi:10.1038/hdy.2010.43 * Tseng, M. and L. Bernatchez. 2010. Editorial: 2009 in review. Evolutionary Applications. 3(2):93-95, doi:10.1111/j.1752-4571.2010.00122.x ---++++ National Science Foundation (NSF)
Beginning January 18, 2011, proposals submitted to NSF must include a supplementary document of no more than two pages labeled “Data Management Plan”. This supplementary document should describe how the proposal will conform to NSF policy on the dissemination and sharing of research results. See Grant Proposal Guide (GPG) Chapter II.C.2.j for full policy implementation.
The policy may be found in the Award and Administration Guide (AAG), [[http://www.nsf.gov/pubs/policydocs/pappguide/nsf11001/aag_6.jsp#VID4 section VI.D.4.b]]:
b. Investigators are expected to share with other researchers, at no more than incremental cost and within a reasonable time, the primary data, samples, physical collections and other supporting materials created or gathered in the course of work under NSF grants. Grantees are expected to encourage and facilitate such sharing.
The Grant Proposal Guide, [[http://www.nsf.gov/pubs/policydocs/pappguide/nsf11001/gpg_2.jsp#IIC2j Section II.C.2.j]], reads partially as follows:
Plans for data management and sharing of the products of research. Proposals must include a supplementary document of no more than two pages labeled “Data Management Plan”. This supplement should describe how the proposal will conform to NSF policy on the dissemination and sharing of research results (see AAG Chapter VI.D.4), and may include: Data management requirements and plans specific to the Directorate, Office, Division, Program, or other NSF unit, relevant to a proposal are available at: http://www.nsf.gov/bfa/dias/policy/dmp.jsp. If guidance specific to the program is not available, then the requirements established in this section apply.
---+++ Dublin Core *(not done)* there isn't a standard for encoding Dublin Core (Dc) publication data in XML. In particular, there isn't an enclosing element. In NeXML it would be "meta". DC isn't very well suited to journal articles, anyway. The best attempt I've seen (http://reprog.wordpress.com/2010/09/03/bibliographic-data-part-2-dublin-cores-dirty-little-secret/) goes like this: Michael P. Taylor Darren Naish 2007 An unusual new neosauropod dinosaur from the Lower Cretaceous Hastings Beds Group of East Sussex, England. urn:ISSN:0081-0239 Blackwell Text Palaeontology 50(6), 1547-1564. (2007) info:doi:10.1111/j.1475-4983.2007.00728.x ---+++ Darwin Core and TDWG From the [[http://rs.tdwg.org/dwc/terms/guides/xml/index.htm Darwin Core XML Guide]] (specify namespace with xmlns:dwc="http://rs.tdwg.org/dwc/terms/"): Anthus hellmayri Aves Anthus hellmayri urn:catalog:AUDCLO:EBIRD:OBS64515331 ---++ Appendix 2: Sample data set rendered in different formats To illustrate the representation capabilities (and limitations) of different formats we have developed a set of test files: * [[%ATTACHURL%/PF00034_4.nwk][PF00034_4.nwk]]: 4-taxon test case in Newick format * [[%ATTACHURL%/PF00034_4.nhx][PF00034_4.nhx]]: 4-taxon test case in NHX format * [[%ATTACHURL%/PF00034_4.nex][PF00034_4.nex]]: 4-taxon test case in NEXUS format * [[%ATTACHURL%/PF00034_4_phylo.xml][PF00034_4_phylo.xml]]: 4-taxon test case in [[http://www.phyloxml.org phyloXML]] format * [[%ATTACHURL%/PF00034_4_nexml.xml][PF00034_4_nexml.xml]]: 4-taxon test case in [[http://www.nexml.org NeXML]] format Each file represents the same token set of data and (as allowed) metadata from cytochrome C sequences (PFAM family PF00034). The tree for this gene family ((Mus_musculus_CAA25899.1:0.008307,Rattus_norvegicus_AAA21711.1:0.009662):0.024280[0.74], (Gallus_gallus_CAA25046.1:0.055226,Rattus_norvegicus_AAA41015.1:0.117358):0.040335[0.69]); is is a molecular gene tree, not a species tree, as shown by the two different OTUs from the same rat species. The data and metadata are drawn from the following table: |*Label*|*Species*|*NCBI taxid*|*LSID*|*Accn*|*Sequence data*| |Mus_musculus_CAA25899.1|Mus musculus|10090|http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:2481174|CAA25899.1|MGDVEKGKKIFVQKCAQCHT| |Rattus_norvegicus_AAA41015.1|Rattus norvegicus|10116|http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:2481343|AAA41015.1|MGDAEAGKKIFIQKCAQCHT| |Gallus_gallus_CAA25046.1|Gallus gallus|9031|http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:3854553|CAA25046.1|MGDIEKGKKIFVQKCSQCHT| |Rattus_norvegicus_AAA21711.1|Rattus norvegicus|10116|http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:2481343|AAA21711.1|MGDVEKGKKIFVQKCAQCHT| ---+++ Newick *Capabilities*: * tree topology, branch lengths, OTU labels * labels on internal nodes, as in ((rat, mouse)rodent,(gorilla,human)primate) * multiple trees in one file (terminate with semi-colon and end-of-line) * by convention, bootstraps confidence values in square brackets *Limitations*: * no formal syntax or semantics, only a conventional understanding * not extensible to allow character data, accessions, coordinates, or taxon ids (except as labels) *File content* ((Mus_musculus_CAA25899.1:0.008307,Rattus_norvegicus_AAA21711.1:0.009662)0.024280[0.74], (Gallus_gallus_CAA25046.1:0.055226,Rattus_norvegicus_AAA41015.1:0.117358)0.040335[0.69]); ---+++ NHX (New Hampshire Extended) *Capabilities* in addition to what Newick allows: * tag designated for species name (but parsers don't expect spaces) * tag designated for NCBI-style taxid (but not LSID-like identifier with punctuation) * tag designated for accession (but not fully documented in format standard) * tag designated for sequence (but not fully documented in format standard) * unassigned tag designated for user-defined uses *Limitations*: * no formal syntax or semantics, only a limited format description * all tree-linked info must be embedded in the tree (no refs) * not actively or openly developed; deprecated by developer (C. Zmasek) in favor of phyloXML *File content* ((Mus_musculus_CAA25899.1:0.008307[&&NHX:S=Mus_musculus:AC=CAA25899.1:T=10090: XN=protein=NCBI=MGDVEKGKKIFVQKCAQCHT],Rattus_norvegicus_AAA21711.1: 0.009662[&&NHX:S=Rattus_norvegicus:T=10116:AC=AAA21711.1: XN=protein=NCBI=MGDVEKGKKIFVQKCAQCHT]):0.024280[&&NHX:B=0.74],(Gallus_gallus_CAA25046.1: 0.055226[&&NHX:S=Gallus_gallus:T=9031:AC=CAA25046.1:XN=protein=NCBI=MGDIEKGKKIFVQKCSQCHT], Rattus_norvegicus_AAA41015.1:0.117358[&&NHX:S=Rattus_norvegicus:T=10116:AC=AAA41015.1: XN=protein=NCBI=MGDAEAGKKIFIQKCAQCHT]):0.040335[&&NHX:B=0.69]); This view of NHX file PF00034_4.nhx was made with Archaepteryx using the "species name" view option:
PF00034_4.jpg ---+++ NEXUS *Capabilities* in addition to what Newick allows: * trees can be named and assigned weights * extensive capacity to represent molecular or morphological character data * arbitrary notes can be assigned to OTUs, characters, states in NOTES block * extensible by means of user-defined blocks and commands * extensive format description (Maddison, et al., 1997) *Limitations*: * no formal syntax or semantics * no designated commands to denote species names, accessions or coordinates * conflicting interpretations in the user community *File content* with species names as comments (interspersed with taxlabels) and LSIDs embedded in the NOTES block: #NEXUS BEGIN TAXA; DIMENSIONS ntax=4; TAXLABELS Mus_musculus_CAA25899.1 [Mus musculus] Rattus_norvegicus_AAA21711.1 [Rattus norvegicus] Gallus_gallus_CAA25046.1 [Gallus gallus] Rattus_norvegicus_AAA41015.1 [Rattus norvegicus]; END; BEGIN CHARACTERS; DIMENSIONS nchar=20; FORMAT datatype=protein gap=- missing=?; MATRIX Mus_musculus_CAA25899.1 MGDVEKGKKIFVQKCAQCHT Rattus_norvegicus_AAA41015.1 MGDAEAGKKIFIQKCAQCHT Gallus_gallus_CAA25046.1 MGDIEKGKKIFVQKCSQCHT Rattus_norvegicus_AAA21711.1 MGDVEKGKKIFVQKCAQCHT; END; BEGIN TREES; TREE con_50_majrule = ((Mus_musculus_CAA25899.1:0.008307,Rattus_norvegicus_AAA21711.1:0.009662):0.024280[0.74],(Gallus_gallus_CAA25046.1:0.055226,Rattus_norvegicus_AAA41015.1:0.117358):0.040335[0.69]); END; BEGIN NOTES; text taxon=Mus_musculus_CAA25899.1 text='http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:2481174' text taxon=Rattus_norvegicus_AAA41015.1 text='http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:2481343' text taxon=Gallus_gallus_CAA25046.1 text='http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:3854553' text taxon=Rattus_norvegicus_AAA21711.1 text='http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:2481343' END; Nexplorer-generated view of NEXUS file PF00034_4.nex:
nexplorer_view.jpg ---+++ PhyloXML *Capabilities* in addition to what Newick allows: * published format description ([[http://www.biomedcentral.com/1471-2105/10/356/ Han and Zmasek]]) * formal syntax as defined in XSD schema * tags designated for accession numbers * tags for geographic coordinates * tags designated for species identifiers with a named authority * tags designated for sequence data * extensible via <property> tag reserved for user-defined properties *Limitations*: * schema focuses on molecular evolution use-cases, doesn't cover other character data * lack of idrefs leads to need to duplicate literals, prevents normalization The snippet below shows a terminal branch with explicitly tagged data and metadata 0.055226 http://www.ubio.org/authority/metadata.php?lsid=urn:lsid:ubio.org:namebank:3854553 Gallus gallus CAA25046.1 Gallus_gallus_CAA25046.1 MGDIEKGKKIFVQKCSQCHT As described (http://www.phyloxml.org/documentation/version_1.10/phyloxml.xsd.html#h-979596407), geographic coordinates also can be represented (example provided by Christian Zmasek): a clade with a distribution San Diego 32.880933 -117.217543 104 ---+++ NeXML *Capabilities* in addition to what Newick allows: * formal syntax as defined in XSD schema * designed with extensive capabilities for representing character data of various types * extensible via <meta> tag intended for external vocabularies such as * DarwinCore taxon concepts * DarwinCore geographic coordinates * dublin core publication data *Limitations*: * format description is not published The snippet below shows <meta>-tagged information associated with an OTU via external vocabularies If we want to put accessions in here, we have to decide what property to use. Other than that, its not a problem: MGDVEKGKKIFVQKCAQCHT Its also possible to put in study identifiers and citation data, as in the following TreeBASE example (which you can retrieve via [[http://treebase.org/treebase-web/search/downloadAStudy.html?id=112&format=nexml its phyloWS URL]]): ---+++ CDAO RDF/XML not done ---++ Appendix 3: Archives ---+++ Dryad [http://www.datadryad.org Dryad] is a new project (established in 2009) to support data archiving for evolutionary research. The organizers of this project worked with publishers to generate agreement on the provisional data archiving policy noted above. The archive will accept text files and spreadsheet files in standard formats. Thus, users could submit a phylogenetic tree in any of the formats noted above. Whereas TreeBASE has a complex internal data model, with each submitted datum being assigned to some slot in the data model, Dryad will accept all sorts of textual information. To allow for query and retrieval, Dryad will index all of this information as text. ---++++ Uploading data to [[http://www.youtube.com/watch?v=RP33cl8tL28 Dryad]] ---++++ Results of using the [[http://www.datadryad.org Dryad]] submission process ---++++Further information on Dryad Communicated by Todd Vision of the Dryad project:
Dryad is a general-purpose repository. It doesn't impose constraints on how data are represented within the files that users submit. The best practices need to come from elsewhere, such as journal policies, MIAPA, and community practice imposed by awareness of how the data will be reused by more specialized phylogenetic tools. Dryad just introduced a "handshaking" feature for TreeBASE. Users can elect to have a NEXUS file that is deposited to Dryad "pushed through" to TreeBASE to initiate the submission process there. So for the special case of phylogenetic data in Dryad, we would encourage having that Newick tree within a NEXUS file, together with the OTU metadata that can fit within that file format. I dream of a future in which lots of different software tools will support the editing and output of metadata-rich phylogenies in NeXML, and that TreeBASE can ingest those NeXML files. But we aren't there yet. If a user doesn't intend to use TreeBASE for whatever reason, then a Newick tree in one file and OTU metadata in a separate CSV file would be a reasonable low-tech solution, as long as the OTU identifiers were consistent between the files. A ReadMe file could also be used to provide study-level metadata.
---+++ TreeBASE TreeBASE is a repository for trees that has been in operation for many (how many? jsw) years. In the past few years, the schema was redesigned, and there have been numerous upgrades to the user interface, including a sophisticated submission process and a web services API to retrieve results via a URL. ---++++ Uploading a tree to [[http://www.treebase.org TreeBase]] The TreeBASE website provides detailed instructions for submitting data. We obtained further information in a teleconference (9/29/10) with Bill Piel, who described the process as follows: 1. Use Mesquite to prepare document before uploading to [[http://www.treebase.org TreeBase]] * Why? Because 1) TreeBASE and Mesquite use the same Java API for parsing NEXUS; and 2) this API is a relatively complete and robust implementation of the standard 1. In Mesquite, best to combine matrix and tree in the same file to ensure matching names 1. Ensure taxon names are written out in full as binomial or trinomial * If there are infraspecies, just write the triplet without ‘var’, ‘subsp’ etc * What if there are multiple specimens for the same taxon? Each name must be unique, so make sure the specimen ID etc, is a suffix formatted with a leading capital or a number so [[http://www.treebase.org TreeBase]] won’t treat it as a new taxon name 1. After upload, click on yellow taxon button. Then click <validate taxon labels>. Tree base tries to match up labels with existing taxon names. If not, checks uBio. If name may be a homonym – will be asked to choose which taxon map link to. NCBI handles the homonyms. [[http://www.treebase.org TreeBase]] will link to taxon names to a GenBank taxid if possible. 1. Create an analysis record to link the matrices to the trees. 1. Linking to specimen IDs (e.g., genbank accession) is done by setting attributes of rows in the matrix: 1. After uploading matrix click <download rowsegment template>. There is a list of row labels to populate. You can enter Darwin Core information about the specimen. 1. There is a bug here: if some rows are populated for a given column, all rows must be populated for that column. There is an error if left blank. To work around this bug, just put something there, such as a dash ("-"). 1. You could apply this metadata to just a part of the alignment See below for notes on how much of this metadata is included in [[http://www.nexml.org NeXML]] output. Currently there is no way to attach metadata to the tree nodes individually. ---++++ Results of using the [[http://www.treebase.org TreeBase]] submission process To assess the TreeBase submission process, we uploaded files with OTU labels that contain species names. These were recognized >90% of the time by TB once the "validate taxon labels" button is pressed (which prompts the question of why TB doesn't suggest these automatically and simply ask the user to confirm. One of us used the "row segment table" interface to annotate a submission with GenBank accessions. One of us (AS) worked with Dr. Martin Wu to submit data from a largely, recently published analysis of prokaryotic phylogeny (Wu, et al., 2010). The data consist of a 720-taxon tree, a 6309-column alignment, and metadata (citation data, analysis methods) added interactively during the submission process. Prior to submission, AS spent several hours to generate matching labels so that the separate alignment and tree files (initially with non-matching names) could be combined in Mesquite or Bio::NEXUS. This is a common stumbling block in phylogenetics workflows. Dr. Wu spent an hour on the submission process itself, though this stretched out over several weeks while a syntax issue due to differing interpretations of NEXUS was resolved via email, with help from Dr. Piel (initially, we encoded names as 'Genus_species_strain', based on the equivalence of spaces and underscores in NEXUS names; however, protecting the underscores within a single-quoted phrase prevented them from being treated as spaces by the TreeBASE NEXUS parser). When this minor syntax issue was resolved, TreeBASE automatically matched all 720 OTU names to qualified species names. The report was submitted and now appears as TreeBASE study [[http://www.treebase.org/treebase-web/search/study/summary.html?id=10965 10965]]. Before submitting to TreeBASE, Dr. Wu had been contacted with requests for the data 3 times in the 11 months since the paper was published. Dr. Wu reports that making the submission to TreeBASE was "definitely worth it". The following [[http://www.treebase.org TreeBase]] screenshot (cropped) shows how a user may assign a UBio Id to an OTU (and it also shows that _TreeBase correctly guesses the actual species_):
tb2_taxlabel_editor_screenshot.jpg The following [[http://www.treebase.org TreeBase]] screenshot (cropped) shows a taxon table with match-able names. The 3 lower rows show OTUs whose names were auto-matched already. Pressing the "Validate taxon labels" button will automatically apply the results of name-matching, which in this case gives the correct attributions:
tb_taxa_table_screenshot.jpg ---++++ Examples of metadata in NeXML output Not all of the metadata stored internally at TreeBase is exported in standard exchange formats. However, some of these metadata are exportable in NeXML. For example, look at this: http://purl.org/phylo/treebase/phylows/matrix/TB2:M5212?format=nexml which shows geographic coordinates and taxon identifiers, like this: Unfortunately the Genbank Accession numbers are not yet included, pending a decision (by the TreeBASE and NeXML developers) on how to represent these. ---++ Appendix 4: Survey and user feedback. Release of the initial report will be coordinated with release of a survey ([[https://spreadsheets.google.com/viewform?formkey=dHhZa0xMQTJuR0ZCZWxoV2JSTG13b2c6MQ preliminary draft]]) developed by the MIAPA group. Planning for the survey is described here: http://www.evoio.org/wiki/MIAPA_Survey The survey is a Google spreadsheet with a web-form interface. Users respond to the form, and their responses are entered automatically into the spreadsheet. The MIAPA survey team will test, revise, and deploy the survey. The survey team will analyze the results of the survey and followup on queries from respondents. -- Main.ArlinStoltzfus - 28 Oct 2010 %META:FILEATTACHMENT{name="tb2_taxlabel_editor_screenshot.jpg" attachment="tb2_taxlabel_editor_screenshot.jpg" attr="" comment="TreeBase2 screenshot (cropped) showing how to assign a UBio Id to an OTU" date="1285946754" path="tb2_taxlabel_editor_screenshot.jpg" size="56373" stream="tb2_taxlabel_editor_screenshot.jpg" user="Main.ArlinStoltzfus" version="2"}% %META:FILEATTACHMENT{name="tb_taxa_table_screenshot.jpg" attachment="tb_taxa_table_screenshot.jpg" attr="" comment="TreeBase2 screenshot (cropped) showing taxon table with match-able names" date="1285944819" path="tb_taxa_table_screenshot.jpg" size="147585" stream="tb_taxa_table_screenshot.jpg" user="Main.ArlinStoltzfus" version="1"}% %META:FILEATTACHMENT{name="nexplorer_view.jpg" attachment="nexplorer_view.jpg" attr="" comment="Nexplorer-generated view of NEXUS file PF00034_4.nex" date="1285944984" path="nexplorer_view.jpg" size="63077" stream="nexplorer_view.jpg" user="Main.ArlinStoltzfus" version="1"}% %META:FILEATTACHMENT{name="PF00034_4.jpg" attachment="PF00034_4.jpg" attr="" comment="Archaeopteryx-generated view of NHX file PF00034_4.nhx" date="1285945026" path="PF00034_4.jpg" size="8131" stream="PF00034_4.jpg" user="Main.ArlinStoltzfus" version="1"}% %META:FILEATTACHMENT{name="PF00034_4.nwk" attachment="PF00034_4.nwk" attr="" comment="4-taxon test case in Newick format" date="1286220953" path="PF00034_4.nwk" size="177" stream="PF00034_4.nwk" user="Main.ArlinStoltzfus" version="1"}% %META:FILEATTACHMENT{name="PF00034_4.nhx" attachment="PF00034_4.nhx" attr="" comment="4-taxon test case in NHX format" date="1286220969" path="PF00034_4.nhx" size="529" stream="PF00034_4.nhx" user="Main.ArlinStoltzfus" version="1"}% %META:FILEATTACHMENT{name="PF00034_4.nex" attachment="PF00034_4.nex" attr="" comment="4-taxon test case in NEXUS format" date="1286291359" path="PF00034_4.nex" size="1283" stream="PF00034_4.nex" user="Main.ArlinStoltzfus" version="2"}% %META:FILEATTACHMENT{name="PF00034_4_phylo.xml" attachment="PF00034_4_phylo.xml" attr="" comment="4-taxon test case in phyloxml format" date="1286221008" path="PF00034_4_phylo.xml" size="2517" stream="PF00034_4_phylo.xml" user="Main.ArlinStoltzfus" version="1"}% %META:FILEATTACHMENT{name="PF00034_4_nexml.xml" attachment="PF00034_4_nexml.xml" attr="" comment="4-taxon test case in nexml format" date="1286293534" path="PF00034_4_nexml.xml" size="10675" stream="PF00034_4_nexml.xml" user="Main.ArlinStoltzfus" version="2"}% %META:FILEATTACHMENT{name="Matrix.xls" attachment="Matrix.xls" attr="h" comment="" date="1295324981" path="Matrix.xls" size="26624" stream="Matrix.xls" user="Main.JamieW" version="1"}% %META:FILEATTACHMENT{name="Matrix_Capabilities.png" attachment="Matrix_Capabilities.png" attr="" comment="Capabilities Matrix" date="1295326233" path="Matrix_Capabilities.png" size="51901" stream="Matrix_Capabilities.png" user="Main.JamieW" version="1"}% %META:FILEATTACHMENT{name="Matrix_Limitations.png" attachment="Matrix_Limitations.png" attr="" comment="" date="1295327188" path="Matrix_Limitations.png" size="31607" stream="Matrix_Limitations.png" user="Main.JamieW" version="2"}% %META:FILEATTACHMENT{name="Capabilities_Limitations_Matrix.xls" attachment="Capabilities_Limitations_Matrix.xls" attr="" comment="" date="1295327342" path="Capabilities_Limitations_Matrix.xls" size="20992" stream="Capabilities_Limitations_Matrix.xls" user="Main.JamieW" version="1"}% %META:FILEATTACHMENT{name="example.jpg" attachment="example.jpg" attr="" comment="Image of an uninformative tree" date="1295634935" path="example.jpg" size="4435" stream="example.jpg" user="Main.ArlinStoltzfus" version="1"}% %META:FILEATTACHMENT{name="phylo_file_format_comparison.gif" attachment="phylo_file_format_comparison.gif" attr="" comment="Image of table comparing file format features" date="1295901298" path="phylo_file_format_comparison.gif" size="43911" stream="phylo_file_format_comparison.gif" user="Main.ArlinStoltzfus" version="1"}% %META:FILEATTACHMENT{name="phylo_file_format_comparison.xls" attachment="phylo_file_format_comparison.xls" attr="" comment="Table comparing file format features (Excel)" date="1295901322" path="phylo_file_format_comparison.xls" size="17408" stream="phylo_file_format_comparison.xls" user="Main.ArlinStoltzfus" version="1"}% @ 1.79 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="HilmarLapp" date="1312152170" format="1.1" version="1.79"}% d3 2 @ 1.78 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1297868113" format="1.1" version="1.78"}% d7 3 a9 3 * Jamie Whitacre, * Dan Rosauer, * Torsten Eriksson, d149 1 a149 1 As of 2010, no standard or draft has been developed (the [[http://mibbi.sourceforge.net/projects/MIAPA/ MIBBI repository for the MIAPA project]] is empty). A [[https://www.nescent.org/wg_evoinfo/MIAPA_WhitePaper NESCent whitepaper on MIAPA]] outlines how the project could be moved forward. As a proof-of-concept exercise (described with some screenshots [[https://www.nescent.org/wg_evoinfo/Supporting_MIAPA#Proof-of-concept_.28annotation_software.29 here]]), participants in NESCent's Evolutionary Informatics working group configured an existing annotation application to use a controlled vocabulary to describe a phylogenetic analysis as a series of steps. d158 1 a158 1 * NEXUS (https://www.nescent.org/wg_phyloinformatics/Supporting_NEXUS_Documentation) - a full featured but dated format d269 1 a269 1 *A reporting standard for a phylogenetic analysis can only extend the re-usability of archived trees*. As noted above, there is no standard governing archiving. A minimal reporting standard for phylogenetic analysis has been suggested (see MIAPA, Appendix 1) but not drafted or approved. Establishing such a standard is a critical step in promoting re-useable trees. To develop a standard requires community organization as well as technological support (some guidance is provided by a [[https://www.nescent.org/wg_evoinfo/MIAPA_WhitePaper MIAPA whitepaper]] from the NESCent EvoInfo working group). d658 1 a658 1 http://www.evoio.org/wg/evoio/MIAPA_Survey @ 1.77 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1297699022" format="1.1" version="1.77"}% d656 1 a656 1 Once the initial report is released, a survey ([[https://spreadsheets.google.com/viewform?formkey=dHhZa0xMQTJuR0ZCZWxoV2JSTG13b2c6MQ preliminary draft]]) will be sent to scientists. This appendix is to summarize the plan for the survey, as well as the feedback from the survey, the "comment" box on this page, and any other comments received. d658 1 a658 1 The survey is a Google spreadsheet with a web-form interface. Users respond to the form, and their responses are entered automatically into the spreadsheet. d660 1 a660 30 A MIAPA survey team will be assembled to test, revise, and deploy the survey. The survey team will analyze the results of the survey and followup on queries from respondents. ---+++ Development and testing The survey was developed in November by Arlin and benefitted from feedback from members of the MIAPA-discuss email list. Bill Piel, Karen Cranston, and others were given edit permissions and added some questions. The survey has not been user-tested. ---+++ Deploying the survey The survey will be disseminated electronically, with an electronic announcement and a link to the survey. The target audience is actual or potential producers and scientific users of phylogenetic trees. We assume that the typical target is a scientist who produces phylogenies for a specific purpose linked to a scientific study. We are not attempting to cover educational uses. To reach the target audience, we plan to use the following: * email list servers * evoldir - thousands of evolutionary biologists * tdwg - hundreds of members of the biodiversity information standards organization * lists from past NESCent phyloinformatics activities (wg-evoinfo, wg-phyloinformatics, phylo-vocamp1) * project lists (nexml-discuss@@sf, cdao-discuss@@sf, phylows@@googlegroups.com * syst biol twitter feed (@@systbiol) * the tolweb curator list, via Katja Schulz (editor) * PIs on tree-of-life grants (Jim Leebens-Mack) * ecolog (Hilmar) * iPlant? (jsw) ---+++ Analyzing the survey The survey team will analyze the survey results. ---+++ Other feedback @ 1.76 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1297358999" format="1.1" version="1.76"}% d5 6 @ 1.75 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1297280525" format="1.1" version="1.75"}% d652 1 a652 1 ---+++ Survey target audience d654 1 a654 1 The survey is aimed at actual or potential producers and users of phylogenetic trees. We assume that the typical target is a scientist who produces phylogenies for a specific purpose linked to a scientific study. d656 18 a673 5 To reach these scientists with an electronic announcement and a link to the survey, we are targeting the following: * evoldir list * tdwg list * email lists from past NESCent phyloinformatics activities (wg-evoinfo, wg-phyloinformatics, phylo-vocamp1) * project lists (nexml-discuss@@sf, cdao-discuss@@sf, phylows@@googlegroups.com d675 1 a675 1 * tolweb curator list a678 2 * another target * another target d680 1 a680 1 ---+++ Survey feedback d682 1 d684 1 @ 1.74 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1297188275" format="1.1" version="1.74"}% a6 4 ---++ To do (Feb 7, 2011) 1. Arlin will smooth out executive summary (just "Summary"?) @ 1.73 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1297183274" format="1.1" reprev="1.73" version="1.73"}% a8 4 1. Arlin (and Jamie) insert some examples of re-use * studies that cite TreeBase * example of re-use of tree in Puerto Rican plot * add point about the biggest trees being hardest to produce, most likely to re-use d33 2 a34 2 Our analysis suggests the following conclusions and recommendations: * The infrastructure to support simple archiving of phylogenetic trees is largely in place already d105 1 a105 1 There are many uses for identifiers. To integrate data from diverse sources, we need to have some kind of integrating variable such as a species name, a specimen number, etc. If we wish to integrate data from all over the world using names for things, then the names should be unique over all the world. By contrast, the tree example above invokes entities "otu1", "otu2" and "otu3". If we integrated data from all over the world using names like "otu1" and "otu2", we would surely make mistakes by aggregating information that does not belong together. d114 1 a114 1 So then, what kind of information makes a tree re-useable? What are the integrating variables one would use to integrate or aggregate trees? Imagine that we have a data-mining tool with access to all published trees, richly annotated. Our challenge is to use this database to reveal prior work on a topic, to test a hypothesis, to discover new relationships, or to carry out a meta-analysis addressing a methodological issue. In this context, useful types of data or metadata would include: d123 1 a123 1 Computable knowledge representation is largely a matter of relationships between entities that can be expressed as subject-predicate-object triples, i.e., "Bob has_friend Susan" or "Susan is_a female_person". By joining these two statements via the identity of Susan, we can answer the question of whether Bob has any female friends, even though neither statement alone tells us this. To establish the relationship of identity we may rely on GUIDs (above). In the above case, we would need to know that the entity called "Susan" in the two statements is really the same thing. To aggregate data on species occurrence, we need to know if a report of a bird in location X and a second report of a bird in location Y refer to the same species of bird. d125 1 a125 1 While the concept of identity is easy to understand and universal, many other types of relationships require definition. To represent the kinds of annotations that make phylogenies suitable for re-use-- citations, taxonomic links, provenance information, georeferences, methods descriptions-- requires language support. For instance, Dublin core (ref) provides a metadata standard for documents, providing terms for assigning authorship, title, and so on. The Open Provenance Model Vocabulary Specification (http://open-biomed.sourceforge.net/opmv/ns.html) provides a term "wasDerivedFrom", such that, having derived "tree1" from "alignment1", we could annotate this relationship with the statement d127 5 a131 1 tree1 http://purl.org/net/opmv/ns#wasDerivedFrom alignment1 a138 1 d141 1 a141 1 *Biodiversity Information Standards (TDWG).* Biodiversity Information Standards (TDWG) is a not for profit scientific and educational association that is affiliated with the International Union of Biological Sciences. It was formerly known as the Taxonomic Database Working Group. d143 1 a143 4 TDWG was formed to establish international collaboration among biological database projects. TDWG promoted the wider and more effective dissemination of information about the World's heritage of biological organisms for the benefit of the world at large. Biodiversity Information Standards (TDWG) now focuses on the development of standards for the exchange of biological/biodiversity data. Its current mission is to: * Develop, adopt and promote standards and guidelines for the recording and exchange of data about organisms * Promote the use of standards through the most appropriate and effective means and * Act as a forum for discussion through holding meetings and through publications d145 1 a145 1 TDWG has and promotes standards, but it does not carry the regulatory weight or enforcement power like NSF and publishers. Darwin Core and LSIDs are TDWG-approved standards. Research organizations look to TDWG for standards and benchmarks. (update jsw) d147 1 a147 3 ---+++ Reporting Standard Scientists with an interest in the archiving and re-use of phylogenetic data have called for (but not yet developed) a minimal reporting standard designated "Minimal Information for a Phylogenetic Analysis", or MIAPA ([[http://www.ncbi.nlm.nih.gov/pubmed/16901231 Leebens-Mack, et al. 2006]]). The vision of these scientists is that the research community would develop, and adhere to, a standard that imposes a minimal reporting requirement yet ensures that the reported data can be interpreted and re-used. Such a standard might be adopted by journals, repositories, databases, workflow systems, granting organizations, and organizations that develop taxonomic nomenclature based on phylogenies. Leebens-Mack, et al. suggest that a study should report objectives, sequences, taxa, alignment method, alignment, phylogeny inference method, and phylogeny (this implies that MIAPA is intended only for molecular, as opposed to non-molecular, phylogenetics). d149 2 a150 1 As of 2010, no standard or draft has been developed (the [[http://mibbi.sourceforge.net/projects/MIAPA/ MIBBI repository for the MIAPA project]] is empty). A [[https://www.nescent.org/wg_evoinfo/MIAPA_WhitePaper NESCent whitepaper on MIAPA]] outlines how the project could be moved forward. As a proof-of-concept exercise (described with some screenshots [[https://www.nescent.org/wg_evoinfo/Supporting_MIAPA#Proof-of-concept_.28annotation_software.29 here]]), participants in NESCent's Evolutionary Informatics working group configured an existing annotation application to use a controlled vocabulary to describe a phylogenetic analysis as a series of steps. d167 1 a167 1 In the past few years, four different XML formats have become available, though none is in widespread use. The main developer of NHX format, Christian Zmasek, went on to develop phyloXML, a validatable format to represent a greater range of attributes than NHX. PhyloXML has an economical schema tuned to the needs of molecular phylogeneticists. The BEAST package has used an XML input format for several years, but it is not considered further here because it is not used to export trees (BEAST outputs trees in NEXUS format). Likewise, while it is possible to encode comparative data in terms of CDAO (Comparative Data Analysis Ontology), and serialize this as RDF-XML, this is not the recommended use of CDAO. NeXML is an XML format with a precisely defined schema, modeled after the structure of NEXUS. While the design of phyloXML takes a very direct approach to satisfying user needs, NeXML opts for greater generality at the expense of a much more complex schema. It has an approach to metadata that allows for arbitrary annotation of data objects using external vocabularies. d175 1 a175 1 Researchers wishing to make a phylogenetic tree available in a public archive currently have two options, TreeBASE and Dryad. Tolkin & morphobank? (jsw) d185 3 d192 1 a192 1 As a general practical matter, NeXML and CDAO together provide an approach to metadata that is open-ended. NeXML is designed in a way to take advantage of external vocabularies. So, if there is a way to say something by invoking the terms of an external vocabulary, it can be said within NeXML (and this is the advantage of the NeXML metadata approach). The Comparative Data Analysis Ontology (CDAO) provides language support for many aspects of comparative analysis, though it has not been tested extensively and remains experimental. The concept of a phylogenetic tree could be designated by reference to CDAO:Tree, which in turn defines a tree as a sub-class of "Network", and in relation to other concepts like "Branch" and "Node". Currently there is an experimental branch of CDAO that imports terms for a variety of computation concepts like "multiple sequence alignment program" as well as specific versions (e.g., "ClustalW"). d196 1 a196 1 *Representing authorship and citations* Citing a journal publication is a key issue. However, there is also an issue of the authorship of an electronic document (e.g., tree file), which may be distinct from a publication. Dublin core, a metadata standard for documents, provides language for authorship, creation dates, copyright, licenses and so on. However, it lumps journal-volume-issue-page into "citation". PRISM, which builds on Dublin core, may provide a better alternative for referencing the scientific literature. Are there other standards more appropriate than these? d198 1 a198 1 *Linking to taxonomic concepts* Sources such as UBio provide LSIDs for taxa (taxon concepts). This solves one part of the problem. The other part is the predicates that link OTU or comparative data to a species source. Does DarwinCore provide an appropriate source of predicates? Another application is to associate internal nodes of a tree with higher taxa. Again, its not clear what is the source of predicates for this. d200 1 a200 1 *Provenance* A key issue in annotating a phylogeny with character data is to indicate the source of data or specimens. For molecular sequences, we typically want a GenBank accession. PhyloXML has a tag for that, but in other cases, its not clear which predicate to use, especially since an aligned molecular sequence may be derived by truncation from a GenBank source. In this case, the Open Provenance Model has some general predicates such as wasDerivedFrom. A tree can be annotated as having been derived from an alignment, and this alignment can be annotated, in turn, as being derived from individual sequences with GenBank sources. Another case is that in which we wish to associate data with a specimen that has a museum accession. Does DarwinCore provide an appropriate source of predicates? d202 1 a202 1 *Georeferences* TDWG TaxonOccurrence (http://rs.tdwg.org/ontology/voc/TaxonOccurrence) seems to address this. In phyloXML, there are pre-assigned tags, but its not certain what they mean, precisely. d204 1 a204 1 *Methods* As indicated in the MIAPA paper (Leebens-Mack, et al) and in the TreeBASE submission protocol, researchers dealing with molecular data consider methods to be an important component of metadata. The Open provenance model, mentioned above, provides some generic concepts that would be useful. However, in spite of the potential of CDAO mentioned above, there seems to be a major gap between what is available, and what is needed to annotate the complex multi-step user-assisted workflows used by a scientific researcher to generate a phylogeny product. A recent LIMS plug-in for the Geneious software allows for tracking workflows and alignment annotations (see http://software.mooreabiocode.org). d206 4 d212 5 a216 2 Even if journal policies encourage archiving and standards that guide representation of data, scientists motivated to archive or re-use data may find it difficult to do so if appropriate tools are lacking. Some of the obvious kinds of tools to support archiving of richly annotated data sets are: * format validators and translators that support data and metadata d221 1 a221 1 * manipulating data and metadata (e.g., extracting subsets or subtrees) while respecting rules of logic d225 1 a225 1 For the purposes of this report, we did not carry out an extensive analysis of available tools. Our initial impressions are that there is a deficit of tools supporting archiving and reuse of phylogenetic data. d227 1 a227 1 For instance, many phylogenetics users implement a customized interactive workflow that relies on diverse software tools. Because these tools may use different formats, the ability to convert among formats is important. However, convenient generalized tools are lacking (the NeXML manual provides a useful list of online servers and scripting approaches). The Geneious software program meets some of these needs (expand jsw). Likewise, there are many tools for viewing trees, but it seems that only a few allow for viewing trees together with a matrix of data (e.g., Archaeopteryx, Mesquite, Nexplore). Support for viewing metadata is very limited. The TreeBASE submission server is an example of a tool that allows users to annotate data sets by associating OTUs with species names, and by associating data rows with accessions. However, this tool only works in the context of a TreeBASE submission. d231 1 a231 1 The analysis of citations by Kumar & Dudley (2007) suggests that the number of phylogeny publications in 2006 was 7000, and the rate of phylogeny publications is rapidly increasing. Experts in phylogenetic analysis typically generate hundreds or thousands of trees for every tree that is published. Thus, it is likely that, each year, many millions of trees are generated in association with published research but only a subset is actually made available through publications and electronic media. d238 1 a238 3 * [[http://datadryad.org Dryad]] * Tolkin (? jsw) * Morphobank (? jsw) d247 5 a251 1 *(not done)* d253 1 a253 1 Needs intro paragraph (jsw). Therefore we address conclusions and recommendations below as they pertain to archiving and linking, separately or collectively. d255 1 a255 3 ---+++ Archiving and linking (lack of translation technology is a key issue here) The needs of archiving are not the same as those of publishing linkable, re-usable data. In a typical case of archiving, even the simplistic Newick format can be used to represent a phylogeny with metadata if there is a combination of 1) a Newick string with unique identifiers for each internal and external node; and 2) an entity-attribute-value table assigning attribute values to nodes, as in the table used in Appendix 2. While this combination may ensure that key information is available in a record, it does not make study replication any easier, and it does little to facilitate re-use, re-purposing, aggregation, or linking of the tree. BEAST users can support study replication by archiving their BEAST XML file, which includes the input data along with precise instructions for processing. This is a perfectly adequate solution for archiving. However, the information in the BEAST XML file provides instructions that only BEAST can understand, and is not anchored by semantics defined in external computable vocabularies. d257 3 a259 2 ---+++ Archiving *The infrastructure to support archiving of phylogenetic trees is largely in place (this is confusing jsw)*. (Data models and data formats are different animals jsw) TreeBASE and Dryad may serve as repositories to ensure that phylogenetic tree information associated with a publication is recoverable many years into the future. TreeBASE imports data into its own data model. In the case of Dryad, commonly used file formats are sufficient for archiving purposes. However, it is preferable to use formats that can be validated relative to a schema, because it is more likely that parsers for these formats will continue to be available in the long-term. a260 2 *While archiving trees is possible, the extent to which current policies require it is unclear*. (some new concepts not introduced previously jsw). The draft policy suggested recently by journals (Appendix 1) refers to "data supporting the results" of a publication. It is not clear what this is intended to cover. Natural scientists traditionally use the term "data" as a synonym for "facts", the empirical observations or measurements on which further analysis rests. By this empirical definition, phylogenies are not data. Computer and information scientists use "data" (and in some sub-fields, even "facts") to refer to any kind of recorded information, regardless of its nature or derivation. The NIH data sharing policy (see note 7 of [http://grants.nih.gov/grants/policy/nihgps/fnpart_ii.htm]) makes clear that NIH uses the informational definition of "data", not the empirical one. We recommend that institutions with data archiving policies be explicit about what they mean by "data". d263 1 a263 1 Currently the gap between needs and capacities is much greater for publishing re-usable trees than for the problem of archiving trees. One conclusion: d265 1 a265 1 *A reporting standard for a phylogenetic analysis will greatly extend the re-usability of archived trees*. A minimal reporting standard for phylogenetic analysis has been suggested (see MIAPA, Appendix 1) but not drafted or approved. Establishing such a standard is a critical step in promoting re-useable trees. To develop a standard requires community organization as well as technological support (some guidance is provided by a [[https://www.nescent.org/wg_evoinfo/MIAPA_WhitePaper MIAPA whitepaper]] from the NESCent EvoInfo working group). d267 3 a269 1 *Issues include:* d274 1 a274 1 1. Lack of software support for annotation d283 1 d295 2 d307 1 a307 4 Sidlauskas, B., G. Ganapathy, E. Hazkani-Covo, K. P. Jenkins, H. Lapp, L. W. McCall, S. Price, R. Scherle, P. A. Spaeth, and D. M. Kidd. 2010. Linking big: the continuing promise of evolutionary synthesis. Evolution 64:871-880. Zmasek, C. M. 2009. phyloXML: XML for evolutionary biology and comparative genomics. @ 1.72 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1297114281" format="1.1" version="1.72"}% d76 1 a76 1 While these factors represent incentives or pressures to "push" data out into the world, there is simultaneously an increasing "pull" from the promise of large-scale studies that aggregate and re-purpose data. The availability of data from PDB and GenBank, for example, has resulted in innumerable publications by scientists analyzing data generated by other scientists and stored in these archives. d81 3 a83 1 In the economy of data sharing, there are thousands of phylogeny producers (Kumar and Dudley, 2006), but few phylogeny consumers, in spite of an archive called "TreeBASE" (Piel, et al., 2002) that has enabled phylogeny re-use since the late 1990's. Until recently, for most authors, publishing a phylogenetic tree meant publishing a picture of a tree in a journal article (figure)-- an informational dead-end. However, conditions are changing to favor archiving and re-use of trees: d90 3 a92 1 (now provide examples of phylogeny re-use, including specific examples with citations. Examples of large-scale projects that rely on meta-analysis or integration are assembling a tree of life representing all known species, or identifying vulnerable species by combining occurrence data, climate data, and phylogeny in a geographic framework.) (send e-mail to dpt jsw) (what about source data? jsw) d94 5 a98 5 Thus, an analysis of best practices for publishing re-useable trees is timely. To address this issue, we must begin with some notion of what makes a tree re-usable. In considering what makes it likely for a tree to be re-used in study replication, meta-analysis, aggregation, or integration, we draw guidance from the 2008 roadmap of the TDWG Technical Architecture Group (TAG), and two recent commentaries, by Sidlauskas, et al (2010) on synthetic approaches to evolutionary analysis, and by Patterson, et al (2010) on the importance of names. Together, these resources suggest the importance of * globally unique identifiers (GUIDs) * validatable formats (especially XML) * formal language support (ontologies) * annotations ("metadata") d106 1 a106 1 Newick is the simplest of several data formats that are used to represent trees (see Appendices 1 and 3). Some of these formats can be validated, i.e., every valid instance of the type of file conforms to its abstract schema. (this is not clear- jsw) d109 1 a109 1 The example above invokes entities "otu1", "otu2" and "otu3", and implicitly defines a tree entity. To promote re-use and linking of data, it is essential to have globally unique identifiers or GUIDs for such entities so that they can be referenced unambiguously. Otherwise, when aggregating data from several trees, we might mistakenly combine "otu1" with another different entity also called "otu1". Using GUIDs ensures that when we refer to a thing, regardless of context, we know what it is. For instance, perhaps otu1 corresponds to the species ''_Canis latrans_'' (coyote). In that case, we can provide a Life Science Identifier or LSID (urn:lsid:ubio.org:namebank:2478093), which is a kind of GUID, and this LSID will make it possible for the researcher to associate the entity with information on ''_Canis latrans_'' available in resources such as the Encyclopedia of Life. If otu1 is a gene sequence, then an http URI for its NCBI accession can serve as a GUID, and this will make it possible for any subsequent researcher to associate "otu1" with the underlying sequence data. d111 1 a111 1 In TreeBase, each tree, data matrix, and study receives a GUID. NameBankIDs and NCBI taxid, and GenBank accession number for sequences. (jsw) d113 1 a113 2 1. How can users use GUIDs and LSIDs to make data more interoperable? 1. What are the opportunities for future use of GUIDs and LSIDs? d116 1 a116 1 While the Newick tree above is in a standard format and could be archived in the Newick format, it remains a relatively useless scientific tool because we do not know what it refers to or how it was derived. Even if our goal is to explore models of speciation, and we wish only to measure whether the topology of the tree is ladder-like vs. bushy, we can't use this particular tree because we can't tell whether it is a species tree (relevant to speciation), or some other kind of tree (irrelevant). d118 2 a119 1 So then, what kind of information makes a tree re-useable? Imagine that we have an archive of all published trees, richly annotated, and that these trees have been loaded into a database that provides tools to thoroughly mine the data. Our challenge is to use this database to reveal prior work on a topic, to gather data to test a hypothesis, to discover new relationships among types of data, or to carry out a meta-analysis addressing a methodological issue. In this context, useful types of data or metadata would include: d121 1 a121 1 * taxonomic links and species identifiers for OTUs (e.g., to find all studies relevant to the group "rodents") a122 1 * links to data from which the tree was inferred (e.g., to combine data into a supermatrix) d286 2 @ 1.71 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1297098613" format="1.1" version="1.71"}% d7 1 a7 1 ---++ Revision Plan (Jan 10, 2011) a8 7 to do list 1. Jamie add section on GUIDs * note that treebase provides http-uri guids for submitted studies * what about Dryad-- does it provide guids for studies, files? * keep the focus narrow, on how users can use GUIDs *now* to make data more interoperable * explain context, stress tentative nature of report * align it with rest report d15 2 a16 2 ---++ Executive Summary *(not done)* d54 3 a56 1 A major National Research Council (NRC) report on "A New Biology for the 21st Century" (2009; http://www.nap.edu/catalog.php?record_id=12764) suggests enormous potential for biological discovery based on aggregating and integrating data from diverse sources and from multiple disciplines. More specifically, recent commentaries (e.g., Sidlauskas, et al, 2010; Patterson, et al., 2010), suggest the possibility that archiving and re-use of phylogenetic trees and biodiversity data will soon take off at a pace not seen before. At the same time that funding agencies, publishers, and scientific culture are shifting in ways that incentivize sharing of data- including phylogenetic trees- new technologies and standards are emerging that make phylogenetic methods and results more interoperable. This interoperability infrastructure benefits individual researchers by enabling them to link results to related data, to avoid duplication of effort, to collaborate more effectively, and to pursue large-scale, integrative projects. d58 1 a58 1 What standards and technologies _will_ allow scientists to take advantage of phylogenies for "ANew Biology"? We aren't entirely sure, but we assume that the first step in addressing this question is to understand where we are right now. For this reason, we are undertaking a project to to assess the current state of the field in regard to archiving and re-use of phylogenetic trees, with the ultimate goal of encouraging the use and further development of practices that make trees interoperable. We aim to identify strengths and weaknesses in the current infrastructure, practices and, policies; to educate tree producers about the needs of tree users; and to educate users about the needs of tree producers. The scope of this project extends, in principle, to all areas (systematics, phylogenetics, paleobiology, diversity studies, etc) where the archiving and re-use of trees is of interest to scientists. d67 1 a67 1 This document reports the results of our preliminary assessment. Immediately upon releasing this report, we will disseminate a survey to thousands of scientists in order to obtain feedback and assess current practices. After analyzing this feedback, we intend to expand this preliminary report into a manuscript for publication. We invite those willing to make a commitment of work to join in this project. d72 1 a72 1 The circumstances of publishing and data reuse have changed radically in the past few decades. As the result of new technologies that generate massive amounts of data, many scientific reports depend on data too voluminous to publish in printed form. For instance, the record of a 3-dimensional protein structure contains roughly 10^4 coordinates (each a floating-point number) per domain. To facilitate archiving and re-use of such data, crystallographers collaborated in 1971 to launch the Protein Database (PDB), which is still the world's premier archive for 3D protein structures. In 1982, just 5 years after the discovery of DNA sequencing methods, GenBank was launched to archive the DNA sequences that were crowding the pages of journals. In both cases, editorial boards of relevant scientific journals quickly decided to require simultaneous archiving (of 3D structures in PDB; of DNA sequences in GenBank), so that data would be accessible to all scientists upon publication. d74 1 a74 1 Meanwhile, principled reasons to promote data sharing have exerted an increasing influence over institutions and institutional policies. Professional associations, publishers, and funding agencies recognize that "availability of the data underlying published scientific findings is essential to a healthy scientific process" (insert citation-jsw) (see Appendix 1). Funding agencies increasingly recognize that work done on behalf of the public, especially if it is funded by taxpayers, should be accessible to the public without restriction. d76 1 a76 1 While these factors represent incentives or pressures to "push" data out into the world, there is simultaneously an increasing "pull" from the promise of large-scale studies that aggregate and re-purpose data. The availability of data from PDB and GenBank, for example, has resulted in innumerable publications by scientists analyzing data generated by other scientists. d81 2 a82 2 Until recently, for most authors, publishing a phylogenetic tree results in an informational dead-end: a picture of the tree embedded in a journal article (figure). In the economy of data sharing, there are thousands of phylogeny producers, but few phylogeny consumers, in spite of an archive called "TreeBASE" (Piel, et al., 2002) that has enabled phylogeny re-use since the late 1990's. However, conditions are changing to favor archiving and re-use of trees: * In early 2010, eight journals in evolution and systematics announced plans to implement a data-archiving policy (see Appendix 1); d84 1 a84 1 * A new data archive, Dryad, began accepting data from ecological and evolutionary studies, including phylogenetic trees in 20XX? (jsw); @ 1.70 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1296508556" format="1.1" version="1.70"}% d3 2 a4 1 ---+ Current Best Practices for Publishing Trees Electronically: Preliminary Report and Request for Comments d6 1 a8 2 Jamie and Arlin plan to finish a draft report by January 24. The goal is to have a report of 5 to 10 pages (not counting appendices) that will accompany the survey. The survey is for researchers actively using phylogenies, so its ok to rely on technical language. a21 14 done: * AS will flesh out "Language support (ontologies and vocabularies)". * JW will write sections Analysis- Policies and Analysis- Standards (Computable Concepts) (Wednesday) * AS will write sections Analysis: Current Practices (Wednesday) * AS will add new sections according to outline (Monday) * AS will send draft of format comparison table from NeXML manuscript (Monday) * JW will create matrix of capabilities and limitations * AS will write Analysis: Archives and Analysis: Tools * JW will present the report and get feedback (Thursday) * AS and JW will talk on Thursday Jan 13 at 3:00 * AS add to gaps or tools section, ability to compare trees * AS *shorten* and rewrite rationale and background * Jamie update description of journal policies to reflect 2011 @ 1.69 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="JamieW" date="1296088560" format="1.1" reprev="1.69" version="1.69"}% d89 1 a89 1 In the life sciences, as in other fields, re-use of data (e.g., to verify or build on a published result) is crucial to the progressive and self-correcting nature of scientific inquiry. In the distant past, publication of a conventional scientific article was deemed sufficient to satisfy the demand for accessible and re-usable data. By publishing, authors made an implicit or explicit pledge to share data (and materials) upon request, but publishers had little power to enforce such promises. In practice, authors determined which data were released, to whom, and in what form, often assuming that their own interests were served best by hoarding data, rather than sharing data, an attitude that remains common in some fields (ref: Piwowar). d91 1 a91 1 This is no longer the case in many fields, due to various factors involving supply and demand, technology, and institutional policies. As the result of new technologies that generate massive amounts of data, many scientific reports depend on data too voluminous to publish in printed form. For instance, the record of a 3-dimensional protein structure contains roughly 10^4 coordinates (each a floating-point number) per domain. To facilitate archiving and re-use of such data, crystallographers collaborated in 1971 to launch the Protein Database (PDB), which is still the world's premier archive for 3D protein structures. In 1982, just 5 years after the discovery of DNA sequencing methods, GenBank was launched to archive the DNA sequences that were crowding the pages of journals. In both cases, editorial boards of relevant scientific journals quickly decided to require simultaneous archiving (of 3D structures in PDB; of DNA sequences in GenBank), so that data would be accessible to all scientists upon publication. d95 1 a95 1 Finally, the scientific potential for data aggregation and integration has led to a demand for data represented in a formal way that computers can understand. The availability of data from PDB and GenBank, for example, has resulted in innumerable publications by scientists analyzing data generated by other scientists. In fact, recent research has shown that making data available in public archives increases citations (ref: Piwowar). d98 1 a98 1 Presumably, all of the same principles will apply to archiving and re-use of phylogenies. Until recently, however, for most authors, publishing a phylogenetic tree results in an informational dead-end: a picture of the tree embedded in a journal article (figure). This is understandable and acceptable to the extent that research projects produce trees as final end-products, with each tree being unique and unlikely to be re-usable. In other words, in the economy of data sharing, there are thousands of phylogeny producers, but few phylogeny consumers, in spite of an archive called "TreeBASE" (Piel, et al., 2002) that has enabled phylogeny re-use since the late 1990's. d100 1 a100 1 However, conditions are changing to favor archiving and re-use of trees: @ 1.68 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="JamieW" date="1296071267" format="1.1" reprev="1.68" version="1.68"}% d38 1 a38 1 An assessment of best practices for publishing phylogenetic trees is timely given recent decisions by many journals to require the archiving of trees. However, even without that justification, several longer-term trends favor an increased emphasis on richly annotated re-usable trees that can be linked to other data: the opportunities for phylogeny re-use are greater; there are new opportunities for aggregation and integration; and both specific and general technologies have emerged that make sharing and re-using trees easier. This report summarizes an as-yet-incomplete project to perform this assessment and, in turn, suggest solutions for meeting recommendations and filling gaps in the current landscape. d40 1 a40 1 The motivation for the report is that it will encourage the use, and further the development, of data management practices that will benefit scientists individually and collectively. Archiving of results post-publication seems to benefit the scientific community, and to benefit the individual archiving scientist in the form of increased recognition. A less speculative motivation is that developing the capacity to manage richly annotated yet interoperable data benefits scientists (individually and collectively) by making it easier to carry out integrative, automated, or large-scale projects. d42 1 a42 1 This report represents the first step in a larger analysis. Following release of this initial report, we will disseminate a survey broadly, carry out further analysis, and write a more complete report. For the present purposes, we have conducted an initial general assessment of d47 1 a47 1 In several other areas, we offer comments and call for more extensive analysis: d52 1 a52 1 In addition, we have studied the submission process of TreeBASE, and evaluated the capacity of various file formats to represent specific kinds of metadata (annotations) deemed likely to increase the capacity for research results to be discovered, interpreted, linked (to other data) and re-used: d64 1 a64 1 * The gap between needs and capacities is much greater for the problem of publishing re-usable trees than for simple archiving d66 1 a66 1 Archiving of trees is technically feasible given current formats, and using currently available archives (TreeBASE and Dryad). However, the archival value of many trees will be limited without a shift in emphasis toward re-useability, along with technology and standards to support such a shift. While making trees archival is an important step forward for the phylogenetic community, re-usability of trees depends on several other conditions that, for the foreseeable future, will be difficult for most researchers to obtain. Before interoperability of richly annotated trees can be obtained, the research community must commit to the use of globally unique identifiers (GUIDs) for informational and material entities, and develop the syntax and semantics to represent the metadata upon which the value of the data depend. The community may be ready to respond to renewed calls for a Minimal Information for a Phylogenetic Analysis (MIAPA) standard. d71 1 a71 1 We also solicit feedback on this preliminary report (see [[#AddComments][below]]). We invite interested scientists to make comments and to join the effort to carry out the work needed to complete this report. d75 1 a75 1 A major NRC report on "The New Biology" (2009; http://www.nap.edu/catalog.php?record_id=12764) suggests an enormous new potential for biological discovery based on the prospect of aggregating and integrating data from diverse sources and from multiple disciplines. More specifically, recent commentaries (e.g., Sidlauskas, et al, 2010; Patterson, et al., 2010), suggest the possibility that archiving and re-use of phylogenetic trees and biodiversity data will soon take off at a pace not seen before. At the same time that funding agencies, publishers, and scientific culture are shifting in ways that incentivize sharing of data-- including phylogenetic trees--, new technologies and standards are emerging that make phylogenetic methods and results more interoperable. This interoperability infrastructure benefits individual researchers by enabling them to link results to related data, to avoid duplication of effort, to collaborate more effectively, and to pursue large-scale, integrative projects. d77 1 a77 1 What standards and technologies will allow scientists to take advantage of phylogenies for "The New Biology"? We aren't entirely sure, but we assume that the first step in addressing this question is to understand where we are right now. For this reason, we are undertaking a project to to assess the current state of the field in regard to archiving and re-use of phylogenetic trees, with the ultimate goal of encouraging the use and further development of practices that make trees interoperable. We aim to identify strengths and weaknesses in the current infrastructure, practices and policies, to educate tree producers about the needs of tree users, and to educate users about the needs of tree producers. The scope of this project extends, in principle, to all areas (systematics, phylogenetics, paleobiology, diversity studies, etc) where the archiving and re-use of trees is of interest to scientists. d86 2 a87 2 This document reports the results of our preliminary assessment. Immediately upon releasing this report, we will disseminate a survey to thousands of scientists, in order to obtain feedback and assess current practices. After analyzing this feedback, we intend to expand this preliminary report into a manuscript for publication. We invite those willing to make a commitment of work to join in this project. ---++ Background: the why and the how of data archiving and re-use d89 1 a89 1 In the life sciences, as in other fields, re-use of data (e.g., to verify or build on a published result) is crucial to the progressive and self-correcting nature of scientific inquiry. In the distant past, publication of a conventional scientific article was deemed sufficient to satisfy the demand for accessible and re-usable data. By publishing, authors made an implicit or explicit pledge to share data (and materials) upon request, but publishers had little power to enforce such promises. In practice, authors determined which data were released, to whom, and in what form, often assuming that their own interests were served best by hoarding data, rather than sharing data, an attitude that remains common in some fields (ref: piwowar) (this is a little strong-jw). d93 1 a93 1 Meanwhile, principled reasons to promote data sharing have exerted an increasing influence over institutions and institutional policies. Professional associations, publishers, and funding agencies recognize that "availability of the data underlying published scientific findings is essential to a healthy scientific process" (insert citation-jw) (see Appendix 1). Funding agencies increasingly recognize that work done on behalf of the public, especially if it is funded by taxpayers, should be accessible to the public without restriction. d95 1 a95 1 Finally, the potential (what kind of potential?-jsw) for data aggregation and integration has led to a demand for data represented in a formal way that computers can understand. The availability of data from PDB and GenBank, for example, has resulted in innumerable publications by scientists analyzing data generated by other scientists. Recent research has shown that making data available in public archives increases citations (ref: Piwowar). d98 1 a98 3 Presumably, all of the same principles will apply to archiving and re-use of phylogenies. For most authors, publishing a phylogenetic tree results in an informational dead-end: a picture of the tree embedded in a journal article (figure). This is understandable and acceptable to the extent that research projects produce trees as final end-products, with each tree being unique and unlikely to be re-usable. In other words, in the economy of data sharing, there are thousands of phylogeny producers, but few phylogeny consumers, in spite of an archive called "TreeBASE" (Piel, et al., 2002) that has enabled phylogeny re-use since the late 1990's. (really?-jsw) d102 1 a102 1 * In 2002? (jsw), TreeBase (Piel, et al., 2002) completed a substantial upgrade of features, including its submission process; d104 2 a105 2 * NSF recently increased its [[http://www.nsf.gov/pubs/policydocs/pappguide/nsf11001/gpg_2.jsp#IIC2j requirements for data-sharing plans]] in grant proposals (see Appendix 1); (to include trees? jsw) * In recent years, NESCent has invested in "phyloinformatics" efforts to enable interoperability, resulting in projects to develop an XML file format (NeXML), an ontology (CDAO) and a web-services standard (PhyloWS). d107 1 a107 1 (now provide examples of phylogeny re-use, including specific examples with citations. Examples of large-scale projects that rely on meta-analysis or integration are assembling a tree of life representing all known species, or identifying vulnerable species by combining occurrence data, climate data, and phylogeny in a geographic framework.) (send e-mail to dpt jsw) d109 1 a109 1 Thus, an analysis of best practices for publishing re-useable trees is timely. To address this issue, we must begin with some notion of what makes a tree re-usable. In considering what makes it likely for a tree to be re-used in study replication, meta-analysis, aggregation, or integration, we draw guidance from the 2008 roadmap of the TDWG Technical Architecture Group (TAG), and two recent commentaries, by Sidlauskas, et al (2010) on synthetic approaches to evolutionary analysis, and by Patterson, et al (2010) on the importance of names. (what about source data? jsw) Together, these resources suggest the importance of d114 1 a114 3 for the re-useability of phylogenetic results. Below, we briefly explain these features. d117 1 a117 1 Currently, most trees that appear in the published literature are accessible only in the form of an embedded graphical image, i.e., the published item is literally a picture of a tree (figure, above), rather than a machine-readable symbolic encoding of relationships. For trees to be re-usable, they must be accessible in a standard format that makes the structure of the tree explicit. The tree image above corresponds to the following Newick string: d121 1 a121 1 Newick is the simplest of several data formats that are used to represent trees (see Appendices 1 and 3). Some of these formats can be validated, i.e., every valid instance of the type of file conforms to its abstract schema. d124 1 a124 1 The example above invokes entities "otu1", "otu2" and "otu3", and implicitly defines a tree entity. To promote re-use and linking of data, it is essential to have globally unique identifiers or GUIDs for such entities so that they can be referenced unambiguously. Otherwise, when aggregating data from several trees, we might mistakenly combine "otu1" with another different entity also called "otu1". Using GUIDs ensures that when we refer to a thing, regardless of context, we know what it is. For instance, perhaps otu1 corresponds to the species ''Canis latrans'' (coyote). In that case, we can provide a Life Science Identifier or LSID (urn:lsid:ubio.org:namebank:2478093), which is a kind of GUID, and this LSID will make it possible for the researcher to associate the entity with information on ''Canis latrans'' available in resources such as the Encyclopedia of Life. If otu1 is a gene sequence, then an http URI for its NCBI accession can serve as a GUID, and this will make it possible for any subsequent researcher to associate "otu1" with the underlying sequence data. d126 1 a126 1 In TreeBase, each tree, data matrix, and study receives a GUID. NameBankIDs and NCBI taxid, and GenBank accession number for sequences. d132 1 a132 1 While the Newick tree above is in a standard format and could be archived in this format, it remains relatively useless because we do not know what it refers to or how it was derived. Even if our goal is to explore models of speciation, and we wish only to measure whether the topology of the tree is ladder-like vs. bushy, we can't use this particular tree because we can't tell whether it is a species tree (relevant to speciation), or some other kind of tree (irrelevant). d134 1 a134 1 What kind of information makes a tree re-useable? Imagine that we have an archive of all published trees, richly annotated, and that these trees have been loaded into a database that provides tools to thoroughly mine the data. Our challenge is to use this database to reveal prior work on a topic, to gather data to test a hypothesis, to discover new relationships among types of data, or to carry out a meta-analysis addressing a methodological issue. In this context, useful types of data or metadata would include: d179 1 a179 1 * [[http://www.phyloxml.org PhyloXML]]- an economical and easy to use XML format tuned to molecular phylogenies d188 1 a188 1 In the past few years, four different XML formats have become available, though none is in widespread use. The main developer of NHX format, Christian Zmasek, went on to develop PhyloXML, a validatable format to represent a greater range of attributes than NHX. PhyloXML has an economical schema tuned to the needs of molecular phylogeneticists. The BEAST package has used an XML input format for several years, but it is not considered further here because it is not used to export trees (BEAST outputs trees in NEXUS format). Likewise, while it is possible to encode comparative data in terms of CDAO (Comparative Data Analysis Ontology), and serialize this as RDF-XML, this is not the recommended use of CDAO. NeXML is an XML format with a precisely defined schema, modeled after the structure of NEXUS. While the design of PhyloXML takes a very direct approach to satisfying user needs, NeXML opts for greater generality at the expense of a much more complex schema. It has an approach to metadata that allows for arbitrary annotation of data objects using external vocabularies. d190 1 a190 1 Features of Newick, NHX, NEXUS, PhyloXML and NeXML are compared in the table below (a filled square indicates presence of the feature; an open circle indicates that there are significant limitations on this feature). d220 1 a220 1 *Georeferences* TDWG TaxonOccurrence (http://rs.tdwg.org/ontology/voc/TaxonOccurrence) seems to address this. In PhyloXML, there are pre-assigned tags, but its not certain what they mean, precisely. d275 1 a275 1 Currently the gap between needs and capacities is much greater for the problem of publishing re-usable trees than for the problem of archiving trees. One conclusion: d401 1 a401 1 * [[%ATTACHURL%/PF00034_4_phylo.xml][PF00034_4_phylo.xml]]: 4-taxon test case in [[http://www.phyloxml.org PhyloXML]] format @ 1.67 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="JamieW" date="1296051226" format="1.1" reprev="1.67" version="1.67"}% d157 1 a157 3 *(not done: finish the TDWG section below)* *Evolution-related journals and their data policies.* In early 2010, the editorial boards of eight journals: Evolution, Molecular Biology and Evolution, American Naturalist, Molecular Ecology, Journal of Evolutionary Biology, Heredity, Evolutionary Applications) announced plans for a joint data archiving policy. This is a minority of the journals that regularly publish phylogenetic trees (other examples include Systematic Biology, Molecular Phylogenetics, and so on). The policies adopted by most of these journals as of January 2011 require data archiving in an "appropriate public archive" to ensure that the data are "preserved and usable for decades in the future". However, some policies are more stringent than others. For example, Evolution requires that "authors submit DNA sequence data to GenBank and phylogenetic data to TreeBase" and American Naturalist stipulates that "authors. . . deposit the data associated with accepted papers in a public archive. For gene sequence data and phylogenetic trees, deposition in GenBank or TreeBASE, respectively, is required." Other journals have a looser policy. Molecular Ecology "expects that data supporting the results in the paper should be archived in an appropriate public archive such as GenBank, Gene Expression Omnibus, TreeBASE, Dryad, the Knowledge Network for Biocomplexity, your own institutional or funder repository, or as Supporting Information on the Molecular Ecology web site." Furthermore, Evolutionary Applications states that "only data underlying the main results in the paper need to be made available, In addition, sufficient information must be provided such that data can be readily suitable for re-analyses, meta-analyses, etc. . . . The preferred way to archive data is using public repositories. For types of data for which there is no public repository, authors can upload the relevant data as Supplementary Materials on the journal's website. Data submission to any of these repositories and the acceptance of the data by these repositories must occur *before* the manuscript goes to production. Appendix 1 provides detailed guidelines for submitting to TreeBASE and to Dryad. @ 1.66 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1295971839" format="1.1" reprev="1.66" version="1.66"}% a15 1 1. Jamie update description of journal policies to reflect 2011 d34 1 d46 1 a46 1 * available support for LSIDs and other globally unique identifiers d157 1 a157 1 *(not done: need to finish the TDWG section below)* d159 1 a159 1 *Evolution-related journals and their data policies.* In early 2010, the editorial boards of eight journals: Evolution, Molecular Biology and Evolution, American Naturalist, Molecular Ecology, Journal of Evolutionary Biology, Heredity, Evolutionary Applications) announced plans for a joint data archiving policy. This is a minority of the journals that regularly publish phylogenetic trees (other examples include Systematic Biology, Molecular Phylogenetics, and so on). The policy (to be developed at a later date) would require "that data supporting the results in the paper should be archived in an appropriate public archive" to ensure that the data are "preserved and usable for decades in the future". The policy does not make clear whether phylogenetic trees would be considered "data supporting the results in the paper" (which is oddly phrased-- shouldn't it refer to data supporting the _conclusions_ of the paper?). See Appendix 1 for details. (update- jsw) a160 64 Their updated data policy statements as of January 24, 2011 are: *Evolution* "We require authors to submit DNA sequence data to GenBank and phylogenetic data to TreeBase." , *Molecular Biology and Evolution* (nothing found for trees): For NUCLEOTIDE SEQUENCES * Newly reported sequences must be deposited in the DDBJ/EMBL/GenBank database. Accession numbers must be included in the final version of the manuscript and cannot be added at the proof stage * Nucleotide sequences must be accurate. Measures employed to avoid error should be clearly stated. Determining the sequence on both strands is the standard for most studies. When only one strand is sequenced, this must be explicitly stated both in Materials and Methods and in the entry that is submitted to one of the sequence databases * All ambiguous positions in a nucleotide sequence must be indicated with the appropriate IUBMB single-letter code (see "terminology" above), rather than "resolved" by guesswork *American Naturalist* "The American Naturalist requires authors to deposit the data associated with accepted papers in a public archive. For gene sequence data and phylogenetic trees, deposition in GenBank or TreeBASE, respectively, is required. There are many possible archives that may suit a particular data set, including the Dryad repository for ecological and evolutionary biology data (http://datadryad.org). All accession numbers for GenBank, TreeBASE, and Dryad must be included in accepted manuscripts before they go to Production. Any impediments to data sharing should be brought to the attention of the editors at the time of submission." *Molecular Ecology* "Data Accessibility To enable readers to locate archived data from Molecular Ecology papers, as of January 2011 we will require that authors include a 'Data Accessibility' section after their references. This should list the data base and respective accession numbers for all data from the manuscript that has been made publicly available. For example: "Data Accessibility: -DNA sequences: Genbank accessions F234391-F234402; NCBI SRA: SRX0110215 -Final DNA sequence assembly: uploaded as online supporting information -Phylogenetic data: TreeBASE Study accession no. S9345 -Sample locations and microsatellite data: DRYAD entry doi:10.5521/dryad.1311" Please note that this section must be complete prior to the submission of the final version of your manuscript. Papers lacking this section will not be sent to Production." Journal Policies Policy on data archiving NOTE: the policy below comes into force for papers submitted after 1st January 2011, but we hope that authors will comply in the interim. DNA sequence data from either Sanger or next generation sequencing should continue to be archived in a public data base and the accession numbers included in the manuscript. Molecular Ecology expects that data supporting the results in the paper should be archived in an appropriate public archive, such as GenBank, Gene Expression Omnibus, TreeBASE, Dryad, the Knowledge Network for Biocomplexity, your own institutional or funder repository, or as Supporting Information on the Molecular Ecology web site. Data are important products of the scientific enterprise, and they should be preserved and usable for decades in the future. Authors may elect to have the data publicly available at time of publication, or, if the technology of the archive allows, may opt to embargo access to the data for a period up to a year after publication. Exceptions may be granted at the discretion of the editor, especially for sensitive information such as human subject data or the location of endangered species. Authors are expected to archive the data supporting their results and conclusions, along with sufficient details so that a third party can interpret them correctly. As discussed by Whitlock et al. (2010),* this will likely "require a short additional text document, with details specifying the meaning of each column in the data set. The preparation of such shareable data sets will be easiest if these files are prepared as part of the data analysis phase of the preparation of the paper, rather than after acceptance of a manuscript." For additional guidelines on data deposition best practice, please visit http://datadryad.org/depositing *Whitlock MC, McPeek MA, Rausher MD, Rieseberg LR, Moore AJ (2010) Data archiving. American Naturalist, 175, 145-146. Policy on data analysis best practice Molecular Ecology expects that statistical and molecular tools used in submitted papers should meet a high standard of rigour. All analytical approaches have inherent limitations, and authors should therefore attempt to identify the limitations of their chosen approach and corroborate their interpretations when possible. *Journal of Evolutionary Biology* "Authors must use the appropriate database to deposit detailed information supplementing submitted papers, and quote the accession number in their manuscripts." *Heredity* Nothing available online *Evolutionary Applications* NEW: Data Archiving Statement After the Acknowledgements and before the Literature Cited, include one sentence stating where the raw data underlying the main results of the study will be archived (see Data Sharing/Archiving section below for more information). e.g.: Data for this study are available at Dryad. DOI: or, if you want to wait until a decision has been made on the manuscript: e.g.: Data for this study are available at: to be completed after manuscript is accepted for publication Data Sharing/Archiving In line with most major evolutionary journals, starting in January 2011, Evolutionary Applications will be requesting that the data underlying the main results in papers be made publicly available. Once papers are accepted, authors will be asked to provide evidence that the relevant data have been or will be archived before publication (if such information has not already been provided). Any impediments to data sharing should be brought to the attention of the editors at the time of submission. To reiterate, only data underlying the main results in the paper need to be made available. In addition, sufficient information must be provided such that data can be readily suitable for re-analyses, meta-analyses, etc. Authors will be required to include a short statement in their paper (see Data Archiving Statement above) indicating where the data for the paper can be obtained. The preferred way to archive data is using public repositories. For types of data for which there is no public repository, authors can upload the relevant data as Supplementary Materials on the journal's website. Sufficient information must be provided such that data can be readily suitable for re-analyses, meta-analyses, etc. Examples of public data repositories: DNA/RNA/Protein sequences/mircroarray data: Genbank/European Nucleotide Archive(ENA)/DDBJ, Protein DataBank, UniProt, NCBI trace and short-read archive, ENA's Sequence Read Archive, GEO or ArrayExpress. Other NCBI databases: e.g. Nucleotide, Protein, PopSet, SNP, dbSNP (accepts microsattelite data) Morphological data, phylogenetic trees, taxonomy, etc.: The Knowledge Network for Biocomplexity, TreeBASE, Dryad, Integrated Taxonomic Information System, Species 2000, and institutional databases. Many institutions also have data/information repositories, e.g. Circle (University of British Columbia), SMD (Stanford Microarray Database). d171 1 a171 1 TDWG has and promotes standards, but it does not carry the regulatory or enforcement power like NSF and publishers. Darwin Core and LSIDs are TDWG-approved standards. Research organizations look to TDWG for standards and benchmarks. (update jsw) @ 1.65 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1295961878" format="1.1" reprev="1.65" version="1.65"}% a15 2 1. Arlin flesh out "Language support (ontologies and vocabularies)". * avoid term "semantics". use "ontologies" ok d21 1 d24 1 d75 1 a75 1 A major NRC report on "The New Biology" (2009, get ref: http://www.nap.edu/catalog.php?record_id=12764)suggests an enormous new potential for biological discovery based on the prospect of aggregating and integrating data from diverse sources and from multiple disciplines. As suggested in recent commentaries (e.g., Sidlauskas, et al, 2010; Patterson, et al., 2010), several factors (see Background) suggest more specifically that archiving and re-use of phylogenetic trees and biodiversity data will soon take off at a pace not seen before. At the same time that funding agencies, publishers, and scientific culture are shifting in ways that incentivize sharing of data-- including phylogenetic trees--, new technologies and standards are emerging that make phylogenetic methods and results more interoperable. This interoperability infrastructure benefits individual researchers by enabling them to link results to related data, to avoid duplication of effort, to collaborate more effectively, and to pursue large-scale, integrative projects. d153 1 a153 1 An important aim for this project is to investigate what kinds of annotations can be supported by available vocabularies and, where possible, to make recommendations. Some types of annotations involve ''domain-specific concepts'', e.g., if we wish to distinguish "unrooted tree" from "rooted tree" in a robust way, this must make reference to some externally defined concept. d278 1 a278 4 *(not done)* Based on Appendix 1, this will summarize relevant bits of * Dublin core and prism, for representing citation data * Darwin core and its relevance to key types of metadata * CDAO (comparative data analysis ontology). d280 1 a280 1 The Comparative Data Analysis Ontology (CDAO) provides language support for many aspects of comparative analysis, though it has not been tested extensively and remains experimental. The concept of a phylogenetic tree could be designated by reference to CDAO:Tree, which in turn defines a tree as a sub-class of "Network", and in relation to other concepts like "Branch" and "Node". DarwinCore also provides language support for several key concepts useful in representing comparative data. d282 11 a292 1 A major gap (in what? jsw) involves language support for annotating workflows, i.e., for describing the sequence of operations used by a scientific researcher to generate a phylogeny product. Currently there is an experimental branch of CDAO that imports terms for a variety of computation concepts like "multiple sequence alignment program" as well as specific versions (e.g., "ClustalW"). In addition, a recent LIMS plug-in for the Geneious software allows for tracking workflows and alignment annotations (see http://software.mooreabiocode.org). @ 1.64 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="JamieW" date="1295924193" format="1.1" reprev="1.64" version="1.64"}% d254 1 a254 1 NEXUS is a highly expressive data format that has been in use for nearly as long as Newick. It is the output format for many phylogenetic inference programs. The published format description (Maddison, et al., 1997) is detailed but does not contain a formal grammar. The basic structure of a NEXUS file is a series of blocks, each containing commands. The most commonly used blocks are TAXA (a declared list of OTUs), CHARACTERS (a matrix of comparative data) and TREES (one or more phylogenetic trees for the OTUs), but there are also blocks for ASSUMPTIONS underlying an analysis, as well as other infrequently used blocks. OTUs and characters can be cross-referenced by index numbers within certain types of commands. Due to the lack of an ongoing development model, and ambiguities in the syntax, different interpretations of NEXUS have arisen within the phylogenetics community. Nevertheless, it seems to be the most commonly used format among phylogeny experts. d258 1 a258 3 In the past few years, four different XML formats have become available. None of them is in widespread use yet. The main developer of NHX format, Christian Zmasek, went on to develop PhyloXML, an XML format, which he recommends over NHX. PhyloXML provides a validatable and more robust syntax to represent a greater range of attributes than NHX. PhyloXML is an economical format tuned to the needs of molecular phylogeneticists, but is not suitable more broadly for those working on non-molecular characters. The BEAST package has used an XML format for several years. This format is used for input to BEAST, which produces NEXUS output. Users are recommended to archive the BEAST XML file, which contains precise instructions for carrying out an analysis, along with the NEXUS output file that contains phylogenetic trees. Because BEAST XML is not intended as an output format, we do not list it in the comparison table below. Likewise, while it is possible to encode comparative data in terms of CDAO (Comparative Data Analysis Ontology), and serialize this as RDF-XML, this is not the recommended use of CDAO. The most recently developed format, and the most promising, is NeXML. NeXML is an XML format with a precisely defined schema, modeled after the structure of NEXUS. While the design of PhyloXML takes a very direct approach to satisfying user needs, NeXML opts for greater generality at the expense of a much more complex schema. It provides the capacity for arbitrary annotation of all data objects, using external vocabularies. d260 1 a260 1 Features of Newick, NHX, NEXUS, PhyloXML and NeXML are compared in the table below. d262 1 a262 1 @ 1.63 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1295906264" format="1.1" reprev="1.63" version="1.63"}% d130 5 d159 66 a224 1 *Evolution-related journals and their data policies.* In early 2010, the editorial boards of eight journals (Evolution, Molecular Biology and Evolution, American Naturalist, Molecular Ecology, Journal of Evolutionary Biology, Heredity, and Evolutionary Applications) announced plans for a joint data archiving policy. This is a minority of the journals that regularly publish phylogenetic trees (other examples include Systematic Biology, Molecular Phylogenetics, and so on). The policy (to be developed at a later date) would require "that data supporting the results in the paper should be archived in an appropriate public archive" to ensure that the data are "preserved and usable for decades in the future". The policy does not make clear whether phylogenetic trees would be considered "data supporting the results in the paper" (which is oddly phrased-- shouldn't it refer to data supporting the _conclusions_ of the paper?). See Appendix 1 for details. (update- jsw) d274 1 a274 1 The Dryad project was launched in 2009, to support archiving of data from ecological and evolutionary studies, including data that do not fit any specialized database. The data may include images, text files, spreadsheets, and some other types of files. According to Todd Vision of the Dryad project (see Appendix 3), "since Dryad is a general-purpose repository, it doesn't impose any constraints on how the data are represented within the files that users submit. The best practices need to come from elsewhere, such as journal policies, MIAPA . . . and community practice." Because of the diversity of data, there is no back-end schema for knowledge organization. Instead, any text in uploaded files is indexed so that relevant files can be identified and retrieved by users. The submission process, currently in beta testing, is carefully explained. By submitting data, users make it available for re-use via a Creative Commons license. d654 7 a660 1 Further information on Dryad was communicated by Todd Vision of the Dryad project: d668 1 @ 1.62 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="JamieW" date="1295897871" format="1.1" reprev="1.62" version="1.62"}% d75 1 a75 1 Archiving and re-use of phylogenetic trees will soon take off at a pace not seen before. As explained below (Background), at the same time that funding agencies, publishers, and scientific culture shift toward incentivizing data sharing-- including phylogenetic trees, new technologies and standards are emerging that make phylogenetic methods and results more interoperable. This interoperability infrastructure benefits individual researchers by enabling them to link results to related data, to avoid duplication of effort, to collaborate more effectively, and to pursue large-scale, integrative projects. d77 1 a77 1 For this reason, we are undertaking a project to to assess the current state of the field in regard to archiving and re-use of phylogenetic trees, with the ultimate goal of encouraging the use and further development of practices that make trees interoperable. We aim to identify strengths and weaknesses in the current infrastructure, practices and policies, to educate tree producers about the needs of tree users, and to educate users about the needs of tree producers. The scope of this project extends, in principle, to all areas (systematics, phylogenetics, paleobiology, diversity studies, etc) where the archiving and re-use of trees is of interest to scientists. d89 1 a89 1 In the life sciences, as in other fields, re-use of data (e.g., to verify or build on a published result) is crucial to the progressive and self-correcting nature of scientific inquiry. In the distant past, publication of a conventional scientific article was deemed sufficient to satisfy the demand for accessible and re-usable data. By publishing, authors made an implicit or explicit pledge to share data (and materials) upon request, but publishers had little power to enforce such promises. In practice, authors determined which data were released, to whom, and in what form, often assuming that their own interests were served best by hoarding data, rather than sharing data (this is a little strong-jw). d142 1 a142 1 Computable knowledge representation is largely a matter of relationships between entities that can be expressed as subject-predicate-object triples, i.e., "Bob has_friend Susan" or "Susan is_a female_person". By joining these two statements via the identity of Susan, we can conclude that Bob has a friend that is a female person. Much of the formal reasoning behind data aggregation relies on the relationship of identity (this thing is the same as that thing), and the relationship of subsumption (this thing belongs to a certain class). For instance, to aggregate data on species occurrence, we need to know if a report of a bird in location X and a second report of a bird in location Y refer to the same species of bird. To establish the relationship of identity we may rely on GUIDs. d144 1 a144 1 Further language support is needed for the kinds of annotations listed above. For instance, if we wish to associate a tree with a publication, i.e., tree1 has_published_source citation1 we need an ID for tree1, a definition of the predicate has_published_source, and a way to represent citation1. ''Dublin core'' (jsw) and ''prism'' (jsw) are standards in common use for representing citation data. d146 3 a148 1 While the concept of citation is generic throughout academia, some types of annotations above involve ''domain-specific concepts'', i.e., concepts specific to biology, evolution, and even the sub-discipline of phylogenetic analysis. Examples would include annotations of data types or phylogenetic methods. To specify these concepts clearly, they must be defined in an external vocabulary or ontology. d174 1 a174 1 *(not done)* Various formats are used for phylogenies. This should be a summary of what is in Appendix 2, focusing on what the formats can represent in terms of useful metadata. d176 5 a180 9 1. who made it? 2. when was it made? 3. what can you do with it? (jsw) * Newick- only allows labels, no metadata * NEXUS (https://www.nescent.org/wg_phyloinformatics/Supporting_NEXUS_Documentation) * NHX [[http://www.phylosoft.org/NHX/nhx.pdf PDF docs]]) * [[http://www.phyloxml.org PhyloXML]]- find out how much of attributes below can be represented * [[http://www.nexml.org NeXML]] - write a description of how to represent LSIDs, GenBank accession numbers, geo coordinates d182 1 a182 1 The Newick ("New Hampshire") format is the simplest, oldest, and most commonly used format. d184 1 a184 1 NEXUS (Maddison, et al., 1997) is a highly expressive data format that has been in use for over 25 years. *further description* d186 1 a186 1 NHX (New Hampshire eXtended) format was developed as an extension of Newick, to represent common annotations of nodes. However, the highly constrained syntax of NHX limits its usefulness. d188 1 a188 1 The main developer of NHX format, Christian Zmasek, went on to develop PhyloXML, an XML format, which he recommends over NHX. PhyloXML provides a validatable and more robust syntax to represent a greater range of attributes than NHX. d190 3 a192 3 NeXML, like PhyloXML, is an XML format with a precisely defined schema. While the design of PhyloXML takes a very direct approach to satisfying user needs (particularly in genomics-related applications of phylogeny), NeXML opts for greater generality at the expense of a much more complex schema. The BEAST package has used an XML format for several years. This format is used for input to BEAST, which produces NEXUS output. Users are recommended to archive the BEAST XML file, which contains precise instructions for carrying out an analysis, along with the NEXUS output file that contains phylogenetic trees. d194 1 a194 5 Combine like-ideas in these tables (jsw) d688 2 @ 1.61 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1295649030" format="1.1" reprev="1.61" version="1.61"}% d7 1 a7 1 Jamie and Arlin plan to finish a draft report by January 24. The goal is to have a report of 5 to 10 pages (not counting appendices) that will accompany the survey. The survey is for active researchers using phylogenies, so its ok to rely on technical language. d40 1 a40 1 The motivation for the report is that it will encourage the use, and further the development, of data management practices that will benefit scientists individually and collectively. This motivation is somewhat speculative. Archiving of results post-publication seems to benefit the scientific community, and to benefit the individual archiving scientist in the form of increased recognition. A less speculative motivation is that developing the capacity to manage richly annotated yet interoperable data benefits scientists (individually and collectively) by making it easier to carry out integrative, automated, or large-scale projects. d61 1 a61 1 * No reporting standard for a phylogenetic analysis currently exists d66 1 a66 1 Archiving of trees is technically feasible given current formats, and using currently available archives (TreeBASE and Dryad). However, the archival value of many trees will be limited without a shift in emphasis toward re-useability, along with technology and standards to support such a shift. While making trees archival is an important step forward for the phylogenetic community, re-usability of trees depends on several other conditions that, for the foreseeable future, will be difficult for most researchers to obtain. Before interoperability of richly annotated trees can be obtained, the research community must commit to the use of globally unique identifiers for informational and material entities, and develop the syntax and semantics to represent the metadata upon which the value of the data depend. The community may be ready to respond to renewed calls for a Minimal Information for a Phylogenetic Analysis (MIAPA) standard. d75 1 a75 1 Archiving and re-use of phylogenetic trees will soon take off at a pace not seen before. As explained below (Background), at the same time that funding agencies, publishers, and the scientific culture are shifting in ways that create incentives for sharing data, including phylogenetic trees, new technologies and standards are emerging that make phylogenetic methods and results more interoperable. This interoperability infrastructure benefits individual researchers by enabling them to link results to related data, to avoid duplication of effort, to collaborate more effectively, and to pursue large-scale, integrative projects. d77 1 a77 1 For this reason, we are undertaking a project to to assess the current state of the field in regard to archiving and re-use of phylogenetic trees, with the ultimate goal of encouraging the use and further development of practices that make trees interoperable. We aim to identify strengths and weaknesses in current infrastructure, practices and policies, to educate producers about the needs of consumers, and to educate consumers about the needs of producers. The scope of this project extends, in principle, to all areas (systematics, phylogenetics, paleobiology, diversity studies, etc) where the archiving and re-use of trees is of interest to scientists. d89 1 a89 1 In the life sciences, as in other fields, re-use of data (e.g., to verify or build on a published result) is crucial to the progressive and self-policing nature of scientific inquiry. In the distant past, publication of a conventional scientific article was deemed sufficient to satisfy the demand for accessible and re-usable data. By publishing, authors made an implicit or explicit pledge to share data (and materials) upon request, but publishers had little power to enforce such promises. In practice, authors determined which data were released, to whom, and in what form, often assuming that their own interests were served best by hoarding data, rather than sharing data. d91 1 a91 1 This is no longer the case in many fields, due to various factors involving supply and demand, technology, and institutional policies. As the result of new technologies that generate massive amounts of data, many scientific reports depend on data too voluminous to publish in printed form. For instance, the record of a 3-dimensional protein structure contains roughly 10^4 coordinates (each a floating-point number) per domain. To facilitate archiving and re-use of such data, crystallographers collaborated in 1971 to launch the PDB, which is still the world's premier archive for 3D protein structures. In 1982, just 5 years after the discovery of DNA sequencing methods, GenBank was launched to archive the DNA sequences that were crowding the pages of journals. In both cases, editorial boards of relevant scientific journals quickly decided to require simultaneous archiving (of 3D structures in PDB; of DNA sequences in GenBank), so that data would be accessible to all scientists upon publication. d93 1 a93 1 Meanwhile, principled reasons to promote data sharing have exerted an increasing influence over institutional policies. Professional associations, publishers, and funding agencies recognize that "availability of the data underlying published scientific findings is essential to a healthy scientific process" (see Appendix 1). Funding agencies increasingly recognize that work done on behalf of the public, especially if it is funded by taxpayers, should be accessible to the public without restriction. d95 1 a95 1 Finally, the potential for data aggregation and integration has led to a demand for data represented in a formal way that computers can understand. The availability of data from PDB and GenBank, for example, has resulted in innumerable publications by scientists analyzing data generated by other scientists. Recent research has shown that making data available in public archives increases citations (ref: Piwowar). d98 1 a98 1 Presumably, all of the same principles will apply to archiving and re-use of phylogenies, though it is too soon to tell. For most authors, publishing a phylogenetic tree results in an informational dead-end: a picture of the tree embedded in a journal article (figure). This is understandable and acceptable to the extent that research projects produce trees as final end-products, with each tree being unique and unlikely to be re-usable. d100 1 a100 1 In other words, in the economy of data sharing, there are thousands of phylogeny producers, but few phylogeny consumers, in spite of an archive called "TreeBASE" (Piel, et al., 2002) that has enabled phylogeny re-use since the late 1990's. d104 3 a106 2 * Recently, TreeBase (Piel, et al., 2002) completed a substantial upgrade of features, including its submission process; a new data archive, Dryad, began accepting data from ecological and evolutionary studies, including phylogenetic trees; * NSF has recently increased its [[http://www.nsf.gov/pubs/policydocs/pappguide/nsf11001/gpg_2.jsp#IIC2j requirements for data-sharing plans]] in grant proposals (see Appendix 1); d109 1 a109 1 (now provide examples of phylogeny re-use, including specific examples with citations. Examples of large-scale projects that rely on meta-analysis or integration are assembling a tree of life representing all known species, or identifying vulnerable species by combining occurrence data, climate data, and phylogeny in a geographic framework. ) d111 1 a111 1 Thus, an analysis of best practices for publishing re-useable trees is timely. To address this issue, we must begin with some notion of what makes a tree re-usable. In considering what makes it likely for a tree to be re-used in study replication, meta-analysis, aggregation, or integration, we draw guidance from the 2008 roadmap of the TDWG Technical Architecture Group (TAG), and two recent commentaries, by Sidlauskas, et al (2010) on synthetic approaches to evolutionary analysis, and by Patterson, et al (2010) on the importance of names. Together, these resources suggest the importance of d118 1 a118 1 Below, we explain these features briefly. d120 1 a120 1 ---++++ 1. Standard, validatable formats d127 1 a127 1 ---++++ 2. GUIDS d130 2 a131 2 ---++++ 3. Rich annotations ("metadata") While the Newick tree above is in a standard format and could be archived in this format, it remains relatively useless because we don't know what it refers to or how it was derived. Even if our goal is to explore models of speciation, and we wish only to measure whether the topology of the tree is ladder-like vs. bushy, we can't use this particular tree because we can't tell whether it's a species tree (relevant to speciation), or some other kind of tree (irrelevant). d133 1 a133 1 What kind of information makes a tree re-useable? Imagine that we have an archive of all published trees, richly annotated, and that these trees have been loaded into a database that provides tools to interrogate the data thoroughly. Our challenge is to use this database to reveal prior work on a topic, to gather data to test a hypothesis, to discover new relationships among types of data, or to carry out a meta-analysis addressing a methodological issue. In this context, useful types of data or metadata would include: d144 1 a144 1 Further language support is needed for the kinds of annotations listed above. For instance, if we wish to associate a tree with a publication, i.e., tree1 has_published_source citation1 we need an ID for tree1, a definition of the predicate has_published_source, and a way to represent citation1. ''Dublin core'' and ''prism'' are standards in common use for representing citation data. d152 1 a152 1 *Some evolution-related journals.* In early 2010, the editorial boards of eight journals (Evolution, Molecular Biology and Evolution, American Naturalist, Molecular Ecology, Journal of Evolutionary Biology, Heredity, and Evolutionary Applications) announced plans for a joint data archiving policy. This is a minority of the journals that regularly publish phylogenetic trees (other examples would be Systematic Biology, Molecular Phylogenetics, and so on). The policy (to be developed at a later date) would require "that data supporting the results in the paper should be archived in an appropriate public archive" to ensure that the data are "preserved and usable for decades in the future". The policy does not make clear whether phylogenetic trees would be considered "data supporting the results in the paper" (which is oddly phrased-- shouldn't it refer to data supporting the _conclusions_ of the paper?). See Appendix 1 for details. d163 1 a163 1 TDWG is an organization. TDWG has standards, but it does not carry the regulatory or enforcement power like NSF and publishers. Darwin core and LSIDs are TDWG-approved standards. Research organizations look to TDWG for standards and benchmarks. d165 1 a165 1 ---+++ Reporting standard d167 1 a167 1 Scientists with an interest in the archiving and re-use of phylogenetic data have called for (but not yet developed) a minimal reporting standard designated "Minimal Information for a Phylogenetic Analysis", or MIAPA ([[http://www.ncbi.nlm.nih.gov/pubmed/16901231 Leebens-Mack, et al. 2006]]). The vision of these scientists is that the research community would develop, and adhere to, a standard that imposes a minimal reporting burden yet ensures that the reported data can be interpreted and re-used. Such a standard might be adopted by journals, repositories, databases, workflow systems, granting organizations, and organizations that develop taxonomic nomenclature based on phylogenies. Leebens-Mack, et al. suggest that a study should report objectives, sequences, taxa, alignment method, alignment, phylogeny inference method, and phylogeny (this implies that MIAPA is intended only for molecular, as opposed to non-molecular, phylogenetics). d172 6 a177 1 *(not done)* This should be a summary of what is in Appendix 2, focusing on what the formats can represent in terms of useful metadata. d184 1 a184 1 Various formats are used for phylogenies. The Newick ("New Hampshire") format is the simplest, oldest, and most commonly used format. NHX (New Hampshire eXtended) format was developed as an extension of Newick, to represent common annotations of nodes. However, the highly constrained syntax of NHX limits its usefulness. The main developer of NHX format, Christian Zmasek, went on to develop PhyloXML, an XML format, which he recommends over NHX. PhyloXML provides a validatable and more robust syntax to represent a greater range of attributes than NHX. d188 4 d196 2 d204 1 a204 1 Researchers wishing to make a phylogenetic tree available in a public archive currently have two options, TreeBASE and Dryad. d206 1 a206 1 The TreeBASE project is a specialized repository that focuses on supporting phylogenetic studies (Piel, et al., 2002). TreeBASE 2.0 (released in March, 2010) has a relational database back-end with a complex schema that allows it to accommodate, not just phylogenies and character matrices, but metadata associated with a study, including authors, publications, and descriptions of methods. The submission process (see Appendix 3) is well documented, and allows users to associate OTUs with species names (NCBI or UBio names) and to add other types of metadata. Data are uploaded in NEXUS format. Other formats are not supported. Support for studies with large numbers of trees is limited. Archived data are made available to users via a convenient web interface. The web interface does not provide full access to the schema. d214 1 a214 1 ---+++ Language support (ontologies and other vocabularies) d221 1 a221 1 The concept of a phylogenetic tree could be designated by reference to CDAO:Tree, which in turn defines a tree as a sub-class of "Network", and in relation to other concepts like "Branch" and "Node". The Comparative Data Analysis Ontology (CDAO) provides language support for many aspects of comparative analysis, though it has not been tested extensively and remains experimental. DarwinCore also provides language support for several key concepts useful in representing comparative data. d223 1 a223 1 A major gap involves language support for annotating workflows, i.e., for describing the sequence of operations used by a scientific researcher to generate a phylogeny product. Currently there is an experimental branch of CDAO that imports terms for a variety of computation concepts like "multiple sequence alignment program" as well as specific versions (e.g., "ClustalW"). d227 2 a228 2 Even if there are journal policies that encourage archiving, and standards that guide representation of data, scientists motivated to archive or re-use data may find it difficult to do so if appropriate tools are lacking. Some of the obvious kinds of tools to support archiving of richly annotated data sets are: * format validators and translators for formats that support data and metadata d233 1 a233 1 * manipulating data and metadata (e.g., extracting subsets or subtrees) while respecting rules d239 1 a239 1 For instance, most phylogenetics users implement a customized interactive workflow that relies on diverse software tools. Because these tools may use different formats, the ability to convert among formats is important. However, convenient generalized tools are lacking (the NeXML manual provides a useful list of online servers and scripting approaches). Likewise, there are many tools for viewing trees, but it seems that only a few allow for viewing trees together with a matrix of data (e.g., Archaeopteryx, Mesquite, Nexplore). Support for viewing metadata is very limited. The TreeBASE submission server is an example of a tool that allows users to annotate data sets by associating OTUs with species names, and by associating data rows with accessions. However, this tool only works in the context of a TreeBASE submission. d241 1 a241 1 ---+++ Current practices d243 1 a243 1 The analysis of citations by Kumar & Dudley (2007) suggests that the number of phylogeny publications in 2006 was 7000, and rapidly increasing. Experts in phylogenetic analysis typically generate hundreds or thousands of trees for every tree that is published. Thus, it is likely that, each year, many millions of trees are generated in association with published research. d245 1 a245 1 We have not done a systematic analysis of current practices for archiving and re-use of trees. In early 2011 we intend to release a survey that will provide information on some aspects of current practices. The main questions to address are: d251 2 d263 1 a263 3 We begin by noting that the needs of archiving are not the same as those of publishing linkable, re-usable data. In a typical case of archiving, even the simplistic Newick format can be used to represent a phylogeny with metadata if there is a combination of 1) a Newick string with unique identifiers for each internal and external node; and 2) an entity-attribute-value table assigning attribute values to nodes, as in the table used in Appendix 2. While this combination may ensure that key information is available in a record, it does not make study replication any easier, and it does little to facilitate re-use, re-purposing, aggregation, or linking of the tree. BEAST users can support study replication by archiving their BEAST XML file, which includes the input data along with precise instructions for processing. This is a perfectly adequate solution for archiving. However, the information in the BEAST XML file provides instructions that only BEAST can understand, and is not anchored by semantics defined in external computable vocabularies. Therefore we address conclusions and recommendations below as they pertain to archiving and linking, separately or collectively. d267 1 d270 1 a270 1 *The infrastructure to support archiving of phylogenetic trees is largely in place*. TreeBASE and Dryad may serve as repositories to ensure that phylogenetic tree information associated with a publication is recoverable many years into the future. TreeBASE imports data into its own data model. In the case of Dryad, commonly used file formats are sufficient for archiving purposes. However, it is preferable to use formats that can be validated relative to a schema, because it is more likely that parsers for these formats will continue to be available in the long-term. d272 1 a272 1 *While archiving trees is possible, the extent to which current policies require it is unclear*. The draft policy suggested recently by journals (Appendix 1) refers to "data supporting the results" of a publication. It is not clear what this is intended to cover. Natural scientists traditionally use the term "data" as a synonym for "facts", the empirical observations or measurements on which further analysis rests. By this empirical definition, phylogenies are not data. Computer and information scientists use "data" (and in some sub-fields, even "facts") to refer to any kind of recorded information, regardless of its nature or derivation. The NIH data sharing policy (see note 7 of [http://grants.nih.gov/grants/policy/nihgps/fnpart_ii.htm]) makes clear that NIH uses the informational definition of "data", not the empirical one. We recommend that institutions with data archiving policies be explicit about what they mean by "data". d276 1 a276 1 Currently the gap between needs and capacities is much greater for the problem of publishing re-usable trees than for the problem of archiving trees. Some of the issues: d278 1 a278 1 *A reporting standard for a phylogenetic analysis will greatly extend the re-usability of archived trees*. A minimal reporting standard for phylogenetic analysis has been suggested (see MIAPA, Appendix 1) but never drafted or approved. Establishing such a standard would be a critical step in promoting re-useable trees. To develop a standard would require community organization as well as technological support (some guidance is provided by a [[https://www.nescent.org/wg_evoinfo/MIAPA_WhitePaper MIAPA whitepaper]] from the NESCent EvoInfo working group). d280 6 a285 6 *other issues* 1. lack of resolvable LSIDs; lack of a validator to see if species refs are resolved 1. lack of formal language support (e.g., tree inferred_from_data matrix) 1. lack of community standards for some types of metadata 1. lack of education, awareness, of metadata standards 1. lack of software support for annotation d295 1 a295 1 * Elena Herzog, TE, DR, AS, and JW participated in the TDWG workshop project Oct X in Woods Hole d299 1 d343 1 a343 1 ---++++ NSF d347 1 a347 1 The policy may be found in the Award and Administration Guide, [[http://www.nsf.gov/pubs/policydocs/pappguide/nsf11001/aag_6.jsp#VID4 section VI.D.4.b]]: d366 1 a366 1 there isn't a standard for encoding Dublin Core (Dc) publication data in XML. in particular, there isn't an enclosing element. In NeXML it would be "meta". DC isn't very well suited to journal articles, anyway. The best attempt I've seen (http://reprog.wordpress.com/2010/09/03/bibliographic-data-part-2-dublin-cores-dirty-little-secret/) goes like this: d396 2 a397 1 We have developed a set of test files: a404 2 to illustrate the representation capabilities (and limitations) of different formats. d592 1 a592 1 Since Dryad is a general-purpose repository, it doesn't impose any constraints on how the data are represented within the files that users submit. The best practices need to come from elsewhere, such as journal policies, MIAPA (which would be nice to give some attention to in that paper), and community practice imposed by awareness of how the data will be reused by more specialized phylogenetic tools. d594 1 a594 1 In case you aren't aware, Dryad just introduced a "handshaking" feature for TreeBASE. Users can elect to have a NEXUS file that is deposited to Dryad "pushed through" to TreeBASE to initiate the submission process there. So for the special case of phylogenetic data in Dryad, we would encourage having that Newick tree within a NEXUS file, together with the OTU metadata that can fit within that file format. I dream of a future in which lots of different software tools will support the editing and output of metadata-rich phylogenies in NeXML, and that TreeBASE can ingest those NeXML files. But we aren't there yet. d599 1 a599 1 TreeBASE is a repository for trees that has been in operation for many years. In the past few years, the schema was redesigned, and there have been numerous upgrades to the user interface, including a sophisticated submission process and a web services API to retrieve results via a URL. d671 1 @ 1.60 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1295642592" format="1.1" reprev="1.60" version="1.60"}% d14 1 a14 2 1. Arlin *shorten* and rewrite rationale and background * explain context, stress tentative nature of report d16 1 a16 2 1. Arlin flesh out "computable concepts" section * change "computable concepts" to "Language support (ontologies and vocabularies)". d19 1 a19 2 1. Arlin "reporting standards" are a different category, but could be lumped with "policies" 1. Arlin insert some examples of re-use d34 1 d73 1 a73 1 ---++ Rationale and scope for this report d75 1 a75 1 Archiving and re-use of phylogenetic trees will soon take off at a pace not seen before. At the same time that funding agencies, publishers, and the scientific culture are shifting in ways that create incentives for sharing data, including phylogenetic trees, new technologies are emerging to make re-use of phylogenetic results easier. This infrastructure has both post-publication and pre-publication benefits. The pre-publication benefits of sharing-and-interoperability technology is that it enables researchers to link results to related data, to avoid duplication of effort, to collaborate more effectively, and to pursue large-scale, integrative projects. d77 1 a77 1 The goal of this project is to assess the current state of the field in regard to archiving and re-use of phylogenetic trees, with the ultimate goal of encouraging practices that make trees more interoperable. d79 2 a80 2 This preliminary report is an initial step toward that goal. The scope of this report will extend, ideally, to all areas (systematics, phylogenetics, paleobiology, diversity studies, etc) where the archiving and re-use of trees is of interest to scientists. In regard to the electronic archiving and re-use of trees, we aim to assess * relevant institutional policies d82 2 a83 1 * available standards, including file formats a84 1 * tools that facilitate archiving and re-use d86 2 a87 1 We aim to identify strengths and weaknesses in current infrastructure and policies, to educate producers about the needs of consumers, and to educate consumers about the needs of producers. d89 1 a89 4 Upon releasing this preliminary report, we will disseminate a survey to thousands of scientists, in order to obtain feedback and assess current practices. After analyzing the results of this process, we intend to expand this preliminary report into a manuscript for publication. We invite others (who are willing to make a commitment of work) to join in this project. ---++ Background and Rationale ---+++ Archiving and re-use of data Re-use of accessible data, e.g., to verify a published result, is crucial to the progressive and self-policing nature of scientific inquiry. In the distant past, publication of a conventional scientific article was deemed sufficient to satisfy the demand for accessible and re-usable data. Scientific publishers often had a policy that publication represents a pledge by the author to share materials or data, but had little power to enforce such promises. In practice, authors determined which data were released, to whom, and in what form, often assuming that their own interests were served best by hoarding, rather than sharing, data. d91 1 a91 1 This is no longer the case in many fields, due to various factors involving supply and demand, technology, and institutional policies. As the result of new technologies that generate massive amounts of data, many scientific reports depend on data too voluminous to publish in printed form. For instance, a crystallographically determined 3D protein structure contains on the order of 10^4 coordinates (each a floating-point number) per domain. Thus, crystallographers collaborated in 1971 to launch the PDB, which is still the world's premier archive for 3D protein structures. In 1982, just 5 years after the discovery of DNA sequencing methods, GenBank was launched to archive the DNA sequences that were crowding the pages of journals. In both cases, editorial boards of journals quickly decided to make simultaneous archiving (of 3D structures in PDB, and of DNA sequences in GenBank) a condition of publication, so that the data would be accessible to all scientists. a96 1 ---+++ Archiving and re-use of phylogenetic trees d98 3 a100 1 For most authors, publishing a phylogenetic tree results in an informational dead-end: a picture of the tree embedded in a journal article. This is understandable and acceptable to the extent that research projects produce trees as final end-products, with each tree being unique and unlikely to be re-usable. In other words, in the economy of data sharing, there are thousands of phylogeny producers, but few phylogeny consumers, in spite of an archive called "TreeBASE" (Piel, et al., 2002) that has enabled phylogeny re-use since the late 1990's. d104 3 a106 5 * Recently, TreeBase completed a substantial upgrade of features, including its submission process; a new data archive, Dryad, began accepting data from ecological and evolutionary studies, including phylogenetic trees; * NSF has recently increased its [[http://www.nsf.gov/pubs/policydocs/pappguide/nsf11001/gpg_2.jsp#IIC2j requirements for data-sharing plans]] in grant proposals (see Appendix 1). Thus, scientists will be motivated by funding agencies to share data electronically; * In recent years, phyloinformatics researchers have been developing supporting technologies to enable interoperability, including XML file formats (NeXML, PhyloXML), an ontology (CDAO) and a web-services standard (PhyloWS). (now provide examples of phylogeny re-use, including specific examples with citations. Examples of large-scale projects that rely on meta-analysis or integration are assembling a tree of life representing all known species, or identifying vulnerable species by combining occurrence data, climate data, and phylogeny in a geographic framework. ) d108 1 a108 1 Thus, an analysis of best practices for publishing re-useable trees is timely. d110 1 a110 1 To analyze the state of the field in regard to publishing re-useable trees, we must begin with some conception of what makes a tree re-usable. In considering what makes it likely for a tree to be re-used in study replication, meta-analysis, aggregation, or integration, we draw guidance from the 2008 roadmap of the TDWG TAG (Technical Architecture Group), and two recent commentaries, by Sidlauskas, et al ("Linking Big") on synthetic approaches to evolutionary analysis, and by Patterson, et al on "Names are key to the big new biology". Together, these resources suggest the importance of d117 2 d120 1 a120 1 Currently, most trees that appear in the published literature are accessible only in the form of an embedded graphical image, i.e., the published item is literally a picture of a tree, rather than the tree as an informational entity. For trees to be re-usable, they must be accessible in a standard format that makes the structure of the tree explicit. The tree image above corresponds to the following Newick file: d124 1 a124 1 There are a variety of data formats that are used to represent trees (see Appendices 1 and 3). Some of these formats can be validated, i.e., every valid instance of the type of file conforms to its abstract schema. d127 1 a127 1 The example above invokes entities "otu1", "otu2" and "otu3", and implicitly defines a tree entity. To promote re-use and linking of data, it is essential to have globally unique identifiers or GUIDs for such entities. Otherwise, when aggregating data from several trees, we might mistakenly combine "otu1" with another different entity also called "otu1". Using GUIDs ensures that when we refer to a thing, regardless of context, we know what it is. For instance, perhaps otu1 corresponds to the species xxxx. In that case, there is a specific Life Science Identifier or LSID, which is a kind of GUID, and this will make it possible for the researcher to associate otu1 with information on that species available in resources such as the Encyclopedia of Life. If otu1 is a gene sequence, then an http URI for its NCBI accession can serve as a GUID, and this will make it possible for any subsequent researcher to associate "otu1" with the underlying sequence data. d130 1 a130 1 The Newick tree above cannot be re-used because its citation information is not included with the tree and the labels are not intuitively readable. Even if our goal is to explore models of speciation, and we wish only to measure whether the topology of the tree is ladder-like vs. bushy, we can't use this particular tree because we can't tell whether it's a species tree (relevant to speciation), or some other kind of tree (irrelevant). This example suggests some obvious ways to make a tree re-useable, namely to provide citation information, and when appropriate, to provide taxonomic links or other identifiers for the "OTUs" (Operational Taxonomic Units) at the tips of the tree. d132 7 a138 7 Another way to look at the problem of re-useability is to imagine that we have a database full of all published trees, richly annotated with the right kinds of metadata, and our challenge is to use this database to reveal prior work on a topic, to gather data to test a hypothesis, to discover new relationships among types of data, or to carry out a meta-analysis addressing a methodological issue. Types of data or metadata useful for such studies would include: * authorship and citation data * taxonomic links and species identifiers for OTUs * identifiers for a specimen or accession to which OTUs are linked * links to data from which the tree was inferred * geographic coordinates * a description of the method by which the tree was inferred d141 1 a141 5 Computable knowledge representation is largely a matter of relationships between entities that can be expressed as subject-predicate-object triples, i.e., "Bob has_friend Susan" or "Susan is_a female_person". By joining these two statements via the identity of Susan, we can conclude that Bob has a friend that is a female person. Much of the formal reasoning behind data aggregation relies on just two simple relationships: * the relationship of identity (this thing is the same as that thing) * the relationship of subsumption (this thing belongs to a certain class). For instance, to aggregate data on species occurrence, we need to know if a report of a bird in location X and a second report of a bird in location Y refer to the same species of bird. To establish the relationship of identity we rely mainly on GUIDs. d143 1 a143 1 (need to continue here, cover domain-specific concepts. ) d145 1 @ 1.59 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1295368001" format="1.1" reprev="1.59" version="1.59"}% d3 1 a3 1 ---+ Current Best Practices for Publishing Trees Electronically: Draft Report and Request for Comments a25 1 1. Arlin add to gaps or tools section, ability to compare trees d36 1 d75 12 a86 4 ---++ Background and Rationale ---+++ Rationale for archiving re-useable data and metadata (*contains generalizations without adequate refs*) Re-use of accessible data is crucial to the progressive and self-policing nature of scientific inquiry. Professional associations, publishers, and funding agencies recognize that "availability of the data underlying published scientific findings is essential to a healthy scientific process" (see Appendix 1). In the past, publication of a conventional scientific article was deemed sufficient to satisfy the demand for accessible and re-usable data. This is no longer the case, due to various factors that affect either the supply or the demand for scientific data. As the result of new technologies that generate massive amounts of data, many scientific reports depend on data too voluminous to publish in printed form. At the same time, the potential for data aggregation and integration leads to demand for data represented in a formal way that computers can understand. d88 1 a88 1 Such conditions drive stakeholders to develop data archives and information standards, along with the cyber-infrastructure to support their use. This process has been going on for some time ( describe indications that these things are already happening: NCEAS; NESCent; EvoInfo; interop projects ). d90 4 a93 1 This infrastructure has both post-publication and pre-publication benefits to researchers. Post-publication archiving of results increases the accessibility and exposure of a researcher's work (ref: Piwowar). However, the same technologies and standards that facilitate effective archiving also enable researchers in the pre-publication stage to link results to related data, to avoid duplication of effort, to collaborate more effectively, and to pursue large-scale, integrative projects. Examples of large-scale projects that rely on meta-analysis or integration are assembling a tree of life representing all known species, or identifying vulnerable species by combining occurrence data, climate data, and phylogeny in a geographic framework. d95 1 a95 1 (now provide examples of phylogeny re-use) d97 1 a97 1 ---+++ Rationale for this report d99 1 a99 1 In the past, publishing a phylogenetic tree typically resulted in an informational dead-end: a picture of the tree embedded in a journal article. This was understandable and acceptable to the extent that, in the past, research projects produced trees as final end-products, with each tree being unique and unlikely to be re-usable. In the past economy of data sharing, there were phylogeny producers, but few phylogeny consumers. However, conditions are changing to favor publishing re-usable trees, especially ones that can be linked into a global web of data. d101 3 a103 1 While the proximate goal of this project is simply to assess current practices, the ultimate goal is to encourage archiving practices that make trees more interoperable. A premise of this report is that most of the benefits of interoperability are still to come, and that the time is ripe to stimulate further advances. More trees are generated every year. More and more journals that serve the phylogenetics community are asking authors to archive data. Because standards of quality and software sophistication have gone up enormously, a typical user finds the task of phylogeny inference daunting, i.e., its more difficult to be a producer, especially of a high-quality product. Yet, because more trees are being made, its more likely that a tree to serve a user's need already exists-- if it can be discovered and obtained in a re-usable form. In addition, trees increasingly serve as inputs for larger projects to aggregate data, compare methods, or integrate phylogenies with other kinds of data. At the same time that new ways of consuming trees have emerged, standards and technologies are emerging to make it easier to share and re-use trees. d105 1 a105 2 As a step toward this goal, we aim to assess current approaches to publishing trees electronically, in order to identify strengths and weaknesses, to educate producers about the needs of consumers, and consumers about the needs of producers. The scope of this report will extend, ideally, to all areas (systematics, phylogenetics, paleobiology, diversity studies, etc) where the archiving and re-use of trees is of interest to scientists. Such an assessment is timely for several reasons: * While in the past, many scientists felt no incentive to share data, recent research has shown that making data available in public archives increases citations (ref: Piwowar, research remix), widely understood as an indicator of professional success; d107 1 a107 1 * The only major electronic repository of trees, TreeBase, has recently completed a major upgrade of features, including its submission process; in 2009, a data archive called Dryad was launched and will accept various kind of electronic files, including those with trees; d111 1 a111 3 Thus, at the same time that funding agencies, publishers, and the scientific culture are shifting in ways that create incentives for sharing data, including phylogenetic trees, new technologies are emerging to make it easier. d113 1 a113 1 d115 6 a120 3 ---+++ Features expected to make archived trees re-useable *(not done: see formal language support below)* What would make these trees re-useable, allowing that there may be many different categories of re-use (replication, meta-analysis, aggregation, integration)? Some guidance is provided by the considerations outlined in the 2008 roadmap of the TDWG Technical Architecture group. The TAG Roadmap emphasis 3 things: globally unique identifiers (GUIDs), validatable formats (specifically, XML), and formal language support (ontologies). To this list, we add the idea that annotations ("metadata") are key to the re-useability of phylogenetic results, as suggested in (ref: Linking Big). d123 1 a123 17 *(done)* Currently, most trees that appear in the published literature are accessible only in the form of an embedded graphical image, i.e., the published item is literally a picture of the tree, rather than the tree as an informational entity. For an electronic file, one typically must write to the authors. For trees to be re-usable, they must be accessible in a standard format that makes the structure of the tree explicit. There are a variety of data formats that do this (see Appendices 1 and 3). Some of these formats can be validated, i.e., they are defined by a schema to which every valid instance conforms. ---++++ 2. Formal Language Support Computable knowledge representation is largely a matter of relationships between entities that can be expressed as triples, i.e., "Bob has_friend Susan" or "Susan is_a female_person". By joining these two statements via the identity of Susan, we can conclude that Bob has a friend that is a female person. Ontologies that formalize knowledge relations can get very complicated, but the reasoning used in data aggregation relies overwhelmingly on just two simple relationships: * the relationship of identity (this thing is the same as that thing) * the relationship of subsumption (this thing belongs to a certain class). For instance, to aggregate data on species occurrence, we need to know when a species seen in location X and in location Y is the same species. ---++++ 3. GUIDS (this section needs to continue with a brief discussion of GUIDs) ---++++ 4. Rich annotations ("metadata") *(done)* The importance of metadata ("annotations") can be illustrated with the following example of a tree in the most commonly encountered format, a "Newick" string with nested parentheses representing clades: d127 4 a130 1 The Newick tree cannot be re-used for any purpose because its citation information is not included with the tree and the labels are not intuitively readable. Even if our goal is to explore models of speciation, and we wish only to measure whether the topology of the tree is ladder-like vs. bushy, we can't use this particular tree because we can't tell whether it's a species tree (relevant to speciation), or some other kind of tree (irrelevant). This example suggests some obvious ways to make a tree re-useable, namely to provide citation information, and when appropriate, to provide taxonomic links or other identifiers for the "OTUs" (Operational Taxonomic Units) at the tips of the tree. We might be able to determine this by reading the original publication to find out what the labels ("otu1", etc) mean, but no citation information is included with the tree. To interpret this tree, or to integrate it with other information, we would need to link it with other information, but we can't, because it does not refer to any identifiable thing. d132 4 a135 1 Another way to look at the problem of re-useability is to imagine that we have a database full of all published trees, richly annotated with the right kinds of metadata, and our challenge is to use this database to explore a basic research question, discover new relationships among types of data, or carry out a meta-analysis addressing a methodological issue. Types of data or metadata useful for such studies would include: d137 1 a137 1 * taxonomic links and species identifiers d143 10 d170 7 a176 1 ---+++ Formats d192 4 d208 1 a208 1 ---+++ Computable concepts (formerly "Standards") a212 1 * MIAPA (non-existent minimal reporting standard) d229 1 d297 4 d303 7 a386 7 ---+++ Minimal Information for a Phylogenetic Analysis (MIAPA) *(done)* Scientists with an interest in the archiving and re-use of phylogenetic data have called for (but not yet developed) a minimal reporting standard designated "Minimal Information for a Phylogenetic Analysis", or MIAPA ([[http://www.ncbi.nlm.nih.gov/pubmed/16901231 Leebens-Mack, et al. 2006]]). The vision of these scientists is that the research community would develop, and adhere to, a standard that imposes a minimal reporting burden yet ensures that the reported data can be interpreted and re-used. Such a standard might be adopted by journals, repositories, databases, workflow systems, granting organizations, and organizations that develop taxonomic nomenclature based on phylogenies. Leebens-Mack, et al. suggest that a study should report objectives, sequences, taxa, alignment method, alignment, phylogeny inference method, and phylogeny (this implies that MIAPA is intended only for molecular, as opposed to non-molecular, phylogenetics). As of 2010, no standard or draft has been developed (the [[http://mibbi.sourceforge.net/projects/MIAPA/ MIBBI repository for the MIAPA project]] is empty). A [[https://www.nescent.org/wg_evoinfo/MIAPA_WhitePaper NESCent whitepaper on MIAPA]] outlines how the project could be moved forward. As a proof-of-concept exercise (described with some screenshots [[https://www.nescent.org/wg_evoinfo/Supporting_MIAPA#Proof-of-concept_.28annotation_software.29 here]]), participants in NESCent's Evolutionary Informatics working group configured an existing annotation application to use a controlled vocabulary to describe a phylogenetic analysis as a series of steps. d685 1 @ 1.58 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="JamieW" date="1295327426" format="1.1" reprev="1.58" version="1.58"}% d7 1 a7 1 Jamie and Arlin plan to finish a draft report by mid-January. The goal is to have a report of 5 to 10 pages (not counting appendices) that will accompany the survey. The survey is for active researchers using phylogenies, so its ok to rely on technical language. d9 20 a28 1 to do: a30 2 done: d40 1 a40 1 An assessment of best practices for publishing phylogenetic trees, whether for the strict purpose of archiving, or for the more general purposes of pushing richly annotated trees out into a worldwide web of interoperable data is greatly needed. Such an assessment is timely given recent decisions by several journals to require the archiving of trees. However, even without that justification, several longer-term trends that favor publishing re-usable trees exist, especially re-usable trees that can be linked to other data: the opportunities for phylogeny re-use are greater; there are new opportunities for aggregation and integration; and both specific and general technologies have emerged that make sharing and re-using trees easier. This report summarizes an as-yet-incomplete project to perform this assessment and, in turn, suggest solutions for meeting recommendations and filling gaps in the current landscape. d42 1 a42 1 The motivation for the report is that it will encourage the use, and further the development, of practices that will benefit scientists individually and collectively. This motivation is somewhat speculative. Archiving of results post-publication seems to benefit the scientific community, and to benefit the individual archiving scientist in the form of increased recognition. A less speculative motivation is that developing the capacity to manage richly annotated yet interoperable data benefits scientists (individually and collectively) by making it easier to carry out integrative, automated, or large-scale projects. d44 9 a52 5 For the purposes of this report, we have conducted an initial (and in some ways, incomplete) assessment of * relevant archiving policies of journals and funding agencies * file formats commonly used for representing phylogenies * relevant data and metadata standards * electronic archives suitable for storing phylogenies d54 1 a54 1 We considered the capacity to represent various kinds of metadata (annotations) crucial for ensuring that research results can be discovered, interpreted, linked (to other data) and re-used: d68 1 a68 3 Archiving of trees is technically feasible given current formats, and using currently available archives (TreeBASE and Dryad). However, the archival value of many trees will be limited without a reporting standard specifying what types of metadata (annotations) should accompany a phylogeny report. An early call for a Minimal Information for a Phylogenetic Analysis (MIAPA) standard apparently led nowhere. While making trees archival is an important step forward for the phylogenetic community, re-usability of trees depends on several other conditions that, for the foreseeable future, will be difficult for most researchers to obtain. Before interoperability of richly annotated trees can be obtained, the research community must decide on the syntax and semantics to represent the metadata upon which the value of the data depend. d71 3 a73 5 *(done)* To ensure that the descriptions and recommendations here are accurate and relevant to the community of users, we are seeking feedback in several ways 1. we are targeting scientists (see Appendix 5 for strategy) with a survey to assess current practices and needs ([[https://spreadsheets.google.com/viewform?formkey=dHhZa0xMQTJuR0ZCZWxoV2JSTG13b2c6MQ][draft survey]]). 2. we provide a form for feedback on this page ([[#AddComments][below]]) d76 1 a76 1 ---+++ Rationale for archiving re-useable scientific data and metadata d78 3 a80 1 Re-use of accessible data is crucial to the progressive and self-policing nature of scientific inquiry. Thus, professional associations, publishers, and funding agencies recognize that "availability of the data underlying published scientific findings is essential to a healthy scientific process" (see Appendix 1). In the past, publication of a conventional scientific article was sufficient to satisfy most of the demand for accessible and re-usable data. This is no longer the case. Technological advances have made it easy to produce massive amounts of data, and to aggregate and synthesize diverse types of data from all over the globe. Many scientific reports depend on data too voluminous to publish in printed form. These conditions drive scientific communities to develop data archives and information standards, along with the cyber-infrastructure to support their use. d84 1 a84 3 The premise of archiving phylogenies is that the scientific community will benefit from having access to the multitude of trees generated every year. The analysis of citations by Kumar & Dudley (2007) suggests that the number of phylogeny publications in 2006 was 7000, and rapidly increasing. Experts in phylogenetic analysis typically generate hundreds or thousands of trees for every tree that is published. Thus, it is likely that, each year, many millions of trees are generated in association with published research. In the past, publishing a phylogenetic tree typically resulted in an informational dead-end: a picture of the tree embedded in a journal article. However, conditions have changed to favor publishing re-usable trees, especially ones that can be linked into a global web of data. Many journals that serve the phylogenetics community already require archiving of data, or soon will. While more trees are published every year, standards of quality and software sophistication have gone up enormously, so that a typical user finds the task of phylogeny inference daunting. Yet, because more trees are being made, its more likely that a tree to serve a user's need already exists-- if it can be discovered and obtained in a re-usable form. In addition, trees increasingly serve as inputs for larger projects to aggregate data, compare methods, or integrate phylogenies with other kinds of data. At the same time, standards and technologies are emerging to make it easier to share and re-use trees. d86 1 a86 3 ---+++ Rationale for this assessment *(done, but could be shorter)* In the past, publishing a phylogenetic tree typically resulted in an informational dead-end: a picture of the tree embedded in a journal article. This was understandable and acceptable to the extent that, in the past, research projects produced trees as final end-products, with each tree being unique and unlikely to be re-usable. However, as explained in more detail below, conditions have changed. d88 1 a88 1 The ultimate goal of our effort here is to encourage archiving practices that make trees more interoperable. Rather than a dusty paper archive, we imagine a computable web of linked phylogenetic resources that can be explored and interrogated, and ultimately exploited for discovery and hypothesis-testing. d90 1 a90 1 As a step toward this goal, we aim to assess current approaches to publishing trees electronically, in order to educate phylogenetic users, and to identify strengths and weaknesses. The scope of this report will extend, ideally, to all areas (systematics, phylogenetics, paleobiology, diversity studies, etc) where the archiving and re-use of trees is of interest to scientists. d92 1 a92 1 Such an assessment is timely for several reasons: d107 1 a107 1 What would make these trees re-useable, allowing that there may be many different categories of re-use (replication, meta-analysis, aggregation, integration)? Some guidance is provided by the considerations outlined in the 2008 roadmap of the TDWG Technical Architecture group. The TAG Roadmap emphasis 3 things: globally unique identifiers (GUIDs), validatable formats (specifically, XML), and formal language support (ontologies). d113 1 a113 2 ---++++ 2. Computable concepts d120 2 d124 1 a124 1 ---++++ 3. Rich annotations ("metadata") d214 2 @ 1.57 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="JamieW" date="1295326595" format="1.1" reprev="1.57" version="1.57"}% a10 1 * JW will create matrix of capabilities and limitations (Tuesday) d16 1 d86 4 d638 2 a639 1 %META:FILEATTACHMENT{name="Matrix_Limitations.png" attachment="Matrix_Limitations.png" attr="" comment="" date="1295326566" path="Matrix_Limitations.png" size="99905" stream="Matrix_Limitations.png" user="Main.JamieW" version="1"}% @ 1.56 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="JamieW" date="1295315511" format="1.1" reprev="1.56" version="1.56"}% d632 3 @ 1.55 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1295042481" format="1.1" version="1.55"}% d10 1 a10 1 * JW will write sections Analysis: Policies and Analysis: Standards (Wednesday) d142 1 a142 1 * [[http://www.phyloxml.org phyloXML]]- find out how much of attributes below can be represented d165 1 a165 1 ---+++ Computable concepts a316 3 ---+++ Minimal Information for a Phylogenetic Analysis (MIAPA) *(done)* Scientists with an interest in the archiving and re-use of phylogenetic data have called for (but not yet developed) a minimal reporting standard designated "Minimal Information for a Phylogenetic Analysis", or MIAPA ([[http://www.ncbi.nlm.nih.gov/pubmed/16901231 Leebens-Mack, et al. 2006]]). The vision of these scientists is that the research community would develop, and adhere to, a standard that imposes a minimal reporting burden yet ensures that the reported data can be interpreted and re-used. Such a standard might be adopted by journals, repositories, databases, workflow systems, granting organizations, and organizations that develop taxonomic nomenclature based on phylogenies. Leebens-Mack, et al. suggest that a study should report objectives, sequences, taxa, alignment method, alignment, phylogeny inference method, and phylogeny (this implies that MIAPA is intended only for molecular, as opposed to non-molecular, phylogenetics). d318 1 a318 3 As of 2010, no standard or draft has been developed (the [[http://mibbi.sourceforge.net/projects/MIAPA/ MIBBI repository for the MIAPA project]] is empty). A [[https://www.nescent.org/wg_evoinfo/MIAPA_WhitePaper NESCent whitepaper on MIAPA]] outlines how the project could be moved forward. As a proof-of-concept exercise (described with some screenshots [[https://www.nescent.org/wg_evoinfo/Supporting_MIAPA#Proof-of-concept_.28annotation_software.29 here]]), participants in NESCent's Evolutionary Informatics working group configured an existing annotation application to use a controlled vocabulary to describe a phylogenetic analysis as a series of steps. ---+++ TDWG and Darwin Core d331 7 @ 1.54 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="JamieW" date="1294953486" format="1.1" reprev="1.54" version="1.54"}% d60 2 a61 2 (*done, but should be shortened*) Re-use of accessible data is crucial to the progressive and self-policing nature of scientific inquiry. Thus, professional associations, publishers, and funding agencies recognize that "availability of the data underlying published scientific findings is essential to a healthy scientific process" (see Appendix 1). In the past, publication of a conventional scientific article was sufficient to satisfy most of the demand for accessible and re-usable data. This is no longer the case, due to technological advances that make it easy to produce massive amounts of data, and to aggregate and synthesize diverse types of data from all over the globe. These conditions drive scientific communities to develop data archives and information standards, along with the cyber-infrastructure to support their use. d63 1 a63 1 This infrastructure has both post-publication and pre-publication benefits to researchers. Post-publication archiving of results increases the accessibility and exposure of a researcher's work. However, the same technologies and standards that facilitate effective archiving also enable researchers in the pre-publication stage to link results to related data, to avoid duplication of effort, to collaborate more effectively, and to pursue large-scale, integrative projects. Examples of large-scale projects that rely on meta-analysis or integration are assembling a tree of life representing all known species, or identifying vulnerable species by combining occurrence data, climate data, and phylogeny in a geographic framework. d65 1 a65 1 The premise of archiving phylogenies is that the scientific community will benefit from having access to the multitude of trees generated every year. The analysis of citations by Kumar & Dudley suggested that the number of phylogeny publications in 2006 was 7000, and rapidly increasing. Experts in phylogenetic analysis typically generate hundreds or thousands of trees for every tree that is published. Thus, it is likely that, each year, many millions of trees are generated in association with published research. d67 1 a67 3 In the past, publishing a phylogenetic tree typically resulted in an informational dead-end: a picture of the tree embedded in a journal article. This is understandable to the extent that, in the past, research projects produced trees as final end-products, with each tree being unique and unlikely to be re-usable. However, conditions have changed to favor publishing re-usable trees, especially ones that can be linked into a global web of data. Many journals that serve the phylogenetics community already require archiving of data, or soon will. While more trees are published every year, standards of quality and software sophistication have gone up enormously, so that a typical user finds the task of phylogeny inference daunting. Yet, because more trees are being made, its more likely that a tree to serve a user's need already exists-- if it can be discovered and obtained in a re-usable form. In addition, trees increasingly serve as inputs for larger projects to aggregate data, compare methods, or integrate phylogenies with other kinds of data. At the same time, standards and technologies are emerging to make it easier to share and re-use trees. d71 7 a77 1 The ultimate goal of our effort here is to make trees more interoperable. As a step toward this goal, we aim to assess current approaches to publishing trees electronically, in order to educate phylogenetic users, and to identify strengths and weaknesses. This effort is timely for several reasons: d94 7 a100 1 ---++++1.5 GUIDs d102 3 a104 1 ---++++ 2. Rich annotations ("metadata") a119 7 ---++++ 3. Formal language support *(not done)* Formal language support is what makes clear (to a computer) the meaning of a token, or a string of tokens. This is done by associating it with a concept, that is then associated with other concepts by means of a class hierarchy and relations. In this context, the concept of a phylogenetic tree could be designated by reference to CDAO:Tree, which in turn defines a tree as a sub-class of "Network", and in relation to other concepts like "Branch" and "Node". The Comparative Data Analysis Ontology (CDAO) provides language support for many aspects of comparative analysis, though it has not been tested extensively and remains experimental. DarwinCore also provides language support for several key concepts useful in representing comparative data. A major gap involves language support for annotating workflows, i.e., for describing the sequence of operations used by a scientific researcher to generate a phylogeny product. Currently there is an experimental branch of CDAO that imports terms for a variety of computation concepts like "multiple sequence alignment program" as well as specific versions (e.g., "ClustalW"). d150 2 d165 1 a165 3 ---+++ Standards File formats a166 1 Ontologies and data standards d173 3 d210 6 a215 1 *The needs of archiving are not the same as those of publishing linkable, re-usable data.* In a typical case of archiving, even the simplistic Newick format can be used to represent a phylogeny with metadata if there is a combination of 1) a Newick string with unique identifiers for each internal and external node; and 2) an entity-attribute-value table assigning attribute values to nodes, as in the table used in Appendix 2. While this combination may ensure that key information is available in a record, it does not go very far toward facilitating re-use, re-purposing, aggregation, or linking of the tree. Ensuring that research results persist informationally does not ensure that they will be re-used effectively. a216 1 Therefore we address archiving and linking separately below. d220 1 a220 1 *While archiving trees is possible, the extent to which current policies require it is unclear*. The draft policy suggested recently by journals (Appendix 1) refers to "data supporting the results" of a publication. It is not clear what this is intended to cover. Natural scientists traditionally use the term "data" as a synonym for "facts", the empirical observations or measurements on which further analysis rests. By this definition, phylogenies are not data. Computer and information scientists use "data" (and in some sub-fields, even "facts") to refer to any kind of recorded information, regardless of its nature or derivation. The NIH data sharing policy (see note 7 of [http://grants.nih.gov/grants/policy/nihgps/fnpart_ii.htm]) makes clear that NIH uses the informational definition of "data", not the empirical one. We recommend that institutions with data archiving policies be explicit about what they mean by "data". a221 2 *Currently there is no reporting standard for a phylogenetic analysis*. A minimal reporting standard for phylogenetic analysis has been suggested (see MIAPA, Appendix 1) but never drafted or approved. Establishing such a standard would be a critical step in promoting re-useable trees. To develop a standard would require community organization as well as technological support (some guidance is provided by a [[https://www.nescent.org/wg_evoinfo/MIAPA_WhitePaper MIAPA whitepaper]] from the NESCent EvoInfo working group). d225 4 d250 2 d337 1 a337 1 A set of test files may be used to illustrate the representation capabilities (and limitations) of different formats. The files are available for download and analysis here: d345 3 a347 1 The sample files represent data for a set of 4 cytochrome C sequences (from PFAM family PF00034), using the tree d352 1 a352 1 and the following associated data: a357 1 Note that this is not a species tree, but a gene tree, and that there are two different sequences from the same rat species. d382 1 a382 1 * deprecated in favor of phyloXML, which covers the same ground with validatable syntax d397 1 a397 1 ---+++ NEXUS ( d448 2 d482 1 a482 1 * designed with extensive capabilities for representing character data d552 1 a552 1 1. After upload, click on yellow taxon button. Then click . Tree base tries to match up labels with existing taxon names. If not, checks uBio. If name may be a homonym – will be asked to choose which taxon map link to. NCBI handles the homonyms. [[http://www.treebase.org TreeBase]] will link to taxon names to a GenBank taxid if possible. d555 1 a555 1 1. After uploading matrix click . There is a list of row labels to populate. You can enter Darwin Core information about the specimen. d570 1 a570 1 The following [[http://www.treebase.org TreeBase]] screenshot (cropped) shows a taxon table with match-able names. Pressing the "Validate taxon labels" button will automatically apply the results of name-matching, which in this case gives the correct attributions:
d595 1 a595 1 Unfortunately the Genbank Accession numbers are not yet included, pending a decision on how to represent these. @ 1.53 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1294929979" format="1.1" reprev="1.53" version="1.53"}% d11 1 a12 2 * JW will present the report and get feedback (Thursday) * AS and JW will talk on Thursday Jan 13 at 3:00 d18 2 d23 1 a23 1 This report summarizes an as-yet-incomplete project to assess best practices for publishing phylogenetic trees, whether for the strict purpose of archiving, or for the more general purposes of pushing richly annotated trees out into a worldwide web of interoperable data. Such an assessment is timely given recent decisions by several journals to require the archiving of trees. However, even without that justification, there are several longer-term trends that favor publishing re-usable trees, especially ones that can be linked to other data: the opportunities for phylogeny re-use are greater; there are new opportunities for aggregation and integration; and both specific and general technologies have emerged that make sharing and re-using trees easier. d40 4 a43 3 * The needs of archiving are not the same as those of publishing linkable, re-usable data. * The infrastructure to support simple archiving of phylogenetic trees is largely in place already. * The extent to which _data_ archiving policies require archiving of phylogenetic trees is unclear. a44 1 * Currently there is no reporting standard for a phylogenetic analysis d47 1 a47 1 Archiving of trees is technically feasible given current formats, and using currently available archives (TreeBASE and Dryad). However, the archival value of many trees will be limited without a reporting standard specifying what types of metadata (annotations) should accompany a phylogeny report. An early call for a "MIAPA" standard apparently led nowhere. d54 2 a55 2 * we are targeting scientists (see Appendix 5 for strategy) with a survey to assess current practices and needs ([[https://spreadsheets.google.com/viewform?formkey=dHhZa0xMQTJuR0ZCZWxoV2JSTG13b2c6MQ][draft survey]]). * we provide a form for feedback on this page ([[#AddComments][below]]) d61 1 a61 1 Re-use of accessible data is crucial to the progressive and self-policing nature of scientific inquiry. Thus, professional associations, publishers, and funding agencies recognize that 'availability of the data underlying published scientific findings is essential to a healthy scientific process' (see Appendix 1). In the past, publication of a conventional scientific article was sufficient to satisfy most of the demand for accessible and re-usable data. This is no longer the case, due to technological advances that make it easy to produce massive amounts of data, and to aggregate and synthesize diverse types of data from all over the globe. These conditions drive scientific communities to develop data archives and information standards, along with the cyber-infrastructure to support their use. d65 1 a65 1 The premise of archiving phylogenies is that the scientific community will benefit from having access to the trees generated every year. The number of such trees is quite large. The analysis of citations by Kumar & Dudley suggested that the number of phylogeny publications in 2006 was 7000, and rapidly increasing. Experts in phylogenetic analysis typically generate hundreds or thousands of trees for every tree that is published. Thus, it is likely that, each year, many millions of trees are generated in association with published research. d71 11 d90 2 d98 1 a98 1 This tree cannot be re-used for any purpose. Even if our goal is to explore models of speciation, and we wish only to measure whether the topology of the tree is ladder-like vs. bushy, we can't use this particular tree because we can't tell whether it's a species tree (relevant to speciation), or some other kind of tree (irrelevant). We might be able to determine this by reading the original publication to find out what the labels ("otu1", etc) mean, but no citation information is included with the tree. To interpret this tree, or to integrate it with other information, we would need to link it with other information, but we can't, because it does not refer to any identifiable thing. d100 1 a100 1 This example suggests some obvious ways to make a tree re-useable, namely to provide citation information, and when appropriate, to provide taxonomic links or other identifiers for the "OTUs" (Operational Taxonomic Units) at the tips of the tree. Another way to look at the problem of re-useability is to imagine that we have a database full of all published trees, richly annotated with the right kinds of metadata, and our challenge is to use this database to explore a basic research question, discover new relationships among types of data, or carry out a meta-analysis addressing a methodological issue. Types of data or metadata useful for such studies would include: d108 1 a108 1 ---++++ 3. Formal language support. d113 1 a113 11 ---+++ Rationale for this assessment *(done, but could be shorter)* The ultimate goal of our effort here is to make trees more interoperable. As a step toward this goal, we aim to assess current approaches to publishing trees electronically, in order to educate phylogenetic users, and to identify strengths and weaknesses. This effort is timely for several reasons: * While in the past, many scientists felt no incentive to share data, recent research has shown that making data available in public archives increases citations (ref: Piwowar, research remix), widely understood as an indicator of professional success; * In early 2010, eight journals in evolution and systematics announced plans to implement a data-archiving policy (see Appendix 1); * The only major electronic repository of trees, TreeBase, has recently completed a major upgrade of features, including its submission process; in 2009, a data archive called Dryad was launched and will accept various kind of electronic files, including those with trees; * NSF has recently increased its [[http://www.nsf.gov/pubs/policydocs/pappguide/nsf11001/gpg_2.jsp#IIC2j requirements for data-sharing plans]] in grant proposals (see Appendix 1). Thus, scientists will be motivated by funding agencies to share data electronically; * In recent years, phyloinformatics researchers have been developing supporting technologies to enable interoperability, including XML file formats (NeXML, PhyloXML), an ontology (CDAO) and a web-services standard (PhyloWS). Thus, at the same time that funding agencies, publishers, and the scientific culture are shifting in ways that create incentives for sharing data, including phylogenetic trees, new technologies are emerging to make it easier. d119 1 a119 1 *Some evolution-related journals.* In early 2010, the editorial boards of 8 journals (Evolution, Molecular Biology and Evolution, American Naturalist, Molecular Ecology, Journal of Evolutionary Biology, Heredity, and Evolutionary Applications) announced plans for a joint data archiving policy. This is a minority of the journals that regularly publish phylogenetic trees (other examples would be Systematic Biology, Molecular Phylogenetics, and so on). The policy (to be developed at a later date) would require "that data supporting the results in the paper should be archived in an appropriate public archive" to ensure that the data are "preserved and usable for decades in the future". The policy does not make clear whether phylogenetic trees would be considered "data supporting the results in the paper" (which is oddly phrased-- shouldn't it refer to data supporting the _conclusions_ of the paper?). See Appendix 1 for details. d121 1 a121 1 *NSF.* In the US, NSF is the major funder of evolutionary science. As described in Appendix 1, NSF guidelines call for proposals to include a “Data Management Plan” to describe how the proposal will conform to NSF policy on the dissemination and sharing of research results, including what types of data will be produced, "the standards to be used for data and metadata format and content", and plans "for preservation of access" to the data. The policy does not specify any particular standards, but merely calls on researchers to address this issue. d123 8 a130 1 *TDWG.* TDWG is an organization. TDWG has standards, but they do not have carrots or sticks like NSF and publishers. Darwin core and LSIDs are TDWG-approved standards. d133 1 a133 1 *(not done)* This should be a summary of what is in appendix 2, focusing on what the formats can represent in terms of useful metadata. d138 1 a138 1 * [[http://www.nexml.org NeXML]] - write a description of how to represent LSID, GenBank accn, geo coordinates d183 1 a183 1 For instance, most phylogenetics users implement a customized interactive workflow that relies on diverse software tools. Because these tools may use different formats, the ability to convert among formats is important. However, convenient generalized tools are lacking (the NeXML manual provides a useful list of online servers and scripting approaches). Likewise, there are many tools for viewing trees, but it seems that only a few allow for viewing trees together with a matrix of data (e.g., Archaeopteryx, Mesquitec Nexplore). Support for viewing metadata is very limited. The TreeBASE submission server is an example of a tool that allows users to annotate data sets by associating OTUs with species names, and by associating data rows with accessions. However, this tool only works in the context of a TreeBASE submission. d211 1 a211 1 *Currently there is no reporting standard for a phylogenetic analysis*. A minimal reporting standard for phylogenetic analysis has been suggested (see MIAPA, appendix 1) but never drafted or approved. Establishing such a standard would be a critical step in promoting re-useable trees. To develop a standard would require community organization as well as technological support (some guidance is provided by a [[https://www.nescent.org/wg_evoinfo/MIAPA_WhitePaper MIAPA whitepaper]] from the NESCent EvoInfo working group). d216 1 a216 1 1. lack of resolvable lsids; lack of a validator to see if species refs are resolved d227 1 a227 1 ---++ Author contribution and other acknowledgements d240 1 a240 1 ---++ Appendix 1. Relevant standards d246 2 d285 1 a285 1 ---+++ Dublin core d287 1 a287 1 there isn't a standard for encoding dublin core publication data in XML. in particular, there isn't an enclosing element. In NeXML it would be "meta". DC isn't very well suited to journal articles, anyway. too bad. the best attempt I've seen (http://reprog.wordpress.com/2010/09/03/bibliographic-data-part-2-dublin-cores-dirty-little-secret/) goes like this: d302 1 a302 1 ---+++ MIAPA d321 1 a321 1 ---++ Appendix 2: Toy data set rendered in different formats d327 1 a327 1 * [[%ATTACHURL%/PF00034_4_phylo.xml][PF00034_4_phylo.xml]]: 4-taxon test case in [[http://www.phyloxml.org phyloXML]] format d330 1 a330 1 The toy files represent data for a set of 4 cytochrome C sequences (from PFAM family PF00034), using the tree d381 1 a381 1 ---+++ NEXUS @ 1.52 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1294856094" format="1.1" reprev="1.52" version="1.52"}% d22 1 a22 1 This report summarizes a project to asses best practices for publishing phylogenetic trees, whether for the strict purpose of archiving, or for the more general purposes of pushing richly annotated trees out into a worldwide web of interoperable data. Such an assessment is timely given recent decisions by several evolution journals to require the archiving of trees. However, there are several longer-term trends that favor publishing re-usable trees, especially ones that can be linked to other data: the opportunities for phylogeny re-use are greater, along with new opportunities for aggregation and integration; the costs of generating a new phylogeny are higher; and both specific and general technologies have emerged that make sharing and re-using trees easier. d24 1 a24 1 The motivation for the report is that it will encourage the use, and further the development, of practices that will benefit scientists individually and collectively. This motivation is somewhat speculative. At present, it seems likely-- though the evidence is relatively narrow-- that archiving of results post-publication benefits the scientific community, and that this benefit returns to the archiving scientist in the form of increased recognition. A less speculative motivation is that developing the capacity to manage richly annotated yet interoperable data benefits scientists (individually and collectively) by making it easier to carry out integrative, automated, or large-scale projects. d172 1 a172 1 For instance, most phylogenetics users implement a customized interactive workflow that relies on diverse software tools. Because these tools may use different formats, the ability to convert among formats is important. However, convenient generalized tools are lacking. Likewise, there are many tools for viewing trees, but it seems that only a few allow for viewing trees together with a matrix of data (e.g., Archaeopteryx, Mesquitec Nexplore). Support for viewing metadata is very limited. The TreeBASE submission server is an example of a tool that allows users to annotate data sets by associating OTUs with species names, and by associating data rows with accessions. However, this tool only works in the context of a TreeBASE submission. a174 1 *(not done)* This is going to be a short section, because current practices are rudimentary. d176 13 a188 4 * Archiving at journal web sites -- most trees archived this way? * [[http://www.treebase.org TreeBase]], has a submission process (see Appendix 3) * [[http://datadryad.org Dryad]], has been up for a year (does it have submissions with trees?) @ 1.51 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1294766619" format="1.1" reprev="1.51" version="1.51"}% d11 1 a11 1 * AS will write sections Analysis: Archives, Tools and Current Practices (Wednesday) d18 1 d137 1 a137 1 Currently the research community makes use of two repositories for archiving phylogenetic trees, Dryad and TreeBASE. d139 7 a145 11 Dryad * Dryad project aims * Dryad scope * upload process * current usage TreeBASE * project aims * scope * upload process * current usage d161 13 d218 2 d526 1 a526 1 One of us (AS) worked with Dr. Martin Wu to submit data from a largely, recently published analysis of prokaryotic phylogeny (Wu, et al., 2010). The data consist of a 720-taxon tree, a 6309-column alignment, and metadata (citation data, analysis methods) added interactively during the submission process. Prior to submission, AS spent several hours to generate matching labels so that the separate alignment and tree files (initially with non-matching names) could be combined in Mesquite or Bio::NEXUS. This is a common stumbling block in phylogenetics workflows. Dr. Wu spent an hour on the submission process itself, though this stretched out over several weeks while a syntax issue due to differing interpretations of NEXUS was resolved via email, with help from Dr. Piel (initially, we encoded names as 'Genus_species_strain', based on the equivalence of spaces and underscores in NEXUS names; however, protecting the underscores within a single-quoted phrase prevented them from being treated as spaces by the TreeBASE NEXUS parser). When this minor syntax issue was resolved, TreeBASE automatically matched all 720 OTU names to qualified species names. The report was submitted and now appears as TreeBASE study [[http://www.treebase.org/treebase-web/search/study/summary.html?id=10965 10965]]. Dr. Wu reports that making the submission to TreeBASE was "definitely worth it". He said he was contacted with requests for the data 3 times in the 11 months since the paper was published. d558 1 a558 11 ---++ Appendix 4: tools and tips *Format translation* Most phylogenetics users implement a customized interactive workflow that relies on diverse software tools. Because these tools may use different formats, the ability to convert among formats is important. Useful information on format translation, including some addresses of online servers, as well as instructions for using programming libraries in Perl and Python, can be found in the [http://www.nexml.org/manual NeXML manual]. In general, these tools offer translation of data and labels, but not metadata. *Visualization* Many tools for this. Most of them take Newick input and show only the tree, not character data or metadata. For viewing character data and metadata, Archaeopteryx, Mesquite and Nexplorer are recommended. *Databasing* Users who deal with many trees, from multiple projects, have a need for storing large numbers of trees, for querying and retrieving the trees based on their properties, and for revising and updating data. These are databasing issues. We are not aware of richly featured systems that users may implement to manage their trees. The "project" concept in Mesquite allows storage of complex data sets with multiple trees and matrices. *Manipulation (business logic)* Mesquite. Programming libraries ---++ Appendix 5: Survey and user feedback. @ 1.50 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1294691356" format="1.1" version="1.50"}% d5 14 d136 14 d153 1 d198 1 a198 1 ---++ Acknowledgements d200 5 a204 1 I'd like to see a PLOS-style statement of author contributions. @ 1.49 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1290623867" format="1.1" version="1.49"}% d42 1 a42 1 ---++ Introduction d96 1 a96 1 ---++ Relevant standards d120 6 a125 1 ---+++ Ontologies and data standards d132 4 a135 1 ---++ Current practices d169 6 @ 1.48 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1290613644" format="1.1" version="1.48"}% a54 2 d81 3 a83 1 Formal language support is what makes clear (to a computer) the meaning of a token, or a string of tokens. In common uses of natural language, much of the meaning of terms comes from context. A Drosophila researcher will frequently use the term "fly", and in context, listeners will understand that the term "fly" refers to a dipteran insect, and frequently to the particular insect species, Drosophila melanogaster. An airline pilot also will use the term "fly" frequently, but with an entirely different meaning. The way that this problem is solved in the KR world is to assign the term a unique id within a namespace, and define it within that namespace. In the "transportation terms" namespace, "fly" means one thing. In the "biology terms" namespace, it means another. To make it clear, a computer file would refer to transportation_terms:fly or biology_terms:fly. Going further, the file might identify the namespace by a URL source, thus transportation_terms = "http://www.example.org/transportation/1.0/transportation", where this URL points to a transportation ontology that defines "fly" and other terms. The specific term would be http://www.example.org/transportation/1.0/transportation#fly, or simply transportation_terms:fly. Because the token "fly" might have two distinct meanings even within a domain, it would be wise to refer to it by an identifier that is guaranteed to be unique, e.g., biology_terms:7384 might refer to "fly" in the sense of "diptera", and biology_terms:2234 might refer to "fly" in the sense of aerial mobility. a84 1 more @ 1.47 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1290189722" format="1.1" version="1.47"}% d468 1 a468 1 One of us (AS) worked with Dr. Martin Wu to submit data from a largely, recently published analysis of prokaryotic phylogeny (Wu, et al., 2010). The data consist of a 720-taxon tree, a 6309-column alignment, and metadata (citation data, analysis methods) added interactively during the submission process. Prior to submission, AS spent several hours to generate matching labels so that the separate alignment and tree files (initially with non-matching names) could be combined in Mesquite or Bio::NEXUS. This is a common stumbling block in phylogenetics workflows. Dr. Wu spent an hour on the submission process itself, though this stretched out over several weeks while a syntax issue due to differing interpretations of NEXUS was resolved via email, with help from Dr. Piel (initially, we encoded names as 'Genus_species_strain', based on the equivalence of spaces and underscores in NEXUS names; however, protecting the underscores within a single-quoted phrase prevented them from being treated as spaces by the TreeBASE NEXUS parser). When this minor syntax issue was resolved, TreeBASE automatically matched all 720 OTU names to qualified species names. Dr. Wu reports that making the submission to TreeBASE was "definitely worth it". He said he was contacted with requests for the data 3 times in the 11 months since the paper was published. @ 1.46 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1290112594" format="1.1" version="1.46"}% d7 1 a7 1 In the past, publishing a phylogenetic tree has most often taken the form of publishing a picture of a tree (a graphic image) in a scientific paper. That publishing a tree should result in such an informational dead-end (from which further results cannot be computed) is understandable given that, in the past, trees were treated as the end-product of a small customized project, rather than the beginning of a larger project to aggregate phylogenies, evaluate methods, or integrate phylogenetic results with other kinds of data. However, the scientific environment is shifting in ways that favor publishing trees that can be linked into a global web of data. At the same time, standards and technologies are emerging to make this process easier. d9 1 a9 1 This report summarizes a project to assess current best practices for publishing phylogenetic trees, whether for the strict purpose of archiving, or for the more general purposes of pushing interoperable trees out into a worldwide web of interoperable data. The motivation for the report is that establishing effective practices in this area will benefit the scientific community as well as individual scientists. Archiving of results post-publication benefits the scientific community, and this benefit returns to the archiving scientist in the form of increased recognition; the capacity to manage richly annotated yet interoperable data benefits both the scientific community and the individual scientist, by making it easier to carry out integrative, automated, or large-scale projects. d11 1 a11 1 For the purposes of this report, we have conducted an initial (and in some ways, incomplete) review of d51 6 @ 1.45 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1290097307" format="1.1" reprev="1.45" version="1.45"}% d462 1 a462 1 One of us (AS) worked with Dr. Martin Wu to submit data from a largely, recently published analysis of prokaryotic phylogeny (Wu, et al., 2010). The data consist of a 720-taxon tree, a 6309-column alignment, and metadata (citation data, analysis methods) added interactively during the submission process. Prior to submission, AS spent several hours to generate matching labels so that the separate alignment and tree files (initially with non-matching names) could be combined in Mesquite or Bio::NEXUS. This is a common stumbling block in phylogenetics workflows. Dr. Wu spent an hour on the submission process itself, though this stretched out over several weeks while a syntax issue due to differing interpretations of NEXUS was resolved via email, with help from Dr. Piel (initially, we encoded names as 'Genus_species_strain' on the grounds that the NEXUS standard allows spaces and underscores to be interchangeable in names; however, protecting the underscores withing a single-quoted phrase prevented them from being treated as spaces by the TreeBASE NEXUS parser). When this minor syntax issue was resolved, TreeBASE automatically matched all 720 OTU names to qualified species names. Dr. Wu reports that making the submission to TreeBASE was "definitely worth it". @ 1.44 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1289933556" format="1.1" version="1.44"}% d77 1 a77 1 We need to describe the concept of formal language support to assign classes and relationships. GUIDs and ontologies. d79 1 d427 1 a427 1 [http://www.datadryad.org Dryad] is a project to support data archiving for evolutionary research. The organizers of this project worked with publishers to generate agreement on the provisional data archiving policy noted above. The archive will accept text files and spreadsheet files in standard formats. Thus, users could submit a phylogenetic tree in any of the formats noted above. Whereas TreeBASE has a complex internal data model, with each submitted datum being assigned to some slot in the data model, Dryad will accept all sorts of textual information. To allow for query and retrieval, Dryad will index all of this information as text. d438 1 d462 1 a462 1 One of us (AS) worked with Dr. Martin Wu to submit a largely, recently published analysis of prokaryotic phylogeny (Wu, et al., 2010). The reuseable data from the study consists of a 720-taxon tree, a 6309-column alignment, and metadata (citation data, analysis methods) added interactively during the submission process. Prior to submission, considerable effort was expended to generate a matching set of labels so that the separate alignment and tree files could be combined. The elapsed time for the submission process was several weeks (while email communication took place), but the actual time spent on submission was rather small. This may have been due partly to getting personal help from Dr. Piel. During the first attempt at submission, the species names were not recognized from the OTU names due to the way in which we had encoded OTU names (with both underscores-in-place-of-spaces and with single quotes: the NEXUS standard allows spaces and underscores to be interchangeable in names, but if underscores are protected with single-quotes, they are treated literally as part of the name). However, in the second attempt, all 720 OTU names were matched to qualified species names. @ 1.43 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="DanRosauer" date="1289831445" format="1.1" version="1.43"}% d460 2 @ 1.42 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1289829499" format="1.1" version="1.42"}% a39 1 * we accept long comments emailed to an.address [at] geebung.id.au [Add real address here] @ 1.41 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1289572435" format="1.1" version="1.41"}% d511 2 a512 1 * email lists from past NESCent phyloinformatics activities d516 1 @ 1.40 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1289511429" format="1.1" version="1.40"}% d514 1 @ 1.39 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1289495457" format="1.1" reprev="1.39" version="1.39"}% d38 1 a38 1 * we are targeting scientists (via scientific email lists) with a survey to assess current practices and needs ([[https://spreadsheets.google.com/viewform?formkey=dHhZa0xMQTJuR0ZCZWxoV2JSTG13b2c6MQ][draft survey]]). d502 1 a502 1 Once the initial report is released, a survey ([https://spreadsheets.google.com/viewform?formkey=dHhZa0xMQTJuR0ZCZWxoV2JSTG13b2c6MQ preliminary draft]) will be sent to scientists. This appendix is to summarize the feedback from the survey as well as from the "comment" box on this page, and any other comments received. d504 16 @ 1.38 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1289487391" format="1.1" version="1.38"}% d46 1 a46 1 Re-use of accessible data is crucial to the progressive and self-policing nature of scientific inquiry. Thus, professional associations, publishers, and funding agencies recognize that ''availability of the data underlying published scientific findings is essential to a healthy scientific process'' (see Appendix 1). In the past, publication of a conventional scientific article was sufficient to satisfy most of the demand for accessible and re-usable data. This is no longer the case, due to technological advances that make it easy to produce massive amounts of data, and to aggregate and synthesize diverse types of data from all over the globe. These conditions drive scientific communities to develop data archives and information standards, along with the cyber-infrastructure to support their use. d138 1 a138 1 *While archiving trees is possible, the extent to which current policies require it is unclear*. The draft policy suggested recently by journals (Appendix 1) refers to "data supporting the results" of a publication. Natural scientists use the term "data" differently than computer scientists. The conclusions of a paper may depend on an inferred phylogeny. The encoded structure and properties of the phylogeny clearly represent _information_. However, they do not represent _data_ in the sense usually used by scientists. Institutions with archiving policies may wish to clarify how the policy applies to phylogenies. d140 1 a140 1 *Currently there is no reporting standard for a phylogenetic analysis*. A minimal reporting standard for phylogenetic analysis has been suggested but never drafted or approved. d161 1 a161 1 (see the [http://en.wikipedia.org/wiki/Data_sharing Data Sharing] article on wikipedia for references to data sharing policies in the US). Authors of scientific studies often are required (as a condition of funding or of publication) to make such results available to the research community without restriction. d163 1 a163 1 ---++++ Evolution Journals d165 2 @ 1.37 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1289423031" format="1.1" version="1.37"}% d7 1 a7 1 In the past, publishing a phylogenetic tree has most often taken the form of publishing a picture of a tree (a graphic image) in a scientific paper. This is understandable given that, in the past, trees were treated as the end-product of a small customized project, rather than the beginning of a larger project to aggregate phylogenies, evaluate methods, or integrate phylogenetic results with other kinds of data. However, the scientific culture-- including funding agencies and publishers-- is shifting in ways that favor publishing trees that can be linked into a global web of data. At the same time, standards and technologies are emerging to make this process easier. d9 1 a9 1 This report summarizes a project to assess current best practices for publishing phylogenetic trees, whether for the strict purpose of archiving, or for the more general purposes of achieving data interoperability. The motivation for the report is that establishing effective practices in this area will benefit the scientific community as well as individual scientists. Archiving of results post-publication benefits the scientific community, and this benefit returns to the archiving scientist in the form of increased recognition; the capacity to manage richly annotated yet interoperable data benefits both the scientific community and the individual scientist, by making it easier to carry out integrative, automated, or large-scale projects. d17 1 a17 1 We considered the capacity to represent various kinds of metadata (annotations) crucial for ensuring that research results can be discovered, interpreted, linked and re-used: @ 1.36 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1289414273" format="1.1" reprev="1.36" version="1.36"}% d7 1 a7 1 Scientists have been storing and visualizing electronic representations of phylogenetic trees for decades. Typically these trees are treated as the endpoint of a project, not the beginning of a larger project to aggregate phylogenetic data, compare phylogenetic methods, or integrate phylogenetic results with other kinds of data. In the past, publishing a tree has most often taken the form of publishing a scientific paper that includes a graphic image of a tree. d9 1 a9 1 However, the scientific culture-- including funding agencies and publishers-- is shifting in ways that favor publishing phylogenetic trees that can be linked into a global web of data. At the same time, standards and technologies are emerging to make this process easier. Archiving of post-publication data benefits the scientific community, and this benefit returns to the submitting scientist in the form of increased exposure. The interoperability of pre-publication and post-publication data benefits the scientific community and the individual scientist, by making it easier to carry out integrative, automated, or large-scale projects. d11 1 a11 1 The aim of this report is to assess current best practices for publishing phylogenetic trees, whether for the strict purpose of archiving, or for the more general purposes of data interoperability. For this report, we have conducted an initial (and in some ways, incomplete) review of d17 15 a31 1 We find that archiving of trees is technically feasible given current formats, and using currently available archives (TreeBASE and Dryad). However, the archival value of many trees will be limited without a reporting standard specifying what types of metadata (annotations) should accompany a phylogeny report. d33 1 a33 1 While making trees archival is an important step forward for the phylogenetic community, re-usability of trees depends on several other conditions that, for the foreseeable future, will be difficult for most researchers to obtain. *more explanation needed here* @ 1.35 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1289404822" format="1.1" reprev="1.35" version="1.35"}% d7 1 d9 1 a9 1 Recent announcements from major funding agencies and from journals in evolution and systematics (Appendix 1) will require online publication of trees in reusable formats. Increasingly, phylogenies are seen not just as an endpoint – presenting the inferred relationships between a set of organisms or taxa – but as the starting point for a range of further research. This research will almost always involve linking with other data. d11 5 a15 1 Phylogenetic trees are represented in a variety of electronic formats, but when published, appear most frequently as a graphical image. While making the tree accessible online in a standard format would be a major step forward, re-usability of trees depends on several other conditions that, for the foreseeable future, will be difficult for many researchers to obtain. d17 1 a17 3 Most phylogeny information artefacts (e.g., files) out there don't have any of these. Integrating phylogenetic information into the global web of data will progress rapidly when it is: * easy for users to put this information into their trees via appropriate software; and * considered standard good practice, and beneficial to the creators of the trees, to include linkable information. d19 1 a19 1 Thus, the goal of this report is to identify data formats, software and work procedures to deliver reusable, linkable trees. We hope to provoke discussion leading to workable and widely accepted solutions to this problem. d68 1 a68 3 The ultimate goal of our effort here is to make trees more interoperable. We believe that if the forest of trees produced by researchers each year were computationally accessible, the scientific community would have a much greater capacity to validate and extend phylogeny-based research. The benefits of linked data have been discussed elsewhere (http://www.taxonconcept.org/taxonconcept-blog/2010/8/5/why-linked-open-data-makes-sense-for-biodiversity-informatic.html). As a step toward this goal, we aim to assess current approaches to publishing trees electronically, in order to educate phylogenetic users, and to identify strengths and weaknesses. This effort is timely for several reasons: @ 1.34 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1289323624" format="1.1" version="1.34"}% d3 1 d5 2 d8 1 a8 1 ---+ TDWG2010 phylogenetic standards working group project: Linking Trees d10 1 a10 2 ---++ Project Overview At the TDWG meeting in Woods Hole, a small group (Dan, Arlin, Jamie, Torsten, Elena) decided to work out current best practices for publishing a tree electronically, via an archive such as [[http://www.treebase.org TreeBase]] or [[http://datadryad.org Dryad]], or via the semantic web (presumably using CDAO). After the meeting, this evolved into a more specific plan (see below) to produce an initial report, solicit feedback, and then generate a more extensive report for publication. d12 155 a166 46 ---+++ Contributors * Arlin Stoltzfus * Dan Rosauer * Torsten Eriksson * Bill Piel provided instructions for TreeBase and examples * Christian Zmasek provided feedback on phyloXML and NHX * Rutger Vos provided feedback on nexml encodings * Todd Vision provide some suggestions about using Dryad Others who have expressed interest in being kept in the loop * Nico Franz (interested in: ) * Jamie Whitacre (interested in: technical writing, survey analysis, publishing needs for large-scale phylogeny producers) ---+++ work plan This plan is continually under revision 1. Finish TDWG phylogenetics standards report (target: *October 29, 2010*) * decide on form and scope for TDWG report (can be narrower than eventual publication) * Write up rationale for these approaches focusing on integration (*done*) * Work up a toy case of the same 4 taxon tree in 5 different formats (*done*) * Explore options for embedded LSIDs and technology to resolve them _Dan_ * Investigate [[http://datadryad.org Dryad]] as an alternative archive _Dan_ 1. Release TDWG report, obtain feedback from developers & stakeholders (target: *November 15*) * develop and refine survey instrument (google docs) * evoldir and tdwg lists * syst biol, cipres (other projects, journals, communities?) * developers of relevant software & standards (Christian, Rutger, David & Wayne) * archive developers (Bill, Todd) 1. As needed, explore additional issues, including issues arising from stakeholders (*October*, *November*) * Novice workflows for users to get their data into standard formats * tools to annotate and edit files to make the information more linkable * tools that illustrate cool things that can be done with properly formed input files * TOLWEB as an alternative destination for published trees _Arlin_ 1. Write a manuscript for publication (target: *November 30*) * recruit additional authors as appropriate * estimate number of users and trees per year using citations or other means ---++ Preliminary Report The [[http://wiki.tdwg.org/twiki/bin/view/Phylogenetics/LinkingTreesReport1 Preliminary Report]] has been moved to a separate page. We should put an executive summary here, though. ---++ Appendix 2. Data archives ---+++ generic data stores to be done ---+++ Dryad Todd Vision writes d168 2 a169 1 Since Dryad is a general-purpose repository, it doesn't impose any constraints on how the data are represented within the files that users submit. The best practices need to come from elsewhere, such as journal policies, MIAPA (which would be nice to give some attention to in that paper), and community practice imposed by awareness of how the data will be reused by more specialized phylogenetic tools. d171 10 a180 3 In case you aren't aware, Dryad just introduced a "handshaking" feature for TreeBASE. Users can elect to have a NEXUS file that is deposited to Dryad "pushed through" to TreeBASE to initiate the submission process there. So for the special case of phylogenetic data in Dryad, we would encourage having that Newick tree within a NEXUS file, together with the OTU metadata that can fit within that file format. I dream of a future in which lots of different software tools will support the editing and output of metadata-rich phylogenies in NeXML, and that TreeBASE can ingest those NeXML files. But we aren't there yet. If a user doesn't intend to use TreeBASE for whatever reason, then a Newick tree in one file and OTU metadata in a separate CSV file would be a reasonable low-tech solution, as long as the OTU identifiers were consistent between the files. A ReadMe file could also be used to provide study-level metadata. a181 31 ---+++ TreeBASE ---++++ Uploading a tree to [[http://www.treebase.org TreeBase]] (notes from chat with Bill Piel 9/29/10) 1. Use Mesquite to prepare document before uploading to [[http://www.treebase.org TreeBase]] * Why? Because 1) TreeBASE and Mesquite use the same Java API for parsing NEXUS; and 2) this API is a relatively complete and robust implementation of the standard 1. In Mesquite, best to ensure matrix and tree are in the same file to ensure no mismatch 1. Ensure taxon names are written out in full as binomial or trinomial * If there are infraspecies, just write the triplet without ‘var’, ‘subsp’ etc * What if there are multiple specimens for the same taxon? Each name must be unique, so make sure the specimen ID etc, is a suffix formatted with a leading capital or a number so [[http://www.treebase.org TreeBase]] won’t treat it as a new taxon name 1. After upload, click on yellow taxon button. Then click . Tree base tries to match up labels with existing taxon names. If not, checks uBio. If name may be a homonym – will be asked to choose which taxon map link to. NCBI handles the homonyms. [[http://www.treebase.org TreeBase]] will link to taxon names to a GenBank taxid if possible. 1. Create an analysis record to link the matrices to the trees. 1. Linking to specimen IDs – eg genbank accession?This is done by setting attributes of rows in the matrix: 1. After uploading matrix click . There is a list of row labels to populate. You can enter Darwin Core information about the specimen. 1. There is a bug – if some rows are populated for a given column, all rows must be populated for that column. There is an error if left blank. So just put something there - for example a dash ‘-’ 1. You could apply this metadata to just a part of the alignment 1. Not sure that all of this metadata is included in [[http://www.nexml.org NeXML]] Can't attach metadata to the tree nodes— these data derive from the matrix and thus are included there. ---++++ Examples of metadata in [[http://www.treebase.org TreeBase]] (via Bill Piel) Accession numbers * Under the Matrices tab, enter 4953 and click "Matrix ID" * Click either the M4953 link or the image * Under the "Row Segments" column, you should see a "View" link -- click one of them Now you should see any attached metadata -- in this case it is a Genbank accession number that applies to the set of columns 1-992. You can do the same for the following matrices: * 831 = example of a matrix with a single set of row segments with Genbank accession numbers. * 5572 = example of a matrix with multiple row segments, and both Genbank and locality info. Unfortunately the lat/long data is not showing up even though I know the metadata are in there (sorry -- bugzilla) * 5212 = example of a matrix with a single set of row segments, with both Genbank numbers and locality metadata (Unfortunately the lat/long data is not showing up even though I know the metadata are in there) d183 34 a216 30 ---++++ Results of using the [[http://www.treebase.org TreeBase]] submission process Dan, Torsten and Arlin all made attempts to use the submission process. Dan and Arlin both uploaded files with OTU labels that contain species names, and these were recognized >90% of the time by tb once the "validate taxon labels" button is pressed (why doesn't tb suggest these automatically?). Arlin also used the "row segment table" interface to annotate a submission with GenBank accessions. The following [[http://www.treebase.org TreeBase]] screenshot (cropped) shows how a user may assign a UBio Id to an OTU (and it also shows that _TreeBase correctly guesses the actual species_):
tb2_taxlabel_editor_screenshot.jpg The following [[http://www.treebase.org TreeBase]] screenshot (cropped) shows a taxon table with match-able names. Pressing the "Validate taxon labels" button will automatically apply the results of name-matching, which in this case gives the correct attributions:
tb_taxa_table_screenshot.jpg ---++++ Examples of metadata in [[http://www.nexml.org NeXML]] from [[http://www.treebase.org TreeBase]] Looking at the [[http://www.nexml.org NeXML]], I see that Rutger has, indeed, coded in for delivering these metadata. For example, look at this: http://purl.org/phylo/treebase/phylows/matrix/TB2:M5212?format=nexml Instead of attaching lat/longs to some sort of row segment definition (e.g. the [[http://www.nexml.org NeXML]] equivalent of a charset) he has simply listed them among the OTU elements. e.g.: d219 1 a219 10 You can see in there a pair of lat/longs, an LSID to the uBio namebank record for the taxon, and a uniprot purl to the NCBI taxonomy. Unfortunately the Genbank Accession numbers are not yet included. I think Rutger has put exposure of some of these metadata on hold pending that the community of [[http://www.nexml.org NeXML]]-ers decides the proper way to do it. ---++++ how to expose these data on treebase Rutger Vos writes that "the way forward would be a two-step process: 1. the CQL query interface would need to be re-designed/expanded such that more predicates are recognized and supported for searching. Whether this would be on a predicate-by-predicate basis or something more generic remains to be seen. Hopefully the latter, but it's not immediately obvious to me how that would work. 1. a simple search box (a bit like the clever entrez search (Rod Page has been begging for this)) would need to be developed that knows how to construct any relevant CQL search queries and call them, returning all hits from the different search sections. I have some ideas for how to do this, but it wouldn't be trivial." ---++ Appendix 3: a set of toy files for illustration d239 1 a239 1 d407 44 a450 1 ---++ Appendix 4. Survey and User Feedback d452 1 a452 1 Survey has not been carried out, and is not planned until after the initial report is released. Preliminary draft is here: d454 1 a454 1 https://spreadsheets.google.com/viewform?formkey=dHhZa0xMQTJuR0ZCZWxoV2JSTG13b2c6MQ d456 1 a456 1 ---++ Appendix 5. Tools and tips a457 46 ---+++ Getting your data into interoperable formats The first section lists use-cases for format conversion. Each use-case provides a list of tools that can be used for this purpose. The tools are described in the second section. ---++++ Use Cases 1. Convert alignment (character data) format 1. Convert from various rare formats into FASTA, Phylip or ClustalW * see Online Format Translation Servers * see BioPerl 1. Convert from { FASTA, Phylip, ClustalW } into NEXUS * see Mesquite * see BioPerl 1. Convert tree format 1. Convert ToLWeb XML to NeXML 1. Convert Newick to NHX. No conversion is necessary. A Newick tree is the same as an NHX tree without annotations 1. Convert NHX to Newick. 1. Combine and alignment and a tree into one { NEXUS, NeXML or PhyloXML } file 1. Combine Newick tree and alignment { FASTA, Phylip, ClustalW } into NEXUS * see Mesquite * see Bio::NEXUS * see BioPerl + Bio::Phylo 1. Combine Newick tree and alignment { FASTA, Phylip, ClustalW } into NeXML * see Mesquite * see BioPerl + Bio::Phylo 1. Combine Newick tree and alignment { FASTA, Phylip, ClustalW } into PhyloXML * see BioPerl ---++++ Tools 1. Online Format Translation servers * one server - handles this format, that format * another server - handles this format, that format 1. BioPerl - a Perl programming toolbox for bioinformatics * installation instructions are at xxxx * converting alignment formats is described at xxxx * the NEXUS files produced by BioPerl are (deprecated) "DATA block" files, but are readable by other software 1. Mesquite - a graphical application for viewing, manipulating, and analyzing comparative data * web site and installation instructions are at xxxx * instructions for creating a combined NEXUS file are at xxxxx 1. Bio::NEXUS 1. BioPerl plus Bio::Phylo ---+++ Visualizing ---+++ Maintaining alternative naming schemes Dan says that he often creates 2 column tables with one column containing the name used in the nexus file, the other the name used for the same taxon or OTU in the spatial data. Arlin also encounters this kind of name-reconciliation problem. Bio::NEXUS provides tools to safely change the names in NEXUS files using a mapping provided in a simple 2-column input file. d459 11 a469 2 your_shell$ perl -MCPAN -e'install Bio::NEXUS' your_shell$ nextool.pl my_nexus_infile rename_otus my_name_mapping > my_nexus_outfile d472 14 a485 1 where the mapping file (my_name_mapping in the example above) just has lines, each with the old OTU name, followed by whitespace, followed by the new OTU name. d487 1 a487 1 [While a user generated mapping file provides a one-off solution, including an LSID or other GUID for each tree tip could provide a more general solution if the ID resolves to either a) a taxon in a recognised taxon repository such as ITIS or b) to a curated specimen whose taxonomy can be updatred to stay current. In either case the creator of the taxonomy specifies this link rather than leaving the user to interpret. - Dan R] @ 1.33 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1289316175" format="1.1" version="1.33"}% a49 73 ---++ Appendix 1. Relevant standards ---+++ Data sharing and archiving policies (see the [http://en.wikipedia.org/wiki/Data_sharing Data Sharing] article on wikipedia for references to data sharing policies in the US). Authors of scientific studies often are required (as a condition of funding or of publication) to make such results available to the research community without restriction. ---++++ Evolution Journals The [[http://www.datadryad.org Dryad]] web site describes the Joint Data Archiving Policy as follows:
< < Journal > > requires, as a condition for publication, that data supporting the results in the paper should be archived in an appropriate public archive, such as < < list of approved archives here > >. Data are important products of the scientific enterprise, and they should be preserved and usable for decades in the future. Authors may elect to have the data publicly available at time of publication, or, if the technology of the archive allows, may opt to embargo access to the data for a period up to a year after publication. Exceptions may be granted at the discretion of the editor, especially for sensitive information such as human subject data or the location of endangered species.
And lists the following partner journals (for links, go to the [[http://www.datadryad.org Dryad]] web site): * Whitlock, M. C., M. A. McPeek, M. D. Rausher, L. Rieseberg, and A. J. Moore. 2010. Data Archiving. American Naturalist. 175(2):145-146, doi:10.1086/650340 * Rieseberg, L., T. Vines, and N. Kane. 2010. Editorial and retrospective 2010. Molecular Ecology. 19(1):1-22, doi:10.1111/j.1365-294X.2009.04450.x * Rausher, M. D., M. A. McPeek, A. J. Moore, L. Rieseberg, and M. C. Whitlock. 2010. Data Archiving. Evolution. doi:10.1111/j.1558-5646.2009.00940.x * Moore, A. J., M. A. McPeek, M. D. Rausher, L. Rieseberg, and M. C. Whitlock. 2010. The need for archiving data in evolutionary biology. Journal of Evolutionary Biology 2010. doi:10.1111/j.1420-9101.2010.01937.x * Uyenoyama, M. K. 2010. MBE editor's report. Molecular Biology and Evolution. 27(3):742-743. doi:10.1093/molbev/msp229 * Butlin, R. 2010. Data archiving. Heredity advance online publication. 28 April doi:10.1038/hdy.2010.43 * Tseng, M. and L. Bernatchez. 2010. Editorial: 2009 in review. Evolutionary Applications. 3(2):93-95, doi:10.1111/j.1752-4571.2010.00122.x ---++++ NSF
Beginning January 18, 2011, proposals submitted to NSF must include a supplementary document of no more than two pages labeled “Data Management Plan”. This supplementary document should describe how the proposal will conform to NSF policy on the dissemination and sharing of research results. See Grant Proposal Guide (GPG) Chapter II.C.2.j for full policy implementation.
The policy may be found in the Award and Administration Guide, [[http://www.nsf.gov/pubs/policydocs/pappguide/nsf11001/aag_6.jsp#VID4 section VI.D.4.b]]:
b. Investigators are expected to share with other researchers, at no more than incremental cost and within a reasonable time, the primary data, samples, physical collections and other supporting materials created or gathered in the course of work under NSF grants. Grantees are expected to encourage and facilitate such sharing.
The Grant Proposal Guide, [[http://www.nsf.gov/pubs/policydocs/pappguide/nsf11001/gpg_2.jsp#IIC2j Section II.C.2.j]], reads partially as follows:
Plans for data management and sharing of the products of research. Proposals must include a supplementary document of no more than two pages labeled “Data Management Plan”. This supplement should describe how the proposal will conform to NSF policy on the dissemination and sharing of research results (see AAG Chapter VI.D.4), and may include:
  • the types of data, samples, physical collections, software, curriculum materials, and other materials to be produced in the course of the project;
  • the standards to be used for data and metadata format and content (where existing standards are absent or deemed inadequate, this should be documented along with any proposed solutions or remedies);
  • policies for access and sharing including provisions for appropriate protection of privacy, confidentiality, security, intellectual property, or other rights or requirements;
  • policies and provisions for re-use, re-distribution, and the production of derivatives; and plans for archiving data, samples, and other research products, and for preservation of access to them.
Data management requirements and plans specific to the Directorate, Office, Division, Program, or other NSF unit, relevant to a proposal are available at: http://www.nsf.gov/bfa/dias/policy/dmp.jsp. If guidance specific to the program is not available, then the requirements established in this section apply.
---+++ Dublin core there isn't a standard for encoding dublin core publication data in XML. in particular, there isn't an enclosing element. In NeXML it would be "meta". DC isn't very well suited to journal articles, anyway. too bad. the best attempt I've seen (http://reprog.wordpress.com/2010/09/03/bibliographic-data-part-2-dublin-cores-dirty-little-secret/) goes like this: Michael P. Taylor Darren Naish 2007 An unusual new neosauropod dinosaur from the Lower Cretaceous Hastings Beds Group of East Sussex, England. urn:ISSN:0081-0239 Blackwell Text Palaeontology 50(6), 1547-1564. (2007) info:doi:10.1111/j.1475-4983.2007.00728.x ---+++ MIAPA Scientists with an interest in the archiving and re-use of phylogenetic data have called for (but not yet developed) a minimal reporting standard designated "Minimal Information for a Phylogenetic Analysis", or MIAPA ([[http://www.ncbi.nlm.nih.gov/pubmed/16901231 Leebens-Mack, et al. 2006]]). The vision of these scientists is that the research community would develop, and adhere to, a standard that imposes a minimal reporting burden yet ensures that the reported data can be interpreted and re-used. Such a standard might be adopted by journals, repositories, databases, workflow systems, granting organizations, and organizations that develop taxonomic nomenclature based on phylogenies. Leebens-Mack, et al. suggest that a study should report objectives, sequences, taxa, alignment method, alignment, phylogeny inference method, and phylogeny (this implies that MIAPA is intended only for molecular, as opposed to non-molecular, phylogenetics). As of 2010, no standard or draft has been developed (the [[http://mibbi.sourceforge.net/projects/MIAPA/ MIBBI repository for the MIAPA project]] is empty). A [[https://www.nescent.org/wg_evoinfo/MIAPA_WhitePaper NESCent whitepaper on MIAPA]] outlines how the project could be moved forward. As a proof-of-concept exercise (described with some screenshots [[[[https://www.nescent.org/wg_evoinfo/Supporting_MIAPA#Proof-of-concept_.28annotation_software.29 here]]), participants in NESCent's Evolutionary Informatics working group configured an existing annotation application to use a controlled vocabulary to describe a phylogenetic analysis as a series of steps. ---+++ TDWG and Darwin Core From the [[http://rs.tdwg.org/dwc/terms/guides/xml/index.htm Darwin Core XML Guide]] (specify namespace with xmlns:dwc="http://rs.tdwg.org/dwc/terms/"): Anthus hellmayri Aves Anthus hellmayri urn:catalog:AUDCLO:EBIRD:OBS64515331 @ 1.32 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1288450312" format="1.1" version="1.32"}% d53 2 @ 1.31 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1288273919" format="1.1" reprev="1.31" version="1.31"}% d48 1 a48 1 The [[PublishingTreesReport1 Preliminary Report]] has been moved to a separate page. We should put an executive summary here, though. @ 1.30 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1288214627" format="1.1" version="1.30"}% d46 1 a46 102 ---++ Preliminary Report on Current Best Practices for Publishing Trees Electronically ---+++ Abstract This report seeks to identify best practices for publishing phylogenetic trees electronically, It aims to make trees as useful and reusable as possible, by making it easier to automatically link trees to other data. Currently phylogenetic trees are published in a wide variety of formats, but most frequently the only publication is as an image in a journal paper. Trees are increasingly published in machine readable formats, either as a supplementary data file with a journal paper, or in online repositories such as TreeBase. Even where trees are published in one of the standard data formats however, the taxa or specimens are often not identified in a way which supports easy machine recognition and linking to other information. Recent announcements from major funding agencies and from journals in evolution and systematics (see Appendix 1) will require online publication of trees in reusable formats. Increasingly, phylogenies are seen not just as an endpoint – presenting the inferred relationships between a set of organisms or taxa – but as the starting point for a range of further research. This research will almost always involve linking with other data. Typical examples where linking would be valuable include: * linking a node on a tree to geographic information about the taxon * linking a node on a tree to ecological or morphological about the taxon information * linking a node on a tree to a museum specimen or genbank accession * searching for trees which include a given taxon or specimen We want to facilitate the creation and storage of trees built for data integration so that they can be more easily and automatically linked to other data. The integrating variables with the most potential in the short term are: 1. taxon name * well formatted binomial or trinomial taxon name * LSID which resolves to a taxon concept 1. specimen identifier * collection accession number * LSID which resolves to a specimen record in a collection 1. sequence identifier * GenBank accession 1. geographic coordinates Most phylogeny information artefacts (e.g., files) out there don't have any of these. Integrating phylogenetic information into the global web of data will progress rapidly when it is a) easy for users to put this information into their trees via appropriate software and b) considered standard good practice, and beneficial to the creators of the trees, to include linkable information. Thus, the goal of this report is to identify data formats, software and work procedures to deliver reusable, linkable trees. We hope to provoke discussion leading to workable and widely accepted solutions to this problem. ---+++ Outline 1. Rationale and objectives of archiving for re-use 1. Key information to make the tree re-useable * Accessions - may lead to species identifier (potential for conflicts) * Taxonomic Links 1. a well formatted taxon name or taxon concept 1. binomial or separate field for each taxonomic level 1. LSID which resolves on a taxon names service * geo coordinates * publication info (important to note this explicitly, though its covered by Dryad and TreeBASE) 1. Archives to upload to: * [[http://www.treebase.org TreeBase]], has a submission process * [[http://datadryad.org Dryad]] still need to investigate * other places to expose a [[http://www.nexml.org NeXML]], [[http://www.phyloxml.org phyloXML]], or CDAO file (generic LOD store) 1. Formats * Newick- only allows labels, no metadata * NEXUS (https://www.nescent.org/wg_phyloinformatics/Supporting_NEXUS_Documentation) * NHX [[http://www.phylosoft.org/NHX/nhx.pdf PDF docs]]) * [[http://www.phyloxml.org phyloXML]]- find out how much of attributes below can be represented * [[http://www.nexml.org NeXML]] - write a description of how to represent LSID, GenBank accn, geo coordinates 1. gaps and recommendations 1. Semantic link between tree and characters (typically inferred_from) is not explicit * [[http://www.treebase.org TreeBase]] supports this via method link from matrix to tree 1. lack of community standard for accessions and species links 1. lack of resolvable lsids; lack of a validator to see if species refs are resolved 1. lack validation service for most formats ---+++ Rationale Hundreds of thousands-- perhaps millions-- of trees are generated each year in association with published research. Of these, a tiny fraction are published each year in association with journal articles. The vast majority of these "published" trees appear as graphical image— and may be accessible as an electronic image file, while a tiny fraction is archived in a computable electronic form, nearly always as a string with nested parentheses representing clades (the "Newick" format). To a computer, the images and image files are informational dead-ends. The "Newick" tree strings expose the topology and branch lengths of the tree, but such trees typically are not adequate for data integration, re-use, and re-purposing. To understand why, consider the following example:
((my_arbitrary_name1:0.34, idiosyncratic_name:0.19):0.11, my_other_name:0.44)
What this tree means depends entirely on what the labels refer to, but the labels are arbitrary. To interpret this tree, to validate it, or to integrate it with other information, we would need to link it with other information, but we can't, because it does not refer to any identifiable entity. Ideally it would refer to something with a GUID. In general, if the nodes in the tree are not associated with identifiable information, the structure of the tree has no recoverable biological meaning— and indeed, most trees that are archived lack clearly identifiable information allowing them to be linked with other data, except under the guidance of an expert communicating with the authors of the paper. The ultimate goal of our effort here is to make trees more interoperable. We believe that if the forest of trees produced by researchers each year were computationally accessible, the scientific community would have a much greater capacity to validate and extend phylogeny-based research. The benefits of linked data have been discussed elsewhere (http://www.taxonconcept.org/taxonconcept-blog/2010/8/5/why-linked-open-data-makes-sense-for-biodiversity-informatic.html). As a step toward this goal, we aim to assess current approaches to publishing trees electronically, in order to educate phylogenetic users, and to identify strengths and weaknesses. This effort is timely for several reasons: * While in the past, many scientists felt no incentive to share data, recent research has shown that making data available in public archives increases citations (ref: Piwowar, research remix), widely understood as an indicator of professional success; * In early 2010, some key journals in evolution and systematics announced plans to implement a data-archiving policy: to publish in these journals, researchers will need to start archiving their trees; * The only major electronic repository of trees, TreeBase, has recently completed a major upgrade of features, including its submission process; in 2009, a data archive called Dryad was launched and will accept various kind of electronic files, including those with trees; * NSF has recently increased its [[http://www.nsf.gov/pubs/policydocs/pappguide/nsf11001/gpg_2.jsp#IIC2j requirements for data-sharing plans]] in grant proposals. Thus, scientists will be motivated by funding agencies to share data electronically; * In recent years, phyloinformatics researchers have been developing supporting technologies to enable interoperability, including XML file formats (NeXML, PhyloXML), an ontology (CDAO) and a web-services standard (PhyloWS). Thus, at the same time that funding agencies, publishers, and the scientific culture are shifting in ways that create incentives for sharing data, including phylogenetic trees, new technologies are emerging to make it easier. Finally, although there are many indirect, community-wide benefits of sharing data, there are more direct and immediate benefits to the producer including: * increased citation - if researchers can readily apply a tree to additional research questions, it is likely to generate additional citations for the tree's authors. * easier to link ones own data - the same enhancements which prepare trees for reuse will also help researchers to link their own trees to related data for analysis. and more general benefits: * enables web applications, such as for phylogeographic visualisation, which link trees to spatial and other data. * facilitates use of trees in big-data questions which become tractable when data linking can be automated. ---+++ Key factors promoting re-use [drop this section? dr] Standard formats. Validatable. Type of metadata * publication information (dublin core) * ---+++ Available technology ---++++ Archives ---++++ Formats ---++++ Tools d48 1 a48 1 ---+++ Gaps and recommendations @ 1.29 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="DanRosauer" date="1288038532" format="1.1" version="1.29"}% d206 3 a208 1 doesn't exist, but its a nice idea @ 1.28 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="DanRosauer" date="1288034327" format="1.1" reprev="1.28" version="1.28"}% d311 1 a311 1 * [[%ATTACHURL%/PF00034_4.nex][PF00034_4.nex]]: 4-taxon test case in NEXUS forma @ 1.27 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="DanRosauer" date="1288021582" format="1.1" reprev="1.27" version="1.27"}% d125 6 a130 1 Finally (and we need some more work on this paragraph), while the rewards for sharing data are indirect, there are more direct and immediate benefits to the producer of being able to link *one's own data* and discover its connections. d132 2 a133 1 ---+++ Key factors promoting re-use @ 1.26 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1288015192" format="1.1" reprev="1.26" version="1.26"}% d48 1 a48 1 This paper seeks to develop best practices for publishing phylogenetic trees electronically, It aims to make trees as useful and reusable as possible, by making it easier to automatically link trees to other data. d50 1 a50 1 Currently trees are published in a wide variety of formats, and often the only publication is as an image in a journal paper. Trees are increasing published online, either as a supplementary data file with a journal paper, or in online repositories such as TreeBase. Even where trees are published in one of the standard data formats, however, the taxa or specimens are often not identified in a way which supports easy machine recognition and linking to other information. d52 1 a52 1 However, expectations for data sharing are changing, as reflected in a recent announcements from major funding agencies and from journals in evolution and systematics (see Appendix 1). Increasingly, phylogenies are seen not just as an endpoint – presenting the inferred relationships between a set of organisms or taxa – but as the starting point for a range of further research. This research will almost always involve linking with other data. d62 10 a71 1 The integrating variables with the most potential for the near future (next few years) are 1) species name (or other taxonomic identifier); 2) specimen or source identifier such as GenBank? accn; and 3) geographic coordinates. Most phylogeny information artefacts (e.g., files) out there don't have any of these. Integrating phylogenetic information into the global web of data isn't going to get very far until we make it easy for users to put this information into their trees. d73 3 a75 1 Thus, the goal of the working session is to tackle this problem, in terms of clarifying its dimensions and challenges, developing strategies based on available tools, and even developing new tools. An obvious starting point would be to focus on enabling ordinary phylogenetics & systematics users to use current standards (Newick, NHX, [[http://www.phyloxml.org phyloXML]], ...) to associate species names (possibly other tax ids) with phylogenies in their usual workflows. In some cases, its just a matter of knowing how to use the file format properly, possibly aided by better tools for data input. For users whose workflows rely on Newick, we would need a way to keep a separate mapping of OTU ids and tax ids, along with tools to interconvert or translate to one of the other formats (this could be as simple as an Excel spreadsheet or as complex as a web service that maintains your mapping and does the translation for you). d106 1 a106 1 Hundreds of thousands-- perhaps millions-- of trees are generated each year in association with published research. Of these, a tiny fraction, perhaps thousands (tens of thousands?), is published each year in association with journal articles. The vast majority of these "published" trees appear as graphical image— and may be accessible as an electronic image file, while a tiny fraction is archived in a computable electronic form, nearly always as a string with nested parentheses representing clades (the "Newick" format). @ 1.25 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="DanRosauer" date="1287780234" format="1.1" version="1.25"}% a5 1 We decided to work on problem 1. After the work day at TDWG2010, Dan and Arlin made a further plan to complete this report, then work toward a publication. Torsten also indicated his willingness to participate. d8 1 a8 1 At the TDWG meeting in Woods Hole, this group (Dan, Arlin, Jamie, Torsten, Elena) decided to work out current best practices for publishing a tree electronically, via an archive such as [[http://www.treebase.org TreeBase]] or [[http://datadryad.org Dryad]], or via the semantic web (presumably using CDAO). After the meeting, this evolved into a more specific plan (see below) to produce an initial report, solicit feedback, and then generate a more extensive report for publication. d52 1 a52 1 A range of journals [link here] and the National Science Foundation (U.S.A) will soon require publication and / or achiving of phylogenies with a view to future uses. Increasingly, phylogenies are seen not just as an endpoint – presenting the inferred relationships between a set of organisms or taxa – but as the starting point for a range of further research. This research will almost always involve linking with other data. d136 36 @ 1.24 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="DanRosauer" date="1287670215" format="1.1" version="1.24"}% d49 15 a63 1 We want to facilitate data integration so that phylogenies can be more easily and automatically linked to other data. The integrating variables with the most potential for the near future (next few years) are 1) species name (or other taxonomic identifier); 2) specimen or source identifier such as GenBank? accn; and 3) geographic coordinates. Most phylogeny information artefacts (e.g., files) out there don't have any of these. Integrating phylogenetic information into the global web of data isn't going to get very far until we make it easy for users to put this information into their trees. @ 1.23 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1287504466" format="1.1" version="1.23"}% d49 1 a49 1 We want to facilitate data integration, and the integrating variables with the most potential for the near future (next few years) are 1) species name (or other taxonomic identifier); 2) specimen or source identifier such as GenBank? accn; and 3) geographic coordinates. Most phylogeny information artefacts (e.g., files) out there don't have any of these. Integrating phylogenetic information into the global web of data isn't going to get very far until we make it easy for users to put this information into their trees. @ 1.22 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1287434951" format="1.1" reprev="1.22" version="1.22"}% d436 24 a459 1 ---+++ Getting data into formats d461 14 @ 1.21 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1287418636" format="1.1" version="1.21"}% d56 9 a64 1 1. Archives to upload to: a73 7 1. Key information to make the tree re-useable * Accessions - may lead to species identifier (potential for conflicts) * Taxonomic Links 1. a well formatted taxon name or taxon concept 1. binomial or separate field for each taxonomic level 1. LSID which resolves on a taxon names service * geo coordinates d103 7 a109 1 ---+++ Archives d111 1 a111 1 ---+++ Formats d113 5 a117 1 ---+++ Key factors promoting re-use d123 16 d404 21 @ 1.20 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1287412973" format="1.1" reprev="1.20" version="1.20"}% d380 1 a380 1 ---++ Appendix 4. Tools and tips d382 11 a392 1 ---+++Maintaining alternative naming schemes @ 1.19 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1287085150" format="1.1" reprev="1.19" version="1.19"}% a5 1 d8 2 a9 3 ---++ Problem 1. Best practices for publishing a tree electronically ---+++Synopsis Work out current best practices for publishing a tree electronically, via an archive such as [[http://www.treebase.org TreeBase]] or [[http://datadryad.org Dryad]], or via the semantic web (presumably using CDAO). At minimum, get some actual trees (+/- matrices) ready to submit. At best, draft a how-to guide for publication. This is **not** about best practices for inferring a tree. d11 1 a11 1 ---+++Contributors d21 25 a45 2 * Nico Franz * Jamie Whitacre d47 33 d81 1 d83 1 a83 3 Millions of trees are generated each year in association with published research. But, of these millions of trees, only a tiny fraction is published, typically in the form of a graphical image— a picture to look at. To a computer, such image files are informational dead-ends. Of the thousands (tens of thousands?) of trees published each year in association with journal articles, a tiny fraction is archived in a computable electronic form, nearly always as a string with nested parentheses representing clades (the "Newick" format). This exposes the topology and branch lengths of the tree, but such trees typically are not adequate for data integration, re-use, and re-purposing. To understand why, consider the following example: d94 2 a95 2 * The only major electronic repository of trees, TreeBase, has recently completed a major upgrade of features, including its submission process; at the same time, a more loosely structured archive called Dryad was launched and will accept various kind of electronic files, including those with trees; * NSF has recently increased its requirements for data-sharing plans in grant proposals. Thus, scientists will be motivated by funding agencies to share data electronically; d100 5 a104 1 ---+++ work plan (revised post-meeting by Arlin & Dan) d106 5 a110 33 1. general issues to keep in mind * Have a rationale that appeals to the reader * What options are available? * Consider novices and experts (what workflows would be good for novices?) * Make it clear we are not just promoting Mesquite or [[http://www.treebase.org TreeBase]] 1. Finish TDWG phylogenetics standards report (target: *October 15, 2010*) * decide on form and scope for TDWG report (can be narrower than eventual publication) * Write up rationale for these approaches focusing on integration _Dan_ * Why reuse trees (and matrices) rather than reusing the sequences * carrots and sticks * stick: journal policy (refs) * stick: NSF data sharing policy (http://www.nsf.gov/pubs/policydocs/pappguide/nsf11001/gpg_2.jsp#IIC2j) * carrot: data accessability -> reuse -> citations (Heather Piwowar research re journal citations) * carrot: data integration * Rewards for sharing and for facilitating re-use may not be compelling for data producers: a better case might be made by focusing on *data integration* and, particularly, on immediate benefits to the producer of being able to link *one's own data* and discover its connections * Work up a toy case of the same 4 taxon tree in 5 different formats _Arlin_ * get feedback from Christian and Rutger about this * find a case with a taxonomic change, so that linking to the concept prevents real ambiguity. * Explore options for embedded LSIDs and technology to resolve them _Dan_ * Investigate [[http://datadryad.org Dryad]] as an alternative archive _Dan_ 1. Release TDWG report, obtain feedback from developers & stakeholders (target: *October 31*) * evoldir and tdwg lists * syst biol, cipres (other projects, journals, communities?) * developers of relevant software & standards (Christian, Rutger, David & Wayne) * archive developers (Bill, Todd) 1. As needed, explore additional issues, including issues arising from stakeholders (*October*, *November*) * Novice workflows for users to get their data into standard formats * tools to annotate and edit files to make the information more linkable * tools that illustrate cool things that can be done with properly formed input files * TOLWEB as an alternative destination for published trees _Arlin_ 1. Write a manuscript for publication (target: *November 30*) * recruit additional authors as appropriate * estimate number of users and trees per year using citations or other means d112 2 a113 1 ---+++ Relevant TDWG standards d115 1 d128 61 a188 1 ---+++ A set of toy files for illustration d190 33 a222 1 A set of test files may be used to illustrate the representation capabilities (and limitations) of different formats. The toy files below are based on actual data for a set of 4 cytochrome C sequences (from PFAM family PF00034), using the tree a232 1 The test files are available for download and analysis here: d234 1 a234 7 * [[%ATTACHURL%/PF00034_4.nwk][PF00034_4.nwk]]: 4-taxon test case in Newick format * [[%ATTACHURL%/PF00034_4.nhx][PF00034_4.nhx]]: 4-taxon test case in NHX format * [[%ATTACHURL%/PF00034_4.nex][PF00034_4.nex]]: 4-taxon test case in NEXUS forma * [[%ATTACHURL%/PF00034_4_phylo.xml][PF00034_4_phylo.xml]]: 4-taxon test case in [[http://www.phyloxml.org phyloXML]] format * [[%ATTACHURL%/PF00034_4_nexml.xml][PF00034_4_nexml.xml]]: 4-taxon test case in [[http://www.nexml.org NeXML]] format ---++++ Newick d248 1 a248 1 ---++++ NHX (New Hampshire Extended) d257 1 d273 1 a273 1 ---++++ NEXUS d313 1 a313 1 ---++++ [[http://www.phyloxml.org phyloXML]] d318 1 a322 1 * no explicit tags for geographic coordinates d353 1 a353 1 ---++++ [[http://www.nexml.org NeXML]] d377 2 a378 130 ---+++ challenges with aims, scoping and emphasis * the aim is to assess the readiness of the phyloinformatics community . . . * for upcoming data archiving policies . . . * so that users can maximize the reusability of their results * so that publishers understand what they are asking of users * so that developers can target weak spots in current practices * some folks are going to say that the tree is useless (not re-useable) without extensive methods annotations * [[http://www.treebase.org TreeBase]] seems to be the only archive supporting key features, but we don't want this to be all about [[http://www.treebase.org TreeBase]] * we want to assess current best practices, but we don't want this to have the effect of locking them in permanently; instead we want to stimulate improvement ---+++ Outline of doc: "Current best practices for publishing a tree electronically" Here is an outline, but it may be best to do the writing in the google doc. 1. Introduction that reviews the data archiving policy of evolution journals 1. What Archives to upload to: * [[http://www.treebase.org TreeBase]], has a submission process * [[http://datadryad.org Dryad]] still need to investigate * other places to expose a [[http://www.nexml.org NeXML]], [[http://www.phyloxml.org phyloXML]], or CDAO file 1. What formats to use, and describe how it can be used, what information can be included * Newick- only allows labels, no metadata * NEXUS (https://www.nescent.org/wg_phyloinformatics/Supporting_NEXUS_Documentation) * NHX [[http://www.phylosoft.org/NHX/nhx.pdf PDF docs]]) * [[http://www.phyloxml.org phyloXML]]- find out how much of attributes below can be represented * [[http://www.nexml.org NeXML]] - write a description of how to represent LSID, GenBank accn, geo coordinates 1. Key information to make the tree re-useable * Accessions - may lead to species identifier (potential for conflicts) * Taxonomic Links 1. a well formatted taxon name or taxon concept 1. binomial or separate field for each taxonomic level 1. LSID which resolves on a taxon names service * geo coordinates 1. gaps and recommendations 1. Semantic link between tree and characters (typically inferred_from) is not explicit * [[http://www.treebase.org TreeBase]] supports this via method link from matrix to tree 1. lack of community standard for accessions and species links 1. lack of resolvable lsids; lack of a validator to see if species refs are resolved 1. lack validation service for most formats ---+++ What we have learned ---++++ Using Dryad Todd Vision writes
Since Dryad is a general-purpose repository, it doesn't impose any constraints on how the data are represented within the files that users submit. The best practices need to come from elsewhere, such as journal policies, MIAPA (which would be nice to give some attention to in that paper), and community practice imposed by awareness of how the data will be reused by more specialized phylogenetic tools. In case you aren't aware, Dryad just introduced a "handshaking" feature for TreeBASE. Users can elect to have a NEXUS file that is deposited to Dryad "pushed through" to TreeBASE to initiate the submission process there. So for the special case of phylogenetic data in Dryad, we would encourage having that Newick tree within a NEXUS file, together with the OTU metadata that can fit within that file format. I dream of a future in which lots of different software tools will support the editing and output of metadata-rich phylogenies in NeXML, and that TreeBASE can ingest those NeXML files. But we aren't there yet. If a user doesn't intend to use TreeBASE for whatever reason, then a Newick tree in one file and OTU metadata in a separate CSV file would be a reasonable low-tech solution, as long as the OTU identifiers were consistent between the files. A ReadMe file could also be used to provide study-level metadata.
---++++ Uploading a tree to [[http://www.treebase.org TreeBase]] (notes from chat with Bill Piel 9/29/10) 1. Use Mesquite to prepare document before uploading to [[http://www.treebase.org TreeBase]] * Why? Because Mesquite ensures correct formatting. Also it matches the Mequite software used in the [[http://www.treebase.org TreeBase]] website itself. 1. In Mesquite, best to ensure matrix and tree are in the same file to ensure no mismatch 1. Ensure taxon names are written out in full as binomial or trinomial * If there are infraspecies, just write the triplet without ‘var’, ‘subsp’ etc * What if there are multiple specimens for the same taxon? Each name must be unique, so make sure the specimen ID etc, is a suffix formatted with a leading capital or a number so [[http://www.treebase.org TreeBase]] won’t treat it as a new taxon name 1. After upload, click on yellow taxon button. Then click . Tree base tries to match up labels with existing taxon names. If not, checks uBio. If name may be a homonym – will be asked to choose which taxon map link to. NCBI handles the homonyms. [[http://www.treebase.org TreeBase]] will link to taxon names to a GenBank taxid if possible. 1. Create an analysis record to link the matrices to the trees. 1. Linking to specimen IDs – eg genbank accession?This is done by setting attributes of rows in the matrix: 1. After uploading matrix click . There is a list of row labels to populate. You can enter Darwin Core information about the specimen. 1. There is a bug – if some rows are populated for a given column, all rows must be populated for that column. There is an error if left blank. So just put something there - for example a dash ‘-’ 1. You could apply this metadata to just a part of the alignment 1. Not sure that all of this metadata is included in [[http://www.nexml.org NeXML]] Can't attach metadata to the tree nodes— these data derive from the matrix and thus are included there. ---++++ Examples of metadata in [[http://www.treebase.org TreeBase]] (via Bill Piel) Accession numbers * Under the Matrices tab, enter 4953 and click "Matrix ID" * Click either the M4953 link or the image * Under the "Row Segments" column, you should see a "View" link -- click one of them Now you should see any attached metadata -- in this case it is a Genbank accession number that applies to the set of columns 1-992. You can do the same for the following matrices: * 831 = example of a matrix with a single set of row segments with Genbank accession numbers. * 5572 = example of a matrix with multiple row segments, and both Genbank and locality info. Unfortunately the lat/long data is not showing up even though I know the metadata are in there (sorry -- bugzilla) * 5212 = example of a matrix with a single set of row segments, with both Genbank numbers and locality metadata (Unfortunately the lat/long data is not showing up even though I know the metadata are in there) ---++++ Results of using the [[http://www.treebase.org TreeBase]] submission process Dan, Torsten and Arlin all made attempts to use the submission process. Dan and Arlin both uploaded files with OTU labels that contain species names, and these were recognized >90% of the time by tb once the "validate taxon labels" button is pressed (why doesn't tb suggest these automatically?). Arlin also used the "row segment table" interface to annotate a submission with GenBank accessions. The following [[http://www.treebase.org TreeBase]] screenshot (cropped) shows how a user may assign a UBio Id to an OTU (and it also shows that _TreeBase correctly guesses the actual species_):
tb2_taxlabel_editor_screenshot.jpg The following [[http://www.treebase.org TreeBase]] screenshot (cropped) shows a taxon table with match-able names. Pressing the "Validate taxon labels" button will automatically apply the results of name-matching, which in this case gives the correct attributions:
tb_taxa_table_screenshot.jpg ---++++ Examples of metadata in [[http://www.nexml.org NeXML]] from [[http://www.treebase.org TreeBase]] Looking at the [[http://www.nexml.org NeXML]], I see that Rutger has, indeed, coded in for delivering these metadata. For example, look at this: http://purl.org/phylo/treebase/phylows/matrix/TB2:M5212?format=nexml Instead of attaching lat/longs to some sort of row segment definition (e.g. the [[http://www.nexml.org NeXML]] equivalent of a charset) he has simply listed them among the OTU elements. e.g.: You can see in there a pair of lat/longs, an LSID to the uBio namebank record for the taxon, and a uniprot purl to the NCBI taxonomy. Unfortunately the Genbank Accession numbers are not yet included. I think Rutger has put exposure of some of these metadata on hold pending that the community of [[http://www.nexml.org NeXML]]-ers decides the proper way to do it. ---++++ how to expose these data on treebase Rutger Vos writes that "the way forward would be a two-step process: 1. the CQL query interface would need to be re-designed/expanded such that more predicates are recognized and supported for searching. Whether this would be on a predicate-by-predicate basis or something more generic remains to be seen. Hopefully the latter, but it's not immediately obvious to me how that would work. 1. a simple search box (a bit like the clever entrez search (Rod Page has been begging for this)) would need to be developed that knows how to construct any relevant CQL search queries and call them, returning all hits from the different search sections. I have some ideas for how to do this, but it wouldn't be trivial." ---++ Problem 2. Using trees in data integration The ability to link tree tips to other sources of data by taxon name or identifier is a threshold step for automating a whole range of analyses and displays. For example, the software I work on, Biodiverse, links phylogenies to species location data for visualisation and analysis. It would be great to have it running automatically online, linking (for example) trees from [[http://www.treebase.org TreeBase]] to distributions from GBIF. To do this (and many other things) however we need a better solution to the taxon matching problem you describe. (Dan Rosauer) ---+++ synopsis We want to facilitate data integration, and the integrating variables with the most potential for the near future (next few years) are 1) species name (or other taxonomic identifier); 2) specimen or source identifier such as GenBank? accn; and 3) geographic coordinates. Most phylogeny information artefacts (e.g., files) out there don't have any of these. Integrating phylogenetic information into the global web of data isn't going to get very far until we make it easy for users to put this information into their trees. Thus, the goal of the working session is to tackle this problem, in terms of clarifying its dimensions and challenges, developing strategies based on available tools, and even developing new tools. An obvious starting point would be to focus on enabling ordinary phylogenetics & systematics users to use current standards (Newick, NHX, [[http://www.phyloxml.org phyloXML]], ...) to associate species names (possibly other tax ids) with phylogenies in their usual workflows. In some cases, its just a matter of knowing how to use the file format properly, possibly aided by better tools for data input. For users whose workflows rely on Newick, we would need a way to keep a separate mapping of OTU ids and tax ids, along with tools to interconvert or translate to one of the other formats (this could be as simple as an Excel spreadsheet or as complex as a web service that maintains your mapping and does the translation for you). d380 1 a380 1 ---+++some specific challenges (sometimes with solutions) d382 1 a382 1 ---++++Maintaining alternative naming schemes @ 1.18 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1286977479" format="1.1" version="1.18"}% d24 1 d28 1 a28 1 Millions of trees are generated each year in association with published research. If this forest of published trees were computationally accessible, the scientific community would have a greater capacity to validate past results of phylogenetic research, and to build and extend its reach. d30 14 a43 1 But this is not the case. Of these millions of trees, a tiny fraction is published. Of trees that are published, the vast majority are published only in the form of a graphical image interpretable only to a human reader. To a computer, such trees are informational dead-ends. d45 1 a45 16 Of the thousands (tens of thousands) of trees published each year in association with journal articles, a tiny fraction is archived in a computable electronic form, nearly always as a string with nested parentheses representing clades (the "Newick" format). Publishing the tree in the standard electronically encoded form of nested parentheses exposes the topology and branch lengths of the tree. However, this is only a small improvement, because if the nodes in the tree are not associated with identifiable information, the structure of the tree has no recoverable biological meaning— and indeed, most trees that are archived lack clearly identifiable information allowing them to be linked with other data, except under the guidance of an expert communicating with the authors of the paper. The rationale of our effort here is to make trees more interoperable. We believe that if the forest of trees produced by researchers each year were computationally accessible, the scientific community would benefit greatly. The current project aims to identify strengths and weakness in current approaches to publishing trees electronically. By "publish electronically" we mean "make accessible as information via the web". Most phylogenies published appear in journals. These journals may have online editions in which the tree appears as an image, as in [[http://www.pnas.org/content/97/10/5334.full this example]]. Because the tree is an image, the information in it requires a human to interpret. There are machine-readable representations, most commonly "Newick" (nested parentheses), that are used to produce such images. Publishing these representations would make the tree machine-readable, but it would not necessarily make it linkable. If the electronic tree file has only the tree structure with arbitrary labels at the tips, then one can't know how to link the tree to other data (without reading the published paper). * Requirement for journals * see the roadmap available from [[http://wiki.tdwg.org/TAG TDWG technical architecture group]] * NSF data-sharing plan * Resuse of trees for new projects = more citations * publish data, more citations (ref: Piwowar, research remix) * Simpler linking of trees to other data * benefits of linked data (http://www.taxonconcept.org/taxonconcept-blog/2010/8/5/why-linked-open-data-makes-sense-for-biodiversity-informatic.html) @ 1.17 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1286477704" format="1.1" version="1.17"}% d29 1 a29 1 But this is not the case. Of these millions of trees, a tiny fraction is published in the form of a graphical images interpretable only to a human reader. To a computer, such trees are informational dead-ends. @ 1.16 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1286458813" format="1.1" reprev="1.16" version="1.16"}% d84 14 @ 1.15 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1286399139" format="1.1" reprev="1.15" version="1.15"}% d14 7 a20 6 * Arlin Stoltzfus * Dan Rosauer * Torsten Eriksson * Bill Piel provided instructions for TreeBase and examples * Christian Zmasek provided feedback on phyloXML and NHX * Rutger Vos provided feedback on nexml encodings d23 1 a23 1 * Nico Franz d41 1 d282 11 a292 1 1. validation. file is well formed, metadata points to something resolvable d294 4 a297 1 ---+++ What we have learned @ 1.14 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1286387507" format="1.1" version="1.14"}% d13 11 d86 2 a87 2 ((Mus_musculus_CAA25899.1:0.008307,Rattus_norvegicus_AAA21711.1:0.009662)0.024280[0.74], (Gallus_gallus_CAA25046.1:0.055226,Rattus_norvegicus_AAA41015.1:0.117358)0.040335[0.69]); d100 1 a100 1 * [[%ATTACHURL%/PF00034_4_phylo.xml][PF00034_4_phylo.xml]]: 4-taxon test case in [[http://www.phyloxml.org PhyloXML]] format d126 1 d181 1 a181 1 ---++++ [[http://www.phyloxml.org PhyloXML]] d207 14 d238 7 d263 1 a263 1 * other places to expose a [[http://www.nexml.org NeXML]], [[http://www.phyloxml.org phyloxml]], or CDAO file d267 2 a268 2 * NHX (http://www.phylosoft.org/NHX/) (PDF doc at http://www.umanitoba.ca/afs/plant_science/psgendb/doc/atv/NHX.pdf) * [[http://www.phyloxml.org PhyloXML]]- find out how much of attributes below can be represented d349 4 d361 1 a361 1 Thus, the goal of the working session is to tackle this problem, in terms of clarifying its dimensions and challenges, developing strategies based on available tools, and even developing new tools. An obvious starting point would be to focus on enabling ordinary phylogenetics & systematics users to use current standards (Newick, NHX, [[http://www.phyloxml.org PhyloXML]], ...) to associate species names (possibly other tax ids) with phylogenies in their usual workflows. In some cases, its just a matter of knowing how to use the file format properly, possibly aided by better tools for data input. For users whose workflows rely on Newick, we would need a way to keep a separate mapping of OTU ids and tax ids, along with tools to interconvert or translate to one of the other formats (this could be as simple as an Excel spreadsheet or as complex as a web service that maintains your mapping and does the translation for you). @ 1.13 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1286312671" format="1.1" reprev="1.13" version="1.13"}% d33 1 @ 1.12 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1286297639" format="1.1" reprev="1.12" version="1.12"}% d41 3 a43 2 1. Finish TDWG document * Write up rationale for these approaches focussing on integration _Dan_ d56 1 a56 1 1. Solicit feedback from developers & stakeholders d61 8 a68 4 1. As needed, explore additional issues, including issues arising from stakeholders * Novice workflows for users to get their data online * TOLWEB as an alternative destination fir published trees _Arlin_ 1. Write a manuscript for publication a314 4 later, Bill @ 1.11 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1286224554" format="1.1" reprev="1.11" version="1.11"}% d11 1 a11 1 Work out current best practices for publishing a tree electronically, via an archive such as TreeBase? or Dryad, or via the semantic web (presumably using CDAO). At minimum, get some actual trees (+/- matrices) ready to submit. At best, draft a how-to guide for publication. This is **not** about best practices for inferring a tree. d40 1 a40 1 * Make it clear we are not just promoting Mesquite or Treebase d54 1 a54 1 * Investigate Dryad as an alternative archive _Dan_ d83 2 a84 2 * [[%ATTACHURL%/PF00034_4_phylo.xml][PF00034_4_phylo.xml]]: 4-taxon test case in phyloxml format * nexml file is not done yet d87 2 a88 2 Allows: * tree topology, branch lengths, labels d90 6 a95 5 Lacks: * character data * slot for taxon identifiers * slot for geo coordinates * slot for accession numbers d101 10 a110 7 Allows in addition to what Newick allows: * tag for species name (but parsers don't expect spaces) * tag for NCBI-style taxid (but not lsid) * tag for arbitrary node data (but not fully documented in format standard) Lacks: * File content d120 1 a120 1 Archaeopteryx-generated view of NHX file PF00034_4.nhx:
d124 11 a134 4 Allows: * Lacks: * d136 23 d163 12 a174 6 ---++++ PhyloXML Allows: * Lacks: * snippet showing a terminal branch with associated data and metadata d189 11 a199 5 ---++++ nexml Allows: * Lacks: * d201 4 d213 1 a213 1 * treebase seems to be the only archive supporting key features, but we don't want this to be all about treebase d222 3 a224 3 * TreeBase?, has a submission process * Dryad still need to investigate * other places to expose a nexml, phyloxml, or CDAO file d229 2 a230 2 * PhyloXML- find out how much of attributes below can be represented (www.phyloxml.org) * neXML (www.nexml.org) - write a description of how to represent LSID, GenBank accn, geo coordinates d240 1 a240 1 * treebase supports this via method link from matrix to tree d245 1 a245 1 ---++++ Uploading a tree to TreeBase (notes from chat with Bill Piel 9/29/10) d247 2 a248 2 1. Use Mesquite to prepare document before uploading to TreeBase * Why? Because Mesquite ensures correct formatting. Also it matches the Mequite software used in the Treebase website itself. d252 2 a253 2 * What if there are multiple specimens for the same taxon? Each name must be unique, so make sure the specimen ID etc, is a suffix formatted with a leading capital or a number so treebase won’t treat it as a new taxon name 1. After upload, click on yellow taxon button. Then click . Tree base tries to match up labels with existing taxon names. If not, checks uBio. If name may be a homonym – will be asked to choose which taxon map link to. NCBI handles the homonyms. TreeBase will link to taxon names to a GenBank taxid if possible. d259 1 a259 1 1. Not sure that all of this metadata is included in NeXML d263 1 a263 1 ---++++ Examples of metadata in treebase (via Bill Piel) d275 1 a275 1 ---++++ Results of using the treebase submission process d277 1 a277 1 Dan, Thorsten and Arlin all made attempts to use the submission process. Dan and Arlin both uploaded files with OTU labels that contain species names, and these were recognized >90% of the time by tb once the "validate taxon labels" button is pressed (why doesn't tb suggest these automatically?). Arlin also used the "row segment table" interface to annotate a submission with GenBank accessions. d279 1 a279 1 The following TreeBase2 screenshot (cropped) shows how a user may assign a UBio Id to an OTU:
d282 1 a282 1 The following TreeBase2 screenshot (cropped) shows a taxon table with match-able names. Pressing the "Validate taxon labels" button will automatically apply the results of name-matching, which in this case gives the correct attributions:
d285 1 d287 1 a287 3 ---++++ Examples of metadata in nexml from treebase Looking at the NeXML, I see that Rutger has, indeed, coded in for delivering these metadata. For example, look at this: d291 1 a291 1 Instead of attaching lat/longs to some sort of row segment definition (e.g. the NeXML equivalent of a charset) he has simply listed them among the OTU elements. e.g.: d309 1 a309 1 Unfortunately the Genbank Accession numbers are not yet included. I think Rutger has put exposure of some of these metadata on hold pending that the community of NeXML-ers decides the proper way to do it. d316 1 a316 1 The ability to link tree tips to other sources of data by taxon name or identifier is a threshold step for automating a whole range of analyses and displays. For example, the software I work on, Biodiverse, links phylogenies to species location data for visualisation and analysis. It would be great to have it running automatically online, linking (for example) trees from treebase to distributions from GBIF. To do this (and many other things) however we need a better solution to the taxon matching problem you describe. (Dan Rosauer) d322 1 a322 1 Thus, the goal of the working session is to tackle this problem, in terms of clarifying its dimensions and challenges, developing strategies based on available tools, and even developing new tools. An obvious starting point would be to focus on enabling ordinary phylogenetics & systematics users to use current standards (Newick, NHX, phyloxml, ...) to associate species names (possibly other tax ids) with phylogenies in their usual workflows. In some cases, its just a matter of knowing how to use the file format properly, possibly aided by better tools for data input. For users whose workflows rely on Newick, we would need a way to keep a separate mapping of OTU ids and tax ids, along with tools to interconvert or translate to one of the other formats (this could be as simple as an Excel spreadsheet or as complex as a web service that maintains your mapping and does the translation for you). d346 1 a346 1 %META:FILEATTACHMENT{name="PF00034_4.nex" attachment="PF00034_4.nex" attr="" comment="4-taxon test case in NEXUS format" date="1286220985" path="PF00034_4.nex" size="677" stream="PF00034_4.nex" user="Main.ArlinStoltzfus" version="1"}% d348 1 @ 1.10 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1285950904" format="1.1" reprev="1.10" version="1.10"}% d7 1 a7 1 We decided to work on problem 1. d13 1 a13 1 ---+++ Rationale and work plan d15 1 a15 1 ----++++ why publish trees electronically? d17 10 a26 1 By "publish electronically" we mean "make accessible as information via the web". Most phylogenies published appear in journals. These journals may have online editions in which the tree appears as an image, as in [[http://www.pnas.org/content/97/10/5334.full this example]]. Because the tree is an image, the information in it requires a human to interpret. There are machine-readable representations, most commonly "Newick" (nested parentheses), that are used to produce such images. Publishing these representations would make the tree machine-readable, but it would not necessarily make it linkable. If the electronic tree file has only the tree structure with arbitrary labels at the tips, then one can't know how to link the tree to other data (without reading the published paper). d29 1 d34 30 a63 1 ---++++ work plan (revised post-meeting by Arlin & Dan) d65 3 a67 1 *the remainder of this section needs to be reformatted from Dan's notes below* d69 2 a70 28 What options are available? Investigating alternatives What can you do with linked trees – emphasising data integration We need to know a) What options are available? b) What workflows would be good for novices? c) Not just promoting Mesquite / Treebase d) What is the status of NeXML support for LSIDs and other identifiers? e) What about phyloXML? Why reuse trees rather than reusing the sequences Emphasise integration with other data sources rather than reuse Work up a simple 4 taxon example in the different data formats • Make one of the taxa one which has had taxonomic change, so that linking to the concept prevents real ambiguity. Tasks Initially as a TDWG doc a) Write up rationale for these approaches – focussing on integration – Dan (initially) b) Work up a toy case of the same 4 taxon tree in 5 different formats – Arlin a. Find out more about NHX + PhyloXML c) Explore options for embedded LSIDs and technology to resolve them - Dan d) Investigate Dryad as an alternative archive - Dan Then consultations Then as a paper e) Evidence on data accessability -> reuse -> citations (Heather Piwowar) (only matters for journal) f) Novice workflows for users to get their data online a. Treebase b. Other systems too? g) not for TDWG) h) Investigate TOLWEB as an alternative destination fir published trees – Arlin d72 13 a85 1 ---+++ A set of toy files for illustration d174 1 a174 1 * place to expose a nexml file d265 1 a265 1 ---++ Problem 1. Using trees in data integration d295 4 @ 1.9 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1285936109" format="1.1" reprev="1.9" version="1.9"}% d19 3 a21 2 * Requirement for journals * Resuse of trees for new projects = more citations d79 1 d89 2 d99 2 d202 7 d263 5 @ 1.8 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1285884362" format="1.1" reprev="1.8" version="1.8"}% d13 43 a55 6 ---+++Resources 1. servers or archives that accept trees 1. TreeBase upload interface 1. Dryad 1. information on commonly used file formats (NEXUS, nexml, phyloxml) 1. tree artefacts that we can use @ 1.7 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1285870279" format="1.1" reprev="1.7" version="1.7"}% d20 66 @ 1.6 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1285854449" format="1.1" reprev="1.6" version="1.6"}% d42 3 a44 3 * NHX * PhyloXML- find out how much of attributes below can be represented * neXML - write a description of how to represent LSID, GenBank accn, geo coordinates @ 1.5 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="DanRosauer" date="1285774036" format="1.1" reprev="1.5" version="1.5"}% d3 2 d7 1 a7 8 ---++ Problem 1. Using trees in data integration
The ability to link tree tips to other sources of data by taxon name or identifier is a threshold step for automating a whole range of analyses and displays. For example, the software I work on, Biodiverse, links phylogenies to species location data for visualisation and analysis. It would be great to have it running automatically online, linking (for example) trees from treebase to distributions from GBIF. To do this (and many other things) however we need a better solution to the taxon matching problem you describe. (Dan Rosauer)
---+++ synopsis We want to facilitate data integration, and the integrating variables with the most potential for the near future (next few years) are 1) species name (or other taxonomic identifier); 2) specimen or source identifier such as GenBank accn; and 3) geographic coordinates. Most phylogeny information artefacts (e.g., files) out there don't have any of these. Integrating phylogenetic information into the global web of data isn't going to get very far until we make it easy for users to put this information into their trees. Thus, the goal of the working session is to tackle this problem, in terms of clarifying its dimensions and challenges, developing strategies based on available tools, and even developing new tools. An obvious starting point would be to focus on enabling ordinary phylogenetics & systematics users to use current standards (Newick, NHX, phyloxml, ...) to associate species names (possibly other tax ids) with phylogenies in their usual workflows. In some cases, its just a matter of knowing how to use the file format properly, possibly aided by better tools for data input. For users whose workflows rely on Newick, we would need a way to keep a separate mapping of OTU ids and tax ids, along with tools to interconvert or translate to one of the other formats (this could be as simple as an Excel spreadsheet or as complex as a web service that maintains your mapping and does the translation for you). d9 83 a91 1 ---+++ Who is interested in this and what you can bring (problems or solutions) d93 1 a93 1 $ Dan Rosauer : d95 1 d97 1 a97 1 ---+++ some specific challenges (sometimes with solutions) d99 1 a99 2 ---++++Maintaining alternative naming schemes Dan says that he often creates 2 column tables with one column containing the name used in the nexus file, the other the name used for the same taxon or OTU in the spatial data. Arlin also encounters this kind of name-reconciliation problem. a100 1 Bio::NEXUS provides tools to safely change the names in NEXUS files using a mapping provided in a simple 2-column input file. d102 14 a115 3 your_shell$ perl -MCPAN -e'install Bio::NEXUS' your_shell$ nextool.pl my_nexus_infile rename_otus my_name_mapping > my_nexus_outfile
d117 1 a117 1 where the mapping file (my_name_mapping in the example above) just has lines, each with the old OTU name, followed by whitespace, followed by the new OTU name. d119 2 a120 1 [While a user generated mapping file provides a one-off solution, including an LSID or other GUID for each tree tip could provide a more general solution if the ID resolves to either a) a taxon in a recognised taxon repository such as ITIS or b) to a curated specimen whose taxonomy can be updatred to stay current. In either case the creator of the taxonomy specifies this link rather than leaving the user to interpret. - Dan R] d122 1 a122 1 ---++ Problem 2. Best practices for publishing a tree electronically d124 1 a124 1 ---+++ Synopsis d126 1 a126 1 Work out best practices for publishing a tree electronically, via an archive such as TreeBase or Dryad, or via the semantic web (presumably using CDAO). At minimum, get some actual trees (+/- matrices) ready to submit. At best, draft a how-to guide for publication. d128 1 a128 1 ---+++ Resources d130 1 a130 8 * servers of archives that accept trees * [[http://www.treebase.org TreeBase]] upload interface * [[http://www.datadryad.org Dryad]] * information on commonly used file formats * NEXUS * nexml * phyloXML * tree artefacts that we can use d132 1 a132 1 ---+++ Best practices for publishing a tree electronically d134 1 a134 1 *Categories:* d136 1 a136 2 Accessions Taxonomic Links d138 5 a142 6 What Archives to upload to: TreeBase, Dryad What formats to use, and describe how it can be used, what information can be included Problems and gaps (are the standards we develop supported by the exiting data formats and archives?) d144 1 a144 1 * Semantic link between a tree and a character block is not explicit d146 1 a146 1 -- Main.ArlinStoltzfus - 31 Aug 2010 @ 1.4 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1285770968" format="1.1" version="1.4"}% d50 18 a67 1 -- Main.ArlinStoltzfus - 31 Aug 2010@ 1.3 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="DanRosauer" date="1285648838" format="1.1" reprev="1.3" version="1.3"}% d3 3 a5 1 ---+Possible working session for TDWG2010: Linking Trees d9 1 a9 1 ---++synopsis d14 1 a14 1 ---++ Who is interested in this and what you can bring (problems or solutions) a15 1 $ Arlin : programming and phylogenetics experience, familiarity with CDAO, Bio::NEXUS d19 1 a19 1 ---++some specific challenges (sometimes with solutions) d21 1 a21 1 ---+++Maintaining alternative naming schemes d34 1 a34 2 ---+++Another challenge d36 1 a36 1 ---+++ Another challenge d38 1 d40 1 d42 9 a50 1 -- Main.ArlinStoltzfus - 31 Aug 2010 @ 1.2 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="HilmarLapp" date="1285625094" format="1.1" reprev="1.2" version="1.2"}% d31 2 d35 1 @ 1.1 log @none @ text @d1 2 a2 1 %META:TOPICINFO{author="ArlinStoltzfus" date="1283264129" format="1.1" reprev="1.1" version="1.1"}% d14 2 a15 2 $ Arlin : programming and phylogenetics experience, familiarity with CDAO, Bio::NEXUS $ another person : and so on @