%META:TOPICINFO{author="AnnieSimpson" date="1186199871" format="1.1" version="1.1"}% %META:TOPICPARENT{name="DevelopmentDocsGISIN"}% GISIN: Requirements

GISIN: Requirements Specification

Version: 2.2 -- RequirementsSpecificationGISIN

Last Update: 20th of May, 2007

1. Introduction
2. GIS Web Site
3. Registry
4. Provider Toolkit
5. Consumer Toolkit
6. Protocol
Appendix A - Issues

1. Introduction

This document describes the requirements of the Global Invasive Species System (GISIN). The goals of this system are to allow users of the world-wide-web to access the large amount of data that is available on invasive species in a manner that is much easier than today. Currently, accessing data on invasive species consists primarily of searching through Google, searching scientific literature to discover individuals who have data, or simple word-of-mouth. After finding the data it can be a time-consuming task to filter through the data to find exactly what one is looking for and then convert it into a desired format.

The requirements are being set based on feedback from the user community. Features are musts unless followed by 'Need' or 'Want. Needs and wants indicate features that will not be implemented unless time allows. Please see the associated definitions and introduction documents for background information.

1.1 Overall System Requirements

In general, the system must:

Use common terminology and resolve terminology conflicts
Be as simple to install and understand as is possible while still meeting the other requirements
Be able to be translated to other languages where needed
Fast for requesting and receiving data
Support large data sets
Support all taxa
Operate in a client/server, request/response manner
Provide toolkits as required by complexity and user sophistication
Be platform independent
Be software independent
Be easy to implement by the user base
Compatible with 90% of the providers environments.
Be as compatible with GIBF as possible within the other requirements.
Be as compatible with TDWG architecture efforts as possible within the other requirements
Not have any recursive type definitions in the schemas
Elements with complex content except puralization containers must be explicitly typed

1.2 Scope

GISIN will support providers that have data in a relational or flat-file database as long as the database contains filtering capability (e.g. a “WHERE” clause in SQL or the equivalent).

GISIN will not support providers with data in text files, HTML files, or non-digital media. These providers should examine the GISIN web registry for a Data Common they can add their data to.

The system must allow the creation of:

A Registry with lists of available web sites and providers
Providers of data on invasive species
Commons that provide data from various sources
Consumers that obtain data from providers to provide summaries, higher-performance searching, maps, etc.
Portals that allow users to search across providers
Administration sites that search across providers and synthesize results to provide statistics

1.3 Components

To provide the described herein, the system will include:

Web site with all documentation, toolkits, and links to the registry
Registry that is machine and human searchable
Protocol specification for data exchange (including appropriate schemas)
Provider toolkit
Portal toolkit if needed

These components are detailed out in following sections.

1.4 Description of operation

The system operates in a client-server, request-response model. A client (a consumer) makes a request for information to a server (a provider). The server then returns a response based on the parameters within the request. The server may also return an error if appropriate. The type of data in the response may vary based on the type of request. Responses will typically be in Extensible Markup Language (XML) but may also include images in the future.

1.5 Use Cases

The following is a summary set of use cases that while they must be supported by the system and its communication protocol. These use cases represent only a small subset of the systems functionality. The remainder of the requirements document defines the additional features.

1.5.1 Obtain a taxon list of species that have a particular bio-status in a location

Example: Which species are invasive to New Zealand?

1.5.2 Obtain a list of locations where a specified species has a particular bio-status

Example: In which countries is Tamarix considered invasive?

1.5.3 Obtain a list of checklists for a location and/or species

Example: Get the checklists for the genus Tamarix worldwide

1.5.4 Obtain a list of profile URLs for a location and/or species

Examples: Get the list of URLs that contain profile information on Tamarix

1.5.5 Obtain a list of occurrences for a species in a particular location

Examples: Get the occurrence locations for Tamarix in the United States

1.6 Types of data to be shared

1.6.1 First version

The following types of data will be shared in the first version of the system. Please see the definitions document for more details on these types.

Checklists: lists of taxa and their BioStatus in a selected location
Profile URLs: lists of URLs of available profiles
Occurrences: lists of known locations for a individual or population of a particular taxon identified on a specific date (observation, survey), selected by date, location, taxon, and/or BioStatus

Checklists are the highest priority for sharing followed by profile URLs, occurrences and then profiles.

1.6.2 Near future

The following data types are expected to be added in the near future:

Profiles: lists of profile information (fact sheets) on species
Project/Case Study information

1.6.3 Future expansion

The following data types are of interest but are not critical:

Experts
Images

1.7 Performance

Able to complete the following search within 1 second:

Obtain a taxon checklist from a local provider

Able to complete the following search within 1 minute:

Obtain a list of 1,000 occurrences including; scientific name, date, coordinates

1.8 Users

The users of GISIN fall into two large categories, end-users and implementers. The end-users are individuals using a web browser to find information on invasive species. Implementers are the people who are building or associated with the builders of the system including the members of GISIN.

The end-users of GISIN will be as varied a group of computer users as can be imagined. Anyone interested in invasive species from young school children to experienced scientists and resource managers to politicians may be assessing the system. This broad a user base will not have the technical expertise to understand the limitations of the available technology and will expect it to perform as other available web sites do. Examples of these web sites include Google, Yahoo, Wikipedia, and a large number of commercial web sites. This implies that the system must be very easy to use, flexible, and have high-performance with very large data sets.

The implementers have been surveyed and were found to have a wide variety of needs and limited time to spend on technical issues (add link to survey results). This means the system must be as easy to implement as possible, well documented, and easy to monitor for quality and performance problems.

The different types of implementers (providers, consumers, data commons, and portals) will also have a variety of needs. Providers will have a variety of technical abilities and existing systems. The quantity and variety of data will vary from small data sets on one species to data sets with millions of entries for a large number of species. Consumers will include implementers caching data, producing maps, and summarizing data. This will require the system to perform as fast as possible and have flexibility in obtaining data. Data commons are a special data provider that will contain large quantities of data from a variety of end-users. Portals integrate multiple providers to aid users in searching across multiple providers. Requirements from various types of implementers are reflected in the remainder of this document.

For all providers and especially data commons, it is a must that the system maintains metadata on the source of the original data. Below are some examples of different types of users that the system is required to support.

1.8.1 End-users

Examples:

Annie: Find the list of current providers and email them with info on the next meeting!
Pest risk assessor: Find the list of locations where Tamarix is invasive in the US
High school class: Find the list of invasive species for their area
Resource manager: Add recent survey data on invasive species for a park
Pesticide company: Find a web site to add information on a new pesticide

1.8.2 Consumers

Examples:

National Institute of Invasive Species Science : Get occurrences of Tamarix for mapping
NISbase: Show multiple providers in the NISBase web site

1.8.3 Providers

The following are just a few examples of invasive species databases that desire to be online. See GISIN for the complete list.

Southwest Exotic Plant Mapping Program (SWEMP)
Invasive Plant Atlas of New England’s (IPANE)
Non indigenous Aquatic Species (NAS)

1.9 Schedule

Below are the target dates for development of the system:

February 2007, Survey completed
March 2007, Requirement specification completed (1st draft)
April 2007, Protocol specification completed
May 2007, Test implementation of protocol available

1.10 Resources

GISIN is largely a grass-roots organization coordinated by GISIN and reliant upon the participation of a large number of individuals from a diverse group of organizations from across the world. The creation of the individual components is largely the responsibility of the organizations hosting the web servers.

1.10.1 Funding

Organizations have gracefully provided support for individuals to attend meeting and review documents. Below is the only funding available specifically for GISIN development. The most significant problem is that there is no funding available for support and updates.

GBIF, 15K for development of the protocol, registry, and toolkit in 2007

1.10.2 People

The following roles need to be identified to complete and support the system:

Web Site manager: ?
GISIN Coordination: Annie
Task group convener: Jim
Testers: Volunteers
Documentation: Jim (potentially with subcontractors)
Outreach: ?

1.11 Quality

End-users will need to have a quality experience for GISIN to be successful. This means the web sites must be accessible, response times quick, results understandable, and data accurate.

1.11.1 Transmission stability

Transmission stability is effected by the Internet and the providers hardware, software, and database. Complexity and size of the transfers will also effect stability. While the system should be as standard as any other web service based system, we are setting the following criteria.

Able to make 1000 transfers with less than 10% failure. No more than 10 retry's to obtain all data.

1.11.2 Documentation readability

Documentation must be available on the world-wide-web and readable by individuals with appropriate background in web services. The documentation will initially only be available in English.

1.11.3 Data Quality

Below are the target tolerances for data within the system:

Over 90% occurrence locations correct
Over 90% taxonomic identification correct

1.11.4 Data Quantity

At introduction:

Check lists for 100 species
Profile URLs for 100 species
Over 1000 occurrences

Within 5 years of release:

Checklists for ? species
Profile URLs?
Occurrences ?
Profiles for 90% of the top 500 invaders world-wide

1.11.5 Data Relevance (fit for purpose)

Over 80% applies to invasive species

2. GISIN Web Site

A web site will be available with end-user and technical documentation and access to the registry. The web site will also contain a showcase for products created using the GISIN and tools for managing the system.

2.1 Documentation

End-user documentation will include; an introduction to the GISIN, how to use the registry, and how to use the portal to find information on invasive species.

Technical documentation will include how to obtain and install the toolkits and specifications for the protocol. The documentation for the protocol and schema must be freely available and be very easy to create providers from. It should also be easy to create consumers and portals.

2.2 Registry

The web site will contain a registry with requirements in section 3.

2.3 Product Showcase

TBD

2.4 Management Tools

The manage tools will be available through password protected section of the web site and will include:

Access to provider contact info
Development WiKi is available on the TDWG web site

3. Registry

The registry will contain a list of providers, consumers, and portals with URLs for their web sites. For providers it will also include which types of data they contain and statistics on the number of species and areas of interest. The registry will follow an approach similar to DiGIR where were are organizations and then within each organization there can be multiple data sets.

The registry must have a data sharing agreement and track who has agreed to it.

3.1 Data Maintained

The registry will include the following fields for each organization:

Short Name
Long Name
Logo
URL for more information
Contact Name
Contact Phone
Contact Email

For each web service within each organization we will have:

Type: Whether it is a CheckList, ProfileURL, Occurrence, or other web service
URL for the web service
Contact info (email and phone for when there are problems)
Hosting organization

3.2 Web Site

Below is a list of the features that are required for each of the database tables mentioned above.

Search/Browse for web services by:

Area of interest
Taxons of Integer
Types of data contained (CheckLists, Profile URLs, Occurrences, etc.)
Languages (later for profile information)

4. Provider Toolkit

The provider toolkit will be available on a set of mirrored servers on the web and will make it easy for most providers to add their data to the system. The toolkit will contain:

An easy, web-based setup for providers (similar to DiGIR)
Examples for most providers
Documentation on:
- Installation
- How to modify the toolkit for specific database schemas
- How to register as a provider
- Best practices (don’t request lots of data repeatedly, harvest at night, etc.)
- Protocol
- How to adapt the examples for other languages

The documentation will only be available in English initially.

The toolkit has the following general requirements:

Easy to install with minimal technical ability for the most common system configurations
Flexible to be adapted to less common systems
Maps existing provider terminology to standard as needed
Can be queried for version

The bulk of the remainder of this section documents the characteristics of the systems that must be supported to allow our providers to implement the protocol.

4.1 Operating Systems

The toolkit will support the following operating systems:

MS-Windows Server: Must
MS-Windows XP: Want
Linux: Need
Mac X: Want
Unix: Need

4.2 Web Servers

The toolkit will support the following web servers:

MS Internet Information Server: Must
Apache/Tomcat: Must

4.3 Web Development Frameworks

The toolkit will be supported on the following web development frameworks:

PHP: Must
Application Server Pages (ASP): Must
Java Server Pages (JSP): Must
Cold Fusion (CFM): Want
Common Gateway Interface (cgi): Want

4.4 Supported programming languages

The toolkit will be supported on the following programming languages:

PHP: Must
Java: Must
VBScript: Must
JScript: Must?
ColdFusion: Want
Perl: Want
C++: Want

4.5 Supported Databases

MySQL: Must
PostGRES: Need
MS SQL Server: Must?
MS Access: Must?
Excel: Want

4.6 DiGIR Learnings

DiGIR is the most pervasive of the biological data exchange standards. The GISIN toolkit could be thought of as standing on the shoulders of the DiGIR toolkit and taking the next step in ease of use for implementers. This includes:

No complex (i.e. hierarchical) queries
No extra features (i.e. WMS)
Minimal code to adapt to non-supported databases
Easier to modify SQL query creation to adapt to more databases and schemas
More flexible row access (i.e. RowStart, RowCount without requiring partial scientific name)
Quality help for providers
No name spaces to prevent multiple version branches of the protocol
Easier to install
Easier to adapt to relational database schemas
Easier to understand
Does not require an additional framework to be added to the server

4.7 Scope

The provider toolkit will need to allow providers the following scope:

Any Taxa
Checklists, Profile URLs and/or Occurrences
All BioStatuses

Can have the following limitations:

One country per installation
One language per installation

5. Consumer Toolkit

A consumer Toolkit may be needed to make it easy to access information in the system, make requests, and parse responses. It is yet to be determined whether it is required.

6. Protocol

This section provides the requirements for the protocol to communicate data on invasive species between computers. The protocol must provide the requirements appropriate from the material above and the additional requirements in this section.

Protocol will have the following general requirements:

Independent of programming language
Allow for fast transfer of large quantities of data
Simple as possible to implement given the other requirements
Can be implemented with just the documentation
There will be a minimal amount of nesting, in other words hierarchical structures will be as flat as possible
The requests and responses should be human readable (i.e. XML is ok, binary is not)
Have a set of standard terminologies
The protocol will use non-extended ASCII unless transferring textural information
Compatible with common firewall settings
Compatible with TCP/IP protocol (Internet)

The last two items pretty much force the protocol to operate using HTTP through port 80. This is the only method of communication that is typically available for web serves as most other ports are dedicated or blocked by firewalls.

6.1 General

6.1.1 Language Support

As a global system, GISIN must allow users to provide and obtain textural information in various languages. However, most providers will not have information available in multiple forms. To allow providers to operate in their own language and to allow consumers to ingest and then provide translated versions of text, the following strategy will be used.

Text in the language native to the providers will be returned by default
Consumers can query for which languages are supported
Consumers can then query for text in a specific language

This issue only applies to language specific transfers which are discouraged in favor of “coded” transfers.

6.1.2 Taxa Identification

Taxa will be identified by standard Scientific Name (i.e. Kingdom, Genus, Species, Subspecies, Variety). Date and author may be added for specific taxa concepts.

The protocol will not support requesting taxa by common name.

6.2.3 Location names

Locations can be provided either by coordinates (points, polygons, and bounding boxes) or by textural “names”. Coordinates will be in geographic coordinates in the WGS84 or HARN datums. Names will be ISO where available, other standards when available. If a language-specific name is used it will be transferred with its language. If a name is specified to a location, it’s location must accompany it.

6.2 Requests

All parameters in requests that filter the data will be ANDed together. In the future an OR may be provided for certain parameters by concatenating them with commas. Data searches requiring Boolean OR operators can be executed with multiple requests or by requesting a more general set of data and then filtering the data to obtain just the desired data.

For each of the supported data types, the protocol must allow request to be made for:

If the provider contains data for a given taxa and/or location
How many records the provider has for a given taxa and/or location
A block of records for a given taxa and/or location

Blocks of records can be requested given a start number and a number of rows or an entire set of available records.

For consumers to minimize updates to their databases the records can be filtered based on CreationDate and LastModifiedDate. Each record must contain a unique identifier to determine when a record has been updated and to prevent duplication.

Below are requirements within each data type.

6.2.1 Checklists

Filter by:

Location
Taxa
Date range for when the biostatus is valid

Please reference the BioStatus spreadsheet (or the current protocol specification) for the latest information on the BioStatus fields.

6.2.2 Profile URLs

Filter by:

Taxa

6.2.3 Occurrences

Filter by:

Location (code, name, or bounding box)
Taxa
Date range

6.3 Responses

Responses will be returned as XML unless images are requested. For images, the client will specify an image format for the response.

To show information within a portal each request should return metadata including a URL for additional information, a URL for a logo, and a human readable title.

The following sections document the tags that can be returned in a response for a given data type. This section documents which tags the protocol must allow for, which tags are required to be returned will be defined in the protocol specification.

Each element needs to contain a globally unique identifier (GUID).

6.3.1 Checklists

All fields from filters plus:

Date Range: Date when BioStatus is valid
Data Source (citation)

These are from the TWDG meeting, are they still required?

ArrivalDate
Notes

6.3.2 Profile URL

All fields from filters plus:

URL for the profile
URL for an appropriate image for portals to display near the profile listing

6.3.3 Occurrences

All fields from filters plus:

Date
Ancillary information:

Percent Cover
Weight
Height
Sex
LifeStage

Appendix A - Issues

A.1 How do we represent spatial accuracy

- Added accuracy to coordinates

A.2 How to represent taxonomic identification accuracy?

- ?

A.3 Should we provide a portal interface that gives lists of profiles by species?

- Requires a search mechanism that returns list of URLs
- Search by: Taxon, Fields available, Keywords
- Google-like search engine for profiles

A.4 Is GISIN providing a portal or just a list of available portals?

- A portal

A.5 How to get folks to add data when it effects trade?

A.6 Will we include taxon concept IDs?

A.7 Will we use UDDI for the registry?

A.8 We do not have the resources to provide a DiGIR like installation for all 3 major languages.

- My proposal is to provide examples in all 3 languages and, since PHP is the most common and most portable language, provide an easy install in PHP.

A.9 How do we obtain globally unique identifiers?

A.10 How do we monitor quality and resolve problems with data?

-- Main.AnnieSimpson - 04 Aug 2007

GISIN: Requirements Specification

Version: 2.2 -- RequirementsSpecificationGISIN

Last Update: 20th of May, 2007

Contents

1. Introduction

1.1 Overall System Requirements

1.2 Scope

1.3 Components

1.4 Description of operation

1.5 Use Cases

1.5.1 Obtain a taxon list of species that have a particular bio-status in a location

1.5.2 Obtain a list of locations where a specified species has a particular bio-status

1.5.3 Obtain a list of checklists for a location and/or species

1.5.4 Obtain a list of profile URLs for a location and/or species

1.5.5 Obtain a list of occurrences for a species in a particular location

1.6 Types of data to be shared

1.6.1 First version

1.6.2 Near future

1.6.3 Future expansion

1.7 Performance

1.8 Users

1.8.1 End-users

1.8.2 Consumers

1.8.3 Providers

1.9 Schedule

1.10 Resources

1.10.1 Funding

1.10.2 People

1.11 Quality

1.11.1 Transmission stability

1.11.2 Documentation readability

1.11.3 Data Quality

1.11.4 Data Quantity

1.11.5 Data Relevance (fit for purpose)

2. GISIN Web Site

2.1 Documentation

2.2 Registry

2.3 Product Showcase

2.4 Management Tools

3. Registry

3.1 Data Maintained

3.2 Web Site

4. Provider Toolkit

4.1 Operating Systems

4.2 Web Servers

4.3 Web Development Frameworks

4.4 Supported programming languages

4.5 Supported Databases

4.6 DiGIR Learnings

4.7 Scope

5. Consumer Toolkit

6. Protocol

6.1 General

6.1.1 Language Support

6.1.2 Taxa Identification

6.2.3 Location names

6.2 Requests

6.2.1 Checklists

6.2.2 Profile URLs

6.2.3 Occurrences

6.3 Responses

6.3.1 Checklists

6.3.2 Profile URL

6.3.3 Occurrences

Appendix A - Issues