Treatment Data Access
- Access to Plazi Treatments
- Plazi API
- List of Plazi’s available DwC-Archives from GBIF API
- Appendix: Darwin Core Archive Content
- Further reading
- Downloads
- Support and Questions
- Version
Access to Plazi Treatments
What is a treatment?
The Plazi TreatmentBank deals with scientific, published, biosystematic literature. It is the literature documenting and describing all the world’s ca 1.9 Million known species in an estimated corpus of over 500 Million published pages. The cited publications in Plazi are all available at the Biodiversity Literature Repository at Zenodo/CERN.
Treatments are well defined parts of articles that define the particular usage of a scientific name by an author at a given time (the publication)1. With other words, each scientific name has one to several treatments, depending whether there exists only an original description of a species, or whether there are subsequent re-descriptions. Similar to bibliographic references, treatments can be cited, and subsequent usages of names cite earlier treatments.
Treatments are a synthesis of the knowledge of a given species at a given time. They can be very rich in data, explicitly or implicitly, detailed or summarized, and include many references to external data sources, such as scientific names, collection codes, DNA-codes.
The data can be semantically enhanced, and linked. Treatments as parts of publication need be extracted. Most recently, treatments are tagged in electronic publications with the National Library of Medicine’s Journal Article Tag Suites (JATS) TaxPub extension 1. This allows automatic extraction. Still the majority of the ca. 2000 journals and books publishing treatments use the PDF format at best. Plazi has tools to extract treatments, enhance the embedded data and import it into its SRS- Treatment Search Portal for public online access.
The data, that is, treatments and observation data, can be viewed as HTML, XML, RDF, or can be harvested with the protocols provided below. The data is provided for harvesting as Darwin Core-Archives.
What is a DarwinCore Archive?
The Darwin Core Archive format is a simple and extensible schema for sharing biodiversity data, especially catalogue data based on the ratified Darwin Core terms and the Darwin Core text guidelines [4]. Darwin Core is a standard for describing sample data in the Biodiversity Informatics community. It has been developed by the Global Biodiversity Information Facility (GBIF).. DarwinCore Archives use a table-based, “spreadsheet-style” format that is more comfortable and familiar to biologists. It uses plain text-files but it is tied to processes that support consistency and stability.
Fig. Schematic representation of a Darwin Core Archive and its components 2
The GBIF GNA format consists of a set of files where one (or more) files represents the ‘core’ taxonomic data where a single row represents a single taxon reference. The DarwinCore Taxon class provides the majority of concepts supported in the format that enable taxonomic and nomenclatural semantics and syntax (classification, taxonomic and nomenclatural synonymy, status, etc.) to be expressed.
Other files represent “extensions” to this core table and allow additional data elements to be linked to a taxon in the core table with a many to one relationship. The overall topology of one or more of these extensions to the core table is referred to as a “star schema” and provides a compromise between an overly simple flat-file representation of data and more complex multi-related files. In addition to these files, an additional descriptor file named “meta.xml” serves as a key to the other files. Collectively, these files can be further zipped into a single compressed archive file for portability. This compressed file is known as a Darwin Core Archive (DwCA) file 2.
The Darwin Core Archive used by Plazi
There is one archive per article stored in Plazi, containing the data from all the treatments in the article. Archives contain nine files:
meta.xml
: description of columns in data fileseml.xml
: archive meta data, i.e., bibliographic citation of article, etc.taxa.txt
: the archive core file, containing one row per taxon in the nomenclature section of a treatment, thus one or multiple rows per treatment, with any after the first for each treatment handling synonymizations.occurrences.txt
: occurrence data, containing one row per materials citation, with an ID reference totaxa.txt
description.txt
: description data, containing one row per descriptive treatment section, with an ID reference totaxa.txt
distribution.txt
: general distribution data, one row per distribution statement, with an ID reference totaxa.txt
media.txt
: full text treatments with HTML markup with additional meta data like a bibliographic citation, one row per treatment, with an ID reference totaxa.txt
references.txt
: bibliographic references to individual treatments, one row per treatment, with an ID reference totaxa.txt
vernaculars.txt
: vernacular names of treatment taxa, currently empty, as we do not have or mark this kind of data
For a detailed description of the content of each file see Appendix: Darwin Core Archive Content
Treatment Data representation in Plazi
The treatment data is stored in the Treatment Search Portal in native, generic XML included in tagged original publications. The tagged elements are (a) additionally stored in dedicated index structures to support search and (b) extracted and exported in several formats, including DwCA.
A treatment document includes two main elements, the header including the metadata based on the Metadata Object Description Schema (MODS) and the body.
<tax:taxonx> <tax:taxonxHeader> <tax:taxonxBody>
The data XML can be converted via XSLT into HTML, TaxonX XML (a schema developed to model biosystematics legacy literature), and RDF and HTML
HTML:
http://treatment.plazi.org/id/31F96F41-E3E0-02BD-8898-5A4F3A20E45A
(this is also the persistent httpURI used as identifier for treatments)
Plain XML:
http://tb.plazi.org/GgServer/xslt/31F96F41E3E002BD88985A4F3A20E45A
TaxonX XML:
http://tb.plazi.org/GgServer/taxonx/31F96F41E3E002BD88985A4F3A20E45A
RDF:
http://tb.plazi.org/GgServer/rdf/31F96F41E3E002BD88985A4F3A20E45A
or
http://treatment.plazi.org/id/31F96F41-E3E0-02BD-8898-5A4F3A20E45A.rdf
The terms used in TaxonX and RDF are either imported from existing schemas (such as Darwin Core for observation records, MODS for bibliographic data) or are, if not available, defined in schemas (TaxonX) or ontologies (RDF: in development)
Plazi API
Treatment data is open access and can be accessed via HTTP GET as described in detail below. The treatment data is provided in HTML, various XML flavors, and RDF.
Obtaining a list of all the treatments available from Plazi
HTTP GET - RSS
http://tb.plazi.org/GgServer/xml.rss.xml
Response (RSS, in Atom XML, encoded in UTF-8)
Entries of interest
- channel/item/link: the link to the XML treatment
- channel/item/title: the taxon name and authority
Accessing a particular DwC-Archive
HTTP GET - ZIP Archive
tb.plazi.org/GgServer/dwca/<dataSetUUID>.zip
Replace <dataSetUUID> with any UUID from the GBIF-provided listing (see below). It is also possible to directly use the endpoint URL from that listing list.
Example:
http://tb.plazi.org/GgServer/dwca/23A1465DDF212F7DA589F41341B83FCC.zip
Response (ZIP Archive, containing XML and tab separated TXT files, all encoded in UTF-8)
Entries of interest:
eml.xml
: an XML file containing the meta data of the publication, in MODS formattaxa.txt
: a tab separated TXT file listing the taxa and treatments the DwC-Archive contains, plus higher taxonomy; the Identifier column takes the form<treatmentUUID>.taxon
, and the treatment UUID can be used to access the treatment on the Plazi servers (see below)occurrences.txt
: a tab separated TXT file containing occurrence data; the TaxonID column references the Identifier column in taxa.txt, the data column headers are DwC termsmedia.txt
: a tab separated TXT file containing HTML versions of the treatments; the TaxonID column references the Identifier column intaxa.txt
, the HTML treatments are located in the Description columnreferences.txt
:
A detailed description of contents can be found here
http://github.com/plazi/Plazi-Communications/wiki/GBIF#darwin-core-archive
Accessing a particular treatment on the Plazi servers
HTTP GET - Page displaying Treatment
tb.plazi.org/GgServer/html/<treatmentUUID>
;
Replace <treatmentUUID> with the actual treatment UUID from the taxa.txt file found in DwC-Archives
Example:
http://tb.plazi.org/GgServer/html/8C4CE845A6DEE6FDFD1600A70D5BC71B
Response (HTML, encoded in UTF-8): a web page displaying the treatment
HTTP GET - generic XML
tb.plazi.org/GgServer/xml/<treatmentUUID>
;
Replace <treatmentUUID> with the actual treatment UUID from the taxa.txt file found in DwC-Archives
Example:
http://tb.plazi.org/GgServer/xml/8C4CE845A6DEE6FDFD1600A70D5BC71B
Response (XML, encoded in UTF-8): the raw, generic XML version of the treatment, which all other representations are generated from
HTTP GET - TaxonX
tb.plazi.org/GgServer/taxonx/<treatmentUUID>
;
Replace <treatmentUUID> with the actual treatment UUID from the taxa.txt file found in DwC-Archives
Example:
http://tb.plazi.org/GgServer/taxonx/8C4CE845A6DEE6FDFD1600A70D5BC71B
Response (XML, encoded in UTF-8): a TaxonX XML version of the treatment
List of Plazi’s available DwC-Archives from GBIF API
GBIF is a regular harvester of Plazi data and can be used as an alternative site.
HTTP GET - JSON
api.gbif.org/v1/organization/7ce8aef0-9e92-11dc-8738-b8a03c50a862/publishedDataset;
Replace <20k> with any multiple of 20 (including 0) to page through the list. It is also possible to use a limit other than 20, with the offset then being a multiple of that other limit.
Example (first 20 datasets):
Response (JSON)
{
"offset": 0,
"limit": 1,
"endOfRecords": false,
"count": 1129,
"results": [{
"key": "3e8b196b-c482-47f1-9574-772141310c40",
"installationKey": "7ce8aef1-9e92-11dc-8740-b8a03c50a999",
"publishingOrganizationKey": "7ce8aef0-9e92-11dc-8738-b8a03c50a862",
"external": false,
"numConstituents": 0,
"type": "CHECKLIST",
"title": "Revision of the ant genus Myrmoteras in the Malay Archipelago (Hymenoptera, Formicidae).",
"description": "UNAVAILABLE",
"language": "eng",
"homepage": "http://tb.plazi.org/GgServer/summary/23A1465DDF212F7DA589F41341B83FCC",
"citation": {
"text": "Plazi.org taxonomic treatments database: Revision of the ant genus Myrmoteras in the Malay Archipelago (Hymenoptera, Formicidae)."
},
"rights": "No known copyright restrictions apply. See Agosti, D., Egloff, W., 2009. Taxonomic information exchange and copyright: the Plazi approach. BMC Research Notes 2009, 2:53 for further explanation.",
"lockedForAutoUpdate": false,
"createdBy": "plazi",
"modifiedBy": "crawler.gbif.org",
"created": "2014-06-28T12:55:54.089+0000",
"modified": "2014-11-25T13:29:20.716+0000",
"contacts": […],
"endpoints": [{
"key": 45389,
"type": "DWC_ARCHIVE",
"url": "http://plazi.cs.umb.edu/GgServer/dwca/23A1465DDF212F7DA589F41341B83FCC.zip",
"createdBy": "plazi",
"modifiedBy": "plazi",
"created": "2014-06-28T12:55:54.604+0000",
"modified": "2014-06-28T12:55:54.604+0000",
"machineTags": []
}],
"machineTags": [...],
"tags": [],
"identifiers": [{
"key": 23594,
"type": "UUID",
"identifier": "23A1465DDF212F7DA589F41341B83FCC",
"createdBy": "plazi",
"created": "2014-06-28T12:55:54.334+0000"
}],
"comments": [],
"bibliographicCitations": [],
"curatorialUnits": [],
"taxonomicCoverages": [],
"geographicCoverages": [],
"temporalCoverages": [],
"keywordCollections": [],
"countryCoverage": [],
"collections": [],
"dataDescriptions": []
}]
}
Entries of interest:
endOfRecords
: if false, increasing offset will return further datasetscount
: total number of available Plazi datasetsresults.endpoints.url
: the URL of the DwC-Archive containing the data onresults.identifiers.identifier
: the UUID of the datasetresults.homepage
: the URL of an HTML page listing the taxonomic treatments whose data is contained in the DwC-Archive
Appendix: Darwin Core Archive Content
taxa.txt
- http://rs.tdwg.org/dwc/terms/taxonID: treatment UUID +
.taxon
for taxon, treatment ID +.syn
for new junior synonyms - http://rs.tdwg.org/dwc/terms/namePublishedIn: reference string of original description
- http://rs.tdwg.org/dwc/terms/acceptedNameUsageID: blank, except for new junior synonyms
- http://rs.tdwg.org/dwc/terms/parentNameUsageID: blank
- http://rs.tdwg.org/dwc/terms/originalNameUsageID: blank
- http://rs.tdwg.org/dwc/terms/kingdom: taxon@kingdom
- http://rs.tdwg.org/dwc/terms/phylum: taxon@phylum
- http://rs.tdwg.org/dwc/terms/class: taxon@class
- http://rs.tdwg.org/dwc/terms/order: taxon@order
- http://rs.tdwg.org/dwc/terms/family: taxon@family
- http://rs.tdwg.org/dwc/terms/genus: taxon@genus
- http://rs.tdwg.org/dwc/terms/taxonRank: taxon@rank
- http://rs.tdwg.org/dwc/terms/scientificName: taxon name
- http://rs.tdwg.org/dwc/terms/taxonomicStatus: blank except for new junior synonyms, where “synonym”, “homotypicSynonym” if we have a syntype
- http://rs.tdwg.org/dwc/terms/nomenclaturalStatus: blank
- http://purl.org/dc/terms/references: HTTP URI of treatment
occurrences.txt
- http://rs.tdwg.org/dwc/terms/occurrenceID: treatment UUID + “.mc.” + materials citation ID
- http://rs.tdwg.org/dwc/terms/taxonID: treatment UUID + “.taxon”, referencing taxa.txt
- http://rs.tdwg.org/dwc/terms/catalogNumber: mc@specimenCode (explode to one record per specimen code if possible)
- http://rs.tdwg.org/dwc/terms/collectionCode: mc@collectionCode (explode to one record per collection code if possible)
- http://rs.tdwg.org/dwc/terms/institutionCode: blank
- http://rs.tdwg.org/dwc/terms/typeStatus: mc@typeStatus (blank if none given)
- http://rs.gbif.org/terms/1.0/verbatimLabel: mc text
- http://rs.tdwg.org/dwc/terms/sex: mc@sex (also other specimen types like “queen”, “worker”, etc.)
- http://rs.tdwg.org/dwc/terms/individualCount: mc@specimenCount (explode things like “5 workers, 2 females” to one record per typified specimen count if possible)
- http://rs.tdwg.org/dwc/terms/eventDate: mc@collectingDate
- http://rs.tdwg.org/dwc/terms/recordedBy: mc@collectorName
- http://rs.tdwg.org/dwc/terms/recordNumber: blank
- http://rs.tdwg.org/dwc/terms/decimalLatitude: mc@latitude
- http://rs.tdwg.org/dwc/terms/decimalLongitude: mc@longitude
- http://rs.tdwg.org/dwc/terms/minimumElevationInMeters: mc@elevation, or mc@elevationMin if given
- http://rs.tdwg.org/dwc/terms/maximumElevationInMeters: mc@elevationMax if given
- http://rs.tdwg.org/dwc/terms/country: mc@collectingCountry
- http://rs.tdwg.org/dwc/terms/stateProvince: mc@stateProvince or mc@collectingRegion
- http://rs.tdwg.org/dwc/terms/municipality: mc@collectingMunicipality
- http://rs.tdwg.org/dwc/terms/locality: mc@location
- http://purl.org/dc/terms/references: HTTP URI of treatment
description.txt
- http://rs.tdwg.org/dwc/terms/taxonID: treatment UUID +
.taxon
, referencing taxa.txt - http://purl.org/dc/terms/type: subSubSection@type
- http://purl.org/dc/terms/description: subSubSection text
- http://purl.org/dc/terms/language: blank (except if we have language detection (might be reusable from spell checker))
- http://purl.org/dc/terms/source:
article citation
distribution.txt
- http://rs.tdwg.org/dwc/terms/locationID: treatment UUID + “.” + location UUID
- http://rs.tdwg.org/dwc/terms/taxonID: treatment UUID +
.taxon
, referencing taxa.txt - http://rs.tdwg.org/dwc/terms/country: mc@collectinCountry
- http://rs.tdwg.org/dwc/terms/locality: mc@location
- http://rs.tdwg.org/dwc/terms/occurrenceStatus: mc@typeStatus
media.txt
- http://purl.org/dc/terms/identifier: treatment UUID +
.text
- http://rs.tdwg.org/dwc/terms/taxonID: treatment UUID +
.taxon
, referencing taxa.txt - http://purl.org/dc/terms/type:
purl.org/dc/dcmitype/Text
- http://iptc.org/std/Iptc4xmpExt/1.0/xmlns/CVterm: “http://rs.tdwg.org/ontology/voc/SPMInfoItems#GeneralDescription”
- http://purl.org/dc/terms/format:
text/html
- http://purl.org/dc/terms/title: taxon + author + year
- http://purl.org/dc/terms/description: treatment HTML
- http://rs.tdwg.org/dwc/terms/additionalInformationURL: treatment HTTP URI
- http://ns.adobe.com/xap/1.0/rights/UsageTerms:
Public Domain
- http://purl.org/dc/terms/rights:
No known copyright restrictions apply. See Agosti, D., Egloff, W., 2009.
Taxonomic information exchange and copyright: the Plazi approach.
BMC Research Notes 2009, 2:53 for further explanation.
- http://ns.adobe.com/xap/1.0/rights/Owner: blank
- http://purl.org/dc/terms/contributor:
((Pensoft|Zootaxa) via )?Plazi
- http://purl.org/dc/terms/creator: author list, semicolon separated
- http://purl.org/dc/terms/bibliographicCitation: bibliographic reference string
references.txt
- http://purl.org/dc/terms/identifier: treatment UUID +
.ref
for article (treatment) reference, cited treatment ID (from treatmentCitation@httpUri) +.ref
for original description reference - http://rs.tdwg.org/dwc/terms/taxonID: treatment ID +
.taxon
, referencingtaxa.txt
- http://eol.org/schema/reference/publicationType: bibRef@type
- http://eol.org/schema/reference/full_reference: reference text
- http://eol.org/schema/reference/primaryTitle: bibRef@title
- http://purl.org/dc/terms/title: bibRef@journal or bibRef@volumeTitle
- http://purl.org/ontology/bibo/pages: blank
- http://purl.org/ontology/bibo/pageStart: treatment first page
- http://purl.org/ontology/bibo/pageEnd: treatment last page
- http://purl.org/ontology/bibo/journal: bibRef@journal
- http://purl.org/ontology/bibo/volume: bibRef@part
- http://purl.org/dc/terms/publisher: bibRef@publisher
- http://purl.org/ontology/bibo/authorList: bibRef@author, semicolon separated
- http://purl.org/ontology/bibo/editorList: bibRef@editor, semicolon separated
- http://purl.org/dc/terms/created: bibRef@year
- http://purl.org/dc/terms/language: blank
- http://purl.org/ontology/bibo/uri: bibRef@URL, if available
- http://purl.org/ontology/bibo/doi: bibRef@DOI, if available
vernaculars.txt
- http://rs.tdwg.org/dwc/terms/taxonID: treatment UUID +
.taxon
, referencing taxa.txt - http://purl.org/dc/terms/language:
en
- http://rs.tdwg.org/dwc/terms/vernacularName: vernacular name
Further reading
Downloads
Download the description as PDF
Support and Questions
For support and questions, please contact our support.
Version
20150223