Tethys R Client Tutorial
In this tutorial, we will walk through setting up the R Programming Language client for Tethys. The tutorial contains information on setting up the client and how to use it. Data are selected in the R client by specifying the type of data that are desired and specifying text in what we call Tethys Simple Query Language.
Remember that there is frequently more than one way to do the same thing. In addition to writing queries with the R client, you can also use the web client to design queries and save the results as XML to be read by R, or let the web client design a query in Tethys’s native query language XQuery and run that query from the R client. Generally, if you are programming in R it is easiest to use the R client’s data accessor methods (get*) as the client constructs dataframes from the returned XML.
Version 3.2 of the Tethys R client introduced parsing returned XML to create dataframes. A named list is now returned where name xml contains the XML data and other names are dataframes with portions of the XML extracted into a tabular format. In previous versions of the R client, the data access functions returned XML that the user was responsible for doing this.
Extracting dataframes requires knowledge about the XML that is being returned. When queries provide custom return parameters, there is the potential the XML to be unsuitable for processing with the dataframe extraction routines contained in the Tethys R Client. If this happens, call the data access function with parameter dataframes=FALSE and dataframes will not be extracted.
Setup
Start an instance of the R interpreter, either on its own or in an integrated development environment (IDE) of your choice (e.g., RStudio, Jupyter Notebook).
As the R client is a new addition to Tethys, we have not yet placed it in an R package. There are several packages that you will need to install if they are not already in your R environment.
# Install packages for communicating with a Tethys server
# and manipulating resultant XML
install.packages(c("xml2", "XML", "stringr", "jsonlite", "parsedate", "httr2", "dplyr2"))
To use it, source the tethys.r file from your Tethys RClient folder.
For example, if you had installed Tethys to your home directory on a Windows machine, you might use:
# Update appropriately for your installation of Tethys
source("C:/Users/MyAccount/Documents/Tethys/RClient/tethys.r")
Connecting to a Tethys server
In order to connect, you need several pieces of information:
- The name or IP address of the Tethys server machine. If the server is running on the same machine as the client, you can use localhost.
- The port that the Tethys server is running on if the Tethys administrator changed it from the default (9779).
- Whether or not the server is using encrypted communication. Most Tethys users do not set up encrypted communication as it works best with a signed certificate and this requires additional steps to configure.
We will create an instance of the Tethys client and store the result in a variable (for brevity, we use t throughout this tutorial but suggest using a more meaningful name in your code). The Tethys class is an R reference class, which means that its methods are accessed with a dollar sign separator ($). Here we access the new method which creates an instance of the tethys class. You usually will not need more than a single instance.
Several examples:
> t <- tethys$new()
1] "Using Tethys server: http://localhost:9779" # defaults to unencrypted communication on the same machine [
> t <- tethys$new("localhost")
1] "Using Tethys server: http://localhost:9779" # specify machine name and let it default insecure traffic on default port [
# Secure communication over port used for secure web traffic
# We use the keyword names in this example, they are not mandatory.
<- tethys$new(server="tethys.ucsd.edu", port=443, secure=TRUE)
t 1] "Using Tethys server: https://tethys.ucsd.edu:443" [
Once we have created an instance of the tethys class, we access methods by writing the instance variable, a $, and the desired method. The client can be created regardless of whether or not the Tethys server is available. To check if the server is available, use the ping() method which checks to see if the server is available.
# The terminology comes from submarine sonar pings, we are essentially listening for a response.
$ping()
t1] TRUE # Hooray.
[# FALSE indicates a communication problem or that the server is not running.
The two most common causes for failure to communicate are:
- The server is not running and you will need to start it.
- The firewall rules are preventing communication, see the setup and administration section of the main Tethys manual for help on firewall issues.
Most examples in this tutorial assume that the demonstration database is running. You can either adapt the examples to your own data or make sure that the Tethys is serving the demonstration database that is distributed with the software.
Data retrieval: Let’s get started…
We assume that you have made the tethys class accessible and created a new instance of the tethys class with the new method and verified that you can connect to the server as outlined in the previous sections. Remember that for convenience, we refer to this as t.
There are a number of data retrieval methods:
- getDeployments - characteristics of instrument deployments such as location, sample rate, etc.
- getCalibration - instrument calibration data
- getDetectionEffort - determine when and where we were looking for something, e.g., a species
- getDetections - retrieve data on what we found, depending on the type of effort, this may be presence/absence information, individual calls, or measurements of call characteristics.
- getLocalizationEffort - determine when and where efforts to determine position, bearing, tracks, etc., were performed
- getLocalizations - Retrieve the positions, bearings, tracks, etc.
These methods all take up to four arguments. The first argument is the only mandatory argument and the only one that is needed for now. It is called the query and it controls both the data that are selected and the data that are returned.
In order to know what we are going to query, we need to know what types of fields are present in the database. The database is organized by data type, with support for deployments, calibrations, detections, localizations, and ensembles of deployments (ensembles group instrument deployments so they can be treated as a virtual instrument). The rules for how each of these are organized is called a schema and the schema method will open a schema description in a web browser, try it now:
> t$schema() # Don't know what's available? Run without an argument
: Deployment, Calibration, Detection, Localization, Ensemble
Specify root or collection name of document type> t$schema("Deployment")
Each deployment that is added to the database must conform to the rules provided by the table. The table provides a list of hierarchically nested names, a description of the minimum and maximum number of times a name may appear within a single deployment, the type of data that are expected, and a description of the field.
The hierarchy is represented as a slash (/) separated list of names. As an example, in the detections schema, /Detections/Effort/Start indicates that the detection effort started at a specific time. The advanced query view of the Tethys Web Client shows this hierarchy.
The Query
We can select data based on the values of fields. For example, we might wish to select all of the deployments associated with a specific project. In the demonstration database (demodb), there are several deployments associated with the Aleutian islands (Project Aleut). We could select these geospatially but we will use the Project field as it is a simpler query to write:
When using the R client, we recommend placing queries in single quotes. This lets us use double quotes for text values within the query as in the example below. The query language only recognizes values within double quotes as text.
# Text values in the query are enclosed in double quotes.
# We used = to check for equality. Other useful operators: <, <=, >, >=, !=
<- t$getDeployments('Project = "Aleut"')
dep # We also could have put the complete path shown in the schema:
<- t$getDeployments('Deployment/Project = "Aleut"') # same results dep
This is an example of specifying a set of selection criteria; only records in the database that match the criteria will be selected. Our selection criteria can be more complex as we will see in the next section. In this example, because we did not specify which fields we wanted returned to us, the system provided a set of default fields to return to us that are generally useful. We can however add a return keyword after the selection criteria, and provide a comma separated list of fields that we wish returned. These fields replace the default set, they do not add to it.
In general, the query syntax looks like the following:
Query syntax:
condition1 and condition2 (...and conditionM) return field1, field2, ... (, fieldN)
If the field name in the return list is a record (a name that has subfields), then that entire record is returned. If we wanted to see all of the data associated with the deployment, we could use the top-level record which for deployments is Deployment:
# entire deployment record for each deployment
<- t$getDeployments('Project = "Aleut" return Deployment') dep
The dep variable will contain a named list with two entries:
- xml - Contains the XML returned by Tethys
- deployments - A table with information about the deployment. Tables cannot capture nested structure, so some entries may be incomplete. For example, NOAA SWFSC drifters (Project = “ADRIFT”) in the demonstration database have two channels. Only the first channel will be returned. To access the second channel, the XML would need to be processed.
Specifying selection criteria
We have already seen the equality check (=) in the sample query Project = “Aleut”. Comparison operators are as follows:
operator | interpretation |
---|---|
> | greater than |
>= | greater than or equal to |
< | less than |
<= | less than or equal to |
!= | not equal to |
= | equality |
Values are typically text enclosed in double quotes, or numerical values which are not quoted. Suppose we wanted to find instruments deployed between between 117 and 120 degrees west. Longitudes are always stored in degrees east [0, 360) in Tethys, and we can use a tethys class utility method to do the conversion:
# Latitudes are degrees east, longitudes in degrees north. normalizeDegrees will convert
# degrees north, south, east or west to the proper range.
> t$normalizeDegrees(117, 'w') # can use first letter if desired
1] 243
[> t$normalizeDegrees(120, 'West') # case independent
1] 240 [
Let’s try querying and specify that we would like the DeploymentDetails returned:
<- t$getDeployments('Longitude < 243 and Longitude > 240 return DeploymentDetails') dep
That obviously didn’t work. The error message contained the following text (along with other information):
REST.Resources.AmbiguousElement: Unable to uniquely identify Longitude in the schema. Did you mean:
- Deployment/Data/Tracks/Track/Point/Longitude
- Deployment/DeploymentDetails/Longitude
- Deployment/RecoveryDetails/Longitude
We did not provide enough information to disambiguate the field we wanted. We can either use the whole path, Deployment/DeploymentDetails/Longitude, or a just enough to distinguish it from the other ones: DeploymentDetails/Longitude. Note that any name that you use in your path must be complete, you may not use partial text such as “mentDetails/Longitude.”
The code below has ligature (curly) quotes due to the rendering system, you will need to fix these if you copy/paste.
dep <- t$getDeployments(‘DeploymentDetails/Longitude < 243 and DeploymentDetails/Longitude > 240 return DeploymentDetails’)
Once we fix the above error (peek above in “What happened?” if you need to see how to do it correctly), we should have a have variable dep with two fields: dep$xml and dep$deployments. Let’s look at what’s in dep$xml. We will use the R’s cat and substr methods to only print out the beginning of it.
# If we had used dataframes=FALSE in the query, the XML would
# not have been flattened:
# <Result>
# <Deployment>
# <DeploymentDetails>
# <Longitude> ... </Longitude>
# <Latitude> ... </Latitude>
# ...
# </DeploymentDetails>
# </Deployment>
# ...
# </Result>
#
# When t$getDeployments has dataframes=TRUE (default), the resultant XML
# may be transformed to make it easier to convert to a dataframe.
cat(substr(dep$xml, 1, 1180)) # first couple of records...
<Result>
<Deployment>
<DeploymentDetails.Longitude>240.83015</DeploymentDetails.Longitude>
<DeploymentDetails.Latitude>32.84215</DeploymentDetails.Latitude>
<DeploymentDetails.ElevationInstrument_m>-993</DeploymentDetails.ElevationInstrument_m>
<DeploymentDetails.DepthInstrument_m>993</DeploymentDetails.DepthInstrument_m>
<DeploymentDetails.TimeStamp>2012-03-24T11:32:00Z</DeploymentDetails.TimeStamp>
<DeploymentDetails.AudioTimeStamp>2012-03-25T00:00:00Z</DeploymentDetails.AudioTimeStamp>
<DeploymentDetails.Vessel>Outer Limits</DeploymentDetails.Vessel>
</Deployment>
<Deployment>
<DeploymentDetails.Longitude>240.01139999999998</DeploymentDetails.Longitude>
<DeploymentDetails.Latitude>34.14008333333334</DeploymentDetails.Latitude>
<DeploymentDetails.ElevationInstrument_m>-238</DeploymentDetails.ElevationInstrument_m>
<DeploymentDetails.DepthInstrument_m>238</DeploymentDetails.DepthInstrument_m>
<DeploymentDetails.TimeStamp>2008-07-24T16:16:08Z</DeploymentDetails.TimeStamp>
<DeploymentDetails.AudioTimeStamp>2008-07-24T19:00:00Z</DeploymentDetails.AudioTimeStamp>
</Deployment>
...
Variable dep$deployments is an example of a dataframe that was automatically extracted by getDeployments(). Most data access functions in the R Client will create dataframes for you, but we’ll learn more about how to select things before we worry about analyzing them.
Conditions and sequences
We can specify as many conditions as we want, but they must all be joined by the term and. This may seem a bit strange if you are used to querying databases, as it is common to use or to represent choices using an or keyword. Choices for a single field can be represented by sequences. A sequence is a list of terms in parentheses separated by commas. Here is an example of selecting all deployments associated with Project Aleut or Project GofAK.
> aleut_or_alaska <- t$getDeployments('Project = ("Aleut", "GofAK") return Deployment/Id, Deployment/Project, Deployment/Site')
> names(aleut_or_alaska) # show names of list items
1] "xml" "deployments"
[> aleut_or_alaska$deployments
Id Project Site1 ALEUT01-KS Aleut KS
2 ALEUT02-BD Aleut BD
3 ALEUT03-BD Aleut BD
4 GOFAK_CA01 GofAK CA
5 GOFAK_CB01 GofAK CB
6 GOFAK_PT01 GofAK PT
...31 GOFAK_CB_09 GofAK CB
32 GOFAK_CB_10 GofAK CB
33 GOFAK_KOA_01 GofAK KOA
34 GOFAK_CB_11 GofAK CB
The simple query language designed for Tethys does not permit or operations between different fields. In most cases, this is not a problem. Should you have an application that requires this, you will need to write the query in XQuery which is described in the Tethys manual and is an advanced topic. Fortunately, you can generate an XQuery using either the web client or by adding plan=2 to your R Tethys query. Adding plan=2 will return the XQuery that corresponds to your query string, and is a good starting place for designing your own queries should you need more advanced selection criteria.
Specifying taxa
By default, the R client expects specification of taxa as Latin species names. When producing results from a query, taxonomic ranks are also specified in Latin. Both the query input and output may be customized to one of the following types:
- tsn - A taxonomic serial number as specified by the Integrated Taxonomic Information System (ITIS). We add to ITIS by using -10 for other phenomena such as geophony. For phenomena generated by humans (anthrophony), use Homo sapiens (TSN 180092).
- Latin - a Latin taxonomic rank.
- Vernacular - A common name for a taxonomic rank stored in ITIS.
- Abbrev - A custom abbreviation list that has been stored to the Tethys server’s Abbreviations collection
Several functions allow one to set the taxonomic specification for the query input and the resultant output.
- t$setSpeciesInput(type, arg),
- t$setSpeciesOutput(type, arg), and
- t$setSpeciesInputOutput(type, arg)
These allow you to set the query input and output codings. Methods setSpeciesInput and setSpeciesOutput specify how queries search for taxa and how they are represented in the returned data respectively. Method t$setSpeciesInputOutput sets the input and output to the same coding system.
Type and arg for all of these methods must be as follows:
Type | Argument | Interpretation |
---|---|---|
tsn | Ranks are encoded as TSNs, e.g., to select genus Kogia one would use SpeciesId = 180490 | |
Latin | Ranks use Latin names, e.g., genus SpeciesId = “Kogia” | |
Vernacular | English, French, Spanish | Rank must correspond to a local language common name stored in ITIS. Not all ranks have vernacular entries and we default to Latin when one is not present |
Abbrev | abbreviation name | Abbreviations specified in a SpeciesAbbreviations map. For example, in the NOAA.NMFS.v1 abbreviation set, Kssp is used for Kogia |
Method getSpeciesIO() returns the current input and output codings:
$getSpeciesIO()
t
input inputarg output outputarg "Latin" "" "Latin" ""
$setSpeciesOutput("Vernacular", "Spanish")
t$getSpeciesIO()
t
input inputarg output outputarg "Latin" "" "Vernacular" "Spanish"
Species Abbreviations
New species abbreviations sets can be added by creating an XML document and adding it to the SpecesAbbreviation collection. See the web client’s Data Import manual for details on importing data. The easiest way to create a new abbreviation map is to use the Tethys Web Client to access an existing abbreviation map, modify it, and submit the new map.
Method t$getSpeciesAbbreviations returns a dataframe with the names of all available abbreviations maps.
> t$getSpeciesAbbreviations()
Name1 NOAA.NMFS.v1
2 NOAA.NMFS.v2
3 NOAA.NMFS.v3
4 NOAA.NMFS.v4
5 SIO.SWAL.v1
When called with an existing name, it retunrs a dataframe showing Latin names, their TSNs and the abbreviation.
> t$getSpeciesAbbreviations("SIO.SWAL.v1")
completename completename.Group tsn coding1 Balaenoptera borealis <NA> 180526 Bb
2 Balaenoptera brydei <NA> 612597 Be
3 Balaenoptera Balaenoptera brydei GofMX 180523 BeGofMX
4 Balaenoptera musculus <NA> 180528 Bm
5 Balaenoptera physalus <NA> 180527 Bp
6 Eschrichtius robustus <NA> 180521 Er
7 Eubalaena japonica <NA> 612591 Ej
...41 Hyperoodontidae BW40 770799 BW40
42 Hyperoodontidae BW43 770799 BW43
43 Hyperoodontidae BW40 770799 BW50
44 Hyperoodontidae BWG 770799 BWG
45 Hyperoodontidae BWG 770799 BW53
46 Mesoplodon stejnegeri <NA> 180514 BW65
47 Hyperoodontidae BW70 770799 BW70
48 Hyperoodontidae <NA> 770799 UBW
49 Caniformia <NA> 552303 UP
50 Cetacea <NA> 180403 UMM
51 Homo sapiens <NA> 180092 Anthro
52 Actinopterygii <NA> 161061 UF
53 Other phenomena <NA> -10 Other
The above abbreviation table shows how the completename/@Group attribute (shown in the table as completename.Group) can be used to provide additional information about a taxonomic rank. In this case, it is used to specify echolocation characteristics from members of the family of beaked whales that we cannot yet associate to a specific species. Other uses for the @Group attribute include population characteristics such as stock structure.
Queries that produce dataframes
When the user has not specified dataframes=FALSE in a query method call, most Tethys Simple Query Language queries will produce dataframes in addition to the XML field xml. The fields associated with the return value of any deployment can be found using function names() on the result variable, and are summarized here.
getDeployments
- deployments - dataframe describing deployments. Only one channel of multichannel deployments will be displayed.
getCalibration
- calibrations - dataframe containing matching calibrations.
getDetectionEffort
- effort - dataframe indicating an identifier for each effort (Id) and the deployment or ensemble with which it was associated as well as the Start and End of the effort.
- kinds - dataframe indicating what we were looking for (taxonomic rank, granularity, call, etc.).
getDetections
- detections - dataframe with information about individual detections. Always includes Start, SpeciesId, and DetectionsGroup. DetectionsGroup is a number that indicates which detections came from the same effort. If getDetectionsEffort is executed using the same criteria, the DetectionsGroup field provides a many to one match with the DetectionsGroup field in the effort and allows one to identify the source of any detection.
getLocalizationEffort
- effort - Information about the localization effort. Subtype and Name specify the coordinate reference system with other fields providing parameters related to it. LocalizationType indicates the type of information being localized, and Dimension indicates the maximal number of dimensions that may be produced (it is possible to produce information with less dimensions).
getLocalizations
- localizations - Dataframe with one row per localization entry. Fields vary depending on the type of localizations that have been produced.
- localization_groups - Specifies the localization group Id. localizations[‘LocalizeGroup’] contains indices that map to indices into this table.
Manipulating XML (Advanced topic)
Functions for extracting XML only need be used when the data access functions are not capable of extracting desired information. Many users should never need to manipulate XML directly. This section teaches some of the basic available tools should you need to do so.
Extracting data
The xml2 library is a good tool for extracting data and is documented in the R Tethys Client manual (this document is the R Tethys client tutorial). When possible, you may find it easier to convert the XML to a dataframe which is described in the next section of this tutorial.
Creating dataframes
In most cases, you will not need to create dataframes from XML as the data retrieval functions do this for you unless you specify dataframes=FALSE. You need not read this section unless:
- You wish to extract data from the XML that has not bee captured in the dataframes.
- You have modified the query to the point that the XML structure is different enough that the automated dataframe extraction retrieval functions can no longer parse the XML.
Let us return to our earlier longitude query. After we fixed the error in our initial try, we saw that all of the data are in record Result and that information associated with each deployment is grouped with the name Deployment. If we inspect the data that are returned, we see that the path /Result/Deployment/DeploymentDetails navigates us to a list of information about when we put our instruments in the field. Let’s put this in an R dataframe using the xml2df method:
<- t$getDeployments('DeploymentDetails/Longitude < 243 and DeploymentDetails/Longitude > 240 return DeploymentDetails', dataframes=FALSE)
dep # We write a forward slash (/) separated list of fields (a path)
# listing the names from the top of the document to the fields
# that we want to extract into a dataframe.
<- t$xml2df(dep$xml, "/Result/Deployment/DeploymentDetails") df
Sometimes, this is exactly what we need.
This will create dataframe df where each row has columns that correspond to what is in DeploymentDetails: Longitude, Latitude, ElevationInstrument_m, etc. The contents seem reasonable, but there are two issues with this approach:
If there are multiple levels of fields under DeploymentDetails, all subchildren are collapsed into a single column. For example, the field ResponsibleParty will contain Scripps Whale Acoustics Lab, but the schema lets other fields be present such as phone and address, and these will all be present in the ResponsibleParty field. We will address this later.
Because we only returned the DeploymentDetails, we do not know the deployment Id, a field that uniquely identifies the record, as well as some other fields that might be useful such as the name of the site where we deployed.
Nested records
In the last section, we saw how to create an R dataframe. However, we queried a small section of the values that were returned and lost a fair amount of the information. Suppose we wanted to create a datatable that captured a deployment identifier, the project, site, and information about deployment and recovery. We might do something like this:
> dep <- t$getDeployments('return Id, Project, Deployment/Site, DeploymentDetails, RecoveryDetails', dataframes=FALSE)
> cat(dep$xml)
<Result>
<Deployment>
<Id>SOCAL46-H</Id>
<Project>SOCAL</Project>
<Site>H</Site>
<DeploymentDetails>
<Longitude>240.83015</Longitude>
<Latitude>32.84215</Latitude>
<ElevationInstrument_m>-993</ElevationInstrument_m>
<DepthInstrument_m>993</DepthInstrument_m>
<TimeStamp>2012-03-24T11:32:00Z</TimeStamp>
<AudioTimeStamp>2012-03-25T00:00:00Z</AudioTimeStamp>
<Vessel>Outer Limits</Vessel>
</DeploymentDetails>
<RecoveryDetails>
<Longitude>240.83015</Longitude>
<Latitude>32.84215</Latitude>
<TimeStamp>2012-07-21T12:25:15Z</TimeStamp>
<AudioTimeStamp>2012-07-21T12:25:15Z</AudioTimeStamp>
<Vessel>Horizon</Vessel>
<ResponsibleParty>
<organizationName>Scripps Marine Bioacoustics Collaborative</organizationName>
</ResponsibleParty>
</RecoveryDetails>
</Deployment>
<Deployment>
<Id>ALEUT01-KS</Id>
<Project>Aleut</Project>
<Site>KS</Site>
<DeploymentDetails>
...</DeploymentDetails>
<RecoveryDetails>
...</RecoveryDetails>
</Deployment>
...</Result>
Our first instinct might be to try
<- t$xml2df(dep$xml, "/Result/Deployment") f
but as DeploymentDetails and RecoveryDetails are nested records, xml2df will squish all of their data into two fields DeploymentDetails and RecoveryDetails. This is not what we want.
Fortunately, we can use a stylesheet written in XSLT (a tool that can transform XML). This stylesheet will remove nesting so things look like a big table. A problem is that some of the field names are the same. For example, both DeploymentDetails and RecoveryDetails have a Longitude field. We will use a stylesheet to rename the fields to include the parent’s name, e.g., DeploymentDetails.Longitude and RecoveryDetails.Longitude. The stylesheet can be provided as an argument to getDeployments or any of the other query methods that return XML.
Most of us don’t know how to write XSLT stylesheets, but a stylesheet to flatten, or remove hierarchy, is pretty nifty for converting our data to a datatable. Consequently, the stylesheet argument takes one of two values:
- text containing a valid XSLT specification (see Jeni Tenison’s XSLT tutorials and books if you are interested in learning XLST), or
- a named vector that specifies a desired operation and the path within the XML to the field from which we take each datatable row.
We will focus on the named vector that generates the XSLT stylesheet for us. The vector must have operation and path values, and the stylesheet to flatten a hierarchy also requires a parents value. For example: list(operation=‘flatten’, path=‘/Result/Deployment’, parents=2). This tells us that any fields nested deeper than Deployment will be collapsed, or flattened. The parents argument tells us how many names up the hierarchy should be included in the new name. The deepest field that is under Deployment in the XML above is: DeploymentDetails/ResponsibleParty/organizationName. With parents=2, this is rewritten as a single field whose name is consists of the field, and two levels of parents with dots between them: DeploymentDetails.ResponsibleParty.organizationName. We could have used parents=1, but this would map the organizationName under DeploymentDetails and RecoveryDetails to the same name: ResponsibleParty.organizationName.
# Use a stylesheet to flatten the hierarchy
> dep <- t$getDeployments('return Id, Project, Deployment/Site, DeploymentDetails, RecoveryDetails', stylesheet=list(operation='flatten', path='/Result/Deployment', parents=2), dataframes=FALSE)
# Create a dataframe
> df <- t$xml2df(dep$xml, "/Result/Deployment")
> names(df) # show column names
1] "Id" "Project"
[3] "Site" "DeploymentDetails.Longitude"
[5] "DeploymentDetails.Latitude" "DeploymentDetails.ElevationInstrument_m"
[7] "DeploymentDetails.DepthInstrument_m" "DeploymentDetails.TimeStamp"
[9] "DeploymentDetails.AudioTimeStamp" "DeploymentDetails.Vessel"
[11] "RecoveryDetails.Longitude" "RecoveryDetails.Latitude"
[13] "RecoveryDetails.TimeStamp" "RecoveryDetails.AudioTimeStamp"
[15] "RecoveryDetails.Vessel" "RecoveryDetails.ResponsibleParty.organizationName"
[17] "DeploymentDetails.Elevation_m" "DeploymentDetails.ResponsibleParty.organizationName"
[19] "RecoveryDetails.Elevation_m" [
Sometimes there is more than one instance of a record. For instance, when querying detection effort, each detection effort can list effort for multiple species.
Repeated records
Records in the database may be repeated. Let us consider a query to ask when and where we looked for blue whale presence/absence in 60 min periods. Effort for specific taxa are defined in the Detections/Effort/Kind records. Specifically, Detections/Effort/Kind/SpeciesId contains the taxon we are searching for and and Detection/Effort/Kind/Granularity indicates what type of resolution we have for the effort. Valid values are:
- binned - Detections indicate presence/absence within Granularity/@BinSize_min minute ranges (e.g., every 60 min would only create one detection record per hour). If you are producing binned detections, we recommend the Start and End be set to correspond to the bin period as this makes analysis easy. You will see in the sample database that some people choose to interpret this differently and provide the time associated with a specific call within the bin period.
- encounter - Detections report the Start and End times of an acoustic encounter (bout of call detections). When present, the optional Granularity/@EncounterGap_min specifies the minimum amount of time required between two calls are considered to be in different encounters.
- call - Detections represent individual calls.
- group - Detections represent a grouping of calls, such as song or a click train.
Note that specifying criteria for optional fields such as @BinSize_min or @EncounterGap_min will result in records without the optional value being skipped. Let’s query for the effort for binned blue whale calls with a 60 min bins.
> det <- t$getDetectionEffort('Granularity = "binned" and @BinSize_min = 60 and SpeciesId = "Balaenoptera musculus"', dataframes=FALSE)
> cat(det$xml)
<Result>
<Detections>
<Id>SOCAL31M_LF_logs_jsb</Id>
<DeploymentId>SOCAL31-M</DeploymentId>
<Start>2009-01-13T06:00:00Z</Start>
<End>2009-03-08T11:41:26.496000Z</End>
<Kind>
<SpeciesId>Balaenoptera musculus</SpeciesId>
<Call>A</Call>
<Granularity BinSize_min="60">binned</Granularity>
</Kind>
<Kind>
<SpeciesId>Balaenoptera musculus</SpeciesId>
<Call>B</Call>
<Granularity BinSize_min="60">binned</Granularity>
</Kind>
<Kind>
<SpeciesId>Balaenoptera musculus</SpeciesId>
<Call>D</Call>
<Granularity BinSize_min="60">binned</Granularity>
</Kind>
</Detections>
<Detections>
...</Detections>
...</Result>
We could use xml2df to extract the effort for each species: t$xml2df(xml, “/Result/Detection/Kind”), but our table would only contain information about the taxon, call type, and granularity. Information about when we were looking and which deployment this was associated with would be lost. We can use another stylesheet function, copy_parents, to make copies of the parents of Kind into the Kind records. copy_parents takes a single argument, the path to the node that we want rewritten: /Result/Detection/Kind.
<- t$getDetectionEffort('Granularity = "binned" and @BinSize_min = 60 and SpeciesId = "Balaenoptera musculus"', stylesheet=list(operation="copy_parents", path="/Result/Detections/Kind"), dataframes=FALSE) eff
This will copy Id, DeploymentId, Start, and End into the Kind records and we can then produce a dataframe that has the information that we need.
<- t$xml2df(eff$xml, "/Result/Detections/Kind") effort
Finally, we might choose to get the detections themselves:
<- t$getDetections('Granularity = "binned" and @BinSize_min = 60 and SpeciesId = "Balaenoptera musculus"', dataframes=FALSE)
det # xml has a list of Detection records under /Result/Detections. Default return lists Start, End, Call and SpeciesId
# some fields may be absent, e.g. some detection methods do not list Call or End.
<- t$xml2df(det$xml, "/Result/Detections/Detection") detections
Stylesheet transformation recap
Using xml2df to create R dataframes allows working with Tethys data in a format that is familiar to R users. However, due to the nested nature of the data that are stored in XML, the data that are returned may not always be in a format that is amenable to capturing what is needed in a dataframe. XSLT stylesheets can be applied to any of the functions that retrieve data using the Tethys Simple Query Language.
These stylesheets provide a way to transform XML into something suitable for a dataframe. As most users do not wish to learn to write a stylesheet, named lists can be used to generate stylesheets automatically for commonly needed paradigms. Every stylesheet generator requires the following names:
- operation - specifies what to do
- path - Slash (/) separated list of field names leading to a target field
Currently, the defined operations are:
- enumerate_ancestor: Inserts new elements at the specified target path that number an ancestor. This allows us to track where something comes from in a nested structure. For example, detection effort can specify effort for multiple Kind elements indicating that we are looking for multiple taxa or calls. If the query result spans multiple sets of detections, extracting /Detection/Effort/Kind into a datatable would lose information about which detection set the kinds were associated with. By setting the path to “KInd” and the ancestor value to “Detections”, we would add “DetectionsGroup” to each Kind with a number indicating that this was the n’th set of Detections.
- flatten: Removes nesting structure by flattening all children of the target path. Names of field may be transformed to include the names of parent fields. Adding the name parents specifies how many parent fields should be included in the new name. Flatten can also enumerate ancestors using the ancestor argument.
- copy_parents: For a given target, copy the immediate parent fields into the target. This is especially useful when a particular field is repeated and has parent information that needs to be retained in a dataframe. copy_parents is currently limited to only returned the nodes one level above the target although this may be removed in the future.
Querying multiple types of data
For many of the queries, it is possible to include criteria from other types of data. When querying for localizations or detections, if a condition is not found in the localization or detection schemata, the condition is searched for in the deployment schema and matching records are joined together.
Here is an example where we look for Pacific white-sided dolphins in recordings that had sample rates above 180 kHz.
<- t$getDetections('SpeciesId = "Lagenorhynchus obliquidens" and SampleRate_kHz > 180 return Deployment/Id, SampleRate_kHz, Detection', dataframes=FALSE) det
The XML returned from this query is very similar to what we have seen in the past except that it includes things from both detections and deployments. Whenever records are joined like this, the query system will automatically insert a field called Record that pairs matching records together. In general, the XML will look something like this:
Result>
<Record>
<Detections>
<DeploymentId>SOCAL38-M</DeploymentId>
<Detection>
<
... many fields omitted ...Start>2010-05-20T05:29:02.499999Z</Start>
<End>2010-05-20T06:14:22.500000Z</End>
<SpeciesId>Lagenorhynchus obliquidens</SpeciesId>
<Call>Clicks</Call>
<Detection>
</Detection>
... more <Detections>
</Deployment>
<Id>SOCAL38-M</Id>
<Regimen>
<SampleRate_kHz>200</SampleRate_kHz>
<Regimen>
</Deployment>
</Record>
</Record>
<Detections> ... </Detections>
<Deployment> ... </Deployment>
<Record>
</
... other records ...Result> </
Extraction of data is similar to other scenarios, with the caveat below.
If you are attempting to use stylesheets generated by enumerate_ancestors (or flatten with an ancestor argument), it is possible to get into trouble here. Normally, one might use the following:
list(operation=“enumerate_ancestors”, path=“Detection”, ancestor=“Detections”)
The issue here is the Detections are the first child of every Record. So the DetectionsGroup inserted in the /Result/Record/Detection will always have a value of 1. This can be addressed by by providing a vector of ancestors. The ancestor counting mechanism will use the position of the first element in the ancestor list that it finds in the ancestors of Detection, but will always generate an element name from the last element in ancestor list.
Using ancestor=c(“Record”, “Detections”) will use create Detection/DetectionsGroup entries using the position of Record. If Record did not exist, we would use the postion of Detections, so this will work regardless of whether or not a Record element was inserted.