Generating NAs using the Species List

Introduction

The aim of this document is demonstrate how the Species List table (SL) of the RDBES can be used to complement the sample table with NAs in cases where, e.g., a species was not meant to be looked for. This task is made easy using function generateNAsUsingSL available in the RDBEScore package.

Load the package

library(RDBEScore)

Load and validate example data

# read an example dataset and simplify it to 1 trip and 1 haul [dev bote: this section needs to be reworked when data and filterRDBESDataObject are  updated]
data(Pckg_survey_apistrat_H1)
myH1DataObject1 <- Pckg_survey_apistrat_H1
myH1DataObject1$SL<-myH1DataObject1$SL[grepl(myH1DataObject1$SL$SLspeclistName, pat="Pckg_survey_apistrat_H1"),]
#myH1DataObject1<-filterAndTidyRDBESDataObject(myH1DataObject1, fieldsToFilter="FOid",valuesToFilter=70849, killOrphans = TRUE)
myH1DataObject1<-filterRDBESDataObject(myH1DataObject1, fieldsToFilter="SSid",valuesToFilter=227694, killOrphans = TRUE)
# check it is a valid RDBESobject
validateRDBESDataObject(myH1DataObject1, checkDataTypes = TRUE)

A closer look the example data and its characteristics

The example is from data in hierarchy 1. It contains a single trip with a single haul. For simplicity, we restrict our analysis to the tables SL, SS and SA which are the ones handled by the functions we which behaviour we want to demonstrate.

Examining a print of the Species List table (SL) one can conclude that the sampling targeted the landings of only 1 species. In this case the species was Nephrops norvegicus (aphiaId 107254).

myH1DataObject1[c("SL")]
#> $SL
#> Key: <SLid>
#>     SLid SLrecType  SLcou SLinst                             SLspeclistName
#>    <int>    <char> <char> <char>                                     <char>
#> 1: 47891        SL     ZW   4484 WGRDBES-EST_TEST_1_Pckg_survey_apistrat_H1
#>    SLyear SLcatchFrac SLcommTaxon SLsppCode
#>     <int>      <char>       <int>     <int>
#> 1:   1965         Lan      107254    107254

Examining a print of the Species Selection table (SS), one can confirm that only one fishing operation is present in the data (FOid 70849) and that landings were indeed sampled from it (for simplicity only a subset of columns is printed).

myH1DataObject1[[c("SS")]][,1:15]
#> Key: <SSid>
#>      SSid  LEid  FOid  TEid  FTid  SLid  OSid SSrecType SSseqNum
#>     <int> <int> <int> <int> <int> <int> <int>    <char>    <int>
#> 1: 227694    NA 70849    NA    NA 47891    NA        SS        1
#>    SSstratification SSstratumName SSclustering SSclusterName SSobsActTyp
#>              <char>        <char>       <char>        <char>      <char>
#> 1:                N             U            N             U        Sort
#>    SScatchFra
#>        <char>
#> 1:        Lan

Given the previous, it is expected that if Nephrops norvegicus was sampled it will appear in the RDBES Sample table (SA). One can confirm that happened by printing that table (for simplicity only a subset of columns is printed).

myH1DataObject1[[c("SA")]][,c(1:9,48:49)]
#> Key: <SAid>
#>      SAid   SSid  LEid SArecType SAseqNum SAparSequNum SAstratification
#>     <num>  <int> <int>    <char>    <num>        <num>           <char>
#> 1: 572813 227694    NA        SA        1           NA                N
#>    SAstratumName SAspeCode SAtotalWtMes SAsampWtMes
#>           <char>    <char>        <int>       <int>
#> 1:             U    107254          276         276

Generating NAs for species not looked for

Suppose we want to consult the data to produce an estimate of, e.g., cod (aphiaId 126436). That species was not targeted by the sampling programme and it is impossible to infer from the data if it was or not present alongside Nephrops norvegicus during the sampling. The total weight measured (SAtotalWtMes) of cod should therefore be considered missing (NA).

The function does that (again for convenience, only a few columns of the SA table are printed).

myH1DataObject1updte<-generateNAsUsingSL(myH1DataObject1, targetAphiaId = c("126436"))
myH1DataObject1updte$SA[,c(1:9,48:49)]
#> Key: <SAid>
#>      SAid   SSid  LEid SArecType SAseqNum SAparSequNum SAstratification
#>     <num>  <int> <int>    <char>    <num>        <num>           <char>
#> 1: 572813 227694    NA        SA    1.000           NA                N
#> 2: 572813 227694    NA        SA    1.001           NA                N
#>    SAstratumName SAspeCode SAtotalWtMes SAsampWtMes
#>           <char>    <char>        <int>       <int>
#> 1:             U    107254          276         276
#> 2:             U    126436           NA          NA

Note that the new rows have floating points values for SAid, and SAseqNum (we use sprintf to ensure the decimal places are displayed). This facilitates the ordering of the samples and prevenes overlaps when different datasets are joined. Also a SAunitName was created for the new row that builds on the SAid and helps to make the row more readily identifiable.

sprintf(myH1DataObject1updte[['SA']]$SAid, fmt = '%.3f')
#> [1] "572813.000" "572813.001"
sprintf(myH1DataObject1updte[['SA']]$SAseqNum, fmt = '%.3f')
#> [1] "1.000" "1.001"
print(myH1DataObject1updte[['SA']]$SAunitName)
#> [1] "1"                "NAgen_572813.001"

Note that argument targetAphiaId in the function generateNAsUsingSL can also accept a vector thus allowing generation of NAs for multiple species in one go. In the example below Pandalus borealis is added to the call.

myH1DataObject1updte<-generateNAsUsingSL(myH1DataObject1, targetAphiaId = c("126436","107649"))
myH1DataObject1updte$SA[,c(1:9,48:49)]
#> Key: <SAid>
#>      SAid   SSid  LEid SArecType SAseqNum SAparSequNum SAstratification
#>     <num>  <int> <int>    <char>    <num>        <num>           <char>
#> 1: 572813 227694    NA        SA    1.000           NA                N
#> 2: 572813 227694    NA        SA    1.001           NA                N
#> 3: 572813 227694    NA        SA    1.002           NA                N
#>    SAstratumName SAspeCode SAtotalWtMes SAsampWtMes
#>           <char>    <char>        <int>       <int>
#> 1:             U    107254          276         276
#> 2:             U    126436           NA          NA
#> 3:             U    107649           NA          NA

Dealing with diligent observers

In many practical situations, diligent observers sometimes record more species than those expected. Such “excess” data is frequently useless from an estimation point-of-view (because the sampling is observer-dependent and therefore likely non-representative), but in analyses (e.g., distribution of rare species) or summaries (e.g., totals of biomass sampled) it may be useful to preserve them in the data.

The difference between these two cases can be specified via the argument overwriteSampled in the function generateNAsUsingSL. By default (estimation case) the argument is set to TRUE which makes generateNAsUsingSL set the weights of these extra species to NA. But, by explicitly setting that argument as overwriteSampled=FALSE the information collected can also kept.

To demonstrate this we carry out a small alteration of the example data, removing the Nephrops norvegicus from the Species List. This creates a somewhat atypical situation (it configures a case where of a haul where nothing was supposed to be looked for but still Nephrops norvegicus was registered) that is used here for sake of simplifying the example.

# we remove *Nephrops norvegicus*
myH1DataObject1$SL<-myH1DataObject1$SL[-1,]
validateRDBESDataObject(myH1DataObject1, checkDataTypes = TRUE)

Now we call generateNAsUsingSL for Nephrops norvegicus with its implicit default overwriteSampled=TRUE (regular estimation case). It is noticeable that the function sets weights of that species to NA

myH1DataObject1updte<-generateNAsUsingSL(myH1DataObject1, 
                                         targetAphiaId = c("107254"))
myH1DataObject1updte$SA[,c(1:9,48:49)]
#> Key: <SAid>
#>      SAid   SSid  LEid SArecType SAseqNum SAparSequNum SAstratification
#>     <num>  <int> <int>    <char>    <num>        <num>           <char>
#> 1: 572813 227694    NA        SA        1           NA                N
#>    SAstratumName SAspeCode SAtotalWtMes SAsampWtMes
#>           <char>    <char>        <int>       <int>
#> 1:             U    107254           NA          NA

If, on the other hand, we are interested in keeping all available data, we set overwriteSampled=FALSE

myH1DataObject1updte<-generateNAsUsingSL(myH1DataObject1, 
                                         targetAphiaId = c("107254"),
                                         overwriteSampled=FALSE)
myH1DataObject1updte$SA[,c(1:9,48:49)]
#> Key: <SAid>
#>      SAid   SSid  LEid SArecType SAseqNum SAparSequNum SAstratification
#>     <num>  <int> <int>    <char>    <num>        <num>           <char>
#> 1: 572813 227694    NA        SA        1           NA                N
#>    SAstratumName SAspeCode SAtotalWtMes SAsampWtMes
#>           <char>    <char>        <int>       <int>
#> 1:             U    107254          276         276