Estimating Population parameters from RDBESDataObjects

Introduction

The aim of this document is to illustrate how population parameters like total and mean can be estimated using the estimMC(…) function. And how a multi level or multi stratum estimation can be run to estimate the population total and mean using doEstimationForAllStrata(…).

library(RDBEScore)

Prerequisites

First we’ll load some example data from the RDBES and check it’s valid. It’s a good idea to check your RDBESDataObjects are valid after any manipulations you perform.

h1d <- H1Example
validateRDBESDataObject(h1d, verbose = FALSE)
h1d
#> Hierarchy 1 RDBESdataObject:
#>  DE: 8
#>  SD: 8
#>  VS: 1214 (SRSWOR,CENSUS,SRSWR: 2-135/4-1382)
#>  FT: 1430 (CENSUS,SRSWR: 1-3/1-100)
#>  FO: 1916 (CENSUS,SRSWR: 1-3/1-20)
#>  SS: 1916 (CENSUS,SRSWR: 1/1-4)
#>  SA: 1916 (CENSUS,SRSWR: 1/1-2)
#>  FM: 7290
#>  BV: 14580 (SRSWR: 2/2)
#>  VD: 311
#>  SL: 170

Single level Multiple Count Estimator

Let’s first estimate the last level values for a single FMid so that estimation is done for one top level record

FMidSel <- "4033243"
BV <- h1d$BV[h1d$BV$FMid == FMidSel,]

estimMC(as.numeric(BV$BVvalueMeas),
        BV$BVnumSamp,
        BV$BVnumTotal, 
        method=unique(BV$BVselectMeth))
#> $est.total
#> [1] 4
#> 
#> $est.mean
#> [1] 2
#> 
#> $est.algorithm
#> [1] "Generalized Horvitz-Thompson aka Mutiple-Count"
#> 
#> $var.total
#> [1] 0
#> 
#> $var.mean
#> [1] 0
#> 
#> $var.algorithm
#> [1] "Sen-Yates-Grundy"
#> 
#> $PI
#>      [,1] [,2]
#> [1,]  0.5  0.5
#> [2,]  0.5  0.5

This estimMC(…) function is actually the core of the estimation functions running on multiple levels as well. For implementation details see: Variance calculation functions using “Multiple count” estimator

Estimation workflow

Estimation actually is done on a RDBESEstObject that is generate from RDBESDataObject using createRDBESEstObject(…). The following example uses a dataset that resembles in many ways real data at ICES area 27.3.d.28.1 from the Estonian Baltic Trawling fleet for 2022 sprat (Sprattus sprattus).The dataset has both total landings and commercial sampling data.

H8ExampleEE1
#> Hierarchy 8 RDBESdataObject:
#>  DE: 1
#>  SD: 1
#>  TE: 11 (SRSWOR: 2-3/4)
#>  VS: 14 (SRSWR: 1-2/7-12)
#>  LE: 15 (SRSWOR: 1-2/1-2)
#>  SS: 15 (CENSUS: 2/2)
#>  SA: 15 (SRSWOR: 1/794-14268)
#>  BV: 3995 (CENSUS: 16-100/16-100)
#>  VD: 7
#>  SL: 2
#>  CL: 71
#>  CE: 132

The data has not gone through RDBES upload and download procedure and contains invalid data types but also may have some other validity problems. However the fields for estimation should be present and valid.

#to check data types
validateRDBESDataObject(H8ExampleEE1, checkDataTypes = TRUE, strict = FALSE)

#create the estimation object to estimate values on the SA table
estData <- createRDBESEstObject(H8ExampleEE1, 8, "SA" )

#the SAsampWtMes is in grammes let's estimate in tonnes
estData$SAsampWtMes <- estData$SAsampWtMes / (1000 * 1000)

results <- doEstimationForAllStrata(estData, "SAsampWtMes")

#let's compare the results from 1st quarter with actual catches from Q1
CL <- H8ExampleEE1$CL
Q1months <- c("January", "February", "March")
Q1results <- results[results$recType== "TE" &
                     results$stratumName %in% Q1months,]
cat("Total estimated SPR catch in Q1: ",
    round(sum(Q1results$est.total), 3),  "tonnes\n")

Total estimated SPR catch in Q1: 381.262 tonnes


#CLoffWeight is in Kg, converting to tonnes
cat("Total reported  SPR catch in Q1: ", 
    sum(CL[CL$CLquar ==1, "CLoffWeight"])/1000, "tonnes")

Total reported SPR catch in Q1: 396.21 tonnes

#let's add the actual totals
Q1results$actualTotals <- c(February=124.829,January=119.108, March=152.273)

So the total estimated catch (est.total) for the first quarter is near the total of the actual catch. The table below gives the estimates by month with standard errors (se.total). The (est.mean) and its standard error (SE, se.mean) are the estimated mean catch of sprat per week in the given month.

#there are 4 weeks and 2 are sampled in TE so in this case the total is: 
#mean * 4 
Q1results$est.mean * 4
#> [1] 209.27146 138.99464  32.99554

As you can see from the table for the first 2 months the estimates with se.total are in line with the estimate as for March somehow there is a big discrepancy.

columns <- c("recType","stratumName","actualTotals",
             "est.total", "se.total", "est.mean", "se.mean")
captionText <- paste("Estmation results for 1st quarter by months (stratums).",
                     "Total estimated catch (est.total) and standard error (se.total)",
                     " as well as estimated mean with se for vessels are given.",
                     "In addition actual total reported weights are given.")
knitr::kable(Q1results[columns], digits=3, caption=captionText)

Estmation results for 1st quarter by months (stratums). Total estimated catch (est.total) and standard error (se.total) as well as estimated mean with se for vessels are given. In addition actual total reported weights are given.
	recType	stratumName	actualTotals	est.total	se.total	est.mean	se.mean
56	TE	February	124.829	209.271	97.071	52.318	24.268
57	TE	January	119.108	138.995	41.709	34.749	10.427
58	TE	March	152.273	32.996	7.213	8.249	1.803

In the next sections we will use data from R packages survey and SDAResources that are converted into the RDBESDataObject to demonstrate the estimation procedure.

Estimation on Multiple Strata

From SDAResources we use agstrat which is data from a stratified random sample of size 300 from the 1992 U.S. Census of Agriculture agpop data. The datset is converted into RDBESDataObject Pckg_SDAResources_agstrat_H1 so that Table VS and SA contain the core information of used in Lohr examples 3.2 and 3.6. Table VS is stratified with VSstratumName set to region, and VSnumberSampled and VSnumberTotal set according to agstrat VSunitName is set to a combination of original county, state, region and agstrat row numbers. Table SA contains the variable measured acres92 in , and set to 1. Table DE, SD, FT and FO are for the most dummy tables inserted to meet RDBES model requirements to be aggregated during estimation tests. Values of mandatory fields have dummy values taken from an onboard programme, with exception of * - that is set to CENSUS - they should be ignored during estimation. BV, FM, CL, and CE are not provided. SL and VD are subset to the essential rows.

agstrat_H1 <- Pckg_SDAResources_agstrat_H1 #valid
agstrat_H1
#> Hierarchy 1 RDBESdataObject:
#>  DE: 1
#>  SD: 1
#>  VS: 300 (SRSWOR: 21-135/220-1382)
#>  FT: 300 (CENSUS: 1/1)
#>  FO: 300 (CENSUS: 1/1)
#>  SS: 300 (CENSUS: 1/1)
#>  SA: 300 (CENSUS: 1/1)
#>  FM: 0
#>  BV: 0
#>  VD: 311
#>  SL: 1


#create the estimation object to estimate values on the SA table
estObj <- createRDBESEstObject(agstrat_H1, 1, "SA")


res <- doEstimationForAllStrata(estObj, "SAsampWtMes")

# Get the estimated total and mean  for "SAsampWtMes" for the VS stratum "NC"
columns2Get <- c("est.total","est.mean", "se.total","se.mean")
round(unlist(res[res$recType == "VS" & res$stratumName == "NC" ,columns2Get]),1)
#>   est.total    est.mean    se.total     se.mean 
#> 316731379.7    300504.2  16977399.2     16107.6

How to interpret these results in the above example?

est.total - total SAsampWtMes sampled weight per vessel per year in the stratum
est.mean - mean SAsampWtMes per vessel per year in the stratum (i.e how much in total one ship on average contributed to the sampled total in this stratum)

You can explore the result to see the estimated total, mean etc. for other stratums.

Two Level Estimation on Single Stratum

From survey we use apiclus2 which is the Academic Performance Index computed for all California schools based on standardised testing of students. The data sets contain information for all schools with at least 100 students and for various probability samples of the data. The design is 2-stage cluster sampling with clusters of unequal sizes An SRS of 40 districts is selected (psus) from the 757 districts in the population and then up to 5 schools (min 1) were selected from each district (ssus) The datset is converted into RDBESDataObject Pckg_survey_apiclus2_H1 so that 1 DE row with DEstratumName == “Pckg_SDAResources_apiclus2_H1”, 1 child SD row, 40 child rows in VS (the 40 districts), VSnumberTotal is 757, VSnumberSampled is 40, 126 child rows in FT (the 126 schools finally observed), each associated to its cluster (dname), FTnumberTotal is the number of schools in district, FTnumberSAmpled is 1…5 schools sampled, tables FO, SS are just 1:1 links to the final data (in SA), in SA SAsampleWeightMeasured is enroll (NB! there are 4 NAs)

apiclus2_H1 <- Pckg_survey_apiclus2_H1 # valid H1
apiclus2_H1
#> Hierarchy 1 RDBESdataObject:
#>  DE: 1
#>  SD: 1
#>  VS: 40 (SRSWOR: 40/757)
#>  FT: 126 (SRSWOR: 1-5/1-72)
#>  FO: 126 (CENSUS: 10/10)
#>  SS: 126 (CENSUS: 1/1)
#>  SA: 126 (CENSUS: 1/1)
#>  FM: 0
#>  BV: 0
#>  VD: 311
#>  SL: 1


#create the estimation object to estimate values on the SA table
estObj <- createRDBESEstObject(apiclus2_H1, 1, "SA")
    
# Run estimation on total measured weight: SAtotalWtMes
res <- doEstimationForAllStrata(estObj, "SAtotalWtMes")

# Get the estimated total and mean  for "SAtotalWtMes" for the VS
columns2Get <- c("est.total","est.mean", "se.total","se.mean")
round(unlist(res[res$recType == "VS" ,columns2Get]),1)
#> est.total  est.mean  se.total   se.mean 
#> 2639272.9    3486.5  798295.7    1054.6

How to interpret these results in the above example?

est.total - total caught weight per vessel per year
est.mean - mean caught weight per vessel per year (i.e how much in total one ship on average contributed to the total catch per year)

The SE values are just the standard errors for the total and mean respectively.