---
title: "03a Estimating population parameters: unbiased estimator"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{03a Estimating population parameters: unbiased estimator}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```


## Introduction 

The aim of this document is to illustrate how population parameters like total and mean can be estimated using the **estimMC(...)** function. And how a multi level or multi stratum estimation can be run to estimate the population total and mean using  **doEstimationForAllStrata(...)**.

```{r}
library(RDBEScore)
```

## Prerequisites

First we'll load some example data from the RDBES and check it's valid.  It's a good idea to check your RDBESDataObjects are valid after any manipulations you perform.

```{r load}
h1d <- H1Example
validateRDBESDataObject(h1d, verbose = FALSE)
h1d
```


## Single level Multiple Count Estimator

Let's first estimate the last level values for a single FMid so that estimation is done for one top level record 
```{r}
FMidSel <- "4033243"
BV <- h1d$BV[h1d$BV$FMid == FMidSel,]

estimMC(as.numeric(BV$BVvalueMeas),
        BV$BVnumSamp,
        BV$BVnumTotal, 
        method=unique(BV$BVselectMeth))
```

This **estimMC(...)** function is actually the core of the estimation functions running on multiple levels as well. For implementation details see: [Variance calculation functions using "Multiple count" estimator](https://github.com/ices-tools-dev/RDBEScore/issues/13) 

## Estimation workflow

Estimation actually  is done on a *RDBESEstObject* that is generate from *RDBESDataObject* using **createRDBESEstObject(...)**. The following example uses a dataset that  resembles in many ways real data at ICES area 27.3.d.28.1 from the Estonian Baltic Trawling fleet for 2022 sprat (*Sprattus sprattus*).The dataset has both total landings and commercial sampling data.

```{r}
H8ExampleEE1
```
The data has not gone through RDBES upload and download procedure and  contains invalid data types but also may have some other validity problems. However the fields for estimation should be present and valid.
```{r eval=FALSE, echo=TRUE}
#to check data types
validateRDBESDataObject(H8ExampleEE1, checkDataTypes = TRUE, strict = FALSE)
```


```{r, results='asis'}
#create the estimation object to estimate values on the SA table
estData <- createRDBESEstObject(H8ExampleEE1, 8, "SA" )

#the SAsampWtMes is in grammes let's estimate in tonnes
estData$SAsampWtMes <- estData$SAsampWtMes / (1000 * 1000)

results <- doEstimationForAllStrata(estData, "SAsampWtMes")

#let's compare the results from 1st quarter with actual catches from Q1
CL <- H8ExampleEE1$CL
Q1months <- c("January", "February", "March")
Q1results <- results[results$recType== "TE" &
                     results$stratumName %in% Q1months,]
cat("Total estimated SPR catch in Q1: ",
    round(sum(Q1results$est.total), 3),  "tonnes\n")

#CLoffWeight is in Kg, converting to tonnes
cat("Total reported  SPR catch in Q1: ", 
    sum(CL[CL$CLquar ==1, "CLoffWeight"])/1000, "tonnes")

```

```{r}
#let's add the actual totals
Q1results$actualTotals <- c(February=124.829,January=119.108, March=152.273)
```


So the total estimated catch (*est.total*) for the first quarter is near the total of the actual catch. The table below gives the estimates by month with standard errors (*se.total*). The (*est.mean*) and its standard error (SE, *se.mean*) are the estimated mean catch of sprat per week in the given month.

```{r}
#there are 4 weeks and 2 are sampled in TE so in this case the total is: 
#mean * 4 
Q1results$est.mean * 4
```

As you can see from the table for the first 2 months the estimates with *se.total* are in line with the estimate as for March somehow there is a big discrepancy.
```{r}
columns <- c("recType","stratumName","actualTotals",
             "est.total", "se.total", "est.mean", "se.mean")
captionText <- paste("Estmation results for 1st quarter by months (stratums).",
                     "Total estimated catch (est.total) and standard error (se.total)",
                     " as well as estimated mean with se for vessels are given.",
                     "In addition actual total reported weights are given.")
knitr::kable(Q1results[columns], digits=3, caption=captionText)
```


In the next sections we will use data from R packages [survey](https://CRAN.R-project.org/package=survey) and [SDAResources](https://CRAN.R-project.org/package=SDAResources) that are converted into the *RDBESDataObject* to demonstrate the estimation procedure.


### Estimation on Multiple Strata

From [SDAResources](https://CRAN.R-project.org/package=SDAResources) we use *agstrat* which is  data from a stratified random sample of size 300 from the 1992 U.S. Census of Agriculture agpop data. The datset is converted into *RDBESDataObject* **Pckg_SDAResources_agstrat_H1** so that  Table VS and SA contain the core information of \code{data(agstrat)} used in Lohr examples 3.2 and 3.6. Table VS is stratified with VSstratumName set to *region*, and VSnumberSampled and VSnumberTotal set according to *agstrat* VSunitName is set to a combination of original *county*, *state*, *region* and *agstrat* row numbers.
Table SA contains the variable measured *acres92* in \code{SAtotalWeightMeasured}, \code{SAsampleWeightMeasured} and \code{SAconversionFactorMeasLive} set to 1.
Table DE, SD, FT and FO are for the most dummy tables inserted to meet RDBES model requirements to be aggregated during estimation tests. Values of mandatory fields have dummy values taken from an onboard programme, with exception of  *\code{selectionMethod} - that is set to CENSUS - they should be ignored during estimation. BV, FM, CL, and CE are not provided. SL and VD are subset to the essential rows.

```{r}
agstrat_H1 <- Pckg_SDAResources_agstrat_H1 #valid
agstrat_H1
```


```{r}

#create the estimation object to estimate values on the SA table
estObj <- createRDBESEstObject(agstrat_H1, 1, "SA")


res <- doEstimationForAllStrata(estObj, "SAsampWtMes")

# Get the estimated total and mean  for "SAsampWtMes" for the VS stratum "NC"
columns2Get <- c("est.total","est.mean", "se.total","se.mean")
round(unlist(res[res$recType == "VS" & res$stratumName == "NC" ,columns2Get]),1)
 
```

How to interpret these results in the above example?

 - *est.total* - total SAsampWtMes sampled weight per vessel per year in the stratum
 - *est.mean* - mean SAsampWtMes per vessel per year in the stratum (i.e  how much in total one ship on average contributed to the sampled total in this stratum)

You can explore the result to see the estimated total, mean etc. for other stratums. 


### Two Level Estimation on Single Stratum

From [survey](https://CRAN.R-project.org/package=survey) we use apiclus2 which is the Academic Performance Index  computed for all California schools based on standardised testing of students. The data sets contain information for all schools with at least 100 students and for various
 probability samples of the data. The design is 2-stage cluster sampling with clusters of unequal sizes  An SRS of 40 districts is selected (psus) from the 757 districts in the population and then  up to 5 schools (min 1) were selected from each district (ssus)
The datset is converted into *RDBESDataObject* **Pckg_survey_apiclus2_H1** so that  1 DE row with DEstratumName == "Pckg_SDAResources_apiclus2_H1", 1 child SD row, 40 child rows in VS (the 40 districts),  VSnumberTotal is 757,  VSnumberSampled is 40, 126 child rows in FT (the 126 schools finally observed), each associated to its cluster (dname), FTnumberTotal is the number of schools in district, FTnumberSAmpled is 1...5 schools sampled, tables FO, SS are just 1:1 links to the final data (in SA), in SA SAsampleWeightMeasured is enroll (NB! there are  4 NAs)

```{r}
apiclus2_H1 <- Pckg_survey_apiclus2_H1 # valid H1
apiclus2_H1
```

```{r}

#create the estimation object to estimate values on the SA table
estObj <- createRDBESEstObject(apiclus2_H1, 1, "SA")
    
# Run estimation on total measured weight: SAtotalWtMes
res <- doEstimationForAllStrata(estObj, "SAtotalWtMes")

# Get the estimated total and mean  for "SAtotalWtMes" for the VS
columns2Get <- c("est.total","est.mean", "se.total","se.mean")
round(unlist(res[res$recType == "VS" ,columns2Get]),1)
```

How to interpret these results in the above example?

 - *est.total* - total caught weight per vessel per year 
 - *est.mean* - mean caught weight per vessel per year (i.e  how much in total one ship on average contributed to the total catch per year)
 
 The **SE** values are just the standard errors for the total and mean respectively.


See also other package vignettes:

```{r,echo=F, results='asis', eval=T}
# note: vignettes are first compiled and only then html, R and Rmd are moved to folder doc
vignettes <- data.frame(Rmd=dir()[grepl(dir(), pat=".Rmd")])
# determines index of vignette being compiled
thisVignetteRmd<-paste0(gsub(":","",gsub(" ", "-",as.character(rmarkdown::metadata$title))),".Rmd")
# prints list with links
for (i in seq_along(vignettes$Rmd)) {
  if(tolower(vignettes$Rmd[i])!=tolower(thisVignetteRmd)){
  cat(sprintf("- [%s](%s)\n", gsub("-"," ",gsub(".Rmd","",vignettes$Rmd[i])), gsub(".Rmd",".html",vignettes$Rmd[i])))
}
}
```

#END