%\VignetteIndexEntry{Mandallaz' model-assisted small area estimators}
\documentclass[a4paper]{article}
\bibliographystyle{unsrt}
\usepackage[usenames,dvipsnames]{color}
\usepackage{listings}
\lstset{ %
basicstyle=\footnotesize
, commentstyle=\color{PineGreen}
, numberstyle=\tiny\color{Gray}
, rulecolor=\color{Black}
, keywordstyle=\color{Blue}
, stringstyle=\color{Sepia}
, showstringspaces=false
, language=R
}
\usepackage[utf8]{inputenc}
\title{Mandallaz' Model-Assisted Small Area Estimators}
\author{Andreas Dominik Cullmann}

\SweaveOpts{echo=false}
\SweaveOpts{eval=false}
\SweaveOpts{print=false}
\SweaveOpts{width=60}
\begin{document}
\maketitle
\providecommand{\rpack}[1]{package \texttt{#1}}
\providecommand{\rdata}[1]{\texttt{#1}}
\providecommand{\rcode}[1]{\texttt{#1}}
\section{Introduction}
Model-\emph{based} small area estimators (for example \cite{rao2003}, chapters 5ff.)
depend on model assumptions to hold. This dependency doesn't make them very attractive
for official statistics. 

Model-\emph{assisted} small area estimators do not depend on the model assumptions to hold,
albeit their variances will be higher if the model is inappropriate (see
\cite{saerndal1992}, chapter 6.7). The synthetic-regression estimator (SRE) (for example
\cite{rao2003}, chapter 4.2.2) is biased, and the variance of its biased-corrected version,
the generalized regression estimator (for example \cite{rao2003}, chapter 2.5),
``depends crucially on the [\ldots{}] residuals in the small area''
(\cite{mandallaz2013a2}, p. 444), which basically means that its variance will be
unacceptably high in many applications.

Daniel Mandallaz and others (
\cite{mandallaz2012a1}, 
\cite{mandallaz2013a2}, 
\cite{mandallaz2013b2}, 
\cite{mandallaz2013b1}, 
\cite{mandallaz2014c2} and
\cite{mandallaz2013c1} 
)
propose an unbiased extension of the SRE for two- and
three-phase sampling designs with or without clustering. The variance of the  extended 
SRE is, like that of the SRE, based on all residuals in the second (or third) phase 
and asymptotically equivalent to the SRE's variance (see \cite{mandallaz2013a2}, p.~444).

\section{Non-Exhaustive Auxiliary Information}
\subsection{Two-Phase Sampling}
<<eval=true>>==
library("maSAE")
@
Let us suppose we have a two-phase sampling design, and the sample data can be loaded
via 
<<echo=true, eval=true>>==
data("s2")
data("s1")
@
We now add sampling phase indicators to the data and join (from a database point of view
we do a union, but I'll keep calling it join) it into a single data.frame
\rcode{s12}:
<<echo=true, eval=true>>==
s12 <- bind_data(s1, s2)
@
We build a small area Object
<<echo=true, eval=true>>==
saeO <- saObj(data = s12, f = y ~x1 + x2 + x3 | g,
              s2 = "phase2")
@
and get the small area estimations for the non-exhaustive case as
<<echo=true, eval=true>>==
predict(saeO)
@

\subsection{Three-Phase Sampling}
Suddenly we stumble across data from a  third sampling phase, the null phase (\rdata{s0} has
all the predictors of \rdata{s2}, but we keep only one -- if we kept all of them, we'ld
be back to two-phase sampling with more observations), and we join all three phases 
into \rdata{s012}:
<<echo=true, eval=true>>==
data("s0")
s0$x1 <- s0$x3 <- NULL
s012 <- bind_data(s1, s2, s0)
@
from which we predict again: 
<<echo=true, eval=true>>==
predict(saObj(data = s012,  f = y ~x1 + x2 + x3 | g,
              s2 = "phase2", s1 = "phase1"))
@
Note the drop in variance induced by the extensive null phase sampling.

\section{Partially Exhaustive Auxiliary Information}
Let us suppose we \emph{knew} the estimated small area means  of the fixed effect 
sampled in all three phases to be the true small area means:
<<echo=true, eval=true>>==
tm1 <- as.data.frame(tapply(s012$x2, s012$g, mean))
names(tm1)[1] <- c("x2"); tm1$g <- row.names(tm1)
predict(saObj(data = s12, f = y ~x1 + x2 + x3 | g,
              s2 = "phase2", smallAreaMeans = tm1))
@
Again, the variance estimation is reduced, but not as markedly as before: due to
the extensive null phase the mean estimations had very small variances (which added to the
small area estimation variances).
\section{Exhaustive Auxiliary Information}
Of course we could also take our estimated small area means of all fixed effects 
from the first and second phase to be the true means:
<<echo=true, eval=true>>==
preds <- paste("x",1:3, sep="")
tm <- as.data.frame(rbind(colMeans(subset(s12, g == "a")[, preds]),
                          colMeans(subset(s12, g == "b")[, preds])
                          )
); tm$g=c("a", "b")
@
That would give us the smallest variance estimates:
<<echo=true, eval=true>>==
predict(saObj(data = s12, f = y ~x1 + x2 + x3 | g,
              s2 = "phase2", smallAreaMeans = tm))
@
\section{Model-Based Small Area Estimation}
\textit{Wait! If I use the code in Appendix  \ref{RAO} to calculate the EBLUP for the basic unit
level model given by \cite{rao2003}, chapter 7.2, I get at least a much smaller mse1:}

<<echo=true, eval=true>>==
source("Rao.R")
library(nlme)
dat <- subset(s2, ! is.na(s2$g))
dat <- dat[with(dat, order(g)), TRUE]
aLmeObj <- lme(y ~x1 + x2 + x3, data = dat, random =  ~1 | g)
foo <- new(Class = "sae", lmeObj = aLmeObj, domain.df = tm)
sae(foo)
@

Of course you do, but you rely on the model's assumptions to hold. Have you checked them?

\textit{ No, I don't even know what they are. But even if the tests would not reject the
hypotheses of the assumptions being valid on, say, a .95 confidence level, I could never
be sure.}\\
That's why I wrote \rcode{maSAE}.\\
\textit{ I see, but wait again, what is \ldots}

\section{Cluster Sampling}
<<echo=true, eval=true>>==
 grep(".*clust", capture.output(str(s1)), value = TRUE)
@
\textit{\ldots this ``clustid'' in the data.frames?}\\
Dear, I forgot that sampling for my completely made up example data 
was done with a cluster design!
So all the above variances estimates are too optimistic. My fault.
For the sake of CPU time on CRAN, I'll leave the following to you:

<<echo=true, eval=false>>==
predict(saObj(data = s12, f = y ~x1 + x2 + x3 | g,
              s2 = "phase2", cluster = "clustid"))
@
<<echo=true, eval=false>>==
predict(saObj(data = s012,  f = y ~x1 + x2 + x3 | g,
              s2 = "phase2", s1 = "phase1",
              cluster = "clustid"))
@
<<echo=true, eval=false>>==
predict(saObj(data = s12, f = y ~x1 + x2 + x3 | g,
              s2 = "phase2", smallAreaMeans = tm1,
              cluster = "clustid"))
@
<<echo=true, eval=false>>==
predict(saObj(data = s12, f = y ~x1 + x2 + x3 | g,
              s2 = "phase2", smallAreaMeans = tm,
              cluster = "clustid"))
@
\bibliography{bibliography}
\appendix
\section{Rao.R\label{RAO}}
\lstinputlisting{Rao.R}

\end{document}
