% \VignetteIndexEntry{Assigning alleles to isoloci in polysat}
% \VignetteDepends{ape, polysat}

\documentclass{article}

\usepackage{hyperref}

\title{Assigning alleles to isoloci in \textsc{polysat}}
\date{\today}
\author{Lindsay V. Clark \\ University of Illinois, Urbana-Champaign}

\hyphenation{processDatasetAllo}
\hyphenation{alleleCorrelations}
\hyphenation{testAlGroups}

\begin{document}
\SweaveOpts{concordance=TRUE}

\maketitle

\section{Introduction}

This tutorial accompanies the R package \textsc{polysat} versions 1.6 and later,
and demonstrates how to use functions in \textsc{polysat} for recoding data from 
allopolyploid or diploidized autopolyploid organisms such that each 
microsatellite marker is split into multiple isoloci.  The data can then be 
analyzed under the model of random, Mendelian segregation; for example an 
allotetraploid organism can be treated as diploid, giving the user access to
a much greater number of analyses both with \textsc{polysat} and with
other software.

If you are not sure whether your organism is fully polysomic or whether it has
two or more subgenomes that segregate independently from each other, keep reading.
In the ``Quality of clustering'' section I illustrate what the results look like
if a locus (Loc7 in the example dataset) segratates in a polysomic (autopolyploid) 
manner.

The methods described in this tutorial are designed for natural populations of 
one hundred or more individuals.  They will also work for certain types of 
mapping populations, such as an F2 population derived from two inbred 
grandparents.  They will be much less effective for small sample sizes 
($<50$ for tetraploids, $<200$ for higher ploidies), and for
datasets with multiple species or highly-diverged ecotypes.  The methods in 
this tutorial are also inappropriate for species that reproduce asexually.  If
you have duplicate genotypes in your dataset (either as a result of asexual reproduction,
vegatative spread, or sampling the same individual multiple times) you should remove 
them before starting.

It is assumed that the reader is already familiar with R and with \textsc{polysat}.
If you aren't, please spend some time with ``An Introduction to R'' and with the
``polysat version 1.7 Tutorial Manual''.  Additionally, there is a manuscript published
in \emph{Molecular Ecology Resources} at \url{http://dx.doi.org/10.1111/1755-0998.12639} that describes in detail the 
rationale and limitations of the allele assignment tools described in this manual.
You may also find the documentation for individual functions to be helpful
(\emph{e.g.} see \texttt{?alleleCorrelations}, \texttt{?processDatasetAllo},
and \texttt{?recodeAllopoly}).
I'm always happy to answer questions or help debug errors over email, although I appreciate it if you
first take the time to read the documentation carefully and check your work.

\section{Data hygiene}

Below I have some recommendations for generating and cleaning up datasets so that
they will have the greatest success with the allele assignment algorithm.

\subsection{Before you begin genotyping}

Choose your markers well.  If your research group has previously run 
microsatellite markers on your species of interest, choose markers that have
given clear, consistent, and easy-to-interpret patterns of amplification in the
past.  If a linkage map has been published for your species, use markers that 
are on the map; low quality markers would have given segregation distortion 
and would not have been mappable.  Lastly, dinucleotide repeat markers should 
be used with caution, as their high mutation rate increases the chance of 
homoplasy occuring, and their high degree of stutter can make amplification
patterns difficult to interpret in polyploids.  Markers with trinucleotide or 
larger repeats will give cleaner results, despite not having as many alleles.

Avoid multiplexing several markers in one PCR reaction.  Multiplexing increases
the probability of scoring error due to allelic dropout.  Run each marker in a
separate reaction, then, if desired for your electrophoresis method, pool the
reactions post-PCR.

Use the highest resolution electrophoresis method available to you.  Agarose
gels may lack the resolution to distinguish all alleles from each other.  If
using acrylamide gels, make an allelic ladder by pooling PCR products from 
several diverse individuals, then run the same ladder on each gel to ensure
consistency of scoring.  If using a capillary sequencer, always use the same
size standard, and be consistent in terms of which fluorescent dye goes with
which marker.

``Oh no!  I've already run all of my microsatellite markers, and I
  don't have time to re-do them!''  It's okay.  Everything above was just a
suggestion to improve the probability that the method described in this
vignette will work on your dataset.  The method may still work, and if it 
doesn't, you can consider whether failing to meet one of the above suggestions
could have caused the problem.

\subsection{Scoring your microsatellite alleles}

If you know someone in your lab or at your institution who has a lot of 
experience scoring microsatellites (and if you are not very experienced), get
them to sit down with you for a couple hours and demonstrate how they would 
score the markers in your dataset.  Understanding the additivity of overlapping
stutter and allele peaks is important, as is knowing how to distiguish true
alleles from PCR artifacts and dye blobs, and knowing how to check that the 
software interpreted the size standard correctly.

Don't trust any piece of software to score your markers.  Most of them were
optimized for diploid species, and even then they have a lot of problems.  After
the software (e.g. GeneMapper or 
\href{http://www.vgl.ucdavis.edu/informatics/strand.php/}{STRand}) has called the 
alleles, you need to manually inspect every genotype, and you will probably correct 
a lot of them.

Consistency is crucial.  One allele may give a pattern of multiple peaks, so
you need to decide which peak to score, and whether to round up or down to get
the size in nucleotides.  If using software that performs ``binning'',
go through and correct all of the allele calls before using them to make bins.
Using STRand, I like to take screenshots to indicate how I score each marker,
then I can easily look at them months later when I genotype additional 
individuals.

\subsection{Preliminary analysis of the data}

In this section I'll use a simulated dataset to demonstrate how to
clean up your data and split it into subpopulations if necessary.
The same simulated dataset will be used throughout the rest of this manual.

<<>>=
library(polysat)
data(AllopolyTutorialData)
summary(AllopolyTutorialData)
# make a copy of the dataset to modify
mydata <- AllopolyTutorialData
@

Note that datasets can be imported using any of the normal import functions
for \textsc{polysat} (like \texttt{read.GeneMapper}).  The \texttt{data} function 
here is used only because this is an example dataset installed with the package.

First, any questionable genotypes should be removed.  If an electropherogram
or banding pattern was unclear, it is best to replace that genotype with
missing data.  Any duplicate or highly similar genotypes that likely represent
the same individual (or a group of asexually derived individuals) should be 
removed, such that each genotype is only represented once in the dataset.
(The \texttt{assignClones} function might be useful for identifying duplicates if 
you haven't already done so.)  Since this is an allotetraploid, let's also 
make sure that
no genotypes have more than four alleles, and eliminate any that do.

<<>>=
# Calculate the length of each genotype vector (= the number of alleles) and
# construct a TRUE/FALSE matrix of whether that number is greater than four.
tooManyAlleles <- apply(Genotypes(mydata), c(1,2), 
                        function(x) length(x[[1]])) > 4
# Find position(s) in the matrix that are TRUE.
which(tooManyAlleles, arr.ind=TRUE) # 43rd sample, second locus
# Look at the identified genotype, then replace it with missing data.
Genotype(mydata, 43, 2)
Genotype(mydata, 43, 2) <- Missing(mydata)
Genotype(mydata, 43, 2)
@

Next, we'll want to look at population structure in the dataset.  We'll make
a square matrix of genotype dissimilarities using a simple band-sharing
metric, then make a neighbor-joining tree.

<<eval=FALSE>>=
mydist <- meandistance.matrix(mydata, distmetric=Lynch.distance,
                              progress=FALSE)
@

<<echo=FALSE>>=
load("vignettebuild/AllopolyTutorialDist.RData")
@

<<fig=TRUE>>=
require(ape)
mynj <- nj(mydist)
plot(mynj, type="unrooted")
@

You can see that individuals 301, 302, and 303 are highly dissimilar from
the rest.  This is what it looks like when some individuals are a 
different species.  Because of the fast rate at which microsatellites mutate,
allele assignments that we make in one species are very unlikely to
apply to another species.  We will remove these three individuals from
the dataset.

<<>>=
mydata <- deleteSamples(mydata, c("301","302","303"))
@

Now let's examine the rest of the dataset for population structure using
principal coordinates analysis.

<<fig=TRUE>>=
par(mfrow=c(2,1))
mypca <- cmdscale(mydist[Samples(mydata), Samples(mydata)])
plot(mypca[,1], mypca[,2])
hist(mypca[,1], breaks=30)
@

We can see a slightly bimodal distribution of individuals, indicating moderate
population structure.  Since population structure can interfere with preliminary 
allele assignment in the \texttt{alleleCorrelations} step, we will assign 
individuals to two populations that we can analyze separately.  This will also help
us to evaluate reproducibility of allele assignments.

<<>>=
pop1ind <- Samples(mydata)[mypca[,1] <= 0]
pop2ind <- Samples(mydata)[mypca[,1] > 0]
PopInfo(mydata)[pop1ind] <- 1
PopInfo(mydata)[pop2ind] <- 2
@

Of course, principal coordinates analysis is not the only method of assigning
individuals to populations.
If you have already analyzed your data with other software (for example, Structure)
before importing it into \textsc{polysat}, you could instead use population 
assignments from that software, and construct the \texttt{PopInfo} vector
accordingly to indicate population assignments.

\section{The \textsc{polysat} algorithm for allele assignment}

\subsection{General considerations and parameters}

To assign alleles to isoloci, you will primarily be using the function
\texttt{processDatasetAllo}.  This function internally calls two other functions
from \textsc{polysat}: \texttt{alleleCorrelations}
looks for negative correlations between alleles and uses those correlations to make
preliminary assignments, then \texttt{testAlGroups} adjusts those assignments if 
necessary after checking them against individual genotypes.  \texttt{testAlGroups}
has some parameters that affect its accuracy depending on the rate of meiotic error 
(pairing between homeologs or 
paralogs during meiosis), the presence of null alleles,
and homoplasy (different alleles with identical amplicon size) between isoloci.
Below are some recommendations for adjusting arguments from the defaults 
depending on particular issues in the dataset.
\vspace{5mm}

\begin{tabular}{| l | l |}
\hline
Many alleles ($>10$)             & Increase \texttt{R} (the number of allele swaps attempted) \\
Meiotic error                    & Increase \texttt{tolerance} \\
Null alleles                     & \texttt{null.weight = 0} \\
Null alleles at high frequency   & \texttt{swap = FALSE} \\
Homoplasy                        & \texttt{swap = FALSE} \\
\hline
\end{tabular}

\vspace{5mm}
Because you may not know whether the last four issues are present in your
dataset, I recomment trying several parameter sets.  The function 
\texttt{processDatasetAllo} tests several parameter combinations across
all loci in the dataset.

\subsection{Running the algorithm}

Now we will use \texttt{processDatasetAllo} to make allele assignments.
Because this is an allotetraploid we will use \texttt{n.subgen = 2}
(two subgenomes) and \texttt{SGploidy = 2} (each subgenome is diploid).
Because we divided our dataset into two subpopulations that appear to have
different allele frequencies, we will set \texttt{usePops = TRUE} to 
analyze the two populations separately according to \texttt{PopInfo(mydata)}.
We will leave the \texttt{parameters} argument at the default, which includes
one parameter set optimized for no null alleles and no homoplasy, one optimized
for homoplasy, and two optimized for null alleles.  This command will take a 
few minutes to process (especially since the data set includes deliberately
problematic loci, which take longer).  Here I have set \texttt{R} to 500,
which also increases processing time but makes it more likely that your results
will look similar or identical to mine.

<<eval=FALSE>>=
myassign <- processDatasetAllo(mydata, n.subgen = 2, SGploidy = 2, 
                               usePops = TRUE, R = 500)
@

<<echo=FALSE>>=
load("vignettebuild/AllopolyTutorialAssign.RData")
@

\subsection{Inspecting the results}

\subsubsection{Warnings about positive correlations}

We got a warning about Loc6 for both the populations.  If population structure were
a serious problem, we would have gotten warnings about most or all loci.  Let's take
a look at which alleles had positive correlations for Loc6, to see if there may 
have been a scoring problem.

The \texttt{myassign} object is a list, and it has an element called \texttt{AlCorrArray}
that contains the results of \texttt{alleleCorrelations} for each locus and population.
Each set of results is also a list.  The list element called \texttt{significant.pos} indicates any
significantly positive associations between alleles.  We will look at the Loc6 results
for both populations.

<<>>=
myassign$AlCorrArray[["Loc6", 1]]$significant.pos
myassign$AlCorrArray[["Loc6", 2]]$significant.pos
@

We see positive correlations between alleles 330 and 327, 339 and 336, and 348 
and 345.  Since these are trinucleotide repeats, it looks like some of the 
larger alleles (which would tend to have more stutter) had stutter peaks
mis-called as alleles.  If this were a real dataset, I would say to go back 
to the gels or elecropherograms and call the alleles more carefully.  Since
this is a simulated dataset, we will simply exclude Loc6 from further analysis.

<<>>=
mydata <- deleteLoci(mydata, loci="Loc6")
@

Occasionally, a locus with no population structure or scoring error will produce a
warning about positive correlations between alleles simply due to sampling error in the dataset.  
If you only get the warning
for one locus and population in your dataset, and it does not look like there was scoring error
(\emph{i.e.} the alleles that are positively correlated are not close in 
amplicon size), you can safely ignore the warning.

\subsubsection{Quality of clustering}

In your working directory, you should now find a file called ``alleleAssignmentPlots.pdf''
containing several plots to help you evaluate clustering quality.  We will recreate
some of those plots here and discuss their interpretation.

<<fig=TRUE>>=
par(mfrow = c(1,1))
plotSSAllo(myassign$AlCorrArray)
@

This plot gives an indication of clustering quality using the K-means method
(from \texttt{alleleCorrelations}) alone, and is intended to help identify 
problematic loci or populations.  The x-axis essentially shows how much of the
variation in \textit{P}-values for correlations between alleles could be 
explained by clustering alleles into groups.  The y-axis indicates how 
similar in size the groups of alleles are, \textit{e.g} two groups of five 
alleles each would give a high value, and one group with one allele and one
group with nine alleles would give a low value.  So, loci and populations
in the upper-right quadrant performed well in allele assignment, but 
others should be subject to scrutiny.

In the last section we identified Loc6 as a locus that we should exclude.  In this plot, for 
both populations, Loc6 has a relatively low value on the x-axis.  It is also marked in
italics to indicate that there were positive correlations between alleles, in case we
hadn't noticed this problem when the algorithm ran.

It looks like Loc7 may also have a problem, given that it got a low value 
on the y-axis in both populations.

We can also make heatmaps of the \textit{P}-values for correlations between alleles
to get a sense of how well the clustering went.  Let's compare a locus that 
seemed to work very well (Loc5) to one that we expect to cluster poorly (Loc6),
then compare them to Loc7.

<<fig=TRUE>>=
heatmap(myassign$AlCorrArray[["Loc5", "Pop1"]]$heatmap.dist,
        main = "Loc5, Pop1")
@

The big red blocks neatly divide the alleles into two groups.

<<fig=TRUE, height=3>>=
# A plot to show how the colors correspond to p-values in the
# heat map; you can repeat this for the other heat maps in this
# tutorial if you wish.
plot(x=seq(min(myassign$AlCorrArray[["Loc5", "Pop1"]]$heatmap.dist),
           max(myassign$AlCorrArray[["Loc5", "Pop1"]]$heatmap.dist), 
           length.out=12),
     y=rep(1,12), xlab="P-values", ylab="", bg=heat.colors(12), 
     pch=22, cex=3)
@
<<fig=TRUE>>=
heatmap(myassign$AlCorrArray[["Loc6", "Pop1"]]$heatmap.dist,
        main = "Loc6, Pop1")
@

For Loc6, it is not nearly as obvious how to divide the alleles into two groups.

<<fig=TRUE>>=
heatmap(myassign$AlCorrArray[["Loc7", "Pop1"]]$heatmap.dist,
        main = "Loc7, Pop1")
@
<<fig=TRUE>>=
heatmap(myassign$AlCorrArray[["Loc7", "Pop2"]]$heatmap.dist,
        main = "Loc7, Pop2")
@

For Loc7, the alleles also don't seem to go into two neat groups.  In fact, the 
dendrograms on both heatmaps first split off a single allele from the rest,
but that allele is different for Pop1 (125) versus Pop2 (150).
When I simulated Loc7, I actually simulated it as being a single tetrasomic locus instead of 
a pair of disomic isoloci.  If you have a large sample
size and no positive correlations between alleles, but the results
for all of your loci look like this, you can probably treat the data
as being autopolyploid (polysomic).  It is also quite possible for an organism
to have some disomic loci and some polysomic loci.  We will leave Loc7 in the 
dataset but won't attempt to do allele assignment or genotype recoding with it.

\subsection{Evaluating the parameter sets}

Now we will take a look at the results of \texttt{testAlGroups}.  As you may
recall, there are several parameters that can be adjusted for that function,
and we tried out the default four sets of parameters with 
\texttt{processDatasetAllo}.  One way to evaluate the quality of the assignments
produced by \texttt{testAlGroups} is by seeing how many alleles it decided
to make homoplasious, \emph{i.e.} how many alleles were assigned to multiple
isoloci.  Let's plot those proportions separately for the two populations.

<<fig=TRUE>>=
plotParamHeatmap(myassign$propHomoplasious, popname = "Pop1", 
                 main = "Proportion homoplasious loci:")
@

Lighter colors indicate fewer homoplasious alleles, which in turn indicates
better quality allele assignment.
We will ignore Loc6 and Loc7 since we already decided to exclude them.  In Pop1,
Loc4 worked best with the third and fourth parameter sets (allowing null alleles)
and Loc1 worked best with the second, third, or fourth parameter sets (allowing 
homoplasy or null alleles),
but the other loci worked equally well regardless of parameter set.

<<fig=TRUE>>=
plotParamHeatmap(myassign$propHomoplasious, popname = "Pop2", 
                 main = "Proportion homoplasious loci:")
@

For all loci in Pop2 except for Loc 1, we see the same proportion of homoplasious 
loci regardless of parameter set.

\texttt{processDatasetAllo} also runs the \texttt{mergeAlleleAssignments}
function to combine allele assignments across populations, within loci
and parameter sets.  We take a look at how many more homoplasious
alleles we get after merging the assignments (since the same allele may
have been assigned to different isoloci in different populations).

<<fig=TRUE>>=
plotParamHeatmap(myassign$propHomoplMerged, popname = "Merged across populations", 
                 main = "Proportion homoplasious loci:")
@

Now we see that Loc6 and Loc7 look a lot worse, since their allele 
assignments were different between populations.  For Loc1, the 
second parameter set (optimized for homoplasy) was clearly the best
in terms of minimizing homoplasious alleles.  For Loc2, the
first or third parameter set (using allele swapping) 
works best.  For Loc3, Loc4, and Loc5,
we have the same number of homoplasious alleles regardless of
parameter set.

We can also examine, for each locus and parameter set, what proportion of
genotypes would be replaced with missing data if we used that set of
allele assignments to recode the data.  Genotypes are recoded as missing
if homoplasy prevents them from being unambiguously determined.
\texttt{processDatasetAllo} runs \texttt{recodeAllopoly} in order to 
determine these missing data rates.  When it does so, it uses the
\texttt{allowAneuploidy=FALSE} option so that genotypes that have too
many alleles ($>$ \texttt{SGploidy}) for an isolocus will also be recoded
as missing.

<<fig=TRUE>>=
plotParamHeatmap(myassign$missRate, popname = "All Individuals", 
                 main = "Missing data after recoding:")
@

This plot mirrors the plot reporting the proportion of loci that are homoplasious,
since homoplasy tends to lead to genotypes that cannot be unambiguously recoded.

\texttt{processDatasetAllo} suggests a best set of allele assignments based
on missing data after recoding and the proportion of homoplasious alleles.
For each locus, it picks the parameter set that results in the lowest missing
data rate, then in the case of a tie the parameter set with the least 
amount of homoplasy (after merging across populations), then in the case of a
tie the lowest numbered parameter set.  So in this case, parameter set 2 would 
be chosen for Loc1, and parameter set 1 for all other loci.

Let's extract that list of optimal assignments from the results.  We will also
eliminate Loc6 and Loc7 from the list since we discarded Loc6 and don't want to
recode Loc7.

<<>>=
myBestAssign <- myassign$bestAssign
myBestAssign
myBestAssign <- myBestAssign[1:5]
@

We can see a large amount of homoplasy for Loc7 and some for Loc6, since the allele
assignment algorithm was not appropriate for those two loci.  We can also
see that Loc1 and Loc4 each have one homoplasious allele.  This accounts
for the much higher proportion of missing data when recoding Loc1 and Loc4 as 
opposed to Loc2, Loc3, and Loc5.

We can also extract from the results the proportion of missing data that is
introduced by recoding genotypes using the best allele assignments.

<<>>=
apply(myassign$missRate, 1, min)
@

Loc1 would have 52\% missing data after being recoded using the best set of 
allele assignments, Loc2 would have 0.7\% missing data, \emph{etc}.

It is possible to take a deeper look at the allele assignments that were
produced before the best set was chosen.  Let's look at Loc1, since it has
such a high missing data rate after recoding.

<<>>=
myassign$AlCorrArray[["Loc1", "Pop1"]]$Kmeans.groups
myassign$AlCorrArray[["Loc1", "Pop2"]]$Kmeans.groups
myassign$AlCorrArray[["Loc1", "Pop1"]]$UPGMA.groups
myassign$AlCorrArray[["Loc1", "Pop2"]]$UPGMA.groups
@

Both K-means clustering and UPGMA placed allele 236 in a different isolocus
depending on population.  This is the same allele that was ultimately 
assigned to be homoplasious in the ``best'' assignment set.

\texttt{testAlGroups} results for Loc1:

<<>>=
myassign$TAGarray[["Loc1", "Pop1", 1]]$assignments
myassign$TAGarray[["Loc1", "Pop2", 1]]$assignments
myassign$mergedAssignments[["Loc1", 1]]$assignments
myassign$TAGarray[["Loc1", "Pop1", 2]]$assignments
myassign$TAGarray[["Loc1", "Pop2", 2]]$assignments
myassign$mergedAssignments[["Loc1", 2]]$assignments
myassign$TAGarray[["Loc1", "Pop1", 3]]$assignments
myassign$TAGarray[["Loc1", "Pop2", 3]]$assignments
myassign$mergedAssignments[["Loc1", 3]]$assignments
myassign$TAGarray[["Loc1", "Pop1", 4]]$assignments
myassign$TAGarray[["Loc1", "Pop2", 4]]$assignments
myassign$mergedAssignments[["Loc1", 4]]$assignments
@

We can see that with parameter set 1, allele 236 was made homoplasious in Pop1 and
allele 227 was made homoplasious in both populations, resulting in both being homoplasious in the
merged set.  With parameter set 2, however, assignments were consistent
between the two populations.  Parameter set 3 produced allele assignments that
were very different from those produced by parameter sets 1 and 2.  The parameter set 4 
assignments matched the ``best'' assignment set for Pop2, but not Pop1.

\subsection{Re-analyzing individual loci}
For problematic loci, you may wish to try different population splits or 
parameter sets.  It is not necessary to re-run \texttt{processDatasetAllo}
on the entire dataset in order to accomplish this; we can run 
\texttt{alleleCorrelations} and \texttt{testAlGroups} on individual loci instead.

Continuing to examine Loc1, what assignments do we get from K-means clustering and 
UPGMA if we don't split the datset into two populations?

<<>>=
corrLoc1AllInd <- alleleCorrelations(mydata, locus = "Loc1", n.subgen = 2)
corrLoc1AllInd$Kmeans.groups
corrLoc1AllInd$UPGMA.groups
@

Analyzing both populations together gives us clustering identical to the 
previous clustering from Pop1.

We'll try \texttt{testAlGroups} with a parameter set identical to parameter set 2
from our previous analysis.  We'll also try increasing \texttt{tolerance} a little
bit to see if it eliminates homoplasy.  Increasing \texttt{tolerance} could be useful 
if Loc1 does not in fact segregate in a completely disomic manner.

<<>>=
TaLoc1.param2 <- testAlGroups(mydata, corrLoc1AllInd, SGploidy = 2,
                              null.weight = 0.5, tolerance = 0.05, 
                              swap = FALSE)
TaLoc1.param5 <- testAlGroups(mydata, corrLoc1AllInd, SGploidy = 2,
                              null.weight = 0.5, tolerance = 0.1, 
                              swap = FALSE)
TaLoc1.param2$assignments
TaLoc1.param5$assignments
@

In both cases we get the same allele assignments as before, so we will leave it as is.

Hypothetically though, if we liked the new assignments better and wanted to use them
to replace the old assignments, here is how it would be done:

<<>>=
myBestAssign[[1]] <- TaLoc1.param5
@

\subsection{Recoding the data}

Now that we've chosen sets of allele assignments to use and thrown away
loci that had problems, we can recode the dataset.

<<>>=
recodedData <- recodeAllopoly(mydata, myBestAssign)
summary(recodedData)
@

You'll notice that we have more loci now that each marker (except Loc7) has
been split into two isoloci.  We also have a lot of missing data, since
homoplasy can lead to uncertainty about what the true genotype is.  We had
homoplasy for Loc1 and Loc 4.

<<>>=
for(L in Loci(recodedData)){
  proportionmissing <- mean(isMissing(recodedData, loci=L))
  cat(paste(L,":",proportionmissing,"missing"),sep="\n")
}
@

You'll also notice that not the entire dataset is diploid.  That is because
there is some meiotic error in the dataset, and we used 
\texttt{allowAneuoploidy = TRUE} in \texttt{recodeAllopoly}.  Most
genotypes are diploid though.

<<>>=
table(Ploidies(recodedData))
@

We can manually add in the ploidy for Loc7, since it was not recoded.

<<>>=
Ploidies(recodedData)[,"Loc7"] <- 4
@

The recoded data may now be analyzed with any \textsc{polysat} function,
or exported to other software using any of the various \texttt{write}
functions in polysat.  For example, it is now appropriate to estimate
allele frequencies from the data and use those to estimate $G_{ST}$.

<<>>=
myfreq <- simpleFreq(recodedData)
myGst <- calcPopDiff(myfreq, metric = "Gst")
myGst
@

We can also export the data if we want to do analysis using other software.

<<>>=
write.GeneMapper(recodedData, file = "tutorialRecodedData.txt")
@

\section{The Catal\'{a}n method of allele assignment}

An alternative method of allele assignment available in \textsc{polysat}
is that by Catal\'{a}n \emph{et al.} (2006; 
\url{http://dx.doi.org/10.1534/genetics.105.042788}).  The Catal\'{a}n 
method does not allow for homoplasy, null alleles, or meiotic error,
but may perform equally to the \textsc{polysat} method in cases of
strong population structure, with less processing time.  Let's try it on our example dataset.

<<>>=
catResults <- list()
length(catResults) <- length(Loci(mydata))
names(catResults) <- Loci(mydata)

for(L in Loci(mydata)){
  cat(L, sep="\n")
  catResults[[L]] <- catalanAlleles(mydata, locus=L, verbose=TRUE)
}
@

Assignments are returned for Loc5 only.  For Loc2 and Loc3, the
algorithm found the correct allele assignments, but did not return them
since some genotypes were inconsistent with those assignments due to
meiotic error.

The results of \texttt{catalanAlleles} can be passed to \texttt{recodeAllopoly}
in the same way as the results of \texttt{testAlGroups}.

\section{Testing assignment accuracy using simulated datasets}

The simulated data in this tutorial, as well as the simulations for the 
manuscript, were created using the \texttt{simAllopoly} function.  If you
have a different ploidy, number of individuals, or
number of alleles from the datasets simulated in the manuscript, you can
run your own simulations to estimate the accuracy of allele assignment.
By default, the alleles are given names that start with A, B, \emph{etc.} to 
indicate to which isolocus they belong, so that it is easy to see 
whether the output of \texttt{testAlGroups} is correct.  See
\texttt{?simAllopoly} for more information.  The \texttt{\detokenize{tables_figs.R}}
file that is included as supplementary information for the manuscript can
serve as a guide for how to run a large number of simulations and test 
their accuracy.

\end{document}
