dive_phe2mash function v1 to test on several plant species

2021-03-30 18:33:06 -05:00
parent 9581596314
commit 027323acf6
22 changed files with 1964 additions and 339 deletions
--- a/man/div_gwas.Rd
+++ b/man/div_gwas.Rd
@@ -0,0 +1,34 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/wrapper.R
+\name{div_gwas}
+\alias{div_gwas}
+\title{Wrapper for bigsnpr for GWAS}
+\usage{
+div_gwas(df, snp, type, svd, npcs)
+}
+\arguments{
+\item{df}{Dataframe of phenotypes where the first column is sample.ID}
+
+\item{snp}{Genomic information to include for wheat.}
+
+\item{type}{Character string. Type of univarate regression to run for GWAS.
+Options are "linear" or "logistic".}
+
+\item{svd}{Optional covariance matrix to include in the regression. You
+can generate these using \code{bigsnpr::snp_autoSVD()}.}
+
+\item{npcs}{Integer. Number of PCs to use for population structure correction.}
+}
+\value{
+The gwas results for the last phenotype in the dataframe. That
+phenotype, as well as the remaining phenotypes, are saved as RDS objects
+in the working directory.
+}
+\description{
+Given a dataframe of phenotypes associated with sample.IDs, this
+function is a wrapper around bigsnpr functions to conduct linear or
+logistic regression on wheat. The main advantages of this
+function over just using the bigsnpr functions is that it automatically
+removes individual genotypes with missing phenotypic data
+and that it can run GWAS on multiple phenotypes sequentially.
+}
--- a/man/div_lambda_GC.Rd
+++ b/man/div_lambda_GC.Rd
@@ -0,0 +1,48 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/wrapper.R
+\name{div_lambda_GC}
+\alias{div_lambda_GC}
+\title{Return lambda_GC for different numbers of PCs for GWAS on Panicum virgatum.}
+\usage{
+div_lambda_GC(
+  df,
+  type = c("linear", "logistic"),
+  snp,
+  svd = NA,
+  ncores = 1,
+  npcs = c(0:10),
+  saveoutput = FALSE
+)
+}
+\arguments{
+\item{df}{Dataframe of phenotypes where the first column is sample.ID and each
+sample.ID occurs only once in the dataframe.}
+
+\item{type}{Character string. Type of univarate regression to run for GWAS.
+Options are "linear" or "logistic".}
+
+\item{snp}{A bigSNP object with sample.IDs that match the df.}
+
+\item{svd}{big_SVD object; Covariance matrix to include in the regression.
+Generate these using \code{bigsnpr::snp_autoSVD()}.}
+
+\item{ncores}{Number of cores to use. Default is one.}
+
+\item{npcs}{Integer vector of principle components to use.
+Defaults to c(0:10).}
+
+\item{saveoutput}{Logical. Should output be saved as a csv to the
+working directory?}
+}
+\value{
+A dataframe containing the lambda_GC values for each number of PCs
+specified. This is also saved as a .csv file in the working directory.
+}
+\description{
+Given a dataframe of phenotypes associated with sample.IDs and
+output from a PCA to control for population structure, this function will
+return a .csv file of the lambda_GC values for the GWAS upon inclusion
+of different numbers of PCs. This allows the user to choose a number of
+PCs that returns a lambda_GC close to 1, and thus ensure that they have
+done adequate correction for population structure.
+}
--- a/man/dive_phe2mash.Rd
+++ b/man/dive_phe2mash.Rd
@@ -0,0 +1,89 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/wrapper.R
+\name{dive_phe2mash}
+\alias{dive_phe2mash}
+\title{Wrapper to run mash given a phenotype data frame}
+\usage{
+dive_phe2mash(
+  df,
+  snp,
+  type = "linear",
+  svd = NULL,
+  suffix = "",
+  outputdir = ".",
+  min.phe = 200,
+  save.plots = TRUE,
+  thr.r2 = 0.2,
+  thr.m = c("sum", "max"),
+  num.strong = 1000,
+  num.random = NA,
+  scale.phe = TRUE,
+  roll.size = 50,
+  U.ed = NA,
+  U.hyp = NA
+)
+}
+\arguments{
+\item{df}{Dataframe containing phenotypes for mash where the first column is
+'sample.ID', which should match values in the snp$fam$sample.ID column.}
+
+\item{snp}{A "bigSNP" object; load with \code{snp_attach()}.}
+
+\item{type}{Character string, or a character vector the length of the number
+of phenotypes. Type of univarate regression to run for GWAS.
+Options are "linear" or "logistic".}
+
+\item{svd}{A "big_SVD" object; Optional covariance matrix to use for
+population structure correction.}
+
+\item{suffix}{Optional character vector to give saved files a unique search string/name.}
+
+\item{outputdir}{Optional file path to save output files.}
+
+\item{min.phe}{Integer. Minimum number of individuals phenotyped in order to
+include that phenotype in GWAS. Default is 200. Use lower values with
+caution.}
+
+\item{save.plots}{Logical. Should Manhattan and QQ-plots be generated and
+saved to the working directory for univariate GWAS? Default is TRUE.}
+
+\item{thr.r2}{Value between 0 and 1. Threshold of r2 measure of linkage
+disequilibrium. Markers in higher LD than this will be subset using clumping.}
+
+\item{thr.m}{"sum" or "max". Type of threshold to use to clump values for
+mash inputs. "sum" sums the -log10pvalues for each phenotype and uses
+the maximum of this value as the threshold. "max" uses the maximum
+-log10pvalue for each SNP across all of the univariate GWAS.}
+
+\item{num.strong}{Integer. Number of SNPs used to derive data-driven covariance
+matrix patterns, using markers with strong effects on phenotypes.}
+
+\item{num.random}{Integer. Number of SNPs used to derive the correlation structure
+of the null tests, and the mash fit on the null tests.}
+
+\item{scale.phe}{Logical. Should effects for each phenotype be scaled to fall
+between -1 and 1? Default is TRUE.}
+
+\item{roll.size}{Integer. Used to create the svd for GWAS.}
+
+\item{U.ed}{Mash data-driven covariance matrices. Specify these as a list or a path
+to a file saved as an .rds. Creating these can be time-consuming, and
+generating these once and reusing them for multiple mash runs can save time.}
+
+\item{U.hyp}{Other covariance matrices for mash. Specify these as a list. These
+matrices must have dimensions that match the number of phenotypes where
+univariate GWAS ran successfully.}
+}
+\value{
+A mash object made up of all phenotypes where univariate GWAS ran
+successfully.
+}
+\description{
+Though step-by-step GWAS, preparation of mash inputs, and mash
+allows you the most flexibility and opportunities to check your results
+for errors, once those sanity checks are complete, this function allows
+you to go from a phenotype data.frame of a few phenotypes you want to
+compare to a mash result. Some exception handling has been built into
+this function, but the user should stay cautious and skeptical of any
+results that seem 'too good to be true'.
+}
--- a/man/get_best_PC_df.Rd
+++ b/man/get_best_PC_df.Rd
@@ -0,0 +1,22 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/wrapper.R
+\name{get_best_PC_df}
+\alias{get_best_PC_df}
+\title{Return best number of PCs in terms of lambda_GC for Panicum virgatum.
+Return best number of PCs in terms of lambda_GC for the CDBN.}
+\usage{
+get_best_PC_df(df)
+}
+\arguments{
+\item{df}{Dataframe of phenotypes where the first column is NumPCs and
+subsequent column contains lambda_GC values for some phenotype.}
+}
+\value{
+A dataframe containing the best lambda_GC value and number of PCs
+for each phenotype in the data frame.
+}
+\description{
+Given a dataframe created using pvdiv_lambda_GC, this function
+returns the first lambda_GC less than 1.05, or the smallest lambda_GC,
+for each column in the dataframe.
+}
--- a/man/get_lambdagc.Rd
+++ b/man/get_lambdagc.Rd
@@ -0,0 +1,19 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/wrapper.R
+\name{get_lambdagc}
+\alias{get_lambdagc}
+\title{Find lambda_GC value for non-NA p-values}
+\usage{
+get_lambdagc(ps, tol = 1e-08)
+}
+\arguments{
+\item{ps}{Numeric vector of p-values. Can have NA's.}
+
+\item{tol}{Numeric. Tolerance for optional Genomic Control coefficient.}
+}
+\value{
+A lambda GC value (some positive number, ideally ~1)
+}
+\description{
+Finds the lambda GC value for some vector of p-values.
+}
--- a/man/get_qqplot.Rd
+++ b/man/get_qqplot.Rd
@@ -0,0 +1,31 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/wrapper.R
+\name{get_qqplot}
+\alias{get_qqplot}
+\title{Create a quantile-quantile plot with ggplot2.}
+\usage{
+get_qqplot(ps, ci = 0.95, lambdaGC = FALSE, tol = 1e-08)
+}
+\arguments{
+\item{ps}{Numeric vector of p-values.}
+
+\item{ci}{Numeric. Size of the confidence interval, 0.95 by default.}
+
+\item{lambdaGC}{Logical. Add the Genomic Control coefficient as subtitle to
+the plot?}
+
+\item{tol}{Numeric. Tolerance for optional Genomic Control coefficient.}
+}
+\value{
+A ggplot2 plot.
+}
+\description{
+Assumptions for this quantile quantile plot:
+Expected P values are uniformly distributed.
+Confidence intervals assume independence between tests.
+We expect deviations past the confidence intervals if the tests are
+not independent.
+For example, in a genome-wide association study, the genotype at any
+position is correlated to nearby positions. Tests of nearby genotypes
+will result in similar test statistics.
+}
--- a/man/pipe.Rd
+++ b/man/pipe.Rd
@@ -0,0 +1,12 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/utils-pipe.R
+\name{\%>\%}
+\alias{\%>\%}
+\title{Pipe operator}
+\usage{
+lhs \%>\% rhs
+}
+\description{
+See \code{magrittr::\link[magrittr:pipe]{\%>\%}} for details.
+}
+\keyword{internal}
--- a/man/round2.Rd
+++ b/man/round2.Rd
@@ -0,0 +1,20 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/wrapper.R
+\name{round2}
+\alias{round2}
+\title{Return a number rounded to some number of digits}
+\usage{
+round2(x, at)
+}
+\arguments{
+\item{x}{A number or vector of numbers}
+
+\item{at}{Numeric. Rounding factor or size of the bin to round to.}
+}
+\value{
+A number or vector of numbers
+}
+\description{
+Given some x, return the number rounded to some number of
+digits.
+}
--- a/man/round_xy.Rd
+++ b/man/round_xy.Rd
@@ -0,0 +1,31 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/wrapper.R
+\name{round_xy}
+\alias{round_xy}
+\title{Return a dataframe binned into 2-d bins by some x and y.}
+\usage{
+round_xy(x, y, cl = NA, cu = NA, roundby = 0.001)
+}
+\arguments{
+\item{x}{Numeric vector. The first vector for binning.}
+
+\item{y}{Numeric vector. the second vector for binning}
+
+\item{cl}{Numeric vector. Optional confidence interval for the y vector,
+lower bound.}
+
+\item{cu}{Numeric vector. Optional confidence interval for the y vector,
+upper bound.}
+
+\item{roundby}{Numeric. The amount to round the x and y vectors by for 2d
+binning.}
+}
+\value{
+A dataframe containing the 2-d binned values for x and y, and their
+confidence intervals.
+}
+\description{
+Given a dataframe of x and y values (with some optional
+confidence intervals surrounding the y values), return only the unique
+values of x and y in some set of 2-d bins.
+}
--- a/man/tidyeval.Rd
+++ b/man/tidyeval.Rd
@@ -0,0 +1,51 @@
+% Generated by roxygen2: do not edit by hand
+% Please edit documentation in R/utils-tidy-eval.R
+\name{tidyeval}
+\alias{tidyeval}
+\alias{expr}
+\alias{enquo}
+\alias{enquos}
+\alias{sym}
+\alias{syms}
+\alias{.data}
+\alias{:=}
+\alias{as_name}
+\alias{as_label}
+\title{Tidy eval helpers}
+\description{
+\itemize{
+\item \code{\link[rlang]{sym}()} creates a symbol from a string and
+\code{\link[rlang:sym]{syms}()} creates a list of symbols from a
+character vector.
+\item \code{\link[rlang:nse-defuse]{enquo}()} and
+\code{\link[rlang:nse-defuse]{enquos}()} delay the execution of one or
+several function arguments. \code{enquo()} returns a single quoted
+expression, which is like a blueprint for the delayed computation.
+\code{enquos()} returns a list of such quoted expressions.
+\item \code{\link[rlang:nse-defuse]{expr}()} quotes a new expression \emph{locally}. It
+is mostly useful to build new expressions around arguments
+captured with \code{\link[=enquo]{enquo()}} or \code{\link[=enquos]{enquos()}}:
+\code{expr(mean(!!enquo(arg), na.rm = TRUE))}.
+\item \code{\link[rlang]{as_name}()} transforms a quoted variable name
+into a string. Supplying something else than a quoted variable
+name is an error.
+
+That's unlike \code{\link[rlang]{as_label}()} which also returns
+a single string but supports any kind of R object as input,
+including quoted function calls and vectors. Its purpose is to
+summarise that object into a single label. That label is often
+suitable as a default name.
+
+If you don't know what a quoted expression contains (for instance
+expressions captured with \code{enquo()} could be a variable
+name, a call to a function, or an unquoted constant), then use
+\code{as_label()}. If you know you have quoted a simple variable
+name, or would like to enforce this, use \code{as_name()}.
+}
+
+To learn more about tidy eval and how to use these tools, visit
+\url{https://tidyeval.tidyverse.org} and the
+\href{https://adv-r.hadley.nz/metaprogramming.html}{Metaprogramming
+section} of \href{https://adv-r.hadley.nz}{Advanced R}.
+}
+\keyword{internal}