Christian Benner

Command-line arguments | Input | Output | Examples

FINEMAP is a program for

1identifying causal SNPs
2estimating effect sizes of causal SNPs
3estimating the heritability contribution of causal SNPs

in genomic regions associated with complex traits and disease. FINEMAP is computationally efficient by using summary statistics from genome-wide association studies and robust by using a shotgun stochastic search algorithm (Hans et al., 2007). It produces accurate results in a fraction of processing time of existing approaches. It is therefore the ideal tool for analyzing growing amounts of data produced in genome-wide association studies and emerging sequencing or biobank projects.

Download

(license)

finemap_v1.2_MacOSX.tgz (Mac OS X)
finemap_v1.2_x86_64.tgz (Unix)
(Updated 9-May-2018)
To receive email reminders about updates of FINEMAP, send an email to finemap@christianbenner.com.

Command-line arguments

--config	Evaluate a single causal configuration without performing shotgun stochastic search	Subprogram
--corr-config	The posterior probability of a causal configuration is set to zero if it includes a pair of SNPs with absolute correlation above this threshold	Default is 0.95 (with --sss)
--dataset	Option to specify delimiter-separated list of datasets for fine-mapping as given in the master file (e.g. 1,2 or 1\|2)	All datasets are processed by default
--help	Command-line help
--in-files	Master file (see below)	With --sss/--config
--log	Option to write output to log files specified in column 'log' in the master file	No log files are written by default
--n-causal-snps	Maximum number of allowed causal SNPs	Default is 5
--n-configs-top	Number of top causal configurations to be saved	Default is 50000
--n-convergence	Number of iterations that the added probability mass is required to be below the specified threshold (--prob-tol) before the shotgun stochastic search is terminated	Default is 1000
--n-iterations	Maximum number of iterations before the shotgun stochastic search is terminated	Default is 100000
--prior-k	Option to use prior probabilities for the number of causal SNPs as specified in K files (see below) in the master file	SNPs are by default assumed to be causal with probability 1 / (# of SNPs in the genomic region)
--prior-k0	Prior probability that there is no causal SNP in the genomic region. Only used when computing posterior probabilities for the number of causal SNPs but not during fine-mapping itself	Default is 0.0
--prior-std	Comma-separated list of prior standard deviations of effect sizes.	Default is 0.05
--prob-tol	Tolerance at which the added probability mass (over --n-convergence iterations) is considered small enough to terminate the shotgun stochastic search	Default is 0.001
--rsids	Comma-separated list of SNP identifiers corresponding with the rsid column in Z files (see below)	With --config
--sss	Fine-mapping with shotgun stochastic search	Subprogram

Input

(1) Master file

The master file is a semicolon-separated text file and contains no space. It contains the following column names and one dataset per line.

z column contains the names of Z files (input).
ld column contains the names of LD files (input).
snp column contains the names of SNP files (output).
config column contains the names of CONFIG files (output).
n_samples column contains the GWAS sample sizes.
k column contains the optional K files (input).
log column contains the optional LOG files (output).

File extensions must correspond with the column names in the header line!

A master file with two datasets could look as follows.

z;ld;snp;config;log;n_samples

dataset1.z;dataset1.ld;dataset1.snp;dataset1.config;dataset1.log;5363

dataset2.z;dataset2.ld;dataset2.snp;dataset2.config;dataset2.log;5363

(2) Z file

The dataset.z file is a space-delimited text file and contains the GWAS summary statistics one SNP per line. It contains exactly the column names in the following order.

rsid column contains the SNP identifiers. The identifier can be a rsID number or a combination of chromosome name and genomic position (e.g. XXX:yyy).
chromosome column contains the chromosome names. The chromosome names can be chosen freely, e.g. 'X', '0X' or 'chrX'.
position column contains the base pair positions
noneff_allele column contains the non-effect alleles.
eff_allele column contains the effect alleles. The effect allele is the allele that corresponds to the effect size parameter in GWAS software. SNPTEST uses 'allele_B' as the effect allele, whereas BOLT-LMM uses 'ALLELE1' as the effect allele.
maf column contains the minor allele frequencies.
beta column contains the estimated effect sizes.
se column contains the standard errors of effect sizes.

Columns beta and se are required for fine-mapping. Column maf is needed to output posterior effect size estimates on the allelic scale. All other columns are not required for computations and can be specified arbitrarily.

A dataset.z file with three SNPs could look as follows.

rsid chromosome position noneff_allele eff_allele maf beta se

rs1 10 1 T C 0.35 0.0050 0.0208

rs2 10 1 A G 0.04 0.0368 0.0761

rs3 10 1 G A 0.18 0.0228 0.0199

SNPs do not have to be ordered by genomic positions and can reside on different chromsomes. However, the order of SNPs in dataset.z must correspond to the order of SNPs in dataset.ld!

(3) LD file

The dataset.ld file is a space-delimited text file and contains the SNP correlation matrix (Pearson's correlation). The order of the SNPs in the dataset.ld must correspond to the order of variants in dataset.z.

A dataset.ld file with three SNPs could look as follows.

1.00 0.95 0.98

0.95 1.00 0.96

0.97 0.96 1.00

Ideally, the SNP correlation matrix is computed from the genotype data on the same samples from which the GWAS summary statistics orginate. Read here what could happen if SNP correlations from reference genotypes (e.g. 1000 Genomes Project) do not match well with the GWAS summary statistics.

(4) Optional K file

By default, FINEMAP assumes that SNPs are causal with prior probability 1 / (# of SNPs in the genomic region). As an alternative, it is possible to specify prior probabilities for the number of causal SNPs in the genomic region by using a dataset.k file. This is a space-delimited text file and contains the prior probabilities p_k = Pr(# of causal SNPs is k) for k = 1,...,K, where K is the number of entries in the dataset.k file. The prior probabilities must be non-negative and will be normalized to sum to one.

A dataset.k file allowing for three causal SNPs with p₁ = 0.6, p₂ = 0.3 and p₃ = 0.1 would look as follows.

0.6 0.3 0.1

We assume that the genomic region includes at least one causal SNP and thus p₀ = 0. A non-zero prior probability p₀ that there is no causal SNP in the genomic region can be specified with the command-line argument --prior-k0. This value is only used when computing posterior probabilities p_k|data = Pr(# of causal SNPs is k | data) but not during fine-mapping itself. We further assume that p_k = 0 for k = K +1,...,m, where m is the number of SNPs in the dataset.z file.

Output

(1) SNP file

The dataset.snp file is a space-delimited text file. It contains the GWAS summary statistics and model-averaged posterior summaries for each SNP one per line.

index column contains the line numbers in which SNPs appear in the dataset.z file.
rsid column contains the SNP identifiers.
chromosome column contains the chromosome names.
position column contains the base pair positions
noneff_allele column contains the non-effect alleles.
eff_allele column contains the effect alleles. The effect allele is the allele that corresponds to the effect size parameter in GWAS software.
maf column contains the minor allele frequencies.
beta column contains the estimated effect sizes.
se column contains the standard errors of effect size estimates.
z column contains the z-scores.
prob column contains the marginal Posterior Inclusion Probabilities (PIP). The PIP for the l th SNP is the posterior probability that this SNP is causal. It is computed by summing up the posterior probabilities of all causal configurations in the dataset.config file in which l th SNP is included.
log10bf column contains the log₁₀ Bayes factors. The Bayes factor quantifies the evidence that the l th SNP is causal with log₁₀ Bayes factors greater than 2 reporting considerable evidence.
mean column contains the marginalized shrinkage estimates of the posterior effect size mean. The marginalized shrinkage estimate for the l th SNP is computed by averaging the posterior effect size means of this SNP from all causal configurations in the dataset.config file assuming that the effect size of the l th SNP is zero if the SNP is absent from a causal configuration.
sd column contains the marginalized shrinkage estimates of the posterior effect size standard deviation. The estimates are computed in the same way as the marginalized shrinkage estimates of the posterior effect size mean.
mean_incl column contains the conditional estimates of the posterior effect size mean. The conditional estimate for the l th SNP is computed by averaging the posterior effect size means of this SNP from causal configurations in the dataset.config file in which it is included.
sd_incl column contains the conditional estimates of the posterior effect size standard deviation. The estimates are computed in the same way as the conditional estimates of the posterior effect size mean.

(2) CONFIG file

The dataset.config file is a space-delimited text file. It contains the posterior summaries for each causal configuration one per line.

rank column contains the ranking.
config column contains the SNP identifiers.
prob column contains the posterior probabilities that configurations are the causal configuration.
log10bf column contains the log₁₀ Bayes factors. The Bayes factor quantifies the evidence for a causal configuration over the null configuration (no SNPs are causal).
odds column contains the odds of the top causal configurations.
h2 column contains the heritability contribution of SNPs.
h2_0.95CI column contains the 95% credible interval of the heritability contribution of SNPs.
mean column contains the joint posterior effect size means.
sd column contains the joint posterior effect size standard deviations.

(3) LOG file

The dataset.log file outputs additional information. It contains the following output.

Posterior probabilities Pr(# of causal SNPs is k | data) for k = 1,...,K, where K is the maximum number of allowed causal SNPs.
A log₁₀ Bayes factor to quantify the evidence of at least one causal SNP in the genomic region.
Model-averaged heritability and 95% credible interval to quantify the contribution from causal SNPs.

Fine-mapping example

Using genotype data with 50 SNPs and 5363 individuals, a quantitative phenotype was simulated using a linear model with 2 causal SNPs. Single-SNP testing was performed to obtain z-scores. SNP correlations were computed from GWAS genotype data.

Fine-mapping the SNPs in genomic region 1 in the example folder is done follows.

./finemap_v1.2_MacOSX --sss --in-files example/data --dataset 1
./finemap_v1.2_x86_64 --sss --in-files example/data --dataset 1

Single causal configuration example

The same data as in the fine-mapping example above are used. Without having to perform shotgun stochastic search, information about a single causal configuration can be obtain by specifying SNP identifiers as follows

./finemap_v1.2_MacOSX --config --in-files example/data --dataset 1 --rsids rs30,rs11
./finemap_v1.2_x86_64 --config --in-files example/data --dataset 1 --rsids rs30,rs11

References

Benner, C. et al. FINEMAP: Efficient variable selection using summary data from genome-wide association studies. Bioinformatics 32, 1493-1501 (2016).

Hans, D. et al. Shotgun stochastic search for "large p" regression. J Am Stat Assoc 102, 507-516 (2007).

Acknowledgements

Matti Pirinen contributed to the design and implementation of FINEMAP.