Command-line arguments | Input | Output | Examples
FINEMAP-ing articles
- | Refining fine-mapping: effect sizes and regional heritability. bioRxiv. (2018). | |
- | Prospects of fine-mapping trait-associated genomic regions by using summary statistics from genome-wide association studies. Am. J. Hum. Genet. (2017). | |
- | FINEMAP: Efficient variable selection using summary data from genome-wide association studies. Bioinformatics 32, 1493-1501 (2016). |
FINEMAP is a program for
- 1identifying causal SNPs
- 2estimating effect sizes of causal SNPs
- 3estimating the heritability contribution of causal SNPs
in genomic regions associated with complex traits and disease. FINEMAP is computationally efficient by using summary statistics from genome-wide association studies and robust by applying a shotgun stochastic search algorithm (Hans et al., 2007). It produces accurate results in a fraction of processing time of existing approaches. It is therefore the ideal tool for analyzing growing amounts of data produced in genome-wide association studies and emerging sequencing or biobank projects.
Download
(license)
- finemap_v1.4.2_MacOSX.tgz (Mac OS X)
- finemap_v1.4.2_x86_64.tgz (Unix)
- Updated 27-Apr-2023
- (1) Bugfixes
- finemap_v1.4.1_MacOSX.tgz (Mac OS X)
- finemap_v1.4.1_x86_64.tgz (Unix)
- Updated 04-Mar-2022
- (1) Minor bugfixes
- finemap_v1.4_MacOSX.tgz (Mac OS X)
- finemap_v1.4_x86_64.tgz (Unix)
- Updated 14-Feb-2020
- (1) Integration with LDstore2
- (2) Updated credible sets
- (3) Multi-threading and optimizations
- finemap_v1.3.1_MacOSX.tgz (Mac OS X)
- finemap_v1.3.1_x86_64.tgz (Unix)
- Documentation
- finemap_v1.3_MacOSX.tgz (Mac OS X)
- finemap_v1.3_x86_64.tgz (Unix)
- Documentation
- Mac OSX users: If you see dyld: Library not loaded: /usr/local/lib/libzstd.1.dylib, install Zstandard.
- finemap_v1.2_MacOSX.tgz (Mac OS X)
- finemap_v1.2_x86_64.tgz (Unix)
- Documentation
- Be aware that FINEMAP v1.1 cannot handle large effect size regions!
- finemap_v1.1_MacOSX.tgz (Mac OS X)
- finemap_v1.1_x86_64.tgz (Unix)
- Documentation
- To receive email reminders about updates of FINEMAP, send an email to finemap@christianbenner.com.
Command-line arguments
--cond | Fine-mapping with stepwise conditioning | Subprogram | ||
--cond-pvalue | Option to set the p-value threshold for declaring genome-wide significance | Default is 5 × 10-8 | ||
--config | Evaluate a single causal configuration without performing shotgun stochastic search | Subprogram | ||
--corr-config | Option to set the posterior probability of a causal configuration to zero if it includes a pair of SNPs with absolute correlation above this threshold | Default is 0.95 | ||
--dataset | Option to specify a delimiter-separated list of datasets for fine-mapping as given in the master file (e.g. 1,2 or 1|2) | All datasets are processed by default | ||
--flip-beta | Option to read a column 'flip' in the Z file with binary indicators specifying if the direction of the estimated SNP effect sizes needs to be flipped to match SNP correlations | With --cond, --config and --sss | ||
--force-n--samples | Option to allow correlations in a BCOR file to be computed on a set of samples with different size than GWAS sample size | With --cond, --config and --sss | ||
--help | Command-line help | |||
--in-files | Master file (see below) | With --cond, --config and --sss | ||
--log | Option to write output to log files specified in column 'log' in the master file | No log files are written by default | ||
--n-causal-snps | Option to set the maximum number of allowed causal SNPs | Default is 5 | ||
--n-configs-top | Option to set the number of top causal configurations to be saved | Default is 50000 | ||
--n-conv-sss | Option to set the number of iterations that the added probability mass is required to be below the specified threshold (--prob-conv-sss-tol) before the shotgun stochastic search is terminated | Default is 100 | ||
--n-iter | Option to set the maximum number of iterations before the shotgun stochastic search is terminated | Default is 100000 | ||
--n-threads | Option to set the number of parallel threads | Default is 1 | ||
--prior-k | Option to use prior probabilities for the number of causal SNPs as specified in K files (see below) in the master file | SNPs are by default assumed to be causal with probability 1 / (# of SNPs in the genomic region) | ||
--prior-k0 | Option to set the prior probability that there is no causal SNP in the genomic region. Only used when computing posterior probabilities for the number of causal SNPs but not during fine-mapping itself | Default is 0.0 | ||
--prior-snps | Option to read a column 'prob' in the Z file with prior probabilities that a SNP is causal in order to define the prior probability for each causal configuration | With --sss | ||
--prior-std | Option to specify a comma-separated list of prior standard deviations of effect sizes. | Default is 0.05 | ||
--prob-conv-sss-tol | Option to set the tolerance at which the added probability mass (over --n-conv-sss iterations) is considered small enough to terminate the shotgun stochastic search | Default is 0.001 | ||
--prob-cred-set | Option to set the probability at which the credible interval includes a causal SNP | Default is 0.95 | ||
--pvalue-snps | Option to set a p-value threshold at which SNPs are included | Default is 1.0 | ||
--rsids | Option to sepcify a comma-separated list of SNP identifiers corresponding with the rsid column in Z files (see below) | With --config | ||
--sss | Fine-mapping with shotgun stochastic search | Subprogram | ||
--std-effects | Option to print mean and standard deviation of the posterior effect size distribution for standardized dosages | Default is allele dosage |
Input
(1) Master file
The master file is a semicolon-separated text file and contains no space. It contains the following mandatory column names and one dataset per line.
z column contains the names of Z files (input)
ld column contains the names of LD files (input)
bcor column contains the names of BCOR files (input)
snp column contains the names of SNP files (output)
config column contains the names of CONFIG files (output)
cred column contains the names of CRED files (output)
n_samples column contains the GWAS sample sizes
k column contains the names of optional K files (optional input)
log column contains the names of optional LOG files (optional output)
File extensions must correspond with the column names in the header line!
The master file can contain columns ld and bcor simultaneously. For each dataset per line, entries need to be specified for precomputed SNP correlations in text format in column ld or in binary fomat in column bcor. If a line contains entries in both columns, then precomputed SNP correlations are used.
If SNP correlations specified in column bcor are computed on a set of samples with different size than specified in column n_samples then the --force-n-samples command-line argument can be used to override error checks.
A master file with two datasets using precomputed SNP correlations could look as follows.
z;ld;snp;config;cred;log;n_samples |
dataset1.z;dataset1.ld;dataset1.snp;dataset1.config;dataset1.cred;dataset1.log;5363 |
dataset2.z;dataset2.ld;dataset2.snp;dataset2.config;dataset2.cred;dataset2.log;5363 |
(2) Z file
The dataset.z file is a space-delimited text file and contains the GWAS summary statistics one SNP per line. It contains the mandatory column names in the following order.
rsid column contains the SNP identifiers. The identifier can be a rsID number or a combination of chromosome name and genomic position (e.g. XXX:yyy)
chromosome column contains the chromosome names. The chromosome names can be chosen freely with precomputed SNP correlations (e.g. 'X', '0X' or 'chrX')
position column contains the base pair positions
allele1 column contains the "first" allele of the SNPs. In SNPTEST this corresponds to 'allele_A', whereas BOLT-LMM uses 'ALLELE1'
allele2 column contains the "second" allele of the SNPs. In SNPTEST this corresponds to 'allele_B', whereas BOLT-LMM uses 'ALLELE0'
maf column contains the minor allele frequencies
beta column contains the estimated effect sizes as given by GWAS software
se column contains the standard errors of effect sizes as given by GWAS software
flip optional column - see below
Columns beta and se are required for fine-mapping. Column maf is needed to output posterior effect size estimates on the allelic scale. All other columns are not required for computations and can be specified arbitrarily.
When using BCOR support, entries for each SNP in columns rsid, chromosome, position, allele1 and allele2 need to correspond with the information in BCOR files. The chromosome column may have to contain '0X' for X = 1,...,9, where X is the chromosome number.
It is recommended to compute all SNP correlations from allele counts of one of the alleles. In this case, estimated effect sizes and their standard errors from GWAS software can be used directly if the software always codes the same allele as the effect allele. This is the case in software SNPTEST (uses 'allele_B' as the effect allele) and BOLT-LMM (uses 'ALLELE1' as the effect allele). However, if the GWAS software codes the minor allele of the SNPs as the effect allele, then the direction of estimated effect sizes needs to be flipped to either the first or the second allele. This can be done by specifying the --flip-beta command-line argument and augmenting dataset.z by a flip column which contains 1 in a line if the direction of the estimated effect size of the SNP needs to be flipped and 0, otherwise.
SNPs do not have to be ordered by genomic positions and can reside on different chromsomes. However, the order of SNPs in dataset.z must correspond to the order of SNPs in dataset.ld!
A dataset.z file with three SNPs could look as follows.
rsid chromosome position allele1 allele2 maf beta se |
rs1 10 1 T C 0.35 0.0050 0.0208 |
rs2 10 1 A G 0.04 0.0368 0.0761 |
rs3 10 1 G A 0.18 0.0228 0.0199 |
(3) LD file
The dataset.ld file is a space-delimited text file and contains the SNP correlation matrix (Pearson's correlation).
Ideally, the SNP correlation matrix is computed from the genotype data on the same samples from which the GWAS summary statistics orginate. Read here what could happen if SNP correlations from reference genotypes (e.g. 1000 Genomes Project) do not match well with the GWAS summary statistics.
With imputed biobank-scale genotype data, it is important to compute SNP correlations from the same genotype data used in GWAS software. Read here for an example highlighting the importance of computing SNP correlations from the same dosage data used in GWAS software. For example, if GWAS summary statistics are generated with BOLT-LMM using SNP dosages (e.g. when used with BGEN files), then SNP correlations need to be computed from the same SNP dosage data. The same applies to SNPTEST when using the -method expected option to deal with genotype uncertainty. If GWAS summary statistics are computed from SNP dosage data using BGEN files, we recommended to use the LDstore2 software to compute SNP correlations and disadvise to convert genotype probabilities to best-guess genotypes in order to compute SNP correlations.
The order of the SNPs in the dataset.ld must correspond to the order of SNPs in dataset.z.
A dataset.ld file with three SNPs could look as follows.
1.00 0.95 0.98 |
0.95 1.00 0.96 |
0.97 0.96 1.00 |
(4) BCOR file
See here for BCOR file format desciption.
(5) Optional K file
By default, FINEMAP assumes that SNPs are causal with prior probability 1 / (# of SNPs in the genomic region). As an alternative, it is possible to specify prior probabilities for the number of causal SNPs in the genomic region by using a dataset.k file. This is a space-delimited text file and contains the prior probabilities pk = Pr(# of causal SNPs is k) for k = 1,...,K, where K is the number of entries in the dataset.k file. The prior probabilities must be non-negative and will be normalized to sum to one.
We assume that the genomic region includes at least one causal SNP and thus p0 = 0. A non-zero prior probability p0 that there is no causal SNP in the genomic region can be specified with the command-line argument --prior-k0. This value is only used when computing posterior probabilities pk|data = Pr(# of causal SNPs is k | data) but not during fine-mapping itself. We further assume that pk = 0 for k = K + 1,...,m, where m is the number of SNPs in the dataset.z file.
A dataset.k file allowing for three causal SNPs with p1 = 0.6, p2 = 0.3 and p3 = 0.1 would look as follows.
0.6 0.3 0.1 |
Output
(1) SNP file
The dataset.snp file is a space-delimited text file. It contains the GWAS summary statistics and model-averaged posterior summaries for each SNP one per line.
index column contains the line numbers in which SNPs appear in the dataset.z file
rsid, chromosome, position, allele1 and allele2 columns are the SNP identifiers from the Z file
maf column contains the minor allele frequencies as given in the Z file
beta column contains the estimated effect sizes as given in the Z file
se column contains the standard errors of effect size estimates as given in the Z file
z column contains the z-scores
prob column contains the marginal Posterior Inclusion Probabilities (PIP). The PIP for the l th SNP is the posterior probability that this SNP is causal.
log10bf column contains the log10 Bayes factors. The Bayes factor quantifies the evidence that the l th SNP is causal with log10 Bayes factors greater than 2 reporting considerable evidence
mean column contains the marginalized shrinkage estimates of the posterior effect size mean for the same allele as in column beta. The marginalized shrinkage estimate for the l th SNP is computed by averaging the posterior effect size means of this SNP from all causal configurations in the dataset.config file assuming that the effect size of the l th SNP is zero if the SNP is absent from a causal configuration
sd column contains the marginalized shrinkage estimates of the posterior effect size standard deviation. The estimates are computed in the same way as the marginalized shrinkage estimates of the posterior effect size mean
mean_incl column contains the conditional estimates of the posterior effect size mean for the same allele as in column beta. The conditional estimate for the l th SNP is computed by averaging the posterior effect size means of this SNP from causal configurations in the dataset.config file in which it is included
sd_incl column contains the conditional estimates of the posterior effect size standard deviation. The estimates are computed in the same way as the conditional estimates of the posterior effect size mean
The PIPs in column prob are computed by summing up the posterior probabilities of all causal configurations in the dataset.config file in which l th SNP is included. The PIPs sum to 1.0 if the maximum number of allowed causal SNPs is set to 1 with the --n-causal-snps command-line argument.
(2) CONFIG file
The dataset.config file is a space-delimited text file. It contains the posterior summaries for each causal configuration one per line.
rank column contains the ranking
config column contains the SNP identifiers
prob column contains the posterior probabilities that configurations are the causal configuration
log10bf column contains the log10 Bayes factors. The Bayes factor quantifies the evidence for a causal configuration over the null configuration (no SNPs are causal)
odds column contains the odds of the top causal configurations
k column contains the number of SNPs of the top causal configurations
prob_norm_k column contains the posterior probailities that configurations are the causal configuration normalized over the set of configurations with the same number of SNPs
h2 column contains the heritability contribution of SNPs
h2_0.95CI column contains the 95% credible interval of the heritability contribution of SNPs
mean column contains the joint posterior effect size means
sd column contains the joint posterior effect size standard deviations
(3) CRED file
The dataset.cred file is a space-delimited text file. It contains the 95% credible sets for each causal signal in the genomic region. For each credible set, the following posterior summaries are provided
posterior probabilities for each SNP of being in the credible set
minimum, average and median absolute correlation between SNPs in the credible set
log10 Bayes factors quantifying the evidence that there is a causal signal in addition to the other causal signals
CRED files are generated for those cases of k causal SNPs in the genomic region that have largest posterior probability. For specific k, FINEMAP takes the k-SNP causal configuration with highest posterior probability and then asks, for the l th SNP in that set, which are the other candidates that could possibly replace that SNP in this causal configuration. The l th credible set shows the best candidate SNPs and their posterior probability of being in a k-SNP causal configuration that additionally contains k - 1 SNPs. Note that the k - 1 SNPs are chosen to have highest posterior probability in their credible set.
(4) LOG file
The dataset.log file outputs additional information. It contains the following output.
Posterior probabilities pk|data = Pr(# of causal SNPs in the genomic region is k | data) for k = 1,...,K, where K is the maximum number of allowed causal SNPs
Expected number of causal SNPs in the genomic region
A log10 Bayes factor to quantify the evidence of at least one causal SNP in the genomic region
Model-averaged heritability and 95% credible interval to quantify the contribution from causal SNPs
Fine-mapping example
Using genotype data with 55 SNPs and 5363 individuals, a quantitative phenotype was simulated using a linear model with 2 causal SNPs. Single-SNP testing was performed to obtain
Fine-mapping the SNPs in genomic region 1 in the example folder using shotgun stochastic search is done follows.
./finemap_v1.4_MacOSX --sss --in-files example/data --dataset 1Fine-mapping the SNPs in genomic region 2 in the example folder using stepwise conditional search is done follows.
./finemap_v1.4_MacOSX --cond --in-files example/data --dataset 2The stepwise conditiong procedure is similar to the implementation in GCTA COJO.
./finemap_v1.4_x86_64 --sss --in-files example/data --dataset 1
./finemap_v1.4_x86_64 --cond --in-files example/data --dataset 2
Single causal configuration example
The same data as in the fine-mapping example above are used. Without having to perform shotgun stochastic search, information about a single causal configuration can be obtain by specifying SNP identifiers as follows
./finemap_v1.4_MacOSX --config --in-files example/data --dataset 1 --rsids rs30,rs11./finemap_v1.4_x86_64 --config --in-files example/data --dataset 1 --rsids rs30,rs11
References
Benner, C. et al. FINEMAP: Efficient variable selection using summary data from genome-wide association studies. Bioinformatics 32, 1493-1501 (2016). |
Hans, D. et al. Shotgun stochastic search for "large p" regression. J Am Stat Assoc 102, 507-516 (2007). |
Acknowledgements
Matti Pirinen contributed to the design and implementation of FINEMAP.