Christian Benner

Command-line arguments | Input | Output | Examples

FINEMAP-ing articles

-		Refining fine-mapping: effect sizes and regional heritability. bioRxiv. (2018).
-		Prospects of fine-mapping trait-associated genomic regions by using summary statistics from genome-wide association studies. Am. J. Hum. Genet. (2017).
-		FINEMAP: Efficient variable selection using summary data from genome-wide association studies. Bioinformatics 32, 1493-1501 (2016).

FINEMAP is a program for

1identifying causal SNPs
2estimating effect sizes of causal SNPs
3estimating the heritability contribution of causal SNPs

in genomic regions associated with complex traits and disease. FINEMAP is computationally efficient by using summary statistics from genome-wide association studies and robust by applying a shotgun stochastic search algorithm (Hans et al., 2007). It produces accurate results in a fraction of processing time of existing approaches. It is therefore the ideal tool for analyzing growing amounts of data produced in genome-wide association studies and emerging sequencing or biobank projects.

Download

(license)

finemap_v1.4.2_MacOSX.tgz (Mac OS X)
finemap_v1.4.2_x86_64.tgz (Unix)
Updated 27-Apr-2023
(1) Bugfixes
finemap_v1.4.1_MacOSX.tgz (Mac OS X)
finemap_v1.4.1_x86_64.tgz (Unix)
Updated 04-Mar-2022
(1) Minor bugfixes
finemap_v1.4_MacOSX.tgz (Mac OS X)
finemap_v1.4_x86_64.tgz (Unix)
Updated 14-Feb-2020
(1) Integration with LDstore2
(2) Updated credible sets
(3) Multi-threading and optimizations
finemap_v1.3.1_MacOSX.tgz (Mac OS X)
finemap_v1.3.1_x86_64.tgz (Unix)
Documentation
finemap_v1.3_MacOSX.tgz (Mac OS X)
finemap_v1.3_x86_64.tgz (Unix)
Documentation
Mac OSX users: If you see dyld: Library not loaded: /usr/local/lib/libzstd.1.dylib, install Zstandard.
finemap_v1.2_MacOSX.tgz (Mac OS X)
finemap_v1.2_x86_64.tgz (Unix)
Documentation
Be aware that FINEMAP v1.1 cannot handle large effect size regions!
finemap_v1.1_MacOSX.tgz (Mac OS X)
finemap_v1.1_x86_64.tgz (Unix)
Documentation
To receive email reminders about updates of FINEMAP, send an email to finemap@christianbenner.com.

Command-line arguments

--cond	Fine-mapping with stepwise conditioning	Subprogram
--cond-pvalue	Option to set the p-value threshold for declaring genome-wide significance	Default is 5 × 10^-8
--config	Evaluate a single causal configuration without performing shotgun stochastic search	Subprogram
--corr-config	Option to set the posterior probability of a causal configuration to zero if it includes a pair of SNPs with absolute correlation above this threshold	Default is 0.95
--dataset	Option to specify a delimiter-separated list of datasets for fine-mapping as given in the master file (e.g. 1,2 or 1\|2)	All datasets are processed by default
--flip-beta	Option to read a column 'flip' in the Z file with binary indicators specifying if the direction of the estimated SNP effect sizes needs to be flipped to match SNP correlations	With --cond, --config and --sss
--force-n--samples	Option to allow correlations in a BCOR file to be computed on a set of samples with different size than GWAS sample size	With --cond, --config and --sss
--help	Command-line help
--in-files	Master file (see below)	With --cond, --config and --sss
--log	Option to write output to log files specified in column 'log' in the master file	No log files are written by default
--n-causal-snps	Option to set the maximum number of allowed causal SNPs	Default is 5
--n-configs-top	Option to set the number of top causal configurations to be saved	Default is 50000
--n-conv-sss	Option to set the number of iterations that the added probability mass is required to be below the specified threshold (--prob-conv-sss-tol) before the shotgun stochastic search is terminated	Default is 100
--n-iter	Option to set the maximum number of iterations before the shotgun stochastic search is terminated	Default is 100000
--n-threads	Option to set the number of parallel threads	Default is 1
--prior-k	Option to use prior probabilities for the number of causal SNPs as specified in K files (see below) in the master file	SNPs are by default assumed to be causal with probability 1 / (# of SNPs in the genomic region)
--prior-k0	Option to set the prior probability that there is no causal SNP in the genomic region. Only used when computing posterior probabilities for the number of causal SNPs but not during fine-mapping itself	Default is 0.0
--prior-snps	Option to read a column 'prob' in the Z file with prior probabilities that a SNP is causal in order to define the prior probability for each causal configuration	With --sss
--prior-std	Option to specify a comma-separated list of prior standard deviations of effect sizes.	Default is 0.05
--prob-conv-sss-tol	Option to set the tolerance at which the added probability mass (over --n-conv-sss iterations) is considered small enough to terminate the shotgun stochastic search	Default is 0.001
--prob-cred-set	Option to set the probability at which the credible interval includes a causal SNP	Default is 0.95
--pvalue-snps	Option to set a p-value threshold at which SNPs are included	Default is 1.0
--rsids	Option to sepcify a comma-separated list of SNP identifiers corresponding with the rsid column in Z files (see below)	With --config
--sss	Fine-mapping with shotgun stochastic search	Subprogram
--std-effects	Option to print mean and standard deviation of the posterior effect size distribution for standardized dosages	Default is allele dosage

Input

(1) Master file

The master file is a semicolon-separated text file and contains no space. It contains the following mandatory column names and one dataset per line.

z column contains the names of Z files (input)
ld column contains the names of LD files (input)
bcor column contains the names of BCOR files (input)
snp column contains the names of SNP files (output)
config column contains the names of CONFIG files (output)
cred column contains the names of CRED files (output)
n_samples column contains the GWAS sample sizes
k column contains the names of optional K files (optional input)
log column contains the names of optional LOG files (optional output)

File extensions must correspond with the column names in the header line!
The master file can contain columns ld and bcor simultaneously. For each dataset per line, entries need to be specified for precomputed SNP correlations in text format in column ld or in binary fomat in column bcor. If a line contains entries in both columns, then precomputed SNP correlations are used.
If SNP correlations specified in column bcor are computed on a set of samples with different size than specified in column n_samples then the --force-n-samples command-line argument can be used to override error checks.
A master file with two datasets using precomputed SNP correlations could look as follows.

z;ld;snp;config;cred;log;n_samples

dataset1.z;dataset1.ld;dataset1.snp;dataset1.config;dataset1.cred;dataset1.log;5363

dataset2.z;dataset2.ld;dataset2.snp;dataset2.config;dataset2.cred;dataset2.log;5363

(2) Z file

The dataset.z file is a space-delimited text file and contains the GWAS summary statistics one SNP per line. It contains the mandatory column names in the following order.

rsid column contains the SNP identifiers. The identifier can be a rsID number or a combination of chromosome name and genomic position (e.g. XXX:yyy)
chromosome column contains the chromosome names. The chromosome names can be chosen freely with precomputed SNP correlations (e.g. 'X', '0X' or 'chrX')
position column contains the base pair positions
allele1 column contains the "first" allele of the SNPs. In SNPTEST this corresponds to 'allele_A', whereas BOLT-LMM uses 'ALLELE1'
allele2 column contains the "second" allele of the SNPs. In SNPTEST this corresponds to 'allele_B', whereas BOLT-LMM uses 'ALLELE0'
maf column contains the minor allele frequencies
beta column contains the estimated effect sizes as given by GWAS software
se column contains the standard errors of effect sizes as given by GWAS software
flip optional column - see below

Columns beta and se are required for fine-mapping. Column maf is needed to output posterior effect size estimates on the allelic scale. All other columns are not required for computations and can be specified arbitrarily.
When using BCOR support, entries for each SNP in columns rsid, chromosome, position, allele1 and allele2 need to correspond with the information in BCOR files. The chromosome column may have to contain '0X' for X = 1,...,9, where X is the chromosome number.
It is recommended to compute all SNP correlations from allele counts of one of the alleles. In this case, estimated effect sizes and their standard errors from GWAS software can be used directly if the software always codes the same allele as the effect allele. This is the case in software SNPTEST (uses 'allele_B' as the effect allele) and BOLT-LMM (uses 'ALLELE1' as the effect allele). However, if the GWAS software codes the minor allele of the SNPs as the effect allele, then the direction of estimated effect sizes needs to be flipped to either the first or the second allele. This can be done by specifying the --flip-beta command-line argument and augmenting dataset.z by a flip column which contains 1 in a line if the direction of the estimated effect size of the SNP needs to be flipped and 0, otherwise.
SNPs do not have to be ordered by genomic positions and can reside on different chromsomes. However, the order of SNPs in dataset.z must correspond to the order of SNPs in dataset.ld!
A dataset.z file with three SNPs could look as follows.

rsid chromosome position allele1 allele2 maf beta se

rs1 10 1 T C 0.35 0.0050 0.0208

rs2 10 1 A G 0.04 0.0368 0.0761

rs3 10 1 G A 0.18 0.0228 0.0199

(3) LD file

The dataset.ld file is a space-delimited text file and contains the SNP correlation matrix (Pearson's correlation).

Ideally, the SNP correlation matrix is computed from the genotype data on the same samples from which the GWAS summary statistics orginate. Read here what could happen if SNP correlations from reference genotypes (e.g. 1000 Genomes Project) do not match well with the GWAS summary statistics.
With imputed biobank-scale genotype data, it is important to compute SNP correlations from the same genotype data used in GWAS software. Read here for an example highlighting the importance of computing SNP correlations from the same dosage data used in GWAS software. For example, if GWAS summary statistics are generated with BOLT-LMM using SNP dosages (e.g. when used with BGEN files), then SNP correlations need to be computed from the same SNP dosage data. The same applies to SNPTEST when using the -method expected option to deal with genotype uncertainty. If GWAS summary statistics are computed from SNP dosage data using BGEN files, we recommended to use the LDstore2 software to compute SNP correlations and disadvise to convert genotype probabilities to best-guess genotypes in order to compute SNP correlations.
The order of the SNPs in the dataset.ld must correspond to the order of SNPs in dataset.z.
A dataset.ld file with three SNPs could look as follows.

1.00 0.95 0.98

0.95 1.00 0.96

0.97 0.96 1.00

(4) BCOR file

See here for BCOR file format desciption.

(5) Optional K file

By default, FINEMAP assumes that SNPs are causal with prior probability 1 / (# of SNPs in the genomic region). As an alternative, it is possible to specify prior probabilities for the number of causal SNPs in the genomic region by using a dataset.k file. This is a space-delimited text file and contains the prior probabilities p_k = Pr(# of causal SNPs is k) for k = 1,...,K, where K is the number of entries in the dataset.k file. The prior probabilities must be non-negative and will be normalized to sum to one.

We assume that the genomic region includes at least one causal SNP and thus p₀ = 0. A non-zero prior probability p₀ that there is no causal SNP in the genomic region can be specified with the command-line argument --prior-k0. This value is only used when computing posterior probabilities p_k|data = Pr(# of causal SNPs is k | data) but not during fine-mapping itself. We further assume that p_k = 0 for k = K + 1,...,m, where m is the number of SNPs in the dataset.z file.
A dataset.k file allowing for three causal SNPs with p₁ = 0.6, p₂ = 0.3 and p₃ = 0.1 would look as follows.

0.6 0.3 0.1

Output

(1) SNP file

The dataset.snp file is a space-delimited text file. It contains the GWAS summary statistics and model-averaged posterior summaries for each SNP one per line.

index column contains the line numbers in which SNPs appear in the dataset.z file
rsid, chromosome, position, allele1 and allele2 columns are the SNP identifiers from the Z file
maf column contains the minor allele frequencies as given in the Z file
beta column contains the estimated effect sizes as given in the Z file
se column contains the standard errors of effect size estimates as given in the Z file
z column contains the z-scores
prob column contains the marginal Posterior Inclusion Probabilities (PIP). The PIP for the l th SNP is the posterior probability that this SNP is causal.
log10bf column contains the log₁₀ Bayes factors. The Bayes factor quantifies the evidence that the l th SNP is causal with log₁₀ Bayes factors greater than 2 reporting considerable evidence
mean column contains the marginalized shrinkage estimates of the posterior effect size mean for the same allele as in column beta. The marginalized shrinkage estimate for the l th SNP is computed by averaging the posterior effect size means of this SNP from all causal configurations in the dataset.config file assuming that the effect size of the l th SNP is zero if the SNP is absent from a causal configuration
sd column contains the marginalized shrinkage estimates of the posterior effect size standard deviation. The estimates are computed in the same way as the marginalized shrinkage estimates of the posterior effect size mean
mean_incl column contains the conditional estimates of the posterior effect size mean for the same allele as in column beta. The conditional estimate for the l th SNP is computed by averaging the posterior effect size means of this SNP from causal configurations in the dataset.config file in which it is included
sd_incl column contains the conditional estimates of the posterior effect size standard deviation. The estimates are computed in the same way as the conditional estimates of the posterior effect size mean

The PIPs in column prob are computed by summing up the posterior probabilities of all causal configurations in the dataset.config file in which l th SNP is included. The PIPs sum to 1.0 if the maximum number of allowed causal SNPs is set to 1 with the --n-causal-snps command-line argument.

(2) CONFIG file

The dataset.config file is a space-delimited text file. It contains the posterior summaries for each causal configuration one per line.

rank column contains the ranking
config column contains the SNP identifiers
prob column contains the posterior probabilities that configurations are the causal configuration
log10bf column contains the log₁₀ Bayes factors. The Bayes factor quantifies the evidence for a causal configuration over the null configuration (no SNPs are causal)
odds column contains the odds of the top causal configurations
k column contains the number of SNPs of the top causal configurations
prob_norm_k column contains the posterior probailities that configurations are the causal configuration normalized over the set of configurations with the same number of SNPs
h2 column contains the heritability contribution of SNPs
h2_0.95CI column contains the 95% credible interval of the heritability contribution of SNPs
mean column contains the joint posterior effect size means
sd column contains the joint posterior effect size standard deviations

(3) CRED file

The dataset.cred file is a space-delimited text file. It contains the 95% credible sets for each causal signal in the genomic region. For each credible set, the following posterior summaries are provided

posterior probabilities for each SNP of being in the credible set
minimum, average and median absolute correlation between SNPs in the credible set
log10 Bayes factors quantifying the evidence that there is a causal signal in addition to the other causal signals

CRED files are generated for those cases of k causal SNPs in the genomic region that have largest posterior probability. For specific k, FINEMAP takes the k-SNP causal configuration with highest posterior probability and then asks, for the l th SNP in that set, which are the other candidates that could possibly replace that SNP in this causal configuration. The l th credible set shows the best candidate SNPs and their posterior probability of being in a k-SNP causal configuration that additionally contains k - 1 SNPs. Note that the k - 1 SNPs are chosen to have highest posterior probability in their credible set.

(4) LOG file

The dataset.log file outputs additional information. It contains the following output.

Posterior probabilities p_k|data = Pr(# of causal SNPs in the genomic region is k | data) for k = 1,...,K, where K is the maximum number of allowed causal SNPs
Expected number of causal SNPs in the genomic region
A log₁₀ Bayes factor to quantify the evidence of at least one causal SNP in the genomic region
Model-averaged heritability and 95% credible interval to quantify the contribution from causal SNPs

Fine-mapping example

Using genotype data with 55 SNPs and 5363 individuals, a quantitative phenotype was simulated using a linear model with 2 causal SNPs. Single-SNP testing was performed to obtain z-scores. SNP correlations were computed from GWAS genotype data.

Fine-mapping the SNPs in genomic region 1 in the example folder using shotgun stochastic search is done follows.

./finemap_v1.4_MacOSX --sss --in-files example/data --dataset 1

./finemap_v1.4_x86_64 --sss --in-files example/data --dataset 1

Fine-mapping the SNPs in genomic region 2 in the example folder using stepwise conditional search is done follows.

./finemap_v1.4_MacOSX --cond --in-files example/data --dataset 2

./finemap_v1.4_x86_64 --cond --in-files example/data --dataset 2

The stepwise conditiong procedure is similar to the implementation in GCTA COJO.

Single causal configuration example

The same data as in the fine-mapping example above are used. Without having to perform shotgun stochastic search, information about a single causal configuration can be obtain by specifying SNP identifiers as follows

./finemap_v1.4_MacOSX --config --in-files example/data --dataset 1 --rsids rs30,rs11
./finemap_v1.4_x86_64 --config --in-files example/data --dataset 1 --rsids rs30,rs11

References

Benner, C. et al. FINEMAP: Efficient variable selection using summary data from genome-wide association studies. Bioinformatics 32, 1493-1501 (2016).

Hans, D. et al. Shotgun stochastic search for "large p" regression. J Am Stat Assoc 102, 507-516 (2007).

Acknowledgements

Matti Pirinen contributed to the design and implementation of FINEMAP.

Command-line arguments | Input | Output | Examples | Python library

LDstore is a computationally efficient program for estimating and storing Linkage Disequilibrium (SNP correlations). It combines some of the best features from RAREMETALWORKER and PLINK by implementing parallel processing using OPENMP and storing of the SNP correlations with information about the SNPs in the same binary file for fast lookups. LDstore is therefore the ideal tool for sharing SNP correlations in large-scale meta-analyses of genome-wide association studies and for on-the-fly computing/querying within web portals.

LDstore is a program for

1compressing sequencing data
2converting genotype probabilities to dosage data
3computing SNP correlations

Download

(license)

ldstore_v2.0_MacOSX.tgz (Mac OS X)
ldstore_v2.0_x86_64.tgz (Unix)
To receive e-mail reminders about updates of LDstore, send an email to ldstore@christianbenner.com.

Command-line arguments

--bcor-file	Option to specify a BCOR file	With --bcor-to-text
--bcor-to-text	Convert BCOR file to a text file	Subprogram
--bdose-version	Option to set the BDOSE file version (see below)	With --write-bdose
--compression	Option to specify the compression level (see below) of a BDOSE/BCOR file as 'ultra-low' (1 byte), 'low' (2 bytes), 'medium' (4 bytes) or 'high' (8 bytes)	Default is medium. With --write-bcor and --write-bdose
--dataset	Option to specify a delimiter-separated list of datasets as given in the master file (e.g. 1,2 or 1\|2)	All datasets are processed by default
--in-files	Master file (see below)	With --bcor-to-text, --write-bcor, --write-bdose and --write-text
--ld-file	Option to specify a LD file	With --bcor-to-text
--memory	Option to limit the amount of memory in gigabyte that can be used during computation of SNP correlations	Default is 1Gb. With --read-bdose or --write-bdose when using either --write-bcor or --write-text
--n-threads	Option to set the number of parallel threads	Default is 1. With --write-bcor, --write-bdose or --write-text
--range	Option to specify a genomic range xx-yy to operate on when converting a BCOR file to a LD file where xx and yy are the start and end coordinates in base pairs	With --bcor-to-text
--read-bdose	Read dosage data from a BDOSE file	With --write-bcor or --write-text
--read-only-bgen	Read genotype probabilities from a BGEN file and store dosage data in memory	With --write-bcor or --write-text
--rsids	Option to sepcify a comma-separated list of SNP identifiers corresponding with the rsid column in a Z file (see below)	With --bcor-to-text or --write-text
--sample-miss	Option to set a missing data threshold between 0 and 1. If the missing data rate for a SNP is above the specified threshold, then the correlation of any SNP pair that includes this SNP is set to NA. If the missing data rate for a SNP is below the specified threshold, then missing data is mean-imputed	Default is 0.1. With --write-bcor or --write-text
--write-bcor	Write SNP correlations to a BCOR file	Subprogram
--write-bdose	Write dosage data to a BDOSE file	Subprogram and with --write-bcor or --write-text
--write-text	Write SNP correlations to a text file	Subprogram

Input

(1) Master file

The master file is a semicolon-separated text file and contains no space. It contains the following mandatory column names and one dataset per line.

z column contains the names of Z files (input)
bgen column contains the names of BGEN files (input)
bgi column contains the names of BGI files (input)
bcor column contains the names of BCOR files when using --write-bcor (output)
ld column contains the names of LD files when using --bcor-to-text (output)
n_samples column contains the number of samples to include in any processing
bdose column contains the names of optional BDOSE files when using --write-bdose (optional output)
sample column contains the names of optional sample files when using --write-bdose and --bdose-version 1.1 (optional input)
incl column contains the names of optional sample inclusion files (optional input)

File extensions must correspond with the column names in the header line!
The master file can contain columns ld and bcor simultaneously.
Columns sample and incl are required when including less samples in any processing than are available in BGEN files
The master file can be omitted when extracting SNP correlations from a BCOR file with --bcor-to-text by directly specifying --bcor-file and --ld-file.
A master file with two datasets for writing SNP correlations to a BCOR or text file could look as follows.

z;bgen;bgi;bcor;ld;n_samples

dataset1.z;dataset1.bgen;dataset1.bgen.bgi;dataset1.bcor;;5363

dataset2.z;dataset2.bgen;dataset2.bgen.bgi;;dataset2.ld;5363

(2) Z file

The dataset.z file is a space-delimited text file and contains meta information about the SNPs one SNP per line. It contains the mandatory column names in the following order.

rsid column contains the SNP identifiers. The identifier can be a rsID number or a combination of chromosome name and genomic position (e.g. XXX:yyy)
chromosome column contains the chromosome names
position column contains the base pair positions
allele1 column contains the "first" allele of the SNPs
allele2 column contains the "second" allele of the SNPs

Additional columns can be inserted after column allele2 such that a Z file from FINEMAP can be used.
Entries for each SNP in columns rsid, chromosome, position, allele1 and allele2 need to correspond with the information in BGEN files. The chromosome column may have to contain '0X' for X = 1,...,9, where X is the chromosome number.
SNPs do not have to be ordered by genomic positions and can reside on different chromsomes.
A dataset.z file with three SNPs could look as follows.

rsid chromosome position allele1 allele2

rs1 10 1 T C

rs2 10 1 A G

rs3 10 1 G A

(3) BGEN, BGI, SAMPLE and INCL file

These are Oxford file formats and described here (BGEN), here (BGI) and here (SAMPLE). The dataset.incl file is a text file to specify inclusion of samples in any processing. It contains one sample ID per line.

LDstore expects diploid genotype probability data. When converting chromosome X data from PLINK files to BGEN files, male samples need to be recoded as females before conversion to force PLINK to output diploid data.

Output

(1) BCOR v1.1 file

BCOR v1.1 files are binary files that store SNP correlations together with information about the SNPs in the same file for fast lookups. BCOR v1.1 files can be used with FINEMAP v1.4 and also include correlations for more SNPs than will be fine-mapped.

The BCOR v1.1 file format is described here.

(2) BDOSE v1.0 file

BDOSE v1.0 files are binary files and meant for speeding up one-time computations of SNP correlations in a genomic region when memory is limited. LDstore converts genotype probabilities from a BGEN file to dosage data and writes that data in floating-point format to a BDOSE v1.0 file (possibly in parallel). I/O speedups are achievied by memory-mapping the BDOSE v1.0 file and memory limitations are satisfied by computing SNP correlations in a block-wise fashion.

The BDOSE v1.0 file format is described here.

(3) BDOSE v1.1 file

BDOSE v1.1 files are binary files and meant for compressing sequencing data and storing whole-chromosome dosage data. LDstore 1) converts genotype probabilities from a BGEN file to dosage data, 2) converts dosage data from floating-point format to integer format, 3) compresses dosage data in integer format according to the Zstandard compression algorithm, and 4) writes the compressed dosage data to a BDOSE v1.1 file (possibly in parallel). Memory limitations are satisfied by computing SNP correlations in a block-wise fashion.

The BDOSE v1.1 file format is described here.

(4) LD file

LD files are space-delimited text files that contain SNP correlation matrices. A LD file with three SNPs could look as follows.

1.00 0.95 0.98

0.95 1.00 0.96

0.97 0.96 1.00

Examples

BGEN to BDOSE v1.1 file conversion

Genotype data with 55 SNPs and 5363 individuals in BGEN format can be converted to dosage data in BDOSE v1.1 format as follows.

./ldstore_v2.0_x86_64 --in-files example/data --write-bdose --bdose-version 1.1

Dosage data in the BDOSE v1.1 file can be compressed by using

./ldstore_v2.0_x86_64 --in-files example/data --write-bdose --bdose-version 1.1 --compression low

Computation of SNP correlations

SNP correlations for the same data as in the example above can be computed and written to a BCOR v1.1 file. There are several options for 1) storing intermediate dosage data in memory, 2) writing dosage data first to a BDOSE file, or 3) reading from an existing BDOSE file.

./ldstore_v2.0_x86_64 --in-files example/data --write-bcor --read-only-bgen
./ldstore_v2.0_x86_64 --in-files example/data --write-bcor --write-bdose --bdose-version 1.0
./ldstore_v2.0_x86_64 --in-files example/data --write-bcor --write-bdose --bdose-version 1.1
./ldstore_v2.0_x86_64 --in-files example/data --write-bcor --read-bdose

SNP correlations for a subset of SNPs can be computed and written to a LD file by either specifying a comma-delimited list of SNP identifiers or a text file with SNP identifiers one per line after --rsids. The same options for handling dosage data apply as in the example above.

./ldstore_v2.0_x86_64 --in-files example/data --write-bcor --read-only-bgen --rsids rs30,rs11
./ldstore_v2.0_x86_64 --in-files example/data --write-bcor --read-only-bgen --rsids rsids.txt

BCOR v1.1 to LD file conversion

SNP correlations in a BCOR v1.1 file can be extracted and written to a LD file as follows.

./ldstore_v2.0_x86_64 --in-files example/data --bcor-to-text
./ldstore_v2.0_x86_64 --bcor-to-text --bcor-file example/data.bcor --ld-file example/data.ld
./ldstore_v2.0_x86_64 --in-files example/data --bcor-to-text --rsids rs30,rs11
./ldstore_v2.0_x86_64 --in-files example/data --bcor-to-text --rsids rsids.txt
./ldstore_v2.0_x86_64 --in-files example/data --bcor-to-text --range 5-10
./ldstore_v2.0_x86_64 --bcor-to-text --bcor-file example/data.bcor --ld-file example/data.ld --rsids rs30,rs11
./ldstore_v2.0_x86_64 --bcor-to-text --bcor-file example/data.bcor --ld-file example/data.ld --rsids rsids.txt
./ldstore_v2.0_x86_64 --bcor-to-text --bcor-file example/data.bcor --ld-file example/data.ld --range 5-10

Python library

The LDstore python3 library contains functions for reading files from LDstore as well as limited functionalities for computing SNP correlations.

Installation

pip3 install https://files.pythonhosted.org/packages/a8/fd/f98ab7dea176f42cb61b80450b795ef19b329e8eb715b87b0d13c2a0854d/ldstore-0.1.9.tar.gz

BDOSE v1.0 example

>>> from ldstore.bdose import bdose

>>> myBdose = bdose( 'example/data_v1.0.bdose' )

>>> myBdose.getFname()
'example/data_v1.0.bdose'

>>> myBdose.getFsize()
2363184

>>> myBdose.getMeta().loc[ range( 5 ) ]

	rsid	position	chromosome	allele1	allele2
0	rs1	1.0	01	A	G
1	rs2	2.0	01	A	G
2	rs3	3.0	01	A	G
3	rs4	4.0	01	A	G
4	rs5	5.0	01	A	G

>>> myBdose.getMissingness()[ range( 5 ) ]
array([0, 0, 0, 0, 0], dtype=uint32)

>>> myBdose.getNumOfSNPs()
55

>>> myBdose.getNumOfSamples()
5363

>>> myBdose.getOffsets()[ range( 5 ) ]
array([ 3464, 46368, 89272, 132176, 175080], dtype=uint64)

>>> myBdose.readDosages( [ 29, 10 ] )[ 0, : ]
array([-1.50769106, -1.57917029])

>>> myBdose.readDosages( [] )[ 0, [ 29, 10 ] ]
array([-1.50769106, -1.57917029])

>>> myBdose.computeCorr( [ 29, 10 ] )

	0	1
0	1.000000	-0.082955
1	-0.082955	1.000000

>>> myBdose.computeCorr( [] ).loc[ 29, 10 ]
-0.0829552808503373

BDOSE v1.1 example

>>> from ldstore.bdose import bdose

>>> myBdose = bdose( 'example/data_v1.1.bdose' )

>>> myBdose.getFname()
'example/data_v1.1.bdose'

>>> myBdose.getFsize()
125054

>>> myBdose.getMeta().loc[ range( 5 ) ]

	rsid	position	chromosome	allele1	allele2
0	rs1	1.0	01	A	G
1	rs2	2.0	01	A	G
2	rs3	3.0	01	A	G
3	rs4	4.0	01	A	G
4	rs5	5.0	01	A	G

>>> myBdose.getNumOfSNPs()
55

>>> myBdose.getNumOfSamples()
5363

>>> myBdose.getOffsets()[ range( 5 ) ]
array([45066, 46430, 48122, 49810, 51508], dtype=uint64)

>>> myBdose.getSamples()[ 0 : 5 ]
['1', '2', '3', '4', '5']

>>> myBdose.computeMAF( [ 29, 10 ] )
array([0.13499907, 0.43921313])

>>> myBdose.computeMAF( [] )[ [ 29, 10 ] ]
array([0.13499907, 0.43921313])

>>> myBdose.computeFrqAllele1( [ 29, 10 ] )
array([0.13499907, 0.43921313])

>>> myBdose.computeFrqAllele1( [] )[ [ 29, 10 ] ]
array([0.13499907, 0.43921313])

>>> myBdose.computeFrqAllele2( [ 29, 10 ] )
array([0.86500093, 0.56078687])

>>> myBdose.computeFrqAllele2( [] )[ [ 29, 10 ] ]
array([0.86500093, 0.56078687])

>>> myBdose.readDosages( [ 29, 10 ] )[ 0, : ]
array([1., 0.])

>>> myBdose.readDosages( [] )[ 0, [ 29, 10 ] ]
array([1., 0.])

>>> myBdose.computeCorr( [ 29, 10 ] )

	0	1
0	1.000000	-0.082955
1	-0.082955	1.000000

>>> myBdose.computeCorr( [] ).loc[ 29, 10 ]
-0.0829552808503373

BCOR v1.1 example

>>> from ldstore.bcor import bcor

>>> myBcor = bcor( 'example/data.bcor' )

>>> myBcor.getFname()
'example/data.bcor'

>>> myBcor.getFsize()
7723

>>> myBcor.getMeta().loc[ range( 5 ) ]

	rsid	position	chromosome	allele1	allele2
0	rs1	1.0	01	A	G
1	rs2	2.0	01	A	G
2	rs3	3.0	01	A	G
3	rs4	4.0	01	A	G
4	rs5	5.0	01	A	G

>>> myBcor.getNumOfSNPs()
55

>>> myBcor.getNumOfSamples()
5363

>>> myBcor.readCorr( [ 29, 10 ] )[ 29, : ]
array([ 1. , -0.0829553])

>>> myBcor.readCorr( [] )[ 29, 10 ]
-0.08295530080795288

References

Benner, C. et al. Prospects of fine-papping trait-associated genomic regions by using summary statistics from genome-wide association studies. Am. J. Hum. Genet. (2017).