Allelic Variation Explorer.
If you are studying how single nucleotide polymorphisms are clustered in genomic samples, then the Allelic Variation Explorer can help you visualize them.
ave-rest-service
The Allelic Variation Explorer (AVE) is a web application to visualize (clustered) single-nucleotide variants across genomes. The Allelic Variation Explorer rest service clusters genomic variants and lists the available datasets. Combined with ave-app will visualize clustered genomic variants for a certain genomic range in a genome browser.
This service is the back end for the ave-app front end. The front end runs in the users web browser and communicates with the back end running on a web server somewhere. The front end is the user interface and the back end is the service serving the variant, annotation and genomic data. The front end and back end communicate with each other according to the Swagger specification in swagger.yml.
Architecture
The ave-rest-service and ave-app are wrapped up in a Docker image. The Docker image is used for deploying the Allelic Variation Explorer on a server.
The Allelic Variation Explorer consists of the following parts working together:
* a running ave rest service
* an extracted ave-app build archive.
* 2bit (genome sequence), bcf (variants) and bigbed (genes and feature annotations) data files, green in diagram
* a directory with full text indices for genes and features in Whoosh format, filled by data registration commands, red in diagram
* an meta database file, contains list of available datasets inside the application, filled by data registration commands, yellow in diagram
* a NGINX web server, for hosting app and data files and proxy-ing ave rest service behind a single port
* a Docker image combining all above, see ./Dockerfile
for the instructions used to install all the parts
Deployment
A Docker image is available on Docker Hub.
Any change to the master branch of this repo or the ave-app will trigger an automatic build of the Docker image.
The Docker image contains no data and when data is added then the data will be lost when the Docker container is stopped/started. To get a deployment which is persists it's data we will use directories on the server and mount these as volumes in the Docker container. It expects the following volumes:
- /data, location for 2bit, bcf and bigbed data files. Hosted as http://<aveserver>/data
- /whoosh, full text indices for genes and features
- /meta, directory in which ave meta database is stored
Run the service with ```bash
Use sub directories in the current working directory to persist data
mkdir data mkdir whoosh mkdir meta docker run -d \ -v $PWD/data:/data -v $PWD/whoosh:/whoosh -v $PWD/meta:/meta \ -p 80:80 \ --name ave ave2/allelic-variation-explorer ```
Command above will run web server on port 80 of host machine.
After deployment the server is running, but contains no data, see Data pre processing chapter how to prepare data followed by the Data registration chapter how to add data.
Demo
A demo Docker image with a sample dataset is available at https://hub.docker.com/r/ave2/ave-demo/ .
Secure connection
The Docker container uses http which is an unencryted connection. To use a secure connection (https), a reverse proxy with Let's Encrypt certificate can be put in front of the Docker container.
The Docker container must be run a port which is not 443.
Configure a web server like NGINX to server https on port 443 and proxy all requests to the container. Use Certbot to generate the certificate pair and configure the web server.
See example server conf in commented out block in ./nginx.conf
file.
Shutting down
The Docker container can be stopped using
bash
docker rm -f ave
Update image
Make sure the Docker container is not running.
The Docker image can be updated using
bash
docker pull ave2/allelic-variation-explorer
Data pre processing
Before data can be registered it has to be converted in the right format. Below describes pre processing steps to convert common formats into the formats the application expects.
The tools used are available inside the Docker container or can be installed in an Anaconda environment.
To perform the pre processing inside the Docker container, copy the raw files to the /data
Docker volume and login to the Docker container with docker exec -ti ave bash
.
Genome sequence
Genome sequence in 2bit format is used.
When you have a genome sequence in FASTA format, where each chromosome is a sequence in the file.
The FASTA file can be converted to 2bit using:
sh
faToTwoBit genome.fa genome.2bit
The sequence identifiers (chromosome/contig) should match the ones in corresponding variants (bcf) genes (bigbed) and genomic feature annotations (bigbed) files.
Variants
Variants need to be provided in single BCF formatted file.
VCF files can be converted to a BCF file in the following way:
```sh
vcf-merge requires bgzipped and tabix indexed VCF files so do that first
for f in $(ls *.vcf) do bgzip $f tabix -p vcf $f.gz done vcf-merge *.vcf.gz > variants.vcf ```
Sort by chromosome with VCFtools
sh vcf-sort -c variants.vcf > variants.sorted.vcf
Compress by bgzip and index with tabix
sh bgzip variants.sorted.vcf tabix -p vcf variants.sorted.vcf.gz
Convert to BCF with bcftools
sh bcftools view -O b variants.sorted.vcf.gz > variants.sorted.bcf
Index with bcftools
sh bcftools index variants.sorted.bcf
Genes
The genes (or transcripts) are rendered in a gene track.
Genes must be provided as bigBed formatted. A bigBed file can be converted from a BED formatted file
The gene bed file is expected to have the following columns: 1. chrom, name of chromosome 2. chromStart, start of gene, zero-indexed 3. chromStart, end of gene, zero-indexed 4. name, transcript identifier 5. score 6. strand 7. thickStart, location of start codon 8. thinkEnd, location of stop codon 9. itemRgb 10. blockCount, number of exons 11. blockSizes 12. blockStarts 13. gene identifier 14. description of gene
So it should have exons and start/stop codons for one gene on a single line.
Some gene bed files are available at http://bioviz.org/quickload/
To convert a gene bed file to bigbed use: ```bash
Fetch chrom sizes
twoBitInfo genome.2bit chrom.sizes
the version for mac os is available also available http://hgdownload.cse.ucsc.edu/admin/exe/macOSX.x86_64/bedToBigBed
chmod +x bedToBigBed
description field is too long for bedToBigBed so it must be trimmed
gunzip -c S_lycopersicum_May_2012.bed.gz | perl -n -e 'chomp;@F=split(/\t/);$F[13] = substr($F[13],0,255); print join("\t", @F),"\n";' > S_lycopersicum_May_2012.bed.trimmed bedToBigBed -tab -type=bed12+2 S_lycopersicum_May_2012.bed.trimmed chrom.sizes S_lycopersicum_May_2012.bb ```
A BigBed file with more than 256 chromosomes will not render, see issue 37.
Genomic features annotations
Each feature file will render a feature track. The feature track name is the same as the file name.
Feature annotations must be provided as bigBed formatted files.
To convert a gff feature file to bigbed use:
```bash
Download a gff file
wget ftp://ftp.solgenomics.net/tomato_genome/microarrays_mapping/A-AFFY-87_AffyGeneChipTomatoGenome.probes_ITAG2.3genome_mapping.gff
Sort gff
bedtools sort -i A-AFFY-87_AffyGeneChipTomatoGenome.probes_ITAG2.3genome_mapping.gff > A-AFFY-87_AffyGeneChipTomatoGenome.probes_ITAG2.3genome_mapping.sorted.gff
Convert gff to bed
gff2bed < A-AFFY-87_AffyGeneChipTomatoGenome.probes_ITAG2.3genome_mapping.sorted.gff > A-AFFY-87_AffyGeneChipTomatoGenome.probes_ITAG2.3genome_mapping.bed
Fetch chrom sizes
twoBitInfo genome.2bit chrom.sizes
Convert bed to bigbed
bedToBigBed -tab -type=bed6+4 -as=gff3.as A-AFFY-87_AffyGeneChipTomatoGenome.probes_ITAG2.3genome_mapping.bed chrom.sizes A-AFFY-87_AffyGeneChipTomatoGenome.probes_ITAG2.3genome_mapping.bb ```
The gff3.as (in Docker container available as /app/gff3.as
) is used to describe the columns in the bed file.
A BigBed file with more than 256 chromosomes will not render, see issue 37.
Data registration
After deployment the server is running, but contains no data. Data files must be registered so they show up in the web browser.
The following commands expect a running Docker container as described in the Deployment chapter.
Register genome
bash
docker exec ave \
avedata register \
--species 'Solanum Lycopersicum' \
--genome SL.2.40 \
--datatype 2bit \
/data/tomato/SL.2.40/genome.2bit
The last argument is the location of the genome 2bit file (/data/tomato/SL.2.40/genome.2bit
in this example), it must be an absolute path which starts with /data/
must be readable by anyone inside the Docker container.
The ave-app front end will use this path as the relative http(s) path to fetch reference genome sequence in the selected region and the AVE rest service will use it to determine the chromosome list and build haplotype sequence.
Register variants
bash
docker exec ave \
avedata register \
--species 'Solanum Lycopersicum' \
--genome SL.2.40 \
--datatype variants \
/data/tomato/SL.2.40/tomato_snps.bcf
The /data/tomato/SL.2.40/tomato_snps.bcf
file must be readable by anyone inside the Docker container.
To perform clustering a registered genome (2bit) file and corresponding variant (bcf) file is required.
Register genes
One genome can have one gene track. The gene track shows exons, introns and untranslated regions.
bash
docker exec ave \
avedata register \
--species 'Solanum Lycopersicum' \
--genome SL.2.40 \
--datatype genes \
/data/tomato/SL.2.40/gene_models.bb
The last argument is the bigbed formatted file with genes (/data/tomato/SL.2.40/gene_models.bb
in this example), it must be an absolute path which starts with /data/
and must be readable by anyone inside the Docker container.
Registration can take some time because a Whoosh full text index is build.
The ave-app front end will use this path as the relative http(s) path to fetch the genes in the selected region.
Register feature annotations
One genome can have one ore more feature annotation files registered.
bash
docker exec ave \
avedata register \
--species 'Solanum Lycopersicum' \
--genome SL.2.40 \
--datatype features \
/data/tomato/SL.2.40/A-AFFY-87.bb
The last argument is the bigbed formatted file with feature annotations (/data/tomato/SL.2.40/A-AFFY-87.bb
in this example), it must be an absolute path which starts with /data/
and must be readable by anyone inside the Docker container.
Registration can take some time because a Whoosh full text index is build.
The ave-app front end will use this path as the relative http(s) path to fetch the feature annotations in the selected region.
The basename of the file, in this case A-AFFY-87
, will be used as the track label.
Deregister
To deregister one of the files use the deregister command.
For example to deregister the /data/tomato/SL.2.40/A-AFFY-87.bb
file use:
bash
docker exec ave \
avedata deregister \
/data/tomato/SL.2.40/A-AFFY-87.bb
Develop
Below are instructions how to get a development version of the service up and running.
Requirements:
Setup
First clone the repository
bash
git clone https://github.com/nlesc-ave/ave-rest-service.git
cd ave-rest-service
To create a new Anaconda environment with all the ave dependencies installed.
bash
conda env create -f environment.yml
On osx use enviroment.osx.yml
instead of environment.yml
.
Activate the environment
bash
source activate ave2
Install ave for development with
bash
python setup.py develop
If dependencies are changed in environment.yml
then update conda env by runnning
conda env update -f environment.yml
Configuration
The service needs a configuration file called setting.cfg
.
The repo contains an example config file called settings.example.cfg
.
Copy the example config file to settings.cfg
and edit it.
Make sure the WHOOSH_BASE_DIR
directory exists.
Run service
Change to directory with settings.cfg
file.
bash
gunicorn -w 4 --threads 2 -t 60 -b 127.0.0.1:8080 avedata.app:app
It will run the service on http://127.0.0.1:8080/ .
The api endpoint is at http://127.0.0.1:8080/api/
.
The Swagger UI is at http://127.0.0.1:8080/api/ui
.
This will only run the Python web service, the hosting of the application and data files is explained in the deployment chapter.
Run commands
The avedata
command line tool expects to be run from the directory containing the settings.cfg
file.
Available commands: * deregister Remove a filename or url from the database * drop_db Drops database * init_db Initializes database * list List of registered files/urls in the database * register Add file metadata information to the database * run Run as single threaded web service
Build Docker image
Building the Docker image of the latest commit of the master branch is automatically build on Docker hub with the latest tag.
If you have local changes and want to test the Docker container locally then you can build the Docker image with
bash
docker build -t ave2/allelic-variation-explorer .
The Docker image contains the latest version of ave-app.
If you want to run a different version of the app you will need to replace curl/tar command in ./Dockerfile
with commands to put the app you want in /var/www/html
directory and build a Docker image.