About IMP 2.0

IMP-2.0 is the result of our efforts to facilitate the installation of the IMP codebase and dependencies. While the original version of IMP was based on Docker, the use of the Docker framework imposed challenges for running IMP in HPC environments. Hence, we decided to adjust it to use the conda framework instead. Below, you will find instructions on how to install IMP-2.0.

Due to the complexity and amount of dependencies of IMP, the installation process will require some time. Most of it is automated, so you can go and grab a drink or go for a walk. Yet, some parts require your input.

N.B. We put extra attention to replicate the Docker environment, i.e., the dependencies of IMP, in IMP-2.0. As such, we expect the results to be similar between IMP-1.x and IMP-2.0. Nevertheless, small differences might exist. In particular, some of the software uses random initialization, random allocation, or heuristics (e.g., megahit) which will (inevitably) lead to slight differences. We plan to further inspect the degree of variability a bit more in detail in the future, e.g., by comparing the Average Nucleotide Identity between Docker-based assemblies (IMP-1.x) and conda-based assemblies (IMP-2.0).

General details about the pipeline can be found in

N.B. The documentation is for IMP-1.x, i.e., the Docker-based version. An updated documentation for IMP-2.x will follow.

Table of contents

IMP-2.x source code

The development of IMP-2.x currently occurs in the bioconda branch of the IMP GitLab repository. We plan to merge this branch into the master branch in the future, but have kept it separate for now.

Installation in a nutshell

The following assumes all installations to occur in or under /path/to/imp/, i.e., conda, etc. should be installed there. Feel free to choose a different path, but please adjust the respective paths below accordingly.

These three main steps are required to install IMP 2.0:

  • cloning the IMP repository
  • downloading the database (required for filtering of human-derived sequences, rRNA-derived sequences, etc.)
  • getting the conda environment ready:

The following steps assume that you have not yet a conda-environment called test installed.

cd /path/to/imp/
git clone -b bioconda https://git-r3lab.uni.lu/IMP/IMP.git # Get the code base
wget https://webdav-r3lab.uni.lu/public/R3lab/IMP/db/default.tgz # This will take some time (30 - 60 minutes)
tar -xzf default.tgz # Unpack the database
conda env create -n test -f IMP/conda/envs/all.yaml r==3.3.2 # This will take some time to download and install
source activate test # Activate your newly created environment
cd miniconda3/envs/test/opt/krona/
ktUpdateTaxonomy.sh # Update files required by Krona
cd -

To provide legacy support, vcftools and R packages need to be installed separately (i.e., manually) into the conda-environment:

For vcftools:

wget --no-check-certificate https://webdav-r3lab.uni.lu/public/R3lab/IMP/vcftools_0.1.12b.tar.gz -nv
tar -xzf vcftools_0.1.12b.tar.gz
cd vcftools_0.1.12b && make && make install && cd -
cp -r vcftools_0.1.12b/bin/* miniconda3/envs/test/bin/
cp -r vcftools_0.1.12b/perl/* miniconda3/envs/test/lib/perl5/site_perl/5.22.0

For R: Please check whether your R configuration is “clean”, see below and then run:

CHECKPOINTDIR="/path/to/imp"; mkdir -p $CHECKPOINTDIR/.checkpoint && Rscript --vanilla -e "install.packages('checkpoint', repos='https://cloud.r-project.org/'); library(checkpoint); checkpoint('2016-06-20', checkpointLocation='${CHECKPOINTDIR}', project='${CHECKPOINTDIR}/IMP/docker/'); source('http://bioconductor.org/biocLite.R'); biocLite('genomeIntervals', dependencies=TRUE); .libPaths()"
# Verify that at the end of the command only paths prefixed with `/path/to/imp` are printed. 

N.B. Should you modify some of the R packages manually by yourself at some point, please make sure that they are installed into the right directory (s.a., https://stat.ethz.ch/R-manual/R-devel/library/base/html/libPaths.html with respect to R_LIBS, R_LIBS_USER, etc.)

If your environment etc. is set-up, you can directly jump to running IMP

Detailed Installation Guide

Several steps are kept rather short or have been skipped above for brevity reasons. More detailed information on the installation of IMP can be found here.

Conda installation and setup

Install Miniconda3 for 64-bit Linux

To start, download miniconda3 and install it:

cd /path/to/imp # If this directory does not exist, please create it.
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod +x Miniconda3-latest-Linux-x86_64.sh # Enable execution of the installer
bash Miniconda3-latest-Linux-x86_64.sh # Specify `/path/to/imp/miniconda3` as the installation direction (prompt comes after the license agreement prompt). Also, consider to say 'yes' when asked whether the miniconda-path should be prepended to your `PATH` variable

This should take about 5 - 10 minutes and will consume around 2 GB of disk space.

Please note that Miniconda3 is Python-3 based. We tested our pipeline on a 64-bit Linux system. Moreover, after installing Miniconda, please close the console for the changes to be active. You might have to open a new terminal.

Add channels to the conda environment

# Please remember that the CHANNEL ORDER IS IMPORTANT
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda
conda config --add channels imp

This step is not required in this case because these commands are already added to YAML file. However, we recommend to do it to make sure all priorities are correctly set.

Get the IMP

git clone -b bioconda https://git-r3lab.uni.lu/IMP/IMP.git # Get the code base

The codebase of IMP-2.x is currently within the bioconda branch, but is expected to be merged into the master branch in the future.

Creating conda environment

Here we create a conda environment from a file that contains a list of software packages and dependencies required for IMP

Note: Replace the environment name test with the environment name you want to create. To know more about how to create an environment from a YAML file, please refer to this link

conda env create -n test -f IMP/conda/envs/all.yaml r==3.3.2

N.B. As the list of dependencies is quite long, this step will take quite some time to resolve the environment and perform the individual installations. In our tests, it typically took around 25 - 30 minutes. Hence, this might be a good time to grab something to drink or go for a little walk.

At the end of the installation, you should see something along the following lines being printed:

Krona installed.  You still need to manually update the taxonomy
databases before Krona can generate taxonomic reports.  The update
script is ktUpdateTaxonomy.sh.  The default location for storing
taxonomic databases is /home/users/claczny/imp-install-test/miniconda3/envs/test/opt/krona/taxonomy

If you would like the taxonomic data stored elsewhere, simply replace
this directory with a symlink.  For example:

rm -rf /home/users/claczny/imp-install-test/miniconda3/envs/test/opt/krona/taxonomy
mkdir /path/on/big/disk/taxonomy
ln -s /path/on/big/disk/taxonomy /home/users/claczny/imp-install-test/miniconda3/envs/test/opt/krona/taxonomy
ktUpdateTaxonomy.sh

This is explained after in “Update Krona-related files”, after you have activated your newly created environment.

  • To see all your environments, run:
conda info --envs

The active environment will be marked with an asterisk, e.g.,

base                     /path/to/imp/miniconda3
test                  *  /path/to/imp/miniconda3/envs/test
  • Activate the environment
conda activate test

Should you receive something along the following lines:

CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.
If your shell is Bash or a Bourne variant, enable conda for the current user with

    $ echo ". /home/users/claczny/imp-install-test/miniconda3/etc/profile.d/conda.sh" >> ~/.bashrc

or, for all users, enable conda with

    $ sudo ln -s /home/users/claczny/imp-install-test/miniconda3/etc/profile.d/conda.sh /etc/profile.d/conda.sh

The options above will permanently enable the 'conda' command, but they do NOT
put conda's base (root) environment on PATH.  To do so, run

    $ conda activate

in your terminal, or to put the base environment on PATH permanently, run

    $ echo "conda activate" >> ~/.bashrc

Previous to conda 4.4, the recommended way to activate conda was to modify PATH in
your ~/.bashrc file.  You should manually remove the line that looks like

    export PATH="/home/users/claczny/imp-install-test/miniconda3/bin:$PATH"

^^^ The above line should NO LONGER be in your ~/.bashrc file! ^^^

Either follow the instructions or use source activate test instead.

All the Python>3 softwares and dependencies required to run IMP are installed inside the environment. Python 2 dependencies are installed during the first run of the pipeline. This might delay the execution of the very first run to some extent.

  • Update Krona-related files: Please run the following:
    cd miniconda3/envs/test/opt/krona/
    ktUpdateTaxonomy.sh
    cd -
    

Again, this step will take some time (35 - 40 minutes) and require some extra disk space (~6 GB in miniconda3/envs/test/opt/krona).

  • To see the list of installed packages in the active environment, run:
conda list

Dependencies

This is the list of all the tools that are installed inside the conda environment you created using the YAML file provided by IMP:

Python 3 dependencies:

  • fastuniq==1.1
  • idba==1.1.1
  • cap3==10.2011
  • fastqc==0.11.3
  • prokka==1.11
  • parallel==20160622
  • sortmerna==2.1b
  • krona==2.5
  • megahit==1.0.6
  • pullseq==1.0.2
  • samtools==0.1.19
  • dendropy==4.1.0
  • screamingbackpack==0.2.333
  • maxbin2==2.2.1
  • fraggenescan==1.30
  • snakemake==4.7.0
  • tbl2asn
  • r==3.3.2
  • readline==6.2
  • htslib==1.2.1
  • perl-vcftools-vcf==0.953
  • checkm-genome==1.0.7
  • bedtools==2.18.0
  • freebayes
  • bwa==0.7.9a
  • trimmomatic==0.32

Python 2 dependencies:

  • platypus-variant==0.8.1
  • quast==4.6.3

Manual software installation

Before following the steps below, please make sure that the conda environment is activated so that the manual installation of the below mentioned softwares (i.e. vcftools and R packages) can occur within the environment.

vcftools

  • We are using vcftools - 0.1.12b. To provide legacy support, the respective tarball is provided by us directly, rather than by the original developer.
 wget --no-check-certificate https://webdav-r3lab.uni.lu/public/R3lab/IMP/vcftools_0.1.12b.tar.gz -nv
 tar -xzf vcftools_0.1.12b.tar.gz
 cd vcftools_0.1.12b
 make
 make install
 cd -

Now, copy the binaries and perl files to the bin and lib folders of the created environment respectively, i.e. test, here:

cp -r vcftools_0.1.12b/bin/* miniconda3/envs/test/bin/
cp -r vcftools_0.1.12b/perl/* miniconda3/envs/test/lib/perl5/site_perl/5.22.0

R packages

For the analysis, IMP uses R packages that you need to install inside your environment.

  • R with checkpoint libraries and genomeIntervals from biocLite

IMPORTANT: Make sure that you have not set any bash variables related to R, e.g., R_LIBS, R_LIBS_USER, R_LIBS_SITE. Should (any of) these variable be set, it cannot be guaranteed that R “stays within” the conda environment for installations, i.e., it may overwrite packages elsewhere, which is clearly undesireable.

The use of the --vanilla option when calling Rscript prohibits parsing Rs configuration files specifying environment variables (https://stat.ethz.ch/R-manual/R-devel/library/base/html/Startup.html). However, bash variables are not ignored. To make sure that your bash environment is “clean”, use env | grep "^R" | wc -l which should return a count of 0. If not, check whether your respective variables are empty, e.g., :

env | grep "^R"

then returning

R_LIBS=

Again, if these variables are not empty as above, you can empty them, e.g., :

# An example for bash
export R_LIBS=

If you are sure that your environment is clean, run the following to install the R packages.

CHECKPOINTDIR="/path/to/imp"; mkdir -p $CHECKPOINTDIR/.checkpoint && Rscript --vanilla -e "install.packages('checkpoint', repos='https://cloud.r-project.org/'); library(checkpoint); checkpoint('2016-06-20', checkpointLocation='${CHECKPOINTDIR}', project='${CHECKPOINTDIR}/IMP/docker/'); source('http://bioconductor.org/biocLite.R');biocLite('genomeIntervals', dependencies=TRUE); .libPaths()"
# Verify that at the end of the command only paths prefixed with `/path/to/imp` are printed. 

The option project='${CHECKPOINTDIR}/IMP/docker/' makes sure that all required R libraries are installed according to the original Docker specifications.

N.B. Should you modify some of the R packages manually by yourself at some point, please make sure that they are installed into the right directory (s.a., https://stat.ethz.ch/R-manual/R-devel/library/base/html/libPaths.html with respect to R_LIBS, R_LIBS_USER, etc.). This situation is also discussed in https://github.com/conda-forge/r-base-feedstock/issues/37, albeit not with respect to using the checkpoint package. Principally, when using this package, all installation should automatically occur in the path specified by the checkpointLocation argument. For this to happen, you have to make sure that you have loaded the respective conda environment (here test) and that you load the checkpoint package before performing any installation tasks, by specifying library("checkpoint"). Not following these steps might result in overwriting existing R libraries, i.e., libraries outside of the environment, s.a., the description above about the --vanilla option.

We use the checkpoint library set to the 2016-06-20 to install the following R packages:

  • ggplot2
  • gtools
  • data.table
  • reshape
  • grid
  • grDevices
  • genomeIntervals
  • stringr
  • xtable
  • beanplot
  • psych

R version 3.3.2 is installed automatically while creating the environment using the all.yaml file in previous step. This particular step is required to only install R packages.

Downloading Databases

Before running, IMP needs databases that should be downloaded using the following commands:

wget --no-check-certificate https://webdav-r3lab.uni.lu/public/R3lab/IMP/db/default.tgz
tar -xzf default.tgz

Unfortunately, the database files are hosted on a server with a rather slow connection; we hope this will change soon. For now, however, this step can take some time (30 - 60 minutes) due to the database’s filesize (5.9 GB). The path to the unzipped files needs to be provided when executing IMP:

DBPATH="/path/to/imp/db"

Configuration

IMP uses the Snakemake workflow manager to resolve all dependencies and execute the respective commands in the right order.

With snakemake:

CONFIGFILE="/path/to/imp/IMP/config.json" snakemake <MORE-OPTIONS>

Configuration file must be a valid JSON file:

e.g.:

{
    "threads": 8,
    "output": /home/user/analysis_output_directory,
    "preprocessing_filtering": false
}

Available parameters

  • threads: Number of max threads to use.
  • memory_total_gb: Some tools need to set the max memory they could use.
  • memory_per_core_gb: Some tools need to set the max memory they could use per cores.
  • tmp_dir: Path to a temporary directory.
  • raws - Metagenomics: Path to the metagenomics paired files.
  • raws - Metatranscriptomics: Path to the metatranscriptomics paired files.
  • outputdir: Path to the output directory.
  • db_path: Path to the databases.
  • preprocessing_filtering: If you want to filter reads from a database. Can be true or false.
  • assembler: The assembler to use. Could be idba or megahit.

Per tool/step parameters

Trimmomatic

  • pkg_url: Where to download the trimmomatic package to fetch the adapters databases.
  • adapter: What adapter to use.

Following parameters are taken from the Trimmomatic documentation:

  • leading: Cut bases off the start of a read, if below a threshold quality.
  • minlen: Specifies the minimum length of reads to be kept.
  • palindrome_clip_threshold: Specifies how accurate the match between the two ‘adapter ligated’ reads must be for PE palindrome read alignment.
  • simple_clip_threshold: Specifies how accurate the match between any adapter etc. sequence must be against a read.
  • trailing: Specifies the minimum quality required to keep a base.
  • `seed_mismatch: specifies the maximum mismatch count which will still allow a full match to be performed.
  • window_size: Specifies the number of bases to average across.
  • window_quality: Specifies the average quality required.
  • strictness: This value, which should be set between 0 and 1, specifies the balance between preserving as much read length as possible vs. removal of incorrect bases. A low value of this parameter favours longer reads, while a high value favours read correctness.
  • target_length: This specifies the read length which is likely to allow the location of the read within the target sequence to be determined.
  • jarfile: Path to the trimmomatic JAR file on your system. (You don’t need to set it if you are using the docker container.)

idba_ud

  • mink: Minimum k value.
  • maxk: Maximum k value.
  • step: Increment of k-mer of each iteration.
  • perid: Similarity for alignment.

vizbin

  • dimension: 50,
  • kmer: 5,
  • size: 4,
  • theta: 0.5,
  • perp: 30,
  • cutoff: 1000
  • jarfile: Path to the Vizbin JAR file on your system. (You don’t need to set it if you are using the docker container.)

filtering

  • filter: Name of the filter.
  • url: URL to download database.

sortmerna

  • pkg_url: Url to download sormerna databases from
  • files: Databases to use and index.

prokka

  • pkg_url: Url to download prokka databases from
  • databases: List of databases to use.

kegg

  • db_ec2pthy and db_hierarchy: Url to download Kegg information from.

Running IMP

The entire workflow

Here, it is assumed that your metagenomic and metatranscriptomic files are in the /input directory and are called MG.R1.fq, MG.R2.fq for the metagenomic data and MT.R1.fq, MT.R2.fq for the metatranscriptomic data. Moreover, we assume that the results are supposed to be stored in the /output folder. Please adjust these according to your needs.

We suggest that you create a file (e.g., launch_imp.sh) with the following content, which you can then conveniently adjust.

#! /bin/bash -l
CONFIGFILE="/path/to/imp/IMP/src/config.imp.json" \
MG="/input/MG.R1.fq /input/MG.R2.fq" \
MT="/input/MT.R1.fq /input/MT.R2.fq" \
OUTPUTDIR="/output" \
CONDAPATH="/path/to/imp/IMP/conda" \
DBPATH= "/path/to/imp/db" \
FILTER="hg38" \
SRCDIR="/path/to/imp/IMP/src" \
LIBDIR="/path/to/imp/IMP/lib" \
IMP_ASSEMBLER="megahit" \
snakemake -s /path/to/imp/IMP/Snakefile --use-conda -p

Upon your first execution, additional dependencies might be further resolved, e.g., building the BWA index for the filter genome (default: hg38). Thus, the runtime of the initial IMP may be longer than that of subsequent runs.

N.B. IMP currently copies over and potentially unpacks your input files. Moreover, all intermediate files are currently preserved which results in comparably large sizes of the $OUTPUTDIR. We are working on improving the I/O-footprint of IMP.

Running individual steps

There are situations where one might want to run the individual steps of the pipeline incrementally.

If you want to do so, simply specify the <step>in the snakemake-call (i.e., snakemake-target <step>.done), where <step> could be one of one of preprocessing, assembly, analysis, binning, report or workflow.

For example, to run the preprocessing step only, you could use:

#! /bin/bash -l
CONFIGFILE="/path/to/imp/IMP/src/config.imp.json" \
MG="/input/MG.R1.fq /input/MG.R2.fq" \
MT="/input/MT.R1.fq /input/MT.R2.fq" \
OUTPUTDIR="/output" \
CONDAPATH="/path/to/imp/IMP/conda" \
DBPATH= "/path/to/imp/db" \
FILTER="hg38" \
SRCDIR="/path/to/imp/IMP/src" \
LIBDIR="/path/to/imp/IMP/lib" \
IMP_ASSEMBLER="megahit" \
snakemake -s /path/to/imp/IMP/Snakefile --use-conda -p preprocessing.done

Deactivating the conda environment

You can deactivate the conda environment using the following command:

source deactivate

Thank you for using IMP!

When you use IMP and cite it, please do not forget to cite the respective tools that are used. Without the efforts of their respective authors, we could not

stand on the shoulders of giants

Questions, Suggestions, Issues

Facing some problems installing/running IMP? Feel free to open an issue on the Gitlab issues page where you can directly get in touch with the developers.

Website commit id: 299f120. Commit date: Fri Dec 14 18:22:42 2018 +0100.