Last updated: 2020-06-23

Checks: 6 1

Knit directory: PSYMETAB/

This reproducible R Markdown analysis was created with workflowr (version 1.6.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20191126)

The command set.seed(20191126) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: absolute

Using absolute paths to the files within your workflowr project makes it difficult for you and others to run your code on a different machine. Change the absolute path(s) below to the suggested relative path(s) to make your code more reproducible.

absolute	relative
/data/sgg2/jenny/projects/PSYMETAB	.

Repository version: 468f89e

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    ._docs
    Ignored:    .drake/
    Ignored:    analysis/.Rhistory
    Ignored:    analysis/._GWAS.Rmd
    Ignored:    analysis/._data_processing_in_genomestudio.Rmd
    Ignored:    analysis/._quality_control.Rmd
    Ignored:    analysis/GWAS/
    Ignored:    analysis/PRS/
    Ignored:    analysis/QC/
    Ignored:    analysis/figure/
    Ignored:    analysis_prep_1_clustermq.out
    Ignored:    analysis_prep_2_clustermq.out
    Ignored:    analysis_prep_3_clustermq.out
    Ignored:    analysis_prep_4_clustermq.out
    Ignored:    data/processed/
    Ignored:    data/raw/
    Ignored:    download_impute_1_clustermq.out
    Ignored:    init_analysis_1_clustermq.out
    Ignored:    init_analysis_2_clustermq.out
    Ignored:    init_analysis_3_clustermq.out
    Ignored:    init_analysis_4_clustermq.out
    Ignored:    init_analysis_5_clustermq.out
    Ignored:    init_analysis_6_clustermq.out
    Ignored:    packrat/lib-R/
    Ignored:    packrat/lib-ext/
    Ignored:    packrat/lib/
    Ignored:    post_impute_1_clustermq.out
    Ignored:    pre_impute_qc_1_clustermq.out
    Ignored:    process_init_10_clustermq.out
    Ignored:    process_init_11_clustermq.out
    Ignored:    process_init_12_clustermq.out
    Ignored:    process_init_13_clustermq.out
    Ignored:    process_init_14_clustermq.out
    Ignored:    process_init_15_clustermq.out
    Ignored:    process_init_16_clustermq.out
    Ignored:    process_init_17_clustermq.out
    Ignored:    process_init_18_clustermq.out
    Ignored:    process_init_19_clustermq.out
    Ignored:    process_init_1_clustermq.out
    Ignored:    process_init_20_clustermq.out
    Ignored:    process_init_21_clustermq.out
    Ignored:    process_init_22_clustermq.out
    Ignored:    process_init_23_clustermq.out
    Ignored:    process_init_24_clustermq.out
    Ignored:    process_init_25_clustermq.out
    Ignored:    process_init_26_clustermq.out
    Ignored:    process_init_27_clustermq.out
    Ignored:    process_init_28_clustermq.out
    Ignored:    process_init_29_clustermq.out
    Ignored:    process_init_2_clustermq.out
    Ignored:    process_init_30_clustermq.out
    Ignored:    process_init_31_clustermq.out
    Ignored:    process_init_3_clustermq.out
    Ignored:    process_init_4_clustermq.out
    Ignored:    process_init_5_clustermq.out
    Ignored:    process_init_6_clustermq.out
    Ignored:    process_init_7_clustermq.out
    Ignored:    process_init_8_clustermq.out
    Ignored:    process_init_9_clustermq.out
    Ignored:    prs_1_clustermq.out
    Ignored:    prs_2_clustermq.out
    Ignored:    prs_3_clustermq.out
    Ignored:    prs_4_clustermq.out

Untracked files:
    Untracked:  analysis/genetic_quality_control.Rmd
    Untracked:  analysis/plans.Rmd
    Untracked:  analysis_prep.log
    Untracked:  download_impute.log
    Untracked:  grs.log
    Untracked:  init_analysis.log
    Untracked:  process_init.log
    Untracked:  prs.log

Unstaged changes:
    Modified:   analysis/GWAS.Rmd
    Modified:   analysis/data_sources.Rmd
    Modified:   analysis/index.Rmd
    Modified:   analysis/pheno_quality_control.Rmd
    Deleted:    analysis/project.Rmd
    Modified:   cache_log.csv
    Modified:   post_impute.log
    Modified:   slurm_clustermq.tmpl

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File	Version	Author	Date	Message
Rmd	cb27d68	Jenny	2020-02-05	include generated reports/
Rmd	bc8b30e	Sjaarda Jennifer Lynn	2020-01-22	typo in folder creation
Rmd	015e3ac	Jenny Sjaarda	2020-01-13	revert to older setup instructions with packrat
Rmd	4447514	Jenny Sjaarda	2020-01-13	revert to old setup instruction
Rmd	642f3f1	Jenny	2020-01-13	change extraction folder name
Rmd	b7954e7	Sjaarda Jennifer Lynn	2020-01-09	change extraction folder name to extractions
Rmd	8537fee	Sjaarda Jennifer Lynn	2020-01-09	misc typos
Rmd	253277c	Jenny	2020-01-09	update project details
html	81ca4ed	Jenny	2019-12-19	Build site.
Rmd	817ea9e	Jenny	2019-12-19	modify setup description
Rmd	e6f7fb5	Jenny	2019-12-17	improve website
Rmd	f1c2d32	Jenny	2019-12-16	revision to plan based on drake suggestions
Rmd	1da9370	Jenny	2019-12-12	load drake package
html	46477dd	Jenny Sjaarda	2019-12-06	Build site.
Rmd	b503ef0	Sjaarda Jennifer Lynn	2019-12-06	add more details to website
html	b6cb027	Jenny Sjaarda	2019-12-06	Build site.
Rmd	bee9ea8	Sjaarda Jennifer Lynn	2019-12-06	add step for using wflow_status()
Rmd	e430d04	Sjaarda Jennifer Lynn	2019-12-06	modify commiting instructions
html	d1e539c	Jenny Sjaarda	2019-12-06	Build site.
Rmd	487b5f5	Sjaarda Jennifer Lynn	2019-12-06	update website, add qc description
html	9f1ba5e	Jenny Sjaarda	2019-12-06	Build site.
Rmd	5e454c3	Sjaarda Jennifer Lynn	2019-12-06	add more details to website
Rmd	d480e35	Jenny	2019-12-04	misc annotations
html	125be8c	Jenny Sjaarda	2019-12-02	build website
Rmd	179fb3b	Jenny	2019-12-02	eval false to drake launch
Rmd	0dd02a7	Jenny	2019-12-02	modify website
html	2849dcb	Jenny Sjaarda	2019-12-02	wflow_git_commit(all = T)
Rmd	49a7ba9	Sjaarda Jennifer Lynn	2019-12-02	modify git ignore

Last updated: 2020-06-23

Code version: 468f89ecd55d9e84ca3bdd041921a20f764ee2ed

To reproduce the results from this project, please follow these instructions.

In general, drake was used to manage long-running code and workflowr was used to manage the website.

1 Initiate project on remote server.

All processing scripts were run from the root sgg directory. Project was initialized using workflowr rpackage, see here.

On sgg server:

project_name <- "PSYMETAB"
library("workflowr")

wflow_start(project_name) # creates directory called project_name

options("workflowr.view" = FALSE) # if using cluster
wflow_build() # create directories
options(workflowr.sysgit = "")

wflow_publish(c("analysis/index.Rmd", "analysis/about.Rmd", "analysis/license.Rmd"),
              "Publish the initial files for myproject")

wflow_use_github("jennysjaarda") # select option 2: manually create new repository

wflow_git_push()

You have now successfully created a GitHub repository for your project that is accessible on GitHub and the servers.

Next setup a local copy.

2 Create local copy on personal computer.

Within terminal of personal computer, clone the git repository.

cd ~/Dropbox/UNIL/projects/
git clone https://GitHub.com/jennysjaarda/PSYMETAB.git PSYMETAB

Open project in Atom (or preferred text editor) and modify the following files:

Because Terminal cannot generate a preview and workflowr doesn’t like the sysgit, to the .Rprofile file, add:
- options(workflowr.sysgit = "")
- options("workflowr.view" = FALSE)
To ensure git hub isn’t manaaging large files, modify the .gitignore file, by adding the following lines:
- analysis/*
- data/*
- !analysis/*.Rmd
- !data/*.md
- .git/
Save and push these changes to GitHub.
Pull to the server.

3 Create internal project folders.

Return to sgg server and run the following:

project_dir=/data/sgg2/jenny/projects/PSYMETAB
mkdir $project_dir/data/raw
mkdir $project_dir/data/processed
mkdir $project_dir/data/raw/reference_files
mkdir $project_dir/data/raw/phenotype_data
mkdir $project_dir/data/raw/extractions
mkdir $project_dir/data/processed/phenotype_data
mkdir $project_dir/data/processed/extractions
mkdir $project_dir/docs/assets
mkdir $project_dir/docs/generated_reports

This will create the following directory structure in PSYMETAB/:

PSYMETAB/
├── .gitignore
├── .Rprofile
├── _workflowr.yml
├── analysis/
│   ├── about.Rmd
│   ├── index.Rmd
│   ├── license.Rmd
│   └── _site.yml
├── code/
│   ├── README.md
├── data/
│   ├── README.md
│   ├── raw/
│   │   ├── phenotype_data/
│   │   ├── reference_files/
│   │   └── extractions/
│   ├── processed/
│   │   ├── phenotype_data/
│   │   ├── reference_files/
│   │   └── extractions/
├── docs/
│   ├── generated_reports/
│   └── assets/
├── myproject.Rproj
├── output/
│   └── README.md
└── README.md

Raw PLINK (ped and map files) data were copied from the CHUV folder (L:\PCN\UBPC\ANALYSES_RECHERCHE\Jenny\PSYMETAB_GWAS) after being built in genomestudio to the data/ drive.

4 Initialize a packrat directory

Packrat is a dependency management system for R. It is useful for making your project: 1. Isolated. 2. Portable. 3. Reproducible.

Initialize a packrat project by simply running:

packrat::init("/data/sgg2/jenny/projects/PSYMETAB")
packrat::set_opts(auto.snapshot = T)

This creates a packrat directory in your project folder.

Now, everytime you launch R from this directory or run install.packages(), packrat will automatically keep track of your packages and versions. You are no longer in an ordinary R project; you’re in a Packrat project. The main difference is that a packrat project has its own private package library. Any packages you install from inside a packrat project are only available to that project; and packages you install outside of the project are not available to the project.

Likely, we won’t need to do anymore than this, but some additional functions exist, that are useful if we need to move our project to a new disk/computer.

5 Run and summarize analyses.

Note that Part A and B are happening in parallel.

5.1 Setup drake and execute plan.

library(drake)

For execution of drake plan: see make.R. For drake plan(s) see: code/plan.R.

To run drake plans on slurm nodes with parallel backends, there are two options:

clustermq

Requires zeromq installation.
Faster.
Does not allow for transient workers (possible work in progress, see this issue.
Can modify template using template argument within make.

future / future.batchtools

Slower.
Easier intallation.
Allows for transient workers.
Unable to modify template within make directly.

For either option, a template needs to be registered and edited manually according to our cluster’s requirements/needs. We will prepare both templates, because we will use both backends, depending on the plan.

# load and save template from `drake`  using `drake_hpc_template_file` function, edit manually.
drake_hpc_template_file("slurm_clustermq.tmpl")
drake_hpc_template_file("slurm_batchtools.tmpl")

# register the plans
options(clustermq.scheduler = "slurm", clustermq.template = "slurm_clustermq.tmpl")
future::plan(batchtools_slurm, template = "slurm_batchtools.tmpl")

The files created above were edited manually to match slurm_clustermq.tmpl and slurm_batchtools.tmpl.

cat(readLines('slurm_clustermq.tmpl'), sep = '\n')

#!/bin/sh
# From https://github.com/mschubert/clustermq/wiki/SLURM
#SBATCH --job-name={{ job_name }}           # job name
#SBATCH --partition={{ partition }}                 # partition
#SBATCH --output={{ log_file | /dev/null }} # you can add .%a for array index
#SBATCH --error={{ log_file | /dev/null }}  # log file
#SBATCH --mem-per-cpu={{ memory | 7900 }}   # memory
#SBATCH --array=1-{{ n_jobs }}              # job array
#SBATCH --cpus-per-task={{ cpus }}
# module load R                             # Uncomment if R is an environment module.
####ulimit -v $(( 1024 * {{ memory | 4096 }} ))
CMQ_AUTH={{ auth }} R --no-save --no-restore -e 'clustermq:::worker("{{ master }}")'

cat(readLines('slurm_batchtools.tmpl'), sep = '\n')

#!/bin/bash


## Via https://github.com/mllg/batchtools/blob/master/inst/templates/
## Job Resource Interface Definition
##
## ntasks [integer(1)]:       Number of required tasks,
##                            Set larger than 1 if you want to further parallelize
##                            with MPI within your job.
## cpus [integer(1)]:         Number of required cpus per task,
##                            Set larger than 1 if you want to further parallelize
##                            with multicore/parallel within each task.
## memory [integer(1)]:       Memory in megabytes for each cpu.
##                            Default is 7900 Mo/core
## partition [string(1)]:     Partition requested.
##                            Default is "sgg".
##
## Default resources can be set in your .batchtools.conf.R by defining the variable
## 'default.resources' as a named list.


<%
# relative paths are not handled well by Slurm
log.file = fs::path_expand(log.file)
#########################
# Set defaults if needed.

if (!"partition" %in% names(resources)) {
  resources$partition = "sgg"
}

-%>

#SBATCH --job-name=<%= job.name %>
#SBATCH --output=<%= log.file %>
#SBATCH --error=<%= log.file %>
#SBATCH --ntasks=1
#SBATCH --account=sgg
#SBATCH --partition=<%= resources$partition %>
<%= if (!is.null(resources[["cpus"]])) sprintf(paste0("#SBATCH --cpus-per-task='", resources[["cpus"]], "'")) %>
<%= if (array.jobs) sprintf("#SBATCH --array=1-%i", nrow(jobs)) else "" %>
<%= if (!is.null(resources[["memory"]])) sprintf(paste0("#SBATCH --mem-per-cpu='", resources[["memory"]], "'")) %>


## module add ...

## Run R:
Rscript -e 'batchtools::doJobCollection("<%= uri %>")'

5.2 Build and maintain website.

Follow the general workflow outlined by workflowr, with some minor revisions to accomodate workflow between personal computer and remote server:

Open a new or existing R Markdown file in analysis/ (optionally using wflow_open()). (Usually created manually on personal computer and push to server to build later.) If creating manually, add the following to the top of the R Markdown file with an appropriate name for Title:

---
title: "Title"
site: workflowr::wflow_site
output:
  workflowr::wflow_html:
    toc: true
---

Write documentation and perform analyses in the R Markdown file.
Run commit and push to upload revised R Markdown file to GitHub repository.
On server, pull changes using wflow_git_pull() (optionally using git pull from Terminal within cloned repository).
Within R console, run wflow_build(). This will create html files with docs/ folder. These files cannot be viewed directly on server, but can be transfered and viewed via FileZilla or viewed directly by mounting the remote directory to your personal computer using SSHFS (recommended).
Return to step 2 until satisfied with the result (optionally, edit Rmd file directly on server using vi if only small modifications are necessary).
Run wflow_status() to track repository.
Run wflow_publish() to commit the source files (R Markdown files or other files in code/, data/, and output/), build the HTML files, and commit the HTML files. If there are uncommited files in the directory that are not “.Rmd”, wflow_publish(all=T) does not work. Alternatively, run the following with an informative message:

repo_status <- wflow_status()
rmd_commit <- c(rownames(repo_status$status)[repo_status$status$modified],
         rownames(repo_status$status)[repo_status$status$unpublished],
         rownames(repo_status$status)[repo_status$status$scratch])

wflow_publish(rmd_commit,
              message="Updating webiste")

Push the changes to GitHub with wflow_git_push() (or git push in the Terminal).

sessionInfo()

# R version 3.5.3 (2019-03-11)
# Platform: x86_64-pc-linux-gnu (64-bit)
# Running under: CentOS Linux 7 (Core)
# 
# Matrix products: default
# BLAS: /data/sgg2/jenny/bin/R-3.5.3/lib64/R/lib/libRblas.so
# LAPACK: /data/sgg2/jenny/bin/R-3.5.3/lib64/R/lib/libRlapack.so
# 
# locale:
#  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
# [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
# [1] workflowr_1.6.0
# 
# loaded via a namespace (and not attached):
#  [1] Rcpp_1.0.3      rprojroot_1.3-2 packrat_0.5.0   digest_0.6.25  
#  [5] later_1.0.0     R6_2.4.1        backports_1.1.6 git2r_0.26.1   
#  [9] magrittr_1.5    evaluate_0.14   highr_0.8       stringi_1.4.5  
# [13] rlang_0.4.5     fs_1.3.1        promises_1.1.0  whisker_0.4    
# [17] rmarkdown_1.18  tools_3.5.3     stringr_1.4.0   glue_1.4.0     
# [21] yaml_2.2.0      httpuv_1.5.2    xfun_0.11       compiler_3.5.3 
# [25] htmltools_0.4.0 knitr_1.26