Last updated: 2021-03-04

Checks: 6 1

Knit directory: PSYMETAB/

This reproducible R Markdown analysis was created with workflowr (version 1.6.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20191126) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Using absolute paths to the files within your workflowr project makes it difficult for you and others to run your code on a different machine. Change the absolute path(s) below to the suggested relative path(s) to make your code more reproducible.

absolute relative
/data/sgg2/jenny/projects/PSYMETAB .

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    ._docs
    Ignored:    .drake/
    Ignored:    analysis/.Rhistory
    Ignored:    analysis/._GWAS.Rmd
    Ignored:    analysis/._data_processing_in_genomestudio.Rmd
    Ignored:    analysis/._quality_control.Rmd
    Ignored:    analysis/GWAS/
    Ignored:    analysis/GWAS_results.knit.md
    Ignored:    analysis/GWAS_results.utf8.md
    Ignored:    analysis/PRS/
    Ignored:    analysis/QC/
    Ignored:    analysis/Rlogo2.png
    Ignored:    analysis/figure/
    Ignored:    analysis/rplot.jpg
    Ignored:    analysis_prep_10_clustermq.out
    Ignored:    analysis_prep_11_clustermq.out
    Ignored:    analysis_prep_12_clustermq.out
    Ignored:    analysis_prep_1_clustermq.out
    Ignored:    analysis_prep_2_clustermq.out
    Ignored:    analysis_prep_3_clustermq.out
    Ignored:    analysis_prep_4_clustermq.out
    Ignored:    analysis_prep_5_clustermq.out
    Ignored:    analysis_prep_6_clustermq.out
    Ignored:    analysis_prep_7_clustermq.out
    Ignored:    analysis_prep_8_clustermq.out
    Ignored:    analysis_prep_9_clustermq.out
    Ignored:    data/processed/
    Ignored:    data/raw/
    Ignored:    packrat/lib-R/
    Ignored:    packrat/lib-ext/
    Ignored:    packrat/lib/
    Ignored:    process_init_10_clustermq.out
    Ignored:    process_init_11_clustermq.out
    Ignored:    process_init_12_clustermq.out
    Ignored:    process_init_13_clustermq.out
    Ignored:    process_init_14_clustermq.out
    Ignored:    process_init_15_clustermq.out
    Ignored:    process_init_16_clustermq.out
    Ignored:    process_init_17_clustermq.out
    Ignored:    process_init_18_clustermq.out
    Ignored:    process_init_19_clustermq.out
    Ignored:    process_init_1_clustermq.out
    Ignored:    process_init_20_clustermq.out
    Ignored:    process_init_21_clustermq.out
    Ignored:    process_init_22_clustermq.out
    Ignored:    process_init_2_clustermq.out
    Ignored:    process_init_3_clustermq.out
    Ignored:    process_init_4_clustermq.out
    Ignored:    process_init_5_clustermq.out
    Ignored:    process_init_6_clustermq.out
    Ignored:    process_init_7_clustermq.out
    Ignored:    process_init_8_clustermq.out
    Ignored:    process_init_9_clustermq.out
    Ignored:    prs_1_clustermq.out
    Ignored:    prs_2_clustermq.out
    Ignored:    prs_3_clustermq.out
    Ignored:    prs_4_clustermq.out
    Ignored:    prs_5_clustermq.out
    Ignored:    prs_6_clustermq.out
    Ignored:    prs_7_clustermq.out
    Ignored:    prs_8_clustermq.out
    Ignored:    ukbb_analysis_10_clustermq.out
    Ignored:    ukbb_analysis_11_clustermq.out
    Ignored:    ukbb_analysis_12_clustermq.out
    Ignored:    ukbb_analysis_13_clustermq.out
    Ignored:    ukbb_analysis_14_clustermq.out
    Ignored:    ukbb_analysis_15_clustermq.out
    Ignored:    ukbb_analysis_16_clustermq.out
    Ignored:    ukbb_analysis_17_clustermq.out
    Ignored:    ukbb_analysis_18_clustermq.out
    Ignored:    ukbb_analysis_19_clustermq.out
    Ignored:    ukbb_analysis_1_clustermq.out
    Ignored:    ukbb_analysis_20_clustermq.out
    Ignored:    ukbb_analysis_21_clustermq.out
    Ignored:    ukbb_analysis_22_clustermq.out
    Ignored:    ukbb_analysis_2_clustermq.out
    Ignored:    ukbb_analysis_3_clustermq.out
    Ignored:    ukbb_analysis_4_clustermq.out
    Ignored:    ukbb_analysis_5_clustermq.out
    Ignored:    ukbb_analysis_6_clustermq.out
    Ignored:    ukbb_analysis_7_clustermq.out
    Ignored:    ukbb_analysis_8_clustermq.out
    Ignored:    ukbb_analysis_9_clustermq.out

Untracked files:
    Untracked:  Rlogo.png
    Untracked:  Rlogo2.png
    Untracked:  analysis_prep.log
    Untracked:  download_impute.log
    Untracked:  extract_sig.log
    Untracked:  grs.log
    Untracked:  init_analysis.log
    Untracked:  output/PSYMETAB_GWAS_UKBB_comparison.csv
    Untracked:  output/PSYMETAB_GWAS_UKBB_comparison2.csv
    Untracked:  output/PSYMETAB_GWAS_baseline_CEU_result.csv
    Untracked:  output/PSYMETAB_GWAS_subgroup_CEU_result.csv
    Untracked:  output/coffee_consumed_Neale_UKBB_analysis.csv
    Untracked:  process_init.log
    Untracked:  prs.log
    Untracked:  rplot.jpg
    Untracked:  test
    Untracked:  ukbb_analysis.log

Unstaged changes:
    Modified:   analysis/plans.Rmd
    Modified:   cache_log.csv
    Modified:   post_impute.log

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File Version Author Date Message
Rmd 7a65110 Jenny Sjaarda 2021-03-04 Rebuilding genetic QC Rmd
html eabaa1a Jenny Sjaarda 2021-03-03 Build site.
Rmd ae05ca9 Jenny Sjaarda 2021-03-03 Testing genetic QC Rmd
html ec83228 Jenny Sjaarda 2021-03-03 Build site.
Rmd cf00a10 Jenny Sjaarda 2021-03-03 Testing genetic QC Rmd
html 6d2a07e Jenny Sjaarda 2021-03-03 Build site.
Rmd 105ad4f Jenny Sjaarda 2021-03-03 Testing genetic QC Rmd
Rmd 551bdb5 Jenny Sjaarda 2021-03-03 testing genetics_qc
html 843aad9 Jenny Sjaarda 2021-03-03 Build site.
Rmd e98e3a0 Jenny Sjaarda 2021-03-03 testing genetics_qc
Rmd 941b66d Jenny Sjaarda 2021-03-02 add new Rmd files and respective html files
html 941b66d Jenny Sjaarda 2021-03-02 add new Rmd files and respective html files

The following document outlines and summarizes the genetic quality control and processing procedure that was followed to create a clean, imputed dataset.

Step 1: Prepare and cluster genomestudio files.

Step 1 was performed entirely on CHUV computer

Part A: Randomize IDs.

  • Genetic sampleIDs were recoded according to GPCR algorithm to ensure genetic participants are not identifiable.
  • code/radomize_IDs.r was run on CHUV computer before building GenomeStudio project.
  • Creates a new csv file which was used to create a GenomeStudio project with data provided by lab in Geneva.
  • Requires manual addition of header before uploading to GenomeStudio.
[Header],,,,,,,,,,,,,
Investigator Name,,,,,,,,,,,,,
Project Name,,,,,,,,,,,,,
Experiment Name,,,,,,,,,,,,,
Date,,,,,,,,,,,,,
[Manifests],,,,,,,,,,,,,
A,GSA_UPPC_20023490X357589_A1,,,,,,,,,,,,
[Data],,,,,,,,,,,,,
  • Some samples were found to be duplicates (i.e. 2 samples at 2 different time points were analyzed for the same individual) and they were recoded to have ID as: ${ID}002.

Part B: Create GenomeStudio files.

  • Instructions can be found here.
  • Required files:
  • Sample sheet: as csv file (created above).
  • Data repository: as idat files.
  • Manifest file: as bpm file.
  • Cluster file: as egt file.
  • Data provided from Mylene Docquier, copied from sftp and saved here: L:\PCN\UBPC\ANALYSES_RECHERCHE\Jenny\PSYMETAB_GWAS\data.
  • Create new IDs based on GPCR randomization (see code/randomize_IDs.r), and save to above folder as: Eap0819_1t26_27to29corrected_7b9b_randomizedID.csv.
  • Note that original IDs can be found in the same folder at the file: Eap0819_1t26_27to29corrected_7b9.csv, if needed.
  • Create empty folder here: L:\PCN\UBPC\ANALYSES_RECHERCHE\Jenny\PSYMETAB_GWAS, named: GS_project_26092019 (data of creation).
  • Using new IDs, create genome studio project as follows:
  1. Open GenomeStudio.
  2. Select: File > New Genotyping Project.
  3. Select L:\PCN\UBPC\ANALYSES_RECHERCHE\Jenny\PSYMETAB_GWAS as project repository.
  4. Under project name: use “GS_project_26092019” and click “Next”.
  5. Select “Use sample sheet to load intensities” and click “Next”.
  6. Select sample, data and manifests as specified below and click “Next”:
    • Sample sheet: L:\PCN\UBPC\ANALYSES_RECHERCHE\Jenny\PSYMETAB_GWAS\data\Eap0819_1t26_27to29corrected_7b9b_randomizedID.csv,
    • Data repository: L:\PCN\UBPC\ANALYSES_RECHERCHE\Jenny\PSYMETAB_GWAS\data,
    • Manifest repository: L:\PCN\UBPC\ANALYSES_RECHERCHE\Jenny\PSYMETAB_GWAS\data.
  7. Select “Import cluster positions from cluster file” and choose cluster file located here: L:\PCN\UBPC\ANALYSES_RECHERCHE\Jenny\PSYMETAB_GWAS\data\GSPMA24v1_0-A_4349HNR_Samples.egt and click “Finish”.
  • Genome studio files were created using the following required files:
  • Sample sheet: as csv file (created above).
  • Data repository: as idat files (provided from Mylene Docquier).
  • Manifest file: as bpm file (provided from Mylene Docquier).
  • Cluster file: as egt file (provided from Mylene Docquier).

Part D: Copy files to SGG server.

  • PLINK files were copied to SGG servers using FileZilla
  • Host name: je4649@hpc1.chuv.ch
  • Password: <chuv-password>
  • Port: 22
  • Output saved to: /data/sgg2/jenny/projects/PSYMETAB_GWAS/data/raw.

All subsequent steps were performed on the sgg server and run using drake plan

Step 2: Pre-quality control data prep.

See qc_prep drake plan in code/plans.sh.

  • Processed sex and ethnicity files to be used in QC scripts.
  • Sex file was created according to the input specified on plink man page (FID, IID, sex [M/F]).
  • Ethnicity input file to be used in R script for comparison to genetically derived ethnic groups (by snpweights).
  • Recodes ethnic groups as follows:
  • Changes French codes to English.
  • Changes missing to unknown.
  • Groups small ethnic groups to missing.
  • A1 rsid conversion file was updated to remove all SNPs labeled with a [.] (see data sources).
  • Create duplicates file

Step 3: Pre-imputation quality control.

Results of Step 3-6 are saved to analysis/QC. The majority of analyses were performed using PLINK (either version 2.0 or 1.9) Each sub-spet (i.e. 0-15) corresponds to one folder within analysis/QC

Source code for Step 3 can be found at: code/pre_imputation_qc.sh.

0. Preprocessing.

  1. Create binary PLINK files (if necessary):
    • 761641 variants initially.
    • 2767 individuals initially.
  2. Exclude Y and MT variants:
    • 11641 Y / MT variants removed.
    • 750000 variants remaining.
  3. Remove duplicate variants:
    • 7945 duplicates removed.
    • 742055 variants remaining.
  4. Update sex:
    • Using the file located at: data/processed/phenotype_data/PSYMETAB_GWAS_sex.txt (created above).
    
       F    M 
    1298 1469 
  5. Remove duplicate individuals that were identified previously (i.e. two different IDs for the same participant):
    • 2 duplicates identified and removed
    • 2752 indviduals remaining.
  6. Sanity check: ensure that duplicates have the same genetic info"
    • From the PLINK webpage: Note that KING kinship coefficients are scaled such that duplicate samples have kinship 0.5, not 1. First-degree relations (parent-child, full siblings) correspond to ~0.25, second-degree relations correspond to ~0.125, etc.
    • If all duplicate ID samples have KINSHIP ~0.5, then indeed they are genetic duplicates.
   #FID1         ID1 FID2      ID2   NSNP   HETHET          IBS0  KINSHIP
1   2071 BEEEDIGO002  224 BEEEDIGO 703110 0.178137 0.00000000000 0.499539
2   1873 CQLIXEZP002   64 CQLIXEZP 703045 0.153413 0.00000142238 0.499504
3   1965 EFWKQOIK002 1433 EFWKQOIK 697403 0.151525 0.00000860335 0.496680
4   1886 HFNWJHCI002 1448 HFNWJHCI 702845 0.153089 0.00000426837 0.499547
5   2075 HROOJNCI002  553 HROOJNCI 702167 0.155970 0.00000284833 0.499257
6   1974 IOAWLZGK002  549 IOAWLZGK 704278 0.153028 0.00000000000 0.499847
7   2314 KLFEBCIE002 1916 KLFEBCIE 700799 0.153949 0.00000570777 0.499007
8   2073 LWCGLSDP002  317 LWCGLSDP 702226 0.150114 0.00000427213 0.499363
9   2379 PBAIFEMQ002 2070 PBAIFEMQ 700642 0.154083 0.00000285452 0.498820
10  2009 PNWDYVRH002  494 PNWDYVRH 703993 0.153736 0.00000284094 0.499806
11  2068 QHNUPGWK002  318 QHNUPGWK 702891 0.154500 0.00000569078 0.499363
12  1928 QZAUHIPY002  559 QZAUHIPY 702896 0.144711 0.00000142269 0.499826
13  2067 SSITXXAY002  283 SSITXXAY 702409 0.152603 0.00000284734 0.499418
14  1947 WKBFDWJF002  566 WKBFDWJF 703642 0.153506 0.00000284235 0.499783
15  1657 XABRILAR002 1385 XABRILAR 698282 0.154672 0.00000000000 0.497315

1. Strand alignment (so all SNPs are on positive strand).

  1. Update chromosome
  2. Update position
  3. Flip alleles on negative strand
  4. Extract SNPs that are not in strand file
  5. Remove any non autosomal / X chromosomal variants (might have changed due to the pos and chr update):
    • 11 duplicates removed.
    • 741937 variants remaining.
  6. Update rsids using data/processed/reference_files/rsid_conversion.txt
  7. Remove duplicate variants:
    • 6723 duplicates removed.
    • 734328 variants remaining.

2. Removal of SNPs that have MAF zero.

  1. Calculate frequency of all SNPs
  2. Remove MAF 0 SNPs:
    • 93578 variants with MAF = 0.
    • 640750 variants remaining.

3. Missingness.

  1. Exclude variants with >10% missingness (using geno --0.1):
    • 6480 variants removed.
    • 634270 variants remaining.
  2. Exclude individuals with >10% missingness (using mind --0.1):
    • 3 individuals removed.
    • 2749 individuals remaining.
  3. Exclude variants with >5% missingness (using geno --0.05):
    • 8256 variants removed.
    • 626014 variants remaining.
  4. Exclude individuals with >5% missingness (using mind --0.05):
    • 3 individuals removed.
    • 2746 individuals remaining.
  5. Exclude variants with >1% missingness (using geno --0.01):
    • 35957 variants removed.
    • 590057 variants remaining.
  6. Exclude individuals with >1% missingness (using mind --0.01):
    • 5 individuals removed.
    • 2741 individuals remaining.

*Total removed: 50693 variants (7.91%) and 11 individuals (0.40%).**

4. Sex check.

  1. Perform sex check.
  2. Remove unambiguous sex violations.
    • 26 individuals removed.
    • 2715 individuals remaining.

5. Imputation preparation.

  1. Write frequency of final QC’d file (from #4) to file (using -- freq).
  2. Using McCarthy Group Tools, QC’d files were prepared for imputation using the script HRC-1000G-check-bim-NoReadKey.pl (download link).
    • This script checks: Strand, alleles, position, Ref/Alt assignments and frequency differences.
    • Produces: A set of plink commands to update or remove SNPs based on the checks as well as a file (FreqPlot) of cohort allele frequency vs reference panel allele frequency.
    • Updates: Strand, position, ref/alt assignment.
    • Removes: A/T & G/C SNPs if MAF > 0.4, SNPs with differing alleles, SNPs with > 0.2 allele frequency difference (can be removed/changed in V4.2.2), SNPs not in reference panel
  3. Replace underscores in fam file since vcf conversion uses understcores between FID and IID.
  4. Run the generated Run-plink.sh script from #2.
  5. Sort outputed vcf files.
  6. Download zipped files to personl or CHUV files to copy to Michigan Imputation Server.

Step 4: Imputation.

Source code for Step 4 can be found at: code/download_imputation.sh and code/check_imputation.sh.

6. Run and download imputation.

  1. Upload downloaded QC’d vcf.gz files to Michigan Imputation Server as follows:
    • Select Run, Genotype Imputation (Minimac4).
    • Reference panel: HRC r1.1 2016 (GRCh37/hg19).
    • Array build: GRCh37/hg19.
    • rsq Filter: off.
    • Phasing: Eagle v2.4 (phased output).
    • Population: EUR.
    • Mode: Quality Control & Imputation.

2. Download imputation, using password from email retrieve the following files: - QC report. - QC stats. - Logs. - Imputation results.

 [1] "archive"           "chr1.info.gz"      "chr10.info.gz"    
 [4] "chr11.info.gz"     "chr12.info.gz"     "chr13.info.gz"    
 [7] "chr14.info.gz"     "chr15.info.gz"     "chr16.info.gz"    
[10] "chr17.info.gz"     "chr18.info.gz"     "chr19.info.gz"    
[13] "chr2.info.gz"      "chr20.info.gz"     "chr21.info.gz"    
[16] "chr22.info.gz"     "chr3.info.gz"      "chr4.info.gz"     
[19] "chr5.info.gz"      "chr6.info.gz"      "chr7.info.gz"     
[22] "chr8.info.gz"      "chr9.info.gz"      "qcreport.html"    
[25] "snps-excluded.txt"
   Chr Num imputed variants
1    1              3069931
2    2              3392237
3    3              2821894
4    4              2787581
5    5              2588168
6    6              2460111
7    7              2289305
8    8              2242705
9    9              1686471
10  10              1927503
11  11              1936990
12  12              1848117
13  13              1385433
14  14              1270436
15  15              1139215
16  16              1281297
17  17              1090072
18  18              1104755
19  19               868554
20  20               884983
21  21               531276
22  22               524544
23 all             39131578

7. Check imputation.

Step 5: Post imputation quality control.

9. Extract typed SNPs.

  1. Write a list of only genotypd data to run in snpweights software (using require=info "TYPED" flag).
   Chr Num imputed variants R2 filtered Typed SNPs
1    1              3069931     2223452      44252
2    2              3392237     2454557      45514
3    3              2821894     2059285      37277
4    4              2787581     2043343      34139
5    5              2588168     1887050      32223
6    6              2460111     1813223      38957
7    7              2289305     1659374      30649
8    8              2242705     1632513      28611
9    9              1686471     1218207      23815
10  10              1927503     1411585      27714
11  11              1936990     1421749      27722
12  12              1848117     1352003      26603
13  13              1385433     1019900      19346
14  14              1270436      920528      17920
15  15              1139215      816986      17002
16  16              1281297      894859      18447
17  17              1090072      771058      16646
18  18              1104755      798012      15625
19  19               868554      607422      12997
20  20               884983      631946      13557
21  21               531276      374376       7582
22  22               524544      367153       8104
23 all             39131578    28378581     544702
  1. Hard call back to genotypes (bim, bed, and fam files) using the --hard-call-threshold flag set at 0.1.
  2. Merge by chromosome genotyped data into single file.

10. Merge imputed SNPs.

  1. Convert fileset from step #8 back to vcf as there is no merge function in plink (using the flag --recode vcf id-paste=iid vcf-dosage=HDS).
  2. Merge using bcftools concat.
  3. Convert back to pgen using plink2 --vcf <output-name> dosage=HDS.

11. Relatedness (using data from #10).

  1. Calculate missingness.
  2. Remove subjects and individuals with missingness >10%.
    • 1184212 variants removed.
    • 0 individuals removed.
  3. Compute KING table (--make-king-table).
  4. Identify duplicate individuals using kinship threshold > 0.35 as recommended by PLINK.
    #FID1 ID1 FID2 ID2 NSNP HETHET IBS0 KINSHIP
    1353_WIELRZDD 1353_WIELRZDD 1140_FJNJEXCM 1140_FJNJEXCM 26566719 0.0685663 0.0000000 0.499473
    1457_ABDJGAXW 1457_ABDJGAXW 104_ZXDVAJUD 104_ZXDVAJUD 27026535 0.0515445 0.0000000 0.499956
    1821_0030GE 1821_0030GE 628_SJJCZOXT 628_SJJCZOXT 27036506 0.0520661 0.0000000 0.499937
    1937_OINDBCQM 1937_OINDBCQM 1936_SONUEGRB 1936_SONUEGRB 27021954 0.0516029 0.0000001 0.499964
    2020_JNQSZGGL 2020_JNQSZGGL 1305_XTVNZTRY 1305_XTVNZTRY 26975108 0.0526075 0.0000000 0.499941
    2084_JO001 2084_JO001 220_UUHDCMNL 220_UUHDCMNL 27017927 0.0512845 0.0000000 0.499938
    2148_JO280 2148_JO280 2146_JO276 2146_JO276 27011983 0.0519508 0.0000001 0.499942
    2187_S008 2187_S008 1298_YLXVRYDT 1298_YLXVRYDT 27014700 0.0535244 0.0000000 0.499975
    2239_S060 2239_S060 1851_0062GE 1851_0062GE 26706944 0.0543625 0.0000003 0.499584
    2271_S092 2271_S092 1813_0021GE 1813_0021GE 26980867 0.0534746 0.0000000 0.499881
    2272_S093 2272_S093 2174_JO426 2174_JO426 27029496 0.0521779 0.0000000 0.499968
    2637_139CSM 2637_139CSM 2302_HLHILECG 2302_HLHILECG 27037160 0.0520094 0.0000000 0.499967
    2678_184CSM 2678_184CSM 2042_WXPQPUWI 2042_WXPQPUWI 27013112 0.0525789 0.0000000 0.499924
    2689_198CSM 2689_198CSM 509_CHMLAUKH 509_CHMLAUKH 26954876 0.0518025 0.0000002 0.499894
    2693_202CSM 2693_202CSM 1955_OQLLRBMM 1955_OQLLRBMM 27019285 0.0515285 0.0000000 0.499944
    2760_274UAS 2760_274UAS 295_YGGZRBYJ 295_YGGZRBYJ 27047919 0.0523141 0.0000000 0.499972
    • A set of 16 duplicate particpants were found that were not expected.
    • These duplicates needed to be manually checked in database with Celine (see below).
  5. Manually determine cause for duplicates.
    • These pairs of individuals were found to be in both studies under different IDs (PSYMETAB and Severines):
    # 139CSM  HLHILECG
    # 184CSM  WXPQPUWI
    # 198CSM  CHMLAUKH
    # 202CSM  OQLLRBMM
    # 274UAS  YGGZRBYJ
    # JO001 UUHDCMNL
    # S008  YLXVRYDT
    # S092  0021GE
    • These pairs of individuals are duplicates but with different codes and we are keeping the one on the right:
    # ABDJGAXW  ZXDVAJUD
    # JO280 JO276
    # S093  JO426
    • These pair of individuals are not the same and the genetic info matches the one on the right (remove left):
    # JNQSZGGL  XTVNZTRY
    # WIELRZDD  FJNJEXCM
    # OINDBCQM  SONUEGRB
    • These pair of individuals could not be properly sorted and are therefore both being deleted:
    # S060  0062GE
    # 0030GE    SJJCZOXT
  6. Create file of duplicates to be removed (based on above criteria).
  7. Remove duplicates.
    • 2715 individuals originally.
    • 18 individuals removed.
    • 2697 individuals remaining.
  8. Recalculate kinship and identify remaining related individauls using kinship threshold > 0.0884) as recommended by KING authors.
    • #r countLines(paste0("analysis/QC/11_relatedness/", study_name, "2.related.filter.smiss"))[1] - 1 related pairs identified.
    • #r countLines(paste0("analysis/QC/11_relatedness/", study_name, "2.duplicates"))[1] - 1 duplicate pairs identified.
  9. Determine maximal set of unrelated individuals based on kinship results.
    • This was done using personal script (code/qc/relatedness_filter.r).
    • PLINK does not guarantee maximum number of samples (see here: Relationship-based pruning), where it states: PLINK tries to maximize the final sample size, but this maximum independent set problem is NP-hard, so we use a greedy algorithm which does not guarantee an optimal result.
  10. Remove related individuals (note that if using mixed model design we can include all the related IDs, but remove the problematic duplicates).
    • 50 related IDs removed.
    • 2647 individuals remaining.

12. Ethnicty check and admixture estimation.

  1. Clean typed data (from #9).
    • Remove duplicate samples (from #11).
    • Remove genotypes with missingness > 10% (using --geno 0.1).
    • Remove genotypes with MAF < 5% (using --maf 0.05).
    • Remove genotypes with HWE < 5e-4 (using --hwe 5e-4).
    • After filters, 249634 SNPs remaining.
    • After filters, 2697 individuals remaining.
  2. Run snpweights software from Alkes Price software page using SNP weights for European, West African, East Asian and Native American ancestral populations (downloaded here).
    • SNPweights is a software package for inferring genome-wide genetic ancestry using SNP weights precomputed from large external reference panels (Chen et al. 2013 Bioinformatics).
    • The software relies on external databases to infer ancestry weights, so related individuals can safely be included.
    • This software creates a file with the name ${output_name}.NA.predpc.
    • The 10 columns in the ${output_name}.NA.predpc output file are: sample ID, population label, number of SNPs used for inference, predicted PC1, predicted PC2, predicted PC3, % YRI ancestry, % CEU ancestry, % East Asian ancestry, and % Native American ancestry.
  3. Filter to Europeans.
    • Using threshold % CEU ancestry > 0.8 as done in other papers.
    • 543 non-European individuals removed.
    • 2154 European individuals remaining.
  4. Run snpweights in Europeans using SNP weights for NW, SE and AJ ancestral populations of European Americans (downloaded here).
    • The 8 columns in the ${output_name}.CEU80.EA.predpc output file are: sample ID, population label, number of SNPs used for inference, predicted PC1, predicted PC2, % Northwest European ancestry, % Southeast European ancestry, % Ashkenazi Jewish ancestry.
    • Note that nothing was done with these computed snpweights.
  5. Using data from step 1, create pruned set of SNPs.
    • Remove related individuals (idenified in #11).
    • PLINK parameters set as: --indep-pairwise 50 5 0.2.
    • After pruning, 152879 SNPs removed
    • Could consider excluding regions suggested by flaspca by adding the following line: --exclude range $flashpca/exclusion_regions_hg19.txt.
  6. Create an unrelated dataset for calculating PCs using snps from step 5.
    • After pruning, 96755 SNPs remaining.
    • 50 related samples removed.
    • 2647 unrelated samples remaining.
  7. Create a full set for projecting PCs with same set of pruned SNPs.
    • As above, 96755 SNPs remaining.
    • No samples removed, 2697 samples in total.
  8. Perform PCA using flashpca in unrelated set (from step 6).
  9. Perform PCA projections onto related sample set (from 12.7)
  10. Split data into seperate ethnic groups as determined in 12.2.
    • For all ethnic groups except YRI (African), use a threshold of % ancestry > 0.8.
    • For African, use a threshold of % ancestry > 0.7.
    • If a participant has no ethnic group that meets this criteria (note that it is impossible to follow into multiple categories), they are classified as MIXED, in this way 5 ethnic groups remain: CEU (European), EA (East Asian), NA (Native American), YRI (West African), and MIXED.
      AFRICAN EAST_ASIAN EUROPEAN LATIN OTHER SOUTH_ASIAN UNKNOWN Total
CEU         0          0     1540     0    72           1     541  2154
EA          0         17        0     0     2           3       7    29
MIXED      54          4       79     0    99          55     135   426
NA          0          0        0     1     3           0       1     5
YRI        47          4        1     1     3           0      27    83

Version Author Date
ec83228 Jenny Sjaarda 2021-03-03

Version Author Date
ec83228 Jenny Sjaarda 2021-03-03

13. HWE check.

  1. Using imputed data from #10, remove duplicates (from #11).
  2. Split data into separate ethnicity files.
  3. Within each ethnic group:
    • Remove related IDs (from #11).
    • If remaing n is < 100 ignore.
    • If remaining n is > 100, evaluate HWE for each variant.
  4. Identify SNPs failing HWE in at least one ethnic group.
  5. Remove HWE violations (p < 0.00000001).
    • 21062 SNPs removed.
    • 28357543 SNPs remaining.

14. MAF check.

  1. Using HWE filtered file (from #11 - which has duplicates removed), split data in separate ethnicity files.
  2. Within each ethnic group with n > 100 (related individuals removed), calculate allele frequency.
  3. Filter to only SNPs that have at least one MAF > threshold (0.01).
    • 17975192 SNPs removed.
    • 10382351 SNPs remaining.

Step 6: Final processing.

Source code for Step 5 can be found at: code/final_processing.sh.

15. Final processing.

  1. Seperate into ethnic groups with n > 100 and one file with all samples included (using data from #12).
  2. Convert each ethnic subset and full file to bgen format (some programs require this format).
  3. Convert each ethnic subset and full file to vcf format (some programs require this format).
  4. Create imputation info file (could be used to further filter SNPs later).
  5. Within each ethnic specific file, calculate final intra-ethnics PCs:
    1. Extract ethnicity samples from typed file with duplicates removed (from #12).
    2. Re-clean data within each ethnicity before final PCA (as before, using the followings flags: --geno 0.1, --maf 0.05, --hwe 5e-4).
    3. Prune, excluding related individuals (as before, using: --indep-pairwise 50 5 0.2).
      • PLINK2 will give an error if there are less than 50 samples to caclulate LD estimates.
      • This can be overriden --bad-ld, but they suggest that this is almost always a bad idea.
      • Of the five ethnic groups created in #12, the following groups did not proceed past this step due to insufficient sample size:
    
    
    PCs were not successfully computed for ethnic group: EA. 
    The indep log file reads: 
    
     [1] "PLINK v2.00a3LM 64-bit Intel (10 Mar 2020)"                                                     
     [2] "Options in effect:"                                                                             
     [3] "  --bfile EA/PSYMETAB_GWAS.EA.QC"                                                               
     [4] "  --indep-pairwise 50 5 0.2"                                                                    
     [5] "  --out EA/PSYMETAB_GWAS.EA.indep"                                                              
     [6] "  --remove ../../11_relatedness/PSYMETAB_GWAS_related_ids.txt"                                  
     [7] "  --threads 16"                                                                                 
     [8] ""                                                                                               
     [9] "Hostname: node01.cluster"                                                                       
    [10] "Working directory: /data/sgg2/jenny/projects/PSYMETAB/analysis/QC/15_final_processing/final_pca"
    [11] "Start time: Mon Jun 29 18:24:51 2020"                                                           
    [12] ""                                                                                               
    [13] "Random number seed: 1593447891"                                                                 
    [14] "128666 MiB RAM detected; reserving 64333 MiB for main workspace."                               
    [15] "Using up to 16 threads (change this with --threads)."                                           
    [16] "29 samples (0 females, 0 males, 29 ambiguous; 29 founders) loaded from"                         
    [17] "EA/PSYMETAB_GWAS.EA.QC.fam."                                                                    
    [18] "262668 variants loaded from EA/PSYMETAB_GWAS.EA.QC.bim."                                        
    [19] "Note: No phenotype data present."                                                               
    [20] "--remove: 28 samples remaining."                                                                
    [21] "28 samples (0 females, 0 males, 28 ambiguous; 28 founders) remaining after main"                
    [22] "filters."                                                                                       
    [23] "Error: This run estimates linkage disequilibrium between variants, but there"                   
    [24] "are less than 50 samples to estimate from.  You should perform this operation"                  
    [25] "on a larger dataset."                                                                           
    [26] "(Strictly speaking, you can also override this error with --bad-ld, but this is"                
    [27] "almost always a bad idea.)"                                                                     
    [28] ""                                                                                               
    [29] "End time: Mon Jun 29 18:24:51 2020"                                                             
    
    --------------------------------------------------------------------------- 
    
    
    PCs were not successfully computed for ethnic group: NA. 
    The indep log file reads: 
    
     [1] "PLINK v2.00a3LM 64-bit Intel (10 Mar 2020)"                                                     
     [2] "Options in effect:"                                                                             
     [3] "  --bfile NA/PSYMETAB_GWAS.NA.QC"                                                               
     [4] "  --indep-pairwise 50 5 0.2"                                                                    
     [5] "  --out NA/PSYMETAB_GWAS.NA.indep"                                                              
     [6] "  --remove ../../11_relatedness/PSYMETAB_GWAS_related_ids.txt"                                  
     [7] "  --threads 16"                                                                                 
     [8] ""                                                                                               
     [9] "Hostname: node01.cluster"                                                                       
    [10] "Working directory: /data/sgg2/jenny/projects/PSYMETAB/analysis/QC/15_final_processing/final_pca"
    [11] "Start time: Mon Jun 29 18:25:50 2020"                                                           
    [12] ""                                                                                               
    [13] "Random number seed: 1593447950"                                                                 
    [14] "128666 MiB RAM detected; reserving 64333 MiB for main workspace."                               
    [15] "Using up to 16 threads (change this with --threads)."                                           
    [16] "5 samples (0 females, 0 males, 5 ambiguous; 5 founders) loaded from"                            
    [17] "NA/PSYMETAB_GWAS.NA.QC.fam."                                                                    
    [18] "208292 variants loaded from NA/PSYMETAB_GWAS.NA.QC.bim."                                        
    [19] "Note: No phenotype data present."                                                               
    [20] "--remove: 5 samples remaining."                                                                 
    [21] "5 samples (0 females, 0 males, 5 ambiguous; 5 founders) remaining after main"                   
    [22] "filters."                                                                                       
    [23] "Error: This run estimates linkage disequilibrium between variants, but there"                   
    [24] "are less than 50 samples to estimate from.  You should perform this operation"                  
    [25] "on a larger dataset."                                                                           
    [26] "(Strictly speaking, you can also override this error with --bad-ld, but this is"                
    [27] "almost always a bad idea.)"                                                                     
    [28] ""                                                                                               
    [29] "End time: Mon Jun 29 18:25:50 2020"                                                             
    
    --------------------------------------------------------------------------- 
    1. Create an unrelated set for calculating PCs with same set of pruned SNPs.
    2. Create a full set based on pruning in unrelated samples.
    3. Compute PCA components in unrelated set.
    4. Project onto full set.

In general, clean, ethnic specific imputed data were used for analyses (from #15.1), and ethnic-specific PCs were used as adjustments (from #15.5).

16. Compressing large intermediate files.

Run the following commands to compress some large intermediate files.

project_dir="/data/sgg2/jenny/projects/PSYMETAB"
output_dir=$project_dir/analysis/QC

cd $output_dir/06_imputation_get
mkdir archive
tar --exclude='*.info.gz' --exclude qcreport.html --exclude snps-excluded.txt --exclude archive -czvf archive/06_imputation_get.tar.gz * --remove-files

cd $output_dir/10_merge_imputed
mkdir archive
tar --exclude='*.psam' --exclude='*.pvar' -czvf archive/10_merge_imputed.tar.gz * --remove-files

cd $output_dir/13_hwecheck
mkdir archive
tar --exclude='*.hardy.sig.unique' --exclude '*.hwecheck.step13.pvar' --exclude archive -czvf archive/13_hwecheck.tar.gz * --remove-files

cd $output_dir/14_mafcheck
mkdir archive
tar --exclude='*_low_maf_snps.txt' --exclude '*/*.afreq' --exclude '*.mafcheck.step14.pvar' --exclude archive -czvf archive/14_mafcheck.tar.gz * --remove-files

If you need to unzip these files, it can be done as follows: tar xvzf file.tar.gz


sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS: /data/sgg2/jenny/bin/R-3.5.3/lib64/R/lib/libRblas.so
LAPACK: /data/sgg2/jenny/bin/R-3.5.3/lib64/R/lib/libRlapack.so

locale:
[1] en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] tidyselect_0.2.5        rbgen_0.1               ukbtools_0.11.3        
 [4] hrbrthemes_0.8.0        OpenImageR_1.1.6        fuzzyjoin_0.1.5        
 [7] kableExtra_1.1.0        R.utils_2.9.2           R.oo_1.23.0            
[10] R.methodsS3_1.7.1       TwoSampleMR_0.4.25      reader_1.0.6           
[13] NCmisc_1.1.6            optparse_1.6.4          readxl_1.3.1           
[16] ggthemes_4.2.0          tryCatchLog_1.1.6       futile.logger_1.4.3    
[19] DataExplorer_0.8.0      taRifx_1.0.6.1          qqman_0.1.4            
[22] MASS_7.3-51.5           bit64_0.9-7             bit_1.1-14             
[25] rslurm_0.5.0            rmeta_3.0               devtools_2.2.1         
[28] usethis_1.5.1           data.table_1.12.8       clustermq_0.8.8.1      
[31] future.batchtools_0.8.1 future_1.15.1           rlang_0.4.5            
[34] knitr_1.26              drake_7.12.0.9000       forcats_0.4.0          
[37] stringr_1.4.0           dplyr_0.8.3             purrr_0.3.3            
[40] readr_1.3.1             tidyr_1.0.3             tibble_2.1.3           
[43] ggplot2_3.3.2           tidyverse_1.3.0         pacman_0.5.1           
[46] processx_3.4.1          workflowr_1.6.0        

loaded via a namespace (and not attached):
  [1] backports_1.1.6      systemfonts_0.2.3    plyr_1.8.5          
  [4] igraph_1.2.5         storr_1.2.1          listenv_0.8.0       
  [7] digest_0.6.25        foreach_1.4.7        htmltools_0.4.0     
 [10] tiff_0.1-5           fansi_0.4.1          magrittr_1.5        
 [13] checkmate_1.9.4      memoise_1.1.0        base64url_1.4       
 [16] doParallel_1.0.15    remotes_2.1.0        globals_0.12.5      
 [19] extrafont_0.17       modelr_0.1.5         extrafontdb_1.0     
 [22] prettyunits_1.1.0    jpeg_0.1-8.1         colorspace_1.4-1    
 [25] rvest_0.3.5          rappdirs_0.3.1       haven_2.2.0         
 [28] xfun_0.11            callr_3.4.0          crayon_1.3.4        
 [31] jsonlite_1.6         iterators_1.0.12     brew_1.0-6          
 [34] glue_1.4.0           gtable_0.3.0         webshot_0.5.2       
 [37] pkgbuild_1.0.6       Rttf2pt1_1.3.8       scales_1.1.0        
 [40] futile.options_1.0.1 DBI_1.1.0            Rcpp_1.0.3          
 [43] xtable_1.8-4         viridisLite_0.3.0    progress_1.2.2      
 [46] txtq_0.2.0           htmlwidgets_1.5.1    httr_1.4.1          
 [49] getopt_1.20.3        calibrate_1.7.5      ellipsis_0.3.0      
 [52] farver_2.0.1         XML_3.98-1.20        pkgconfig_2.0.3     
 [55] dbplyr_1.4.2         labeling_0.3         reshape2_1.4.3      
 [58] later_1.0.0          munsell_0.5.0        cellranger_1.1.0    
 [61] tools_3.5.3          cli_2.0.1            generics_0.0.2      
 [64] broom_0.5.3          fastmap_1.0.1        evaluate_0.14       
 [67] yaml_2.2.0           fs_1.3.1             packrat_0.5.0       
 [70] nlme_3.1-143         mime_0.8             whisker_0.4         
 [73] formatR_1.7          proftools_0.99-2     xml2_1.2.2          
 [76] compiler_3.5.3       rstudioapi_0.10      png_0.1-7           
 [79] filelock_1.0.2       testthat_2.3.1       reprex_0.3.0        
 [82] stringi_1.4.5        highr_0.8            ps_1.3.0            
 [85] desc_1.2.0           gdtools_0.2.2        lattice_0.20-38     
 [88] vctrs_0.2.4          pillar_1.4.3         lifecycle_0.1.0     
 [91] networkD3_0.4        httpuv_1.5.2         R6_2.4.1            
 [94] promises_1.1.0       gridExtra_2.3        sessioninfo_1.1.1   
 [97] codetools_0.2-16     lambda.r_1.2.4       assertthat_0.2.1    
[100] pkgload_1.0.2        rprojroot_1.3-2      withr_2.1.2         
[103] batchtools_0.9.12    parallel_3.5.3       hms_0.5.3           
[106] grid_3.5.3           rmarkdown_1.18       git2r_0.26.1        
[109] shiny_1.4.0          lubridate_1.7.4