Phenotypic data quality control and processing

Last updated: 2021-03-10

Checks: 6 1

Knit directory: PSYMETAB/

This reproducible R Markdown analysis was created with workflowr (version 1.6.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: uncommitted changes

The R Markdown file has unstaged changes. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20191126)

The command set.seed(20191126) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 91fc98e

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    ._docs
    Ignored:    .drake/
    Ignored:    analysis/.Rhistory
    Ignored:    analysis/._GWAS.Rmd
    Ignored:    analysis/._data_processing_in_genomestudio.Rmd
    Ignored:    analysis/._quality_control.Rmd
    Ignored:    analysis/GWAS/
    Ignored:    analysis/PRS/
    Ignored:    analysis/QC/
    Ignored:    analysis/Rlogo2.png
    Ignored:    analysis/figure/
    Ignored:    analysis/rplot.jpg
    Ignored:    analysis_prep_10_clustermq.out
    Ignored:    analysis_prep_11_clustermq.out
    Ignored:    analysis_prep_12_clustermq.out
    Ignored:    analysis_prep_1_clustermq.out
    Ignored:    analysis_prep_2_clustermq.out
    Ignored:    analysis_prep_3_clustermq.out
    Ignored:    analysis_prep_4_clustermq.out
    Ignored:    analysis_prep_5_clustermq.out
    Ignored:    analysis_prep_6_clustermq.out
    Ignored:    analysis_prep_7_clustermq.out
    Ignored:    analysis_prep_8_clustermq.out
    Ignored:    analysis_prep_9_clustermq.out
    Ignored:    data/processed/
    Ignored:    data/raw/
    Ignored:    packrat/lib-R/
    Ignored:    packrat/lib-ext/
    Ignored:    packrat/lib/
    Ignored:    process_init_10_clustermq.out
    Ignored:    process_init_11_clustermq.out
    Ignored:    process_init_12_clustermq.out
    Ignored:    process_init_13_clustermq.out
    Ignored:    process_init_14_clustermq.out
    Ignored:    process_init_15_clustermq.out
    Ignored:    process_init_16_clustermq.out
    Ignored:    process_init_17_clustermq.out
    Ignored:    process_init_18_clustermq.out
    Ignored:    process_init_19_clustermq.out
    Ignored:    process_init_1_clustermq.out
    Ignored:    process_init_20_clustermq.out
    Ignored:    process_init_21_clustermq.out
    Ignored:    process_init_22_clustermq.out
    Ignored:    process_init_2_clustermq.out
    Ignored:    process_init_3_clustermq.out
    Ignored:    process_init_4_clustermq.out
    Ignored:    process_init_5_clustermq.out
    Ignored:    process_init_6_clustermq.out
    Ignored:    process_init_7_clustermq.out
    Ignored:    process_init_8_clustermq.out
    Ignored:    process_init_9_clustermq.out
    Ignored:    process_ukbb_10_clustermq.out
    Ignored:    process_ukbb_11_clustermq.out
    Ignored:    process_ukbb_12_clustermq.out
    Ignored:    process_ukbb_13_clustermq.out
    Ignored:    process_ukbb_14_clustermq.out
    Ignored:    process_ukbb_15_clustermq.out
    Ignored:    process_ukbb_16_clustermq.out
    Ignored:    process_ukbb_17_clustermq.out
    Ignored:    process_ukbb_18_clustermq.out
    Ignored:    process_ukbb_19_clustermq.out
    Ignored:    process_ukbb_1_clustermq.out
    Ignored:    process_ukbb_20_clustermq.out
    Ignored:    process_ukbb_21_clustermq.out
    Ignored:    process_ukbb_22_clustermq.out
    Ignored:    process_ukbb_2_clustermq.out
    Ignored:    process_ukbb_3_clustermq.out
    Ignored:    process_ukbb_4_clustermq.out
    Ignored:    process_ukbb_5_clustermq.out
    Ignored:    process_ukbb_6_clustermq.out
    Ignored:    process_ukbb_7_clustermq.out
    Ignored:    process_ukbb_8_clustermq.out
    Ignored:    process_ukbb_9_clustermq.out
    Ignored:    prs_1_clustermq.out
    Ignored:    prs_2_clustermq.out
    Ignored:    prs_3_clustermq.out
    Ignored:    prs_4_clustermq.out
    Ignored:    prs_5_clustermq.out
    Ignored:    prs_6_clustermq.out
    Ignored:    prs_7_clustermq.out
    Ignored:    prs_8_clustermq.out
    Ignored:    ukbb_analysis_10_clustermq.out
    Ignored:    ukbb_analysis_11_clustermq.out
    Ignored:    ukbb_analysis_12_clustermq.out
    Ignored:    ukbb_analysis_13_clustermq.out
    Ignored:    ukbb_analysis_14_clustermq.out
    Ignored:    ukbb_analysis_15_clustermq.out
    Ignored:    ukbb_analysis_16_clustermq.out
    Ignored:    ukbb_analysis_17_clustermq.out
    Ignored:    ukbb_analysis_18_clustermq.out
    Ignored:    ukbb_analysis_19_clustermq.out
    Ignored:    ukbb_analysis_1_clustermq.out
    Ignored:    ukbb_analysis_20_clustermq.out
    Ignored:    ukbb_analysis_21_clustermq.out
    Ignored:    ukbb_analysis_22_clustermq.out
    Ignored:    ukbb_analysis_2_clustermq.out
    Ignored:    ukbb_analysis_3_clustermq.out
    Ignored:    ukbb_analysis_4_clustermq.out
    Ignored:    ukbb_analysis_5_clustermq.out
    Ignored:    ukbb_analysis_6_clustermq.out
    Ignored:    ukbb_analysis_7_clustermq.out
    Ignored:    ukbb_analysis_8_clustermq.out
    Ignored:    ukbb_analysis_9_clustermq.out

Untracked files:
    Untracked:  Rlogo.png
    Untracked:  Rlogo2.png
    Untracked:  analysis/meetings.Rmd
    Untracked:  analysis_prep.log
    Untracked:  code/extractions/Aurelie09032021_extract.sh
    Untracked:  code/extractions/MarcodePieri_09032021_extract.sh
    Untracked:  download_impute.log
    Untracked:  extract_sig.log
    Untracked:  grs.log
    Untracked:  init_analysis.log
    Untracked:  output/PSYMETAB_GWAS_UKBB_comparison.csv
    Untracked:  output/PSYMETAB_GWAS_UKBB_comparison2.csv
    Untracked:  output/PSYMETAB_GWAS_baseline_CEU_result.csv
    Untracked:  output/PSYMETAB_GWAS_subgroup_CEU_result.csv
    Untracked:  output/coffee_consumed_Neale_UKBB_analysis.csv
    Untracked:  process_init.log
    Untracked:  process_ukbb.log
    Untracked:  prs.log
    Untracked:  rplot.jpg
    Untracked:  ukbb_analysis.log

Unstaged changes:
    Modified:   analysis/GWAS_results.Rmd
    Modified:   analysis/_site.yml
    Modified:   analysis/genetic_quality_control.Rmd
    Modified:   analysis/index.Rmd
    Modified:   analysis/pheno_quality_control.Rmd
    Modified:   analysis/plans.Rmd
    Modified:   analysis/setup.Rmd
    Modified:   cache_log.csv
    Modified:   code/extractions/AurelieReymond_extract.sh
    Modified:   post_impute.log

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File	Version	Author	Date	Message
Rmd	941b66d	Jenny Sjaarda	2021-03-02	add new Rmd files and respective html files
Rmd	85955b3	Jenny Sjaarda	2020-06-23	Update site
html	85955b3	Jenny Sjaarda	2020-06-23	Update site
html	55b0a65	Jenny Sjaarda	2020-06-23	Build site.
Rmd	9c2cda4	Jenny Sjaarda	2020-06-23	wflow_publish(“analysis/pheno_quality_control.Rmd”)

The following document outlines and summarizes the phenotypic quality control and processing procedure that was followed to create a clean dataset.

Phenoptyic data was extracted and provided by Celine (see Data sources).
In January 2020, Celine detected various problems with the phenotype data with unknown explanation (possible manual error).
After discussion with Celine, Chin, and Enrique (manager of the database), it was decided that we need to make a few changes to way data is entered into the database to help avoid manual errors.
Summary of the agreed changes is below (email correspondance between Celine and Enrique on 13/02/2020 and translation):

Je fais un petit résumé de ce dont on a convenu : 1. Modifier l’entrée d’un nouveau mois, pour que ce soit uniquement possible de choisir une entrée proposée (0-1-2-3-6-…) et non taper un autre chiffre 2. Ajouter un champ pour l’identité de la personne qui entre ou modifie un bilan et un champ pour la date à laquelle a lieu cette modification (emplacement proposé : en dessous de « contrôle » ?) 3. Eliminer l’option de reprendre les données d’un mois précédent et à la place : vider tous les champs SAUF ceux entrés dans les fenêtres « Antipsychotiques » et « Co-Médication » 4. Ajouter une validation de la date d’évaluation au moment de sauver les données d’un bilan entré

A short summary of what we agreed: 1. Modify the entry for a new month, so that it is only possible to choose a proposed entry (0-1-2-3-6 -…) and not enter another number 2. Add a field for the identity of the person entering or modifying a balance sheet and a field for the date on which this modification takes place (proposed location: below “control”?) 3. Eliminate the option to resume data from a previous month and instead: empty all fields EXCEPT those entered in the “Antipsychotics” and “Co-Medication” windows 4. Add validation of the evaluation date when saving the data of an entered balance sheet

After this update and manual revisions, new data was provided on 16/03/2020: data/raw/phenotype_data/PHENO_GWAS_160320_noaccent.csv

pheno_file <- "data/raw/phenotype_data/PHENO_GWAS_160320_noaccent.csv"

pheno_raw <- readr::read_delim(pheno_file, col_types = cols(.default = col_character()), delim = ",") %>% type_convert(col_types = cols())

process_pheno_raw <- function(pheno_raw) {
  
  output <- pheno_raw %>%
    mutate_all( ~ replace(., . == 999, NA)) %>% filter(!is.na(PatientsTaille) &
                                                         !is.na(Poids)) %>%
    mutate(Date = as.Date(Date, format = '%d.%m.%y'))  %>%
    filter(!is.na(Date)) %>% arrange(PatientsRecNum, Date)  %>%
    mutate(AP1 = gsub(" ", "_", AP1)) %>% mutate_at("AP1", as.factor) %>% mutate(AP1 = gsub("_.*$", "", AP1)) %>% mutate(AP1 = na_if(AP1, "")) %>% ## merge retard/depot with original
    filter(!is.na(AP1)) %>%
    group_by(GEN) %>%  mutate(sex = check_sex(Sexe)) %>%  filter(!is.na(Sexe)) %>% ## if any sex is missing take sex from other entries
    ungroup() %>%
    filter(remove_outliers(PatientsTaille)) %>%  # this removes patients with the following heights (cm): 106 106 106 106 106 106 106 106 116  96  96  90
    #filter(remove_outliers(Poids)) %>%
    group_by(GEN) %>%
    mutate_at("PatientsTaille", as.numeric) %>% mutate(height = check_height(PatientsTaille)) %>% ### take average of all heights
    mutate_at(vars(Quetiapine:Doxepine), list(ever_drug = ever_drug)) %>% ungroup() %>%  ### create ever on any drug
    rename(weight = Poids) %>%
    mutate(BMI = weight / (height / 100) ^ 2) %>% filter(!is.na(BMI)) %>% ## create BMI
    group_by(GEN, PatientsRecNum) %>% mutate(drug_instance = row_number()) %>%
    mutate(date_difference = as.numeric(difftime(lag(Date), Date, units = "days"))) %>%
    mutate(
      AP1 = case_when(
        AP1 == "Risperdal" ~ "Risperidone",
        AP1 == "Paliperidone" ~ "Risperidone",
        TRUE ~ AP1
      )
    ) %>%
    mutate(
      follow_up = case_when(
        abs(date_difference) >= (Mois - lag(Mois)) * 30 - leeway_time &
          abs(date_difference) <= (Mois - lag(Mois)) * 30 + leeway_time ~ "sensible",
        is.na(date_difference) ~ "NA",
        Mois == 0 ~ "new_regimen",
        date_difference < 0 ~ "leeway_exceeds",
        TRUE ~ "dupliate"
      )
    ) %>%
    mutate(
      month_descrepency = case_when(
        Mois < lag(Mois) | date_difference >= 0 ~ "month_discrepency",
        TRUE ~ "sensible"
      )
    ) %>%
    mutate(drug_match = check_drug(PatientsRecNum, AP1)) %>%
    mutate(date_difference_first =  as.numeric(difftime(Date, first(Date)), units = "days")) %>%
    ungroup() %>%
    
    group_by(GEN) %>%
    mutate(AP1_mod = rename_meds(AP1, PatientsRecNum, Date)) %>%
    ungroup()
  
  return(output)

}

options("tidylog.display" = list())  # turn off

t <- process_pheno_raw(pheno_raw)

options("tidylog.display" = NULL)    # turn on

## missing date
missing_date <- t %>% filter(is.na(Date)) %>% dplyr::select(GEN, Date, PatientsRecNum) %>%
  mutate(problem_category = "missing_Date")

## missing sex
na_sex <- t %>% filter(is.na(sex)) %>% dplyr::select(GEN, Date, PatientsRecNum) %>%
  mutate(problem_category = "sex_problem")

## missing AP1
missing_AP1 <- t %>% filter(is.na(AP1)) %>% dplyr::select(GEN, Date, PatientsRecNum) %>%
  mutate(problem_category = "missing_AP1")

## missing PatientsRecNum (none)
missing_patrec <- t %>% filter(is.na(PatientsRecNum)) %>% dplyr::select(GEN, Date, PatientsRecNum) %>%
  mutate(problem_category = "missing_PatientsRecNum")

## month mismatch
leeway_exceeds <- t %>% filter(follow_up == "leeway_exceeds") %>% dplyr::select(GEN, Date, PatientsRecNum) %>%
  mutate(problem_category = "month_mismatch")

head(leeway_exceeds)

# A tibble: 6 x 4
  GEN      Date       PatientsRecNum problem_category
  <chr>    <date>              <dbl> <chr>           
1 LHORDBHE 2008-09-21              3 month_mismatch  
2 LHORDBHE 2010-04-20              5 month_mismatch  
3 PGLTWVVK 2011-10-11             17 month_mismatch  
4 BPOCXXYD 2011-04-13             18 month_mismatch  
5 BPOCXXYD 2018-08-23             18 month_mismatch  
6 NALFXWBN 2010-04-11             19 month_mismatch

t %>% filter(GEN=="YSFHMSHX") %>% dplyr::select(follow_up, Date, date_difference, Mois)

# A tibble: 15 x 4
   follow_up      Date       date_difference  Mois
   <chr>          <date>               <dbl> <dbl>
 1 NA             2007-01-14              NA     0
 2 leeway_exceeds 2007-07-18            -185     3
 3 sensible       2007-10-28            -102     6
 4 sensible       2008-04-27            -182    12
 5 NA             2008-10-30              NA     0
 6 sensible       2008-11-26             -27     1
 7 sensible       2008-12-28             -32     3
 8 NA             2009-02-19              NA     0
 9 sensible       2009-03-18             -27     1
10 sensible       2009-04-19             -32     2
11 sensible       2009-05-25             -36     3
12 sensible       2009-08-31             -98     6
13 sensible       2009-12-07             -98     9
14 sensible       2010-03-15             -98    12
15 leeway_exceeds 2017-02-22           -2536  1204

## month discrepency
problem_ids <- t %>% filter(month_descrepency == "month_discrepency") %>% dplyr::select(GEN, Date, PatientsRecNum) %>%
  mutate(problem_category = "month_discrepency")

head(problem_ids)

# A tibble: 6 x 4
  GEN      Date       PatientsRecNum problem_category 
  <chr>    <date>              <dbl> <chr>            
1 CZAFDOTO 2010-10-07            476 month_discrepency
2 JWRLCQCT 2017-06-20            741 month_discrepency
3 AGROMGBJ 2018-09-10            852 month_discrepency
4 PBAIFEMQ 2013-06-11           1425 month_discrepency
5 QWJBVPKW 2017-04-06           1558 month_discrepency
6 UGCKMMCC 2014-03-26           1663 month_discrepency

t %>% filter(GEN=="JWWJQJGS") %>% dplyr::select(follow_up, Date, date_difference, Mois, month_descrepency)

# A tibble: 9 x 5
  follow_up      Date       date_difference  Mois month_descrepency
  <chr>          <date>               <dbl> <dbl> <chr>            
1 NA             2010-03-02              NA     0 sensible         
2 sensible       2010-03-21             -19     1 sensible         
3 sensible       2010-06-17             -88     3 sensible         
4 sensible       2010-09-21             -96     6 sensible         
5 sensible       2011-01-24            -125    12 sensible         
6 NA             2010-01-24              NA    12 sensible         
7 leeway_exceeds 2010-03-01             -36     2 month_discrepency
8 sensible       2010-06-17            -108     6 sensible         
9 sensible       2010-09-21             -96     9 sensible

# drug mismatch
problem_drugs <- t %>% filter(drug_match=="non-match") %>% dplyr::select(GEN, Date, PatientsRecNum) %>%
  mutate(problem_category = "drug_mismatch")

flagged_rows <- rbind(missing_date, na_sex, missing_AP1, leeway_exceeds, problem_ids, problem_drugs)
write.table(flagged_rows, "data/raw/phenotype_data/PHENO_GWAS_160320_flagged_rows.txt", row.names = F,  col.names = T, quote = T)

table(flagged_rows$problem_category)


    drug_mismatch month_discrepency    month_mismatch       sex_problem 
               34                27               700                98

Summary of problems identified:

Missing date: empty date column, 19 individuals/19 rows, e.g. DMLWTARC.
Sex problems: either sex is missing for all instances of an individual, or both sexes are listed for one individual, 47 individuals/239 rows, e.g GYYEHMDR (both sexes are listed), IDDAXPMK (empty sex fields).
Missing AP1: follow-up drug is missing, 28 individuals/32 rows, e.g. YSTTKYJE.
Month discrepancy: if participant data is sorted by date, Mois column is less than the previous row (for e.g. month 3 occurs on January 1/2010, but month 0 occurs on March 1/2010 for the same PatientsRecNum), 38 individuals/39 rows, e.g. JWWJQJGS on 01-03-2010 indicates month 2 which occurs after month 12 on 01-24-2010 at the same PatientsRecNum of 2762.
Month mismatch: date between two follow-ups is > 90 days off based on the Date and Mois column, these may not be as big of a problem - but I have still flagged them (for e.g. say participant has the following entries: month 0 on January 1, and month 3 on September 1. The number of days between those two dates is 244 which is greater than 3 - the month follow-up it was supposed to be based on the Mois column - times 30 + 90 days, in this case 3*30 + 90 = 180. Since 244 is greater than 180, this follow-up for at the month 3 mark would be flagged). Note that I chose 90 days as an arbitrary cutoff. There are 162 individuals in this category/206 rows, e.g. YSFHMSHX on 07-07-2018.
Drug mis-match: two drugs listed for the same GEN and PatientsRecNum, 25 individuals, not sure the number of rows as it’s unclear which ones are correct, e.g. BPOCXXYD has both aripiprazole and amisulpride listed for PatientsRecNum 18.

Sent data/raw/phenotype_data/PHENO_GWAS_160320_flagged_rows.txt to Celine for revision (along with help of Claire, Marianna and Nermine).
New data received on 16/04/2020: PHENO_GWAS_160420.xlsx (processed according to description at Data Sources page):
There were some rows that were flagged, but were not errors (therefore they were not changed), it was mainly for (according to Celine’s response 17/04/2020):
Drug mismatch: if you have risperidone and paliperidone, it is considered as he same drug (since corrected this flag).
Month mismatch: for all patients whose data came from an annual check-up, previously entered in a different database, the value in var month represents annual visits.
Some month mismatch/discrepancy could not be corrected… but we checked that the dates were correct for those, so you can rely on dates to calculate the duration.
For all sex mismatch, there should be no more mistakes… tell me if you find more. And we filled all the AP1 and date that we could find.
Procedure above was repeated to see how many problems remained.

pheno_file2 <- "data/raw/phenotype_data/PHENO_GWAS_160420_noaccent.csv"
pheno_raw2 <- readr::read_delim(pheno_file2, col_types = cols(.default = col_character()), delim = ",") %>% type_convert(col_types = cols())

options("tidylog.display" = list())  # turn off

t <- process_pheno_raw(pheno_raw2)

options("tidylog.display" = NULL)    # turn on

## repeat above procedure to see if all problems are fixed

## missing date
missing_date <- t %>% filter(is.na(Date)) %>% dplyr::select(GEN, Date, PatientsRecNum) %>%
  mutate(problem_category = "missing_Date")

## missing sex
na_sex <- t %>% filter(is.na(sex)) %>% dplyr::select(GEN, Date, PatientsRecNum) %>%
  mutate(problem_category = "sex_problem")

## missing AP1
missing_AP1 <- t %>% filter(is.na(AP1)) %>% dplyr::select(GEN, Date, PatientsRecNum) %>%
  mutate(problem_category = "missing_AP1")

## missing PatientsRecNum (none)
missing_patrec <- t %>% filter(is.na(PatientsRecNum)) %>% dplyr::select(GEN, Date, PatientsRecNum) %>%
  mutate(problem_category = "missing_PatientsRecNum")

## month mismatch
leeway_exceeds <- t %>% filter(follow_up == "leeway_exceeds") %>% dplyr::select(GEN, Date, PatientsRecNum) %>%
  mutate(problem_category = "month_mismatch")

## month discrepency
problem_ids <- t %>% filter(month_descrepency == "month_descrepency") %>% dplyr::select(GEN, Date, PatientsRecNum) %>%
  mutate(problem_category = "month_discrepency")

# drug mismatch
problem_drugs <- t %>% filter(drug_match=="non-match") %>% dplyr::select(GEN, Date, PatientsRecNum) %>%
  mutate(problem_category = "drug_mismatch")

flagged_rows2 <- rbind(missing_date, na_sex, missing_AP1, leeway_exceeds, problem_ids, problem_drugs)

table(flagged_rows2$problem_category)


 drug_mismatch month_mismatch 
            15            683

t %>% filter(drug_match=="non-match") %>% dplyr::select(AP1, drug_match, GEN) %>% unique %>% arrange(GEN)

# A tibble: 6 x 3
  AP1          drug_match GEN     
  <chr>        <chr>      <chr>   
1 Aripiprazole non-match  BPOCXXYD
2 Amisulpride  non-match  BPOCXXYD
3 Mirtazapine  non-match  KOPRATFS
4 Quetiapine   non-match  KOPRATFS
5 Amisulpride  non-match  VGWWZXDK
6 Risperidone  non-match  VGWWZXDK

Still a few problems remained:

Missing date: visits with missing date should be removed.
Sex problems: participants with missing sex information or both sexes listed under different visits should be removed.
Missing AP1: visits with missing AP1 information should be removed.
Month discrepancy: patients who have visits after a previous visit (according to the Date column), but the Mois column suggests this visit occured before a previous visit have been checked and the Mois column should be ignored. Celine’s explanation for why these were not all corrected: …we can not correct the month entry without deleting and reentering all data from one visit… thus we only checked that the dates were correct! It means that the mistakes you still have with month mismatchs, you should use dates without considering the mois column.
Month mismatch: similar to (4), Mois column should be ignored.
Drug mis-match: 4 participants identified with multiple drugs listed for the same PatientsRecNum, these were sent to Celine and corrected in subsequent data extraction (see table above for list of particpants corrected).

New data received on 16/04/2020: PHENO_GWAS_160420_corr.xlsx (processed according to description at Data Sources page).
Procedure above was repeated to confirm that Drug mis-match issues were solved (no other changes were made to the database).

pheno_file3 <- "data/raw/phenotype_data/PHENO_GWAS_160420_corr_noaccent.csv"
pheno_raw3 <- readr::read_delim(pheno_file3, col_types = cols(.default = col_character()), delim = ",") %>% type_convert(col_types = cols())

options("tidylog.display" = list())  # turn off

t <- process_pheno_raw(pheno_raw3)

options("tidylog.display" = NULL)    # turn on

# drug mismatch
problem_drugs <- t %>% filter(drug_match=="non-match") %>% dplyr::select(GEN, Date, PatientsRecNum) %>%
  mutate(problem_category = "drug_mismatch")

t %>% filter(drug_match=="non-match") %>% dplyr::select(AP1, drug_match, GEN) %>% unique %>% arrange(GEN)

# A tibble: 0 x 3
# … with 3 variables: AP1 <chr>, drug_match <chr>, GEN <chr>

# empty table

sessionInfo()

R version 3.5.3 (2019-03-11)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS: /data/sgg2/jenny/bin/R-3.5.3/lib64/R/lib/libRblas.so
LAPACK: /data/sgg2/jenny/bin/R-3.5.3/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] rbgen_0.1               ukbtools_0.11.3         hrbrthemes_0.8.0       
 [4] OpenImageR_1.1.6        fuzzyjoin_0.1.5         kableExtra_1.1.0       
 [7] R.utils_2.9.2           R.oo_1.23.0             R.methodsS3_1.7.1      
[10] TwoSampleMR_0.4.25      reader_1.0.6            NCmisc_1.1.6           
[13] optparse_1.6.4          readxl_1.3.1            ggthemes_4.2.0         
[16] tryCatchLog_1.1.6       futile.logger_1.4.3     DataExplorer_0.8.0     
[19] taRifx_1.0.6.1          qqman_0.1.4             MASS_7.3-51.5          
[22] bit64_0.9-7             bit_1.1-14              rslurm_0.5.0           
[25] rmeta_3.0               devtools_2.2.1          usethis_1.5.1          
[28] data.table_1.12.8       clustermq_0.8.8.1       future.batchtools_0.8.1
[31] future_1.15.1           rlang_0.4.5             knitr_1.26             
[34] drake_7.12.0.9000       forcats_0.4.0           stringr_1.4.0          
[37] dplyr_0.8.3             purrr_0.3.3             readr_1.3.1            
[40] tidyr_1.0.3             tibble_2.1.3            ggplot2_3.3.2          
[43] tidyverse_1.3.0         pacman_0.5.1            processx_3.4.1         
[46] workflowr_1.6.0        

loaded via a namespace (and not attached):
  [1] backports_1.1.6      systemfonts_0.2.3    plyr_1.8.5          
  [4] igraph_1.2.5         storr_1.2.1          listenv_0.8.0       
  [7] digest_0.6.25        foreach_1.4.7        htmltools_0.4.0     
 [10] tiff_0.1-5           fansi_0.4.1          magrittr_1.5        
 [13] checkmate_1.9.4      memoise_1.1.0        base64url_1.4       
 [16] doParallel_1.0.15    remotes_2.1.0        globals_0.12.5      
 [19] extrafont_0.17       modelr_0.1.5         extrafontdb_1.0     
 [22] prettyunits_1.1.0    jpeg_0.1-8.1         colorspace_1.4-1    
 [25] rvest_0.3.5          rappdirs_0.3.1       haven_2.2.0         
 [28] xfun_0.11            callr_3.4.0          crayon_1.3.4        
 [31] jsonlite_1.6         iterators_1.0.12     brew_1.0-6          
 [34] glue_1.4.0           gtable_0.3.0         webshot_0.5.2       
 [37] pkgbuild_1.0.6       Rttf2pt1_1.3.8       scales_1.1.0        
 [40] futile.options_1.0.1 DBI_1.1.0            Rcpp_1.0.3          
 [43] xtable_1.8-4         viridisLite_0.3.0    progress_1.2.2      
 [46] txtq_0.2.0           htmlwidgets_1.5.1    httr_1.4.1          
 [49] getopt_1.20.3        calibrate_1.7.5      ellipsis_0.3.0      
 [52] XML_3.98-1.20        pkgconfig_2.0.3      dbplyr_1.4.2        
 [55] utf8_1.1.4           tidyselect_0.2.5     reshape2_1.4.3      
 [58] later_1.0.0          munsell_0.5.0        cellranger_1.1.0    
 [61] tools_3.5.3          cli_2.0.1            generics_0.0.2      
 [64] broom_0.5.3          fastmap_1.0.1        evaluate_0.14       
 [67] yaml_2.2.0           fs_1.3.1             packrat_0.5.0       
 [70] nlme_3.1-143         mime_0.8             whisker_0.4         
 [73] formatR_1.7          proftools_0.99-2     xml2_1.2.2          
 [76] compiler_3.5.3       rstudioapi_0.10      png_0.1-7           
 [79] filelock_1.0.2       testthat_2.3.1       reprex_0.3.0        
 [82] stringi_1.4.5        ps_1.3.0             desc_1.2.0          
 [85] gdtools_0.2.2        lattice_0.20-38      vctrs_0.2.4         
 [88] pillar_1.4.3         lifecycle_0.1.0      networkD3_0.4       
 [91] httpuv_1.5.2         R6_2.4.1             promises_1.1.0      
 [94] gridExtra_2.3        sessioninfo_1.1.1    codetools_0.2-16    
 [97] lambda.r_1.2.4       assertthat_0.2.1     pkgload_1.0.2       
[100] rprojroot_1.3-2      withr_2.1.2          batchtools_0.9.12   
[103] parallel_3.5.3       hms_0.5.3            grid_3.5.3          
[106] rmarkdown_1.18       git2r_0.26.1         shiny_1.4.0         
[109] lubridate_1.7.4