Important. This document is only useful for UKB-approved KCL reasearchers and their collaborators, with an account on the Rosalind/CREATE HPC cluster.
The ukbkings package works with a Rosalind/CREATE UKB project directory that has been setup to use the package. See ukbproject for setup of the directory structure of a UKB project on Rosalind/CREATE.
Contents:
0. tldr
1. Installation
2. Container
3. Field subset file
4. Read and write data
5. Categorical codes
6. Record level data
7. Genetic data
Install the package
devtools::install_github("kenhanscombe/ukbkings", dependencies = TRUE, force = TRUE)
Or, use the docker container with ukbkings and dependencies installed
Write a serialised R dataframe to file for your required fields.
bio_phen(
project_dir = "<path_to_project_directory>",
field = "<path_to_required_fields_file>", # one per line, no header
out = "<stem_of_output_file>" # e.g. "data/ukb" writes "data/ukb.rds"
)
Note. bio_phen
reads a withdrawal file from the project directory and replaces phenotype values for to-be-excluded samples with NA
.
Read the generated dataset into R.
df <- readRDS("data/ukb.rds")
Change to your user directory and make a study folder. Start an interactive cluster session with sufficient memory to read the UKB data.
The below procedure describes loading the HPC R module and installing ukbkings. Section 2 describes an easier alternative, which is running a containerized version of R with ukbkings and dependencies pre-installed.
Load the default cluster version of R. From your study folder in your user director on Rosalind:
srun -p shared,brc --mem=30G --pty /bin/bash
module avail 2&>1 | grep R
module load <default_R_from_above>
R
Or, from your study folder in your user director on CREATE
srun -p cpu --time=0-1 --mem=30G --pty /bin/bash
module -r spider '^r$'
module load <default_R_from_above>
R
Install from Github
devtools::install_github("kenhanscombe/ukbkings", dependencies = TRUE, force = TRUE)
Alternatively, from your interactive SLURM session, use a container with ukbkings and dependencies pre-installed
# If on Rosalind load singularity; not needed on CREATE
module load apps/singularity/3.5.3
singularity run docker://onekenken/ukbkings:0.2.2
Note: This must be done from within your study folder in your user directory as this will be the working directory in the containerized R session.
Note. All code blocks below are R, unless otherwise specified.
Load libraries
Check help (press ‘q’ to exit).
?ukbkings
Note. All included data and functionality is described in the package index and also described on the ukbkings webpage under the Reference tab.
Check help on specific function, e.g.,
?bio_phen
Point to the project directory.
project_dir <- "<absolute_path_to_project_directory>"
You need a file with required fields, one per line, no header.
Read the project field-to-name “field finder” file, inspect the variable metadata, and display the number of baskets included.
Search for variables required and add their field codes to a file, one per line, no header. You can page through the file.
Or, search name
column
f %>%
select(field, name) %>%
filter(str_detect(name, "vegetables"))
f %>%
select(field, name) %>%
filter(str_detect(name, "ldl|triglycerides"))
Alternatively, search the UKB showcase for a variable of interest then filter on the field
column in the field-to-name dataframe (useful if multiple instances required). For example, if you search for “cholesterol medication”, the field stem you want is 6177.
f %>%
select(field, name) %>%
filter(str_detect(field, "6177"))
bio_field_add
is a convenience function for creating the one per line required variables/fields file. By default the function appends fields. Create the field subset file in your UKB user directory
f %>%
select(field, name) %>%
filter(str_detect(field, "6177")) %>%
bio_field_add("small_field_subset.txt")
Inspect the field selection file.
system("cat small_field_subset.txt")
Read required fields and save as an rds file in your user directory. Argument out
should be a path to your UKB user directory
bio_phen(
project_dir,
field = "small_field_subset.txt",
out = "small_phenotype_subset"
)
Note. Dates in the UKB data are recorded in a variety of formats, some of which are non-standard: “character string is not in a standard unambiguous format”, e.g., 2009-01-12T11:28:56. All date variables have been left in character format for the user to convert as needed.
Check the size of your file and read in your dataset
If required, rename columns from the default UKB field names to the descriptive names used in the field-to-name “field finder” name
column.
df <- bio_rename(df, f)
When there are duplicate fields (across baskets/datasets), drop the duplicates you don’t want and rename the remaining fields by dropping the "_bio_rename
df <- df %>%
select(!ends_with("_<drop_basket>")) %>%
rename_with(~ str_replace(.x, pattern = "_<keep_basket>", replacement = "")) %>%
bio_rename(df, f)
Categorical field codings are included in the field finder.
Retrieve numerical “Value” and and associated “Meaning” for each categorical code.
Look up a particular coding.
To query and read record-level data, use bio_record
and bio_record_map
. List all record level data available for your project
bio_record(project_dir)
You can retrieve the data as a disk.frame which you can inspect with functions like head
, names
, etc. This data is still on disk and so does not require a large amount of memory to read into R.
You can also pipe the disk.frame through dplyr verbs, e.g., for column selection and row filtering. Use collect
to read the results from disk into an R dataframe. For example to retieve death data
gp_scripts_diskf <- bio_record(project_dir, record = "gp_scripts")
gp_scripts_df <- gp_scripts_diskf %>%
filter(str_detect(drug_name, "fluoxetine")) %>%
select(eid, data_provider, issue_date) %>%
collect()
Use the subset
argument as a convenient way to read only the data for the samples you’re interested in, e.g. those with data on your phenotype of interest. This will automatically read the data into a dataframe in R - you do not need to collect
.
sample_subset <- c(321, 654, 987)
gp_scripts_df <- bio_record(
project_dir,
record = "gp_scripts",
subset = sample_subset
)
To inspect several records at once use bio_record_map
, which maps a function to the data on disk without reading it into R. For example, to quickly find which variables are in which tables
bio_record_map(project_dir, func = names)
By default the function is mapped to all record table. You can also specify the records you’re interested in
bio_record_map(
project_dir,
func = head,
records = c("gp_clinical", "gp_scripts", "gp_registrations")
)
For GP record details see UKB documentation:
For COVID-19 record details see UKB documentation:
For HES in-patient record details see UKB documentation:
For death record details see UKB documentation:
For paths to the genetic data available for the project
bio_gen_ls(project_dir)
For the genotyped data “sample information” files fam file and sample QC files
bio_gen_fam(project_dir)
bio_gen_sqc(project_dir)
Read the relatedness data into a dataframe with bio_gen_related
. To get a dataframe of related samples to remove use bio_gen_related_remove
- uses GreedyRelated with default relatedness theshold, thresh = 0.044
bio_gen_related(project_dir)
bio_gen_related_remove(project_dir)
To assign 1000 Genomes super population ancestry to your project-specific pseudo-IDs.
bio_gen_ancestry(project_dir)
To write a PLINK file of samples to keep
, use bio_gen_write_plink_input
: data
is either a vector of samples to keep, or a dataframe with samples IDs to keep in the first column; out
is the file path to write output file to.
bio_gen_write_plink_input(data, out)