Access UKB data on Rosalind/CREATE • ukbkings

Important. This document is only useful for UKB-approved KCL reasearchers and their collaborators, with an account on the Rosalind/CREATE HPC cluster.

The ukbkings package works with a Rosalind/CREATE UKB project directory that has been setup to use the package. See ukbproject for setup of the directory structure of a UKB project on Rosalind/CREATE.

Contents:
0. tldr
1. Installation
2. Container
3. Field subset file
4. Read and write data
5. Categorical codes
6. Record level data
7. Genetic data

0. tldr

Install the package

devtools::install_github("kenhanscombe/ukbkings", dependencies = TRUE, force = TRUE)

Or, use the docker container with ukbkings and dependencies installed

singularity run docker://onekenken/ukbkings:0.2.2

Write a serialised R dataframe to file for your required fields.

bio_phen(
    project_dir = "<path_to_project_directory>",
    field = "<path_to_required_fields_file>", # one per line, no header
    out = "<stem_of_output_file>" # e.g. "data/ukb" writes "data/ukb.rds"
)

Note. bio_phen reads a withdrawal file from the project directory and replaces phenotype values for to-be-excluded samples with NA.

Read the generated dataset into R.

df <- readRDS("data/ukb.rds")

1. Installation

Change to your user directory and make a study folder. Start an interactive cluster session with sufficient memory to read the UKB data.

The below procedure describes loading the HPC R module and installing ukbkings. Section 2 describes an easier alternative, which is running a containerized version of R with ukbkings and dependencies pre-installed.

Load the default cluster version of R. From your study folder in your user director on Rosalind:

srun -p shared,brc --mem=30G --pty /bin/bash

module avail 2&>1 | grep R
module load <default_R_from_above>

R

Or, from your study folder in your user director on CREATE

srun -p cpu --time=0-1 --mem=30G --pty /bin/bash

module -r spider '^r$'
module load <default_R_from_above>

R

Install from Github

devtools::install_github("kenhanscombe/ukbkings", dependencies = TRUE, force = TRUE)

2. Container

Alternatively, from your interactive SLURM session, use a container with ukbkings and dependencies pre-installed

# If on Rosalind load singularity; not needed on CREATE
module load apps/singularity/3.5.3

singularity run docker://onekenken/ukbkings:0.2.2

Note: This must be done from within your study folder in your user directory as this will be the working directory in the containerized R session.

Note. All code blocks below are R, unless otherwise specified.

Load libraries

library(ukbkings)
library(dplyr)
library(stringr)

Check help (press ‘q’ to exit).

?ukbkings

Note. All included data and functionality is described in the package index and also described on the ukbkings webpage under the Reference tab.

Check help on specific function, e.g.,

?bio_phen

Point to the project directory.

project_dir <- "<absolute_path_to_project_directory>"

3. Field subset file

You need a file with required fields, one per line, no header.

Read the project field-to-name “field finder” file, inspect the variable metadata, and display the number of baskets included.

f <- bio_field(project_dir)

head(f)
glimpse(f)

f %>%
    distinct(basket)

Search for variables required and add their field codes to a file, one per line, no header. You can page through the file.

f %>%
    select(name) %>%
    page(method = "print")

Or, search name column

f %>%
    select(field, name) %>%
    filter(str_detect(name, "vegetables"))

f %>%
    select(field, name) %>%
    filter(str_detect(name, "ldl|triglycerides"))

Alternatively, search the UKB showcase for a variable of interest then filter on the field column in the field-to-name dataframe (useful if multiple instances required). For example, if you search for “cholesterol medication”, the field stem you want is 6177.

f %>%
    select(field, name) %>%
    filter(str_detect(field, "6177"))

bio_field_add is a convenience function for creating the one per line required variables/fields file. By default the function appends fields. Create the field subset file in your UKB user directory

f %>%
    select(field, name) %>%
    filter(str_detect(field, "6177")) %>%
    bio_field_add("small_field_subset.txt")

Inspect the field selection file.

system("cat small_field_subset.txt")

4. Read and write data

Read required fields and save as an rds file in your user directory. Argument out should be a path to your UKB user directory

bio_phen(
    project_dir,
    field = "small_field_subset.txt",
    out = "small_phenotype_subset"
)

Note. Dates in the UKB data are recorded in a variety of formats, some of which are non-standard: “character string is not in a standard unambiguous format”, e.g., 2009-01-12T11:28:56. All date variables have been left in character format for the user to convert as needed.

Check the size of your file and read in your dataset

system("ls -lh small_phenotype_subset.rds")
df <- readRDS("small_phenotype_subset.rds")

If required, rename columns from the default UKB field names to the descriptive names used in the field-to-name “field finder” name column.

df <- bio_rename(df, f)

When there are duplicate fields (across baskets/datasets), drop the duplicates you don’t want and rename the remaining fields by dropping the "_" suffix (so they match the original UKB field name) before using bio_rename

df <- df %>%
select(!ends_with("_<drop_basket>")) %>%
rename_with(~ str_replace(.x, pattern = "_<keep_basket>", replacement = "")) %>%
bio_rename(df, f)

5. Categorical codes

Categorical field codings are included in the field finder.

f %>%
    filter(field %in% names(df)) %>%
    select(field, categorical_coding)

Retrieve numerical “Value” and and associated “Meaning” for each categorical code.

cx <- bio_code(project_dir)
head(cx)

Look up a particular coding.

cx %>%
    filter(Coding == 502)

6. Record level data

To query and read record-level data, use bio_record and bio_record_map. List all record level data available for your project

bio_record(project_dir)

You can retrieve the data as a disk.frame which you can inspect with functions like head, names, etc. This data is still on disk and so does not require a large amount of memory to read into R.

You can also pipe the disk.frame through dplyr verbs, e.g., for column selection and row filtering. Use collect to read the results from disk into an R dataframe. For example to retieve death data

gp_scripts_diskf <- bio_record(project_dir, record = "gp_scripts")

gp_scripts_df <- gp_scripts_diskf %>%
    filter(str_detect(drug_name, "fluoxetine")) %>%
    select(eid, data_provider, issue_date) %>%
    collect()

Use the subset argument as a convenient way to read only the data for the samples you’re interested in, e.g. those with data on your phenotype of interest. This will automatically read the data into a dataframe in R - you do not need to collect.

sample_subset <- c(321, 654, 987)

gp_scripts_df <- bio_record(
    project_dir,
    record = "gp_scripts",
    subset = sample_subset
)

To inspect several records at once use bio_record_map, which maps a function to the data on disk without reading it into R. For example, to quickly find which variables are in which tables

bio_record_map(project_dir, func = names)

By default the function is mapped to all record table. You can also specify the records you’re interested in

bio_record_map(
    project_dir,
    func = head,
    records = c("gp_clinical", "gp_scripts", "gp_registrations")
)

For GP record details see UKB documentation:

For COVID-19 record details see UKB documentation:

For HES in-patient record details see UKB documentation:

For death record details see UKB documentation:

7. Genetic data

For paths to the genetic data available for the project

bio_gen_ls(project_dir)

For the genotyped data “sample information” files fam file and sample QC files

bio_gen_fam(project_dir)
bio_gen_sqc(project_dir)

Read the relatedness data into a dataframe with bio_gen_related. To get a dataframe of related samples to remove use bio_gen_related_remove - uses GreedyRelated with default relatedness theshold, thresh = 0.044

bio_gen_related(project_dir)
bio_gen_related_remove(project_dir)

To assign 1000 Genomes super population ancestry to your project-specific pseudo-IDs.

bio_gen_ancestry(project_dir)

To write a PLINK file of samples to keep, use bio_gen_write_plink_input: data is either a vector of samples to keep, or a dataframe with samples IDs to keep in the first column; out is the file path to write output file to.

bio_gen_write_plink_input(data, out)