vignettes/explore-ukb-data.Rmd
explore-ukb-data.Rmd
The UK Biobank is a resource that includes detailed health-related and genetic data on about 500,000 individuals and is available to the research community. ukbtools removes all the upfront data wrangling required to get a single dataset for statistical analysis, and provides tools to assist in quality control, query of disease diagnoses, and retrieval of genetic metadata.
Download and decrypt your data with the supplied helper programs. To use ukbtools, you need to create a UKB fileset (.tab, .r, and .html):
ukb_unpack decrypts your downloaded ukbxxxx.enc file, outputting a ukbxxxx.enc_ukb file. ukb_conv with the r flag converts the decrypted data to a tab-delimited file ukbxxxx.tab and an R script ukbxxxx.r that reads the tab file. The docs flag creates an html file containing a field-code-to-description table (among others).
Note. Full details of the data download and decrypt process are given in the Using UK Biobank Data documentation . Updated versions of these helper programs exist. Other than small name changes (underscores removed) they appear to function similarly.
In R,
# Install from CRAN
install.packages("ukbtools")
# Install latest development version
devtools::install_github("kenhanscombe/ukbtools", build_vignettes = TRUE, dependencies = TRUE)
The function ukb_df()
takes the stem of your fileset and returns a dataframe with usable column names.
You can also specify the path to your fileset if it is not in the current directory. For example, if your fileset is in a subdirectory of the working directory called data
my_ukb_data <- ukb_df("ukbxxxx", path = "/full/path/to/my/ukb/fileset/data")
Use ukb_df_field
to create a field code-to-descriptive name key, as dataframe or named lookup vector.
my_ukb_key <- ukb_df_field("ukbxxxx", path = "/full/path/to/my/ukb/fileset/data")
Note. You can move the three files in your fileset after creating them with ukb_conv
, but they should be kept together. ukb_df()
automatically updates the read call in the R source file to point to the correct directory (the current directly by default, or the directory specified by path
).
Memory and efficiency
To reduce you memory usage, you could save your new UKB dataset with
save(my_ukb_data, file = "my_ukb_data.rda")
. Load the dataset withload("my_ukb_data.rda")
. A UKB dataset from my largest UKB fileset which included a 2.6 GB .tab file took a little under 2 minutes to create withukb_df
. The associated .rda file was 138 MB and loaded in a little under 1.5 mins.
If you have multiple UKB downloads, first read each one in, then merge them with your preferred method. You could use ukb_df_full_join
which is a thin wrapper around dplyr::full_join
applied recursively with purrr::reduce
.
ukbxxxx_data <- ukb_df("ukbxxxx")
ukbyyyy_data <- ukb_df("ukbyyyy")
ukbzzzz_data <- ukb_df("ukbzzzz")
ukb_df_full_join(ukbxxxx_data, ukbyyyy_data, ukbzzzz_data)
Repeated variables.
The join key is set to “eid” only (default value of the
by
parameter). Any additional variables common to any two tables will have “.x” and “.y” appended to their names. If you are satisfied the additional variables are identical to the original, the copies can be safely deleted. For example, ifsetequal(my_ukb_data$var, my_ukb_data$var.x)
isTRUE
, then my_ukb_data$var.x can be dropped. Adlyr::full_join
is like the set operation union in that all observation from all tables are included, i.e., all samples are included even if they are not included in all datasets.Repeated variable names within UKB datasets are unlikely to occur.
ukb_df
creates variable names by combining a snake_case descriptor with the variable’s field code, index and array. This should be sufficient to uniquely identify the variable. However, if an index_array combination is incorrectly repeated in the original UKB data, this will result in a duplicated variable name. We observed two instances. The variables were encoded–0.0 ,–1.0 ,––1.0 , andukb_df
created a variable named var_0_0, var_1_0, var_1_0. This is probably a typo that should have been–0.0 ,–1.0 ,–2.0 , consistent with UKB official documentation describing the field as having 3 values for index. We have providedukb_df_duplicated_names
to identify duplicated names within a dataset. This will allow the user to make changes as appropriate. We expect the occurrence of such duplicates will be rare.
ukbxxxx.tab, ukbxxxx.r, ukbxxxx.html
A minimal example fileset is included with the package, in the subdirectory inst/extdata. This fileset will allow the user to test the the read (ukb_df
, ukb_df_field
) and summarise (ukb_context
) functionality.
# To load the example data
path_to_example_data <- system.file("extdata", package = "ukbtools")
df <- ukb_df("ukbxxxx", path = path_to_example_data)
# To create a field code to name key
df_field <- ukb_df_field("ukbxxxx", path = path_to_example_data)
The full path to the raw test data can be retrieved with system.file("extdata", "ukbXXXX.tab", package = "ukbtools")
.
As an exploratory step you might want to look at the demographics of a particular subset of the UKB sample relative to a reference sample. For example, using the nonmiss.var
argument of ukb_context
will produce a plot of the primary demographics (sex, age, ethnicity, and Townsend deprivation score) and employment status and assessment centre, for the subsample with data on your variable of interest compared to those without data (i.e. NA
).
ukb_context(my_ukb_data, nonmiss.var = "my_variable_of_interest")
It is also possible to supply a logical vector with subset.var
to define the subset and reference sample. This is particularly useful for understanding a subgroup within the UKB study, e.g., overweight individuals.
subgroup_of_interest <- (my_ukb_data$body_mass_index_bmi_0_0 >= 25)
ukb_context(my_ukb_data, subset.var = subgroup_of_interest)
Primary demographic data for a UKB subset of interest. The subset is individuals with BMI >= 25; the reference is BMI < 25. Barplots are displayed as proportions, e.g., about 1/3 of all participants who identified as “Chinese” were overweight compared to about 2/3 of all participants who identified as “British”. ukb_context also allows the user to draw barplots as “stacked” or “side-by-side” bars representing counts, which would reveal there were many more “British” participants (442,698) than there were “Chinese” (1,574).
The International Classification of Diseases (ICD) query functions begin ukb_icd_
. Type ukb_icd_
tab to see the family of functions. The full ICD code definition tables are available as datasets (icd9codes
, icd10codes
). ICD chapter-level tables describing disease blocks are also available for query as datasets (icd9chapters
, icd10chapters
)
Sample query
ukb_icd_diagnosis
takes one or more individual ids and returns a dataframe with a potential message noting ids with no diagnoses
ukb_icd_diagnosis(my_ukb_data, id = "0000000", icd.version = 10)
Diagnosis query
To retrieve the “meaning” of an ICD code use ukb_icd_code_meaning
accepts one or more ICD codes and returns a dataframe of codes and their associated meanings
ukb_icd_code_meaning(icd.code = "I74", icd.version = 10)
Search for a class of diseases with a keyword. Supplying multiple keywords as a character vector will return all ICD entries containing any of the keywords.
ukb_icd_keyword("cardio", icd.version = 10)
You can calculate the prevalence of a diagnosis in the UKB study (or a subset of the full sample) using ukb_icd_prevalence
. The icd.diagnosis
argument takes a regular expression, and so can also be used to retrieve prevalence of a disease “class”, i.e., the proportion of individuals with any diagnosis in the disease class.
# ICD-10 code I74, Arterial embolism and thrombosis
ukb_icd_prevalence(my_ukb_data, icd.version = 10, icd.diagnosis = "I74")
# ICD-10 chapter 9, disease block I00–I99, Diseases of the circulatory system
ukb_icd_prevalence(my_ukb_data, icd.version = 10, icd.diagnosis = "I")
# ICD-10 chapter 2, C00-D49, Neoplasms
ukb_icd_prevalence(my_ukb_data, icd.version = 10, icd.diagnosis = "C|D[0-4].")
To retrieve frequency for one or more ICD diagnoses by the levels of a reference variable, e.g., sex (male or female) use ukb_icd_freq_by
. If the variable is continuous, it is divided into N approximately equal-sized groups (default = 10) within which ICD diagnosis frequency is calculated. ukb_icd_freq_by
also includes an option freq.plot
to produce a figure of ICD diagnosis frequency by reference variable. Diagnoses of interest are passed to icd.code
. The default ICD codes are the WHO top 3 causes of death worldwide (2015): coronary artery disease (CAD), cerebrovascular disease/ stroke, lower respiratory tract infection (LTRI).
ukb_icd_freq_by(my_ukb_data, reference.var = "body_mass_index_bmi_0_0", freq.plot = TRUE)
ukb_icd_freq_by(my_ukb_data, reference.var = "sex_0_0", freq.plot = TRUE)
ICD diagnosis frequency by reference variable. Panel A shows frequency of the query diagnoses with respect to BMI; panel B with respect to sex. Any diagnoses can be queried with the icd.code
parameter (default is CAD, Stroke, LRTI)
Setting freq.plot = FALSE
(default) returns a dataframe of the frequencies. Values for the reference variable group ranges are in the column “group”.
UKB genetic data resources
Interim release metadata
The genetic metadata functions were written to retrieve genetic metadata from the phenotype file for the interim genotype release. The associated QC was described in genotyping and quality control and imputation and association. The fields retrieved became obsolete when the full genotyping results were released at the end of 2017.
Full release metadata
With the release of the full sample (500K individuals) genotypes, sample QC (ukb_sqc_v2.txt) and relatedness (ukbA_rel_sP.txt) data are now supplied as separate files. The contents of these files, along with all other genetic files are described in UKB Resource 531.
Defunct functions
ukb_gen_meta
ukb_gen_pcs
ukb_gen_excl
ukb_gen_rel
ukb_gen_het
ukb_gen_excl_to_na
ukb_gen_write_plink_excl
The above functions remain in the ukbtools package but return the message:
Error: ukb_gen_<name> is defunct. See help("ukb_defunct")
The documentation for
ukb_defunct
includes the information about the change in the way UKB serve the genetic (meta)data.
The sample QC file ukb_sqc_v2.txt has no header. To include a header use ukb_gen_sqc_names
. From UKB Resource 531: “There are currently 2 versions of this file (UKB ukb_sqc_v2.txt) in circulation. The newer version is described below and contains column headers on the first row. The older (deprecated) version lacks the column headers and has two additional Affymetrix internal values prefixing the columns listed below”.
ukb_gen_sqc_names
handles this if used to include column names in a dataframe of sample QC data. If a character vector of names is needed (e.g. to supply on reading in sample QC data) use the argument col_names_only = TRUE
and trim if required.
# With ukb_sqc_v2.txt read into the dataframe my_sqc_data
my_sqc_data <- ukb_gen_sqc_names(my_sqc_data)
# For a character vector of column names
ukb_gen_sqc_names(col_names_only = TRUE)
Relatedness count
ukb_gen_rel_count
now works supplied relatedness file ukbA_rel_sP.txt with header: ID1, ID2, HetHet, IBS0, Kinship. Supply a dataframe of the relatedness data for a dataframe of counts of degree of relatedness (Duplicates/MZ twins, 1st-degree, 2nd-degree, 3rd-degree).
# With ukbA_rel_sP.txt read into the dataframe my_relatedness_data
ukb_gen_rel_count(my_relatedness_data)
You can also create a plot of degree of relatedness, e.g., for samples with data on a variable of interest.
ukb_gen_rel_count(ukb_relatedness, plot = TRUE)
Degree of relatedness. KING kinship coefficient ranges > 0.354, (0.177, 0.354], (0.0884, 0.177] and (0.0442, 0.0884] corresponds to duplicate/MZ twin, 1st-degree, 2nd-degree, and 3rd-degree relationships respectively.
Maximal subset of unrelateds
There are many ways to remove related individuals from phenotypic data for genetic analyses. You could simply exclude all individuals indicated as having “excess relatedness” and include those “used in pca calculation” (these variables are included in the sample QC data, ukb_sqc_v2.txt). This list is based on the complete dataset and possibly removes more samples than you need to for your phenotype of interest. Ideally, you want a maximum independent set, i.e., to remove the minimum number of individuals with data on the phenotype of interest, so that no pair exceeds some cutoff for relatedness. ukb_gen_samples_to_remove
returns a list of samples to remove in order to achieve a maximal set of unrelateds for a given phenotype (default cutoff > 0.0884
- KING coefficient corresponding to 3rd degree relatedness). For the subset of related pairs (ukbA_rel_sP.txt) in which both have data on the phenotype of interest use ukb_gen_related_with_data
.
# With ukbA_rel_sP.txt read into the dataframe my_relatedness_data
# and an integer vector samples_with_phenotype of samples who have
# data on the phenotype of interest
ukb_gen_related_with_data(my_relatedness_data, ukb_with_data = samples_with_phenotype)
ukb_gen_samples_to_remove(my_relatedness_data, ukb_with_data = samples_with_phenotype)
ukbtools includes convenience functions to write phenotype and covariate files for BGENIE and PLINK.
BGENIE phenotype and covariate files are space-delimited, include column names, and have missing values coded as -999. They must also be in .sample file order. The sample file is downloaded with your bulk genotype data and described in UKB Resource 531. The sample file is a rectangular file with two header lines. Using ukb_gen_read_sample
will read it in correctly and label the columns. ukb_gen_write_bgenie
sorts input data to match .sample file id order and writes the data to disk.
# Read .sample file supplied with bulk genetic data
my_sample_file <- ukb_gen_read_sample("path/to/sample_file")
# Write a BGENIE format phenotype or covariate file
ukb_gen_write_bgenie(
my_ukb_data,
path = "path/to/bgenie_input_file",
ukb.sample = my_sample_file,
ukb.variables = c("variable1", "variable2", "variable3")
)
The BGENIE usage page uses the example files example.pheno and example.cov - it is not clear whether the suffixes are obligatory. Use them to be safe.
PLINK phenotype and covariate files are either space- or tab-delimited, column names are optional, first two columns must contain family ID and individual ID respectively, and missing values are “normally expected to be encoded as -9” but also “nonnumeric values such as ‘NA’ … (are) treated as missing”. ukb_gen_write_plink
writes a space-delimited file with column names, UKB ID is automatically written to column 1 and 2 and labelled FID IID, and missing values are coded as NA
. The missing value to be used can be changed with the na.strings
argument. See PLINK standard data input for further details.
# Write a PLINK format phenotype or covariate file
ukb_gen_write_plink(
my_ukb_data,
path = "path/to/plink_input_file",
ukb.variables = c("variable1", "variable2", "variable3")
)
PLINK does not require that individuals in phenotype and covariate files are in any particular order, but you may want to reconcile the individuals you include in your analysis with those in the fam file (from your hard-called data). Read the fam file into R with ukb_gen_read_fam
Sample exclusions
BGENIE does not have an option to read/remove exclusions. You can replace data values in the phenotype with NA
where the individual is to-be-excluded based on genetic metadata considerations. Writing the updated variable to your phenotype file (with the supplied write functions), effectively excludes the individuals from any analysis.
PLINK --remove
takes a space- or tab-delimited file with family IDs in the first column and individual IDs in the second column, without column names. See PLINK input filtering for further details. (The missing value approach described above also works for PLINK.)
Use the sample QC file (ukb_sqc_v2.txt) to determine to-be-excluded samples.