Using UKB primary care prescription data • ukbkings

UKB primary care presciption data are included in the Primary Care Linked data, accessed through the UKB Data Portal. The data can be explored online with SQL queries, the results of which can be downloaded. Alternatively, the full table can be downloaded.

For each KCL UKB application with access to primary care prescription data, we have downloaded the full table and converted it to an R disk.frame for fast local queries with ukbkings bio_record and bio_record_map. (See the article Access UKB data on Rosalind/CREATE for a detailed description of the use of bio_record and bio_record_map to query all record-level data available to a particular KCL UKB application)

Query prescription data

bio_record returns a diskframe, an object which can be queried in chunks without reading the data into R.

library(ukbkings)
library(dplyr)
library(stringr)
library(readxl)
library(lubridate)

project_dir <- "<path_to_project_directory>"

diskf <- bio_record(project_dir, record = "gp_scripts")
head(diskf)

       eid data_provider issue_date read_2 bnf_code dmd_code                                    drug_name quantity
1: XXXXXXX             X XX/XX/XXXX  bi33. 02050501                           Lisinopril 10mg tablets   56.000
2: XXXXXXX             X XX/XX/XXXX  blb1. 02060200                            Amlodipine 5mg tablets   56.000
3: XXXXXXX             X XX/XX/XXXX  bxd5. 02120400                          Simvastatin 40mg tablets   56.000
4: XXXXXXX             X XX/XX/XXXX  l412. 12010151      Otosporin ear drops (GlaxoSmithKline UK Ltd)   10.000
5: XXXXXXX             X XX/XX/XXXX  bi33. 02050501                           Lisinopril 10mg tablets   28.000
6: XXXXXXX             X XX/XX/XXXX  blb1. 02060200                            Amlodipine 5mg tablets   28.000

You can use dplyr verbs to filter rows, select columns, etc., and finally collect to retrieve the resulting data into a local dataframe. Below is an example of using a combination of bio_record and dplyr verbs to retrieve antidepressant prescriptions.

The primary care prescription data include 3 prescription coding systems

code	meaning
bnf_code	British National Formulary (BNF) is a reference book containing the standard list of medicines used in UK
dmd_code	Dictionary of Medicines and Devices (dm+d), the catalogue of things that can be prescribed in the NHS
read_2	Read Codes (version 2 (v2)) are a coded thesaurus of clinical terms (used in the NHS since 1985). There is also a version 3 (CTV3 or v3)

To retrieve prescriptions for a particular drug or drug class, search all three drug code systems and drug_name which includes name and dosage information combined. Below is an example retrieving antidepressant prescriptions.

Antidepressant prescriptions

Search by BNF code

bnf <- bio_record(project_dir, record = "gp_scripts") %>%
    filter(str_detect(bnf_code, "^04\\.?03\\.?0[1-4]")) %>%
    collect()

Search by Read 2 code

Here I’m using bio_code_primary_care to read the UKB-supplied primary care code lookup tables to find Read 2 code equivalents of BNF antidepressant codes.

read2_from_bnf <- bio_code_primary_care(project_dir, "read_v2_drugs_bnf") %>%
    filter(str_detect(bnf_code, "^04\\.?03\\.?0[1-4]")) %>%
    pull(read_code)

# Make a regular expression for search filter
read2_rgx <- read2_from_bnf %>%
    str_replace_all("\\.", "\\\\.") %>%
    str_c("^", ., collapse = "|") %>%
    regex()

read2 <- bio_record(project_dir, record = "gp_scripts") %>%
    filter(str_detect(read_2, read2_rgx)) %>%
    collect() %>%
    mutate(read_2 = str_replace(read_2, "00$", ""))

The mutate drops a trailing “00” on some read2 codes (a UKB artifact?).

Note. Fabbri et al. (2021) Genetic and clinical characteristics of treatment-resistant depression using primary care records in two UK cohorts use an augmented list of read 2 codes, which includes the additional codes: du6z. du61. gdez. gdey. gde2. gde1.

This list is in supplementary material table 1, available from the UKB resources folder on Rosalind/CREATE. (See the paper’s code for treatment-resistant depression phenotype derivation, as well as antidepressant extraction)

# -curated antidepressants - PMID33753889
cf_path <- file.path(
    dirname(project_dir),
    "resources/pmid33753889_medrxiv_supplementary_table1.xlsx"
)

excel_sheets(cf_path)

[1] "readv2_mood_psych_sub" "readv3_mood_psych_sub" "ADs_read_2_codes"      "bnf_drug_names"        "dmd_drug_names"        "read2_drug_names"

Chiara’s augmented list of read 2 codes is in the sheet “ADs_read_2_codes”.

# CF Read 2 code list
cf_path <- file.path(
    dirname(project_dir),
    "resources/pmid33753889_medrxiv_supplementary_table1.xlsx"
)

read2_from_cf <- read_excel(cf_path, sheet = "ADs_read_2_codes") %>%
    pull(read2_code)

read2_rgx <- read2_from_cf %>%
    str_replace_all("\\.", "\\\\.") %>%
    str_c("^", ., collapse = "|") %>%
    regex()

read2 <- bio_record(project_dir, record = "gp_scripts") %>%
    filter(str_detect(read_2, read2_rgx)) %>%
    collect() %>%
    mutate(read_2 = str_replace(read_2, "00$", ""))

Search by dm+d code and/or drug name

You could also search for known dm+d codes in column dmd_code, and/or known drug names in drug_name, to retrieve any prescriptions with missing BNF and Read 2 codes. For example, Chiara’s list of curated antidepressant drug names is included in the ukbkings dataset drug_dmd_antidep.

drug_name_rgx <- str_c(drug_dmd_antidep$dmd_name) %>%
    str_c(collapse = "|") %>%
    regex(ignore_case = TRUE)

other <- bio_record(project_dir, record = "gp_scripts") %>%
    filter(bnf_code == "" & read_2 == "") %>%
    filter(str_detect(drug_name, drug_name_rgx)) %>%
    collect()

The filter on empty strings (missing values) in bnf_code and read_2 code searches among prescriptions not already captured by a bnf or read 2 code.

Merge prescriptions

px <- bind_rows(bnf, read2, other) %>%
    distinct() %>%
    as_tibble()

Initial clean and format

Replace empty strings with NA, fix column types, and drop prescriptions without an associated date (if appropriate).

px <- px %>%
    mutate(
        across(.fns = ~ na_if(.x, "")),
        across(read_2:quantity, as.character),
        issue_date = as_date(issue_date, format = "%d/%m/%Y"),
    ) %>%
    filter(!is.na(issue_date))

The drug_name and quantity columns include a mix of incomplete information (e.g., drug, dosage, form (i.e., capsule, tablet, etc), count, company) and style (upper/lower case, integer/floating point, same form expressed in many ways (e.g., tab, tablet, tablets, tablet(s))). A random sample of 20 rows shows some the above

# A tibble: 20 × 8
       eid data_provider issue_date read_2 bnf_code        dmd_code  drug_name                                        quantity             
                     
 1 XXXXXXX             X XXXX-XX-XX NA     04.03.03.00.00  NA        Citalopram 20mg tablets                          28 tablets           
 2 XXXXXXX             X XXXX-XX-XX NA     04.03.04.00.00  NA        Venlafaxine 75mg modified-release capsules       56 capsule(s) - 75 mg
 3 XXXXXXX             X XXXX-XX-XX NA     0403010N0AAAEAE NA        IMIPRAMINE HYDROCHLORIDE TABLETS 25MG            60.000               
 4 XXXXXXX             X XXXX-XX-XX da91.  NA              NA        NA                                               NA                   
 5 XXXXXXX             X XXXX-XX-XX da93.  NA              NA        NA                                               NA                   
 6 XXXXXXX             X XXXX-XX-XX NA     040301          NA        Dosulepin Hydrochloride CAPS 25MG                168.000              
 7 XXXXXXX             X XXXX-XX-XX NA     04.03.03.00.00  NA        Prozac 20mg capsules (Eli Lilly and Company Ltd) 1 OP                 
 8 XXXXXXX             X XXXX-XX-XX NA     04.03.03.00.00  NA        Fluoxetine 20mg capsules                         60 capsule           
 9 XXXXXXX             X XXXX-XX-XX NA     04.03.03.00.00  NA        Fluoxetine 20mg capsules                         60 capsules          
10 XXXXXXX             X XXXX-XX-XX NA     04.03.04.00.00  NA        Venlafaxine 75mg modified-release capsules       84 capsule(s)        
11 XXXXXXX             X XXXX-XX-XX NA     04.03.04.00.00  NA        Mirtazapine 45mg tablets                         28 tablet            
12 XXXXXXX             X XXXX-XX-XX NA     04.03.03.00.00  NA        Sertraline 100mg tablets                         7 tablet             
13 XXXXXXX             X XXXX-XX-XX da41.  04030300        NA        Fluoxetine 20mg capsules                         30.000               
14 XXXXXXX             X XXXX-XX-XX d712.  NA              321746008 AMITRIPTYLINE HCl tabs 25mg                      56.000               
15 XXXXXXX             X XXXX-XX-XX NA     04.03.01.00.00  NA        Amitriptyline 25mg tablets                       168 tablet(s) - 25 mg
16 XXXXXXX             X XXXX-XX-XX daBz.  NA              NA        NA                                               NA                   
17 XXXXXXX             X XXXX-XX-XX NA     04.03.01.00.00  NA        Amitriptyline 10mg tablets                       28 tablets           
18 XXXXXXX             X XXXX-XX-XX NA     04.03.01.00.00  NA        Clomipramine 50mg capsules                       168 capsules - 50 mg 
19 XXXXXXX             X XXXX-XX-XX NA     04.03.03.00.00  NA        Sertraline 100mg tablets                         14 tablets           
20 XXXXXXX             X XXXX-XX-XX d79z.  NA              NA        NA                                               NA

In this subset of prescriptions, there are 1,655 and 4,830 unique entries for drug_name and quantity respectively.

px %>%
    distinct(drug_name) %>%
    nrow()
px %>%
    distinct(quantity) %>%
    nrow()

[1] 1655
[1] 4830

A first pass tidy up of these two columns significantly reduces the number of unique entries.

tidy_rgx <- "[\\*\\-\\&\\'\\,]|\\(.*[\\)\\.{3}]|\\.{3}|\\.000$|\\[.*\\]|[\\[\\]]"
# \\/ part of dosage

px <- px %>%
    mutate(
        across(c(drug_name, quantity), ~ {
            tolower(.x) %>%
                str_replace_all(tidy_rgx, " ") %>%
                str_trim() %>%
                str_squish()
        })
    )

px %>%
    distinct(drug_name) %>%
    nrow()
px %>%
    distinct(quantity) %>%
    nrow()

[1] 1116
[1] 3568

Additional tidying

This data will require further tidying, some of which will be specific to the subset of prescriptions you collect. In the following, I’ve created a series of regular expressions to capture alternatives for the same descriptor, then added a new columns for drug, dose, and form.

tab_rgx <- c("tablets", "tablet", "tabs", "tab", "soltab") %>%
    str_c("\\b", ., "\\b", collapse = "|") %>%
    regex()

cap_rgx <- c("capsules", "capsule", "caps", "cap", "ca", "cps", "capsules\\d+mg") %>%
    str_c("\\b", ., "\\b", collapse = "|") %>%
    regex()

sol_rgx <- c("solution", "soln") %>%
    str_c("\\b", ., "\\b", collapse = "|") %>%
    regex()

drp_rgx <- c("drop", "drops") %>%
    str_c("\\b", ., "\\b", collapse = "|") %>%
    regex()

sus_rgx <- c("suspension", "susp", "sus") %>%
    str_c("\\b", ., "\\b", collapse = "|") %>%
    regex()

syr_rgx <- c("syrup", "syrp", "syr") %>%
    str_c("\\b", ., "\\b", collapse = "|") %>%
    regex()

liq_rgx <- c("liquid", "liq") %>%
    str_c("\\b", ., "\\b", collapse = "|") %>%
    regex()

eli_rgx <- c("elixir", "eli") %>%
    str_c("\\b", ., "\\b", collapse = "|") %>%
    regex()

mix_rgx <- c("mixture", "mix") %>%
    str_c("\\b", ., "\\b", collapse = "|") %>%
    regex()

px <- px %>%
    mutate(
        drug = word(drug_name, 1),
        dose = str_extract(drug_name, "\\d*\\ *mg|\\d*\\ *microgram") %>% str_trim(),
        form = case_when(
            str_detect(drug_name, tab_rgx) |
                str_detect(quantity, tab_rgx) ~ "tablet",
            str_detect(drug_name, cap_rgx) |
                str_detect(quantity, cap_rgx) ~ "capsule",
            str_detect(drug_name, sol_rgx) |
                str_detect(quantity, sol_rgx) ~ "solution",
            str_detect(drug_name, drp_rgx) |
                str_detect(quantity, drp_rgx) ~ "drops",
            str_detect(drug_name, sus_rgx) |
                str_detect(quantity, sus_rgx) ~ "suspension",
            str_detect(drug_name, syr_rgx) |
                str_detect(quantity, syr_rgx) ~ "syrup",
            str_detect(drug_name, liq_rgx) |
                str_detect(quantity, liq_rgx) ~ "liquid",
            str_detect(drug_name, eli_rgx) |
                str_detect(quantity, eli_rgx) ~ "elixir",
            str_detect(drug_name, mix_rgx) |
                str_detect(quantity, mix_rgx) ~ "mixture"
        )
    )

This is not an exhautive clean of the data. You could continue to clean this data by iteratively running the above chunk, then inspecting the remaining unique entries in drug_name for missing values in drug/ form/ dose as follows

px %>%
    filter(is.na(form)) %>%
    distinct(drug_name) %>%
    print(n = Inf)

Add further terms to the regular expressions above, and repeat until the data are sufficiently clean for your purposes.

3. Other medication reference data

data	source
drug_pharmgkb	PharmGKB curated drug list (https://www.pharmgkb.org/downloads, drugs.zip)
drug_gwas	ukb.gwas.medication.supplementary.data.1.xlsx - Genome-wide association study of medication-use and associated disease in the UK Biobank, Nature Communications
drug_dmd_antidep	A curated list of dm+d antidepressant drug names from Fabbri et al.

4. Resources

NHS prescription drugs - https://openprescribing.net/bnf/ - https://www.thedatalab.org/blog/161/prescribing-data-bnf-codes/ - https://digital.nhs.uk/services/terminology-and-classifications/read-codes - https://digital.nhs.uk/data-and-information/areas-of-interest/prescribing/practice-level-prescribing-in-england-a-summary

ATC codes - https://ec.europa.eu/health/documents/community-register/html/reg_hum_atc.htm - https://bioportal.bioontology.org/ontologies/ATC/?p=summary