vignettes/primary-care-prescriptions.Rmd
primary-care-prescriptions.Rmd
UKB primary care presciption data are included in the Primary Care Linked data, accessed through the UKB Data Portal. The data can be explored online with SQL queries, the results of which can be downloaded. Alternatively, the full table can be downloaded.
For each KCL UKB application with access to primary care prescription data, we have downloaded the full table and converted it to an R disk.frame for fast local queries with ukbkings bio_record
and bio_record_map
. (See the article Access UKB data on Rosalind/CREATE for a detailed description of the use of bio_record
and bio_record_map
to query all record-level data available to a particular KCL UKB application)
bio_record
returns a diskframe, an object which can be queried in chunks without reading the data into R.
library(ukbkings)
library(dplyr)
library(stringr)
library(readxl)
library(lubridate)
project_dir <- "<path_to_project_directory>"
diskf <- bio_record(project_dir, record = "gp_scripts")
head(diskf)
eid data_provider issue_date read_2 bnf_code dmd_code drug_name quantity 1: XXXXXXX X XX/XX/XXXX bi33. 02050501Lisinopril 10mg tablets 56.000 2: XXXXXXX X XX/XX/XXXX blb1. 02060200 Amlodipine 5mg tablets 56.000 3: XXXXXXX X XX/XX/XXXX bxd5. 02120400 Simvastatin 40mg tablets 56.000 4: XXXXXXX X XX/XX/XXXX l412. 12010151 Otosporin ear drops (GlaxoSmithKline UK Ltd) 10.000 5: XXXXXXX X XX/XX/XXXX bi33. 02050501 Lisinopril 10mg tablets 28.000 6: XXXXXXX X XX/XX/XXXX blb1. 02060200 Amlodipine 5mg tablets 28.000
You can use dplyr verbs to filter
rows, select
columns, etc., and finally collect
to retrieve the resulting data into a local dataframe. Below is an example of using a combination of bio_record
and dplyr verbs to retrieve antidepressant prescriptions.
The primary care prescription data include 3 prescription coding systems
code | meaning |
---|---|
bnf_code | British National Formulary (BNF) is a reference book containing the standard list of medicines used in UK |
dmd_code | Dictionary of Medicines and Devices (dm+d), the catalogue of things that can be prescribed in the NHS |
read_2 | Read Codes (version 2 (v2)) are a coded thesaurus of clinical terms (used in the NHS since 1985). There is also a version 3 (CTV3 or v3) |
To retrieve prescriptions for a particular drug or drug class, search all three drug code systems and drug_name which includes name and dosage information combined. Below is an example retrieving antidepressant prescriptions.
bnf <- bio_record(project_dir, record = "gp_scripts") %>%
filter(str_detect(bnf_code, "^04\\.?03\\.?0[1-4]")) %>%
collect()
Here I’m using bio_code_primary_care
to read the UKB-supplied primary care code lookup tables to find Read 2 code equivalents of BNF antidepressant codes.
read2_from_bnf <- bio_code_primary_care(project_dir, "read_v2_drugs_bnf") %>%
filter(str_detect(bnf_code, "^04\\.?03\\.?0[1-4]")) %>%
pull(read_code)
# Make a regular expression for search filter
read2_rgx <- read2_from_bnf %>%
str_replace_all("\\.", "\\\\.") %>%
str_c("^", ., collapse = "|") %>%
regex()
read2 <- bio_record(project_dir, record = "gp_scripts") %>%
filter(str_detect(read_2, read2_rgx)) %>%
collect() %>%
mutate(read_2 = str_replace(read_2, "00$", ""))
The mutate
drops a trailing “00” on some read2 codes (a UKB artifact?).
Note. Fabbri et al. (2021) Genetic and clinical characteristics of treatment-resistant depression using primary care records in two UK cohorts use an augmented list of read 2 codes, which includes the additional codes: du6z. du61. gdez. gdey. gde2. gde1.
This list is in supplementary material table 1, available from the UKB resources folder on Rosalind/CREATE. (See the paper’s code for treatment-resistant depression phenotype derivation, as well as antidepressant extraction)
# -curated antidepressants - PMID33753889
cf_path <- file.path(
dirname(project_dir),
"resources/pmid33753889_medrxiv_supplementary_table1.xlsx"
)
excel_sheets(cf_path)
[1] "readv2_mood_psych_sub" "readv3_mood_psych_sub" "ADs_read_2_codes" "bnf_drug_names" "dmd_drug_names" "read2_drug_names"
Chiara’s augmented list of read 2 codes is in the sheet “ADs_read_2_codes”.
# CF Read 2 code list
cf_path <- file.path(
dirname(project_dir),
"resources/pmid33753889_medrxiv_supplementary_table1.xlsx"
)
read2_from_cf <- read_excel(cf_path, sheet = "ADs_read_2_codes") %>%
pull(read2_code)
read2_rgx <- read2_from_cf %>%
str_replace_all("\\.", "\\\\.") %>%
str_c("^", ., collapse = "|") %>%
regex()
read2 <- bio_record(project_dir, record = "gp_scripts") %>%
filter(str_detect(read_2, read2_rgx)) %>%
collect() %>%
mutate(read_2 = str_replace(read_2, "00$", ""))
You could also search for known dm+d codes in column dmd_code, and/or known drug names in drug_name, to retrieve any prescriptions with missing BNF and Read 2 codes. For example, Chiara’s list of curated antidepressant drug names is included in the ukbkings dataset drug_dmd_antidep
.
drug_name_rgx <- str_c(drug_dmd_antidep$dmd_name) %>%
str_c(collapse = "|") %>%
regex(ignore_case = TRUE)
other <- bio_record(project_dir, record = "gp_scripts") %>%
filter(bnf_code == "" & read_2 == "") %>%
filter(str_detect(drug_name, drug_name_rgx)) %>%
collect()
The filter
on empty strings (missing values) in bnf_code and read_2 code searches among prescriptions not already captured by a bnf or read 2 code.
Merge prescriptions
Replace empty strings with NA
, fix column types, and drop prescriptions without an associated date (if appropriate).
px <- px %>%
mutate(
across(.fns = ~ na_if(.x, "")),
across(read_2:quantity, as.character),
issue_date = as_date(issue_date, format = "%d/%m/%Y"),
) %>%
filter(!is.na(issue_date))
The drug_name and quantity columns include a mix of incomplete information (e.g., drug, dosage, form (i.e., capsule, tablet, etc), count, company) and style (upper/lower case, integer/floating point, same form expressed in many ways (e.g., tab, tablet, tablets, tablet(s))). A random sample of 20 rows shows some the above
# A tibble: 20 × 8 eid data_provider issue_date read_2 bnf_code dmd_code drug_name quantity1 XXXXXXX X XXXX-XX-XX NA 04.03.03.00.00 NA Citalopram 20mg tablets 28 tablets 2 XXXXXXX X XXXX-XX-XX NA 04.03.04.00.00 NA Venlafaxine 75mg modified-release capsules 56 capsule(s) - 75 mg 3 XXXXXXX X XXXX-XX-XX NA 0403010N0AAAEAE NA IMIPRAMINE HYDROCHLORIDE TABLETS 25MG 60.000 4 XXXXXXX X XXXX-XX-XX da91. NA NA NA NA 5 XXXXXXX X XXXX-XX-XX da93. NA NA NA NA 6 XXXXXXX X XXXX-XX-XX NA 040301 NA Dosulepin Hydrochloride CAPS 25MG 168.000 7 XXXXXXX X XXXX-XX-XX NA 04.03.03.00.00 NA Prozac 20mg capsules (Eli Lilly and Company Ltd) 1 OP 8 XXXXXXX X XXXX-XX-XX NA 04.03.03.00.00 NA Fluoxetine 20mg capsules 60 capsule 9 XXXXXXX X XXXX-XX-XX NA 04.03.03.00.00 NA Fluoxetine 20mg capsules 60 capsules 10 XXXXXXX X XXXX-XX-XX NA 04.03.04.00.00 NA Venlafaxine 75mg modified-release capsules 84 capsule(s) 11 XXXXXXX X XXXX-XX-XX NA 04.03.04.00.00 NA Mirtazapine 45mg tablets 28 tablet 12 XXXXXXX X XXXX-XX-XX NA 04.03.03.00.00 NA Sertraline 100mg tablets 7 tablet 13 XXXXXXX X XXXX-XX-XX da41. 04030300 NA Fluoxetine 20mg capsules 30.000 14 XXXXXXX X XXXX-XX-XX d712. NA 321746008 AMITRIPTYLINE HCl tabs 25mg 56.000 15 XXXXXXX X XXXX-XX-XX NA 04.03.01.00.00 NA Amitriptyline 25mg tablets 168 tablet(s) - 25 mg 16 XXXXXXX X XXXX-XX-XX daBz. NA NA NA NA 17 XXXXXXX X XXXX-XX-XX NA 04.03.01.00.00 NA Amitriptyline 10mg tablets 28 tablets 18 XXXXXXX X XXXX-XX-XX NA 04.03.01.00.00 NA Clomipramine 50mg capsules 168 capsules - 50 mg 19 XXXXXXX X XXXX-XX-XX NA 04.03.03.00.00 NA Sertraline 100mg tablets 14 tablets 20 XXXXXXX X XXXX-XX-XX d79z. NA NA NA NA
In this subset of prescriptions, there are 1,655 and 4,830 unique entries for drug_name and quantity respectively.
[1] 1655 [1] 4830
A first pass tidy up of these two columns significantly reduces the number of unique entries.
tidy_rgx <- "[\\*\\-\\&\\'\\,]|\\(.*[\\)\\.{3}]|\\.{3}|\\.000$|\\[.*\\]|[\\[\\]]"
# \\/ part of dosage
px <- px %>%
mutate(
across(c(drug_name, quantity), ~ {
tolower(.x) %>%
str_replace_all(tidy_rgx, " ") %>%
str_trim() %>%
str_squish()
})
)
px %>%
distinct(drug_name) %>%
nrow()
px %>%
distinct(quantity) %>%
nrow()
[1] 1116 [1] 3568
This data will require further tidying, some of which will be specific to the subset of prescriptions you collect. In the following, I’ve created a series of regular expressions to capture alternatives for the same descriptor, then added a new columns for drug, dose, and form.
tab_rgx <- c("tablets", "tablet", "tabs", "tab", "soltab") %>%
str_c("\\b", ., "\\b", collapse = "|") %>%
regex()
cap_rgx <- c("capsules", "capsule", "caps", "cap", "ca", "cps", "capsules\\d+mg") %>%
str_c("\\b", ., "\\b", collapse = "|") %>%
regex()
sol_rgx <- c("solution", "soln") %>%
str_c("\\b", ., "\\b", collapse = "|") %>%
regex()
drp_rgx <- c("drop", "drops") %>%
str_c("\\b", ., "\\b", collapse = "|") %>%
regex()
sus_rgx <- c("suspension", "susp", "sus") %>%
str_c("\\b", ., "\\b", collapse = "|") %>%
regex()
syr_rgx <- c("syrup", "syrp", "syr") %>%
str_c("\\b", ., "\\b", collapse = "|") %>%
regex()
liq_rgx <- c("liquid", "liq") %>%
str_c("\\b", ., "\\b", collapse = "|") %>%
regex()
eli_rgx <- c("elixir", "eli") %>%
str_c("\\b", ., "\\b", collapse = "|") %>%
regex()
mix_rgx <- c("mixture", "mix") %>%
str_c("\\b", ., "\\b", collapse = "|") %>%
regex()
px <- px %>%
mutate(
drug = word(drug_name, 1),
dose = str_extract(drug_name, "\\d*\\ *mg|\\d*\\ *microgram") %>% str_trim(),
form = case_when(
str_detect(drug_name, tab_rgx) |
str_detect(quantity, tab_rgx) ~ "tablet",
str_detect(drug_name, cap_rgx) |
str_detect(quantity, cap_rgx) ~ "capsule",
str_detect(drug_name, sol_rgx) |
str_detect(quantity, sol_rgx) ~ "solution",
str_detect(drug_name, drp_rgx) |
str_detect(quantity, drp_rgx) ~ "drops",
str_detect(drug_name, sus_rgx) |
str_detect(quantity, sus_rgx) ~ "suspension",
str_detect(drug_name, syr_rgx) |
str_detect(quantity, syr_rgx) ~ "syrup",
str_detect(drug_name, liq_rgx) |
str_detect(quantity, liq_rgx) ~ "liquid",
str_detect(drug_name, eli_rgx) |
str_detect(quantity, eli_rgx) ~ "elixir",
str_detect(drug_name, mix_rgx) |
str_detect(quantity, mix_rgx) ~ "mixture"
)
)
This is not an exhautive clean of the data. You could continue to clean this data by iteratively running the above chunk, then inspecting the remaining unique entries in drug_name for missing values in drug/ form/ dose as follows
Add further terms to the regular expressions above, and repeat until the data are sufficiently clean for your purposes.
data | source |
---|---|
drug_pharmgkb | PharmGKB curated drug list (https://www.pharmgkb.org/downloads, drugs.zip) |
drug_gwas | ukb.gwas.medication.supplementary.data.1.xlsx - Genome-wide association study of medication-use and associated disease in the UK Biobank, Nature Communications |
drug_dmd_antidep | A curated list of dm+d antidepressant drug names from Fabbri et al. |
NHS prescription drugs - https://openprescribing.net/bnf/ - https://www.thedatalab.org/blog/161/prescribing-data-bnf-codes/ - https://digital.nhs.uk/services/terminology-and-classifications/read-codes - https://digital.nhs.uk/data-and-information/areas-of-interest/prescribing/practice-level-prescribing-in-england-a-summary
ATC codes - https://ec.europa.eu/health/documents/community-register/html/reg_hum_atc.htm - https://bioportal.bioontology.org/ontologies/ATC/?p=summary