A thin wrapper around purrr::reduce
and dplyr::full_join
to merge multiple UKB datasets.
ukb_df_full_join(..., by = "eid")
Supply comma separated unquoted names of to-be-merged UKB datasets (created with ukb_df
). Arguments are passed to list
.
Variable used to merge multiple dataframes (default = "eid").
The function takes a comma separated list of unquoted datasets. By explicitly setting the join key to "eid" only (Default value of the by
parameter), any additional variables common to any two tables will have ".x" and ".y" appended to their names. If you are satisfied the additional variables are identical to the original, the copies can be safely deleted. For example, if setequal(my_ukb_data$var, my_ukb_data$var.x)
is TRUE
, then my_ukb_data$var.x can be dropped. A dlyr::full_join
is like the set operation union in that all observations from all tables are included, i.e., all samples are included even if they are not included in all datasets.
NB. ukb_df_full_join
will fail if any variable names are repeated **within** a single UKB dataset. This is unlikely to occur, however, ukb_df
creates variable names by combining a snake_case descriptor with the variable's **index** and **array**. If an index_array combination is incorrectly repeated, this will result in a duplicated variable. If the join fails, you can use ukb_df_duplicated_name
to find duplicated names. See vignette(topic = "explore-ukb-data", package = "ukbtools")
for further details.