Let’s discuss joins | Crystal Lewis
Working with knowledge can be a lot less complicated if we at all times solely had one dataset to work with. Nevertheless, on the planet of analysis, we regularly have a number of datasets, collected from totally different devices, contributors, or time durations, and our analysis questions usually require these knowledge to be linked ultimately. But, earlier than combining knowledge, it’s essential to contemplate what kind of be a part of makes essentially the most sense for our particular functions, in addition to learn how to accurately carry out these joins. This weblog publish evaluations the varied methods we might take into account combining our knowledge.
Varieties of joins
Usually, there are two methods to hyperlink our knowledge, horizontally or vertically. When linking or becoming a member of knowledge horizontally we’re matching rows by a number of variables (i.e., keys), making a wider dataset. When becoming a member of vertically, column names are matched and datasets are stacked on prime of one another, making an extended dataset. Joins might be carried out in many alternative packages (e.g., SQL, R, Stata, SAS). Most of this publish can be relevant to any language, however examples in R can be supplied.
Horizontal joins
Becoming a member of knowledge horizontally, additionally known as merging, can be utilized in quite a lot of eventualities. A couple of examples embody:
- Linking knowledge throughout devices (a scholar survey + a scholar evaluation)
- Linking knowledge throughout time (a scholar survey within the fall + a scholar survey within the spring)
- Linking knowledge throughout contributors (a scholar evaluation + a instructor survey)
- Linking for de-identification functions (a scholar survey with title + a scholar roster with research ID)
There are a number of several types of horizontal joins you possibly can carry out. On this publish we’re going to talk about mutating joins that improve the dimensions of your dataset, versus horizontal joins used for filtering knowledge (study extra
here). The joins we’ll talk about embody:
- Left be a part of
- In a left be a part of, all circumstances within the dataset on the left (or our first chosen dataset) are maintained. Any circumstances from the dataset on the left facet are joined with any matching knowledge that exists within the dataset on the best (or our second chosen dataset). If extra, non-matching, circumstances exist in our proper dataset, they won’t be carried over.
- Right here we usually anticipate that the mixed dataset can have the identical variety of rows as our unique left facet dataset.
- Proper be a part of
- In a proper be a part of, all circumstances within the dataset on the best (or our second chosen dataset) are maintained. Any circumstances from the dataset on the best facet are joined with any matching knowledge that exists within the dataset on the left (or our first chosen dataset). If extra, non-matching, circumstances exist in our left dataset, they won’t be carried over.
- Right here we usually anticipate that the mixed dataset can have the identical variety of rows as our unique proper facet dataset.
- Full be a part of
- In a full be a part of, circumstances from each datasets are maintained. Any circumstances that exist in a single dataset however not the opposite can be maintained within the ultimate mixed dataset.
- Internal be a part of
- In an interior be a part of, solely circumstances that exist in each datasets can be maintained. If a case exists in a single however not the opposite, it won’t exist within the mixed dataset.
There are two essential guidelines when performing horizontal joins.
- Variable names can not repeat.
- This implies if a variable is called
gender
in your scholar survey dataset and a variable can be namedgender
in your district data demographics dataset, these names will should be edited (e.g., district gender might be renamed tod_gender
). You may study extra about variable naming
here. - This rule doesn’t apply to your linking keys (e.g., research ID), which are sometimes named identically throughout datasets.
- This implies if a variable is called
- Every dataset should comprise a key.
- There are two kinds of keys that mean you can hyperlink knowledge recordsdata—main and international keys. Each dataset ought to embody a main key that uniquely identifies rows in a dataset. Datasets may additionally embody international keys which comprise values related to a main key in one other desk. Whereas main keys can not embody lacking or duplicated values, these values are allowed with international keys.
- Keys are usually one variable (e.g., a novel research ID), however they might additionally embody a couple of variable (e.g., first title + final title), during which case they’re known as a concatenated key (additionally known as compound or composite key). Within the Determine 2, main keys are denoted by rectangles, and international keys are denoted by ovals. Arrows present how knowledge might be linked by way of each main and international keys.
Left be a part of
Left joins are probably one of the most common types of joins. While it can be used in many scenarios, this type of join is often helpful to use in the data de-identification process. Let’s say we have a teacher questionnaire dataset + a sample roster dataset.
Here we want to add our study ID (tch_id
) to our questionnaire. In order to do this, we can join our questionnaire file with a roster file using a combined primary key (f_name
and l_name
). This works great as long as names are spelled identically across files. 😉
We see that our combined dataset has only three cases which is exactly what we would expect when using a left join. The additional case in our right dataset should not carry over. While we could have used other types of joins, in this scenario we did not want to bring over the information for “Dejana Robert” because she doesn’t exist in our questionnaire data and bringing her in would just create an empty row of data.
Let’s try this join using R. The order of the concatenated key should not matter but I tend to like to merge by last name, then first name. Here I will also fully de-identify the file by removing f_name
and l_name
after completing the join. I also reordered the variables to put tch_id
at the front.
library(dplyr)
tch_svy |>
left_join(tch_roster, by = c("l_name", "f_name")) |>
select(tch_id, item1, item2, item3)
# A tibble: 3 x 4
tch_id item1 item2 item3
<dbl> <dbl> <dbl> <dbl>
1 407 4 5 4
2 409 5 1 3
3 410 3 2 3
Note While that it is typically best practice, and much more convenient, if keys are identically named across files, in programs like R you can still join data even if keys are named differently. See this example where the variable names in the survey are f_name
and l_name
but in the roster they are first_name
and last_name
.
library(dplyr)
tch_svy |>
left_join(tch_roster, by = c("l_name" = "last_name", "f_name" = "first_name")) |>
select(tch_id, item1, item2, item3)
# A tibble: 3 x 4
tch_id item1 item2 item3
<dbl> <dbl> <dbl> <dbl>
1 407 4 5 4
2 409 5 1 3
3 410 3 2 3
Right join
What if, however, we wanted a dataset with our full study sample in it and we did not care whether or not there was missing data for cases? In this case, we could use the same scenario as Figure 2, but instead do a right join (note that we could also just change the order of the datasets and use a left join again). Now when you join your datasets on your compound key, you will end up with four cases in your data.
Let’s try this again using R.
library(dplyr)
tch_svy |>
right_join(tch_roster, by = c("l_name", "f_name")) |>
select(tch_id, item1, item2, item3)
# A tibble: 4 x 4
tch_id item1 item2 item3
<dbl> <dbl> <dbl> <dbl>
1 407 4 5 4
2 409 5 1 3
3 410 3 2 3
4 406 NA NA NA
Full join
Full joins are very common in research. Imagine a scenario where you are collecting multiple instruments on participants, or you are collecting the same instrument on participants over multiple time points. In those cases, you may have missing data for some participants (i.e., a participant was absent for one of those data collections), but you still want any data that you were able to collect to appear in your combined dataset.
Let’s say you have a student questionnaire + a student assessment. In this case we want all data from both forms to exist in our combined dataset.
If we performed a full join on these two datasets using stu_id
as our key this time, we would expect our final dataset to have 5 cases (or rows). There is one case in each dataset that does not exist in the other.
Let’s see what a full join looks like in R.
library(dplyr)
stu_svy |>
full_join(stu_assess, by = "stu_id")
# A tibble: 5 x 7
stu_id item1 item2 item3 math1 math2 math3
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 20056 4 5 4 21 25 32
2 20134 5 1 3 15 22 41
3 20149 3 2 3 NA NA NA
4 20159 3 0 1 16 30 50
5 20160 NA NA NA 32 19 25
Inner join
Our last horizontal join type is one that I personally use less often, but there are certainly many scenarios where this will be useful. Let’s take, for instance, the case of a pre and post survey. In this case, we may only want cases in our combined data where a participant has both pre and post data. This is when an inner join can be very useful.
Let’s take a look at our example pre questionnaire + post questionnaire.
Using an inner join we can merge data using our stu_id
key again. However, in the case of longitudinal data, our variables are named identically over time which violates one of our horizontal join rules. Therefore, in order to create unique variable names, and to be able to associate each variable with a time point of data collection, we must first concatenate a time period to each repeating variable, before merging data. How time is assigned is completely up to the researcher (read
here for extra info). On this case, I added the phrases “pre” and “publish” as prefixes, separated by a delimiter for readability and ease of use.
We see that our mixed dataset solely reveals three circumstances as a result of these are the one circumstances with each pre and publish knowledge out there. Nevertheless, if there had been an empty row for stud_id
= 20149 within the publish questionnaire knowledge, that case would have been pulled into the mixed dataset.
Let’s see what an interior be a part of seems to be like in R.
library(dplyr)
# First rename variables with pre and publish suffix
stu_svy_pre <- stu_svy_pre |>
rename_with(~ paste0("pre_", .), .cols = -stu_id)
stu_svy_post <- stu_svy_post |>
rename_with(~ paste0("post_", .), .cols = -stu_id)
# Then be a part of knowledge
stu_svy_pre |>
inner_join(stu_svy_post, by = "stu_id")
# A tibble: 3 x 7
stu_id pre_item1 pre_item2 pre_item3 post_item1 post_item2 post_item3
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 20056 4 5 4 5 1 2
2 20134 5 1 3 5 0 3
3 20159 3 0 1 4 0 3
Notice Internal joins usually are not just for longitudinal knowledge, they can be utilized for some other different eventualities we mentioned. Equally, longitudinal knowledge will also be joined utilizing any of the opposite strategies we mentioned.
Many relationships
Till now we have now mentioned eventualities which can be thought of one-to-one merges. In these circumstances, we solely anticipate one participant in a dataset to be joined to at least one occasion of that very same participant within the different dataset.
Nevertheless, there are eventualities the place this won’t be the case. Take for example a case the place we’re merging info throughout participant teams (e.g., merging scholar knowledge with instructor knowledge, or merging instructor knowledge with faculty knowledge). In these circumstances, one instructor is commonly related to a number of college students and one faculty is commonly related to a number of academics. Once we merge knowledge like this, we’re working with a one-to-many or a many-to-one merge, relying on which dataset is first or second. On this state of affairs, we might anticipate to see repeating knowledge in our mixed dataset.
As one instance, let’s say we have now a scholar questionnaire + a instructor questionnaire.
We will mix this knowledge utilizing the tch_id
variable which exists in each datasets. Nevertheless, once we be a part of will probably be a one-to-many or a many-to-one be a part of relying on the order of the datasets and which kind of be a part of we use.
Let’s say for instance, we use a left be a part of, with the scholar questionnaire dataset on the left and the instructor questionnaire dataset on the best. Right here we can be doing a many-to-one be a part of, the place every scholar is related to a number of academics—two college students can be linked with tch_id
= 406 and two college students can be linked with tch_id
= 407.
After combining knowledge we see that we have now 4 circumstances in our knowledge, as we anticipate from utilizing a left be a part of, however the instructor info we merged in now repeats (twice for every tch_id
).
Notice The standard guidelines of a left or proper be a part of don’t apply when doing a one-to-many be a part of although. Say for instance we moved our instructor dataset to the left and carried out a left be a part of, this might now be a one-to-many be a part of with one instructor being related to many college students. On this case, the ultimate row quantity in your mixed dataset won’t match the rely of rows in your unique left dataset. As an alternative it is going to match the rely of the many dataset (i.e., the instructor stage dataset will turn into a scholar stage dataset). So as an alternative of two, the ultimate row ultimate row rely can be 4.
Let’s carry out a many-to-one left be a part of utilizing R, with our scholar knowledge on the left and our instructor knowledge on the best.
library(dplyr)
stu_svy |>
left_join(tch_svy, by = "tch_id")
# A tibble: 4 x 8
stu_id tch_id item1 item2 item3 q1 q2 q3
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 20056 406 4 5 4 5 1 2
2 20134 407 5 1 3 5 0 3
3 20149 406 3 2 3 5 1 2
4 20159 407 3 0 1 5 0 3
Vertical joins
Much like horizontal joins, there are lots of use circumstances for becoming a member of knowledge horizontally, additionally known as appending knowledge. Some examples embody:
- Combining comparable knowledge throughout cohorts
- Combining comparable knowledge collected from totally different websites or hyperlinks
- Combining comparable knowledge collected throughout time
Nevertheless, there are some elementary variations between horizontal and vertical joins.
- Quite than becoming a member of knowledge on keys, columns are matched by variable names.
- On this case you don’t want distinctive variable names. Right here, it’s crucial that variables are named and formatted identically throughout datasets.
Let’s take an instance the place we have now a questionnaire collected throughout two cohorts of academics. We will append these knowledge, creating an extended dataset. The inclusion of the cohort variable permits customers to know which knowledge is related to which cohort within the mixed knowledge.
Let’s see what any such be a part of would appear to be in R. As a reminder, throughout the 2 datasets, if any of those variables usually are not identically named and formatted (e.g., numeric kind, character kind), you’re going to get an error.
library(dplyr)
bind_rows(svy_c1, svy_c2)
# A tibble: 7 x 5
tch_id cohort item1 item2 item3
<dbl> <dbl> <dbl> <dbl> <dbl>
1 406 1 4 5 4
2 407 1 5 1 3
3 406 1 3 2 3
4 407 1 3 0 1
5 415 2 5 1 2
6 418 2 5 0 3
7 419 2 4 0 3
You too can append longitudinal knowledge. For instance, in case your repeated measures evaluation requires knowledge to be in lengthy format, reasonably than broad (as seen with horizontal joins), longitudinal knowledge might be vertically joined. On this case, reasonably than concatenating a time element to your time various variables, a time variable is added to the info to delineate which rows belong to which period interval. How a time element is assigned is totally as much as you. Right here I selected so as to add a variable named wave
and assign numeric values to my time factors (learn
here for extra info).
Notice Even when one dataset accommodates variables that don’t exist within the different (e.g., a variable was added to the questionnaire at a later time), appending will nonetheless work.
Let’s see this carried out utilizing R.
library(dplyr)
# First add a wave variable
svy_w1 <- svy_w1 |>
mutate(wave = 1)
svy_w2 <- svy_w2 |>
mutate(wave = 2)
# Then append knowledge
bind_rows(svy_w1, svy_w2) |>
relocate(wave, .after = tch_id)
# A tibble: 7 x 5
tch_id wave item1 item2 item3
<dbl> <dbl> <dbl> <dbl> <dbl>
1 406 1 4 5 NA
2 407 1 5 1 NA
3 409 1 3 2 NA
4 410 1 3 0 NA
5 406 2 5 1 2
6 407 2 5 0 3
7 410 2 4 0 3
Notice Discover now that tch_id
now not uniquely identifies rows in our mixed dataset. When appending longitudinal knowledge, we now have a concatenated main key that uniquely identifies rows (tch_id
+ wave
)
Combining joins
In giant analysis research, it is not uncommon to mix horizontal and vertical joins. Take for example a research that collects two waves of a instructor questionnaire, for 2 cohorts.
This knowledge might be mixed in some ways relying in your wants. A technique we may mix this knowledge is perhaps
- First, horizontally be a part of inside cohort. Right here I’m selecting to do a full be a part of.
- Then, append the cohorts.
Notice Since cohort
seems in each datasets and we’re becoming a member of horizontally, we might want to decide on which cohort variable to maintain in our full be a part of. We don’t wish to maintain each, not solely as a result of this can trigger confusion, but additionally as a result of they’re identically named, violating one in every of our guidelines of horizontal joins. In each circumstances, I select to drop the cohort
variable from the best facet dataset as a result of the left facet dataset has essentially the most full info. This won’t at all times be the case. In some circumstances, chances are you’ll want to make use of the best facet or mix info throughout each variables into one full variable.
1. Let’s first be a part of cohort 1.
library(dplyr)
# First rename variables with wave 1 (w1) and wave 2 (w2) suffix
# Additionally, drop cohort from the wave 2 dataset
tch_svy_w1_c1 <- tch_svy_w1_c1 |>
rename_with(~ paste0("w1_", .), .cols = -c(tch_id, cohort))
tch_svy_w2_c1 <- tch_svy_w2_c1 |>
rename_with(~ paste0("w2_", .), .cols = -c(tch_id, cohort)) |>
choose(-cohort)
# Then horizontally be a part of throughout waves
tch_svy_w1w2_c1 <- tch_svy_w1_c1 |>
full_join(tch_svy_w2_c1, by = "tch_id")
tch_svy_w1w2_c1
# A tibble: 3 x 6
tch_id cohort w1_item1 w1_item2 w2_item1 w2_item2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 406 1 4 5 5 1
2 407 1 5 1 5 0
3 408 1 4 4 NA NA
2. Subsequent let’s be a part of cohort 2.
# First rename variables with wave 1 (w1) and wave 2 (w2) suffix
# Additionally, drop cohort from the wave 2 dataset
tch_svy_w1_c2 <- tch_svy_w1_c2 |>
rename_with(~ paste0("w1_", .), .cols = -c(tch_id, cohort))
tch_svy_w2_c2 <- tch_svy_w2_c2 |>
rename_with(~ paste0("w2_", .), .cols = -c(tch_id, cohort)) |>
choose(-cohort)
# Then horizontally be a part of throughout waves
tch_svy_w1w2_c2 <- tch_svy_w1_c2 |>
full_join(tch_svy_w2_c2, by = "tch_id")
tch_svy_w1w2_c2
# A tibble: 3 x 6
tch_id cohort w1_item1 w1_item2 w2_item1 w2_item2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 415 2 4 3 5 3
2 418 2 4 1 4 0
3 419 2 3 2 NA NA
3. Now we are able to append the cohorts.
bind_rows(tch_svy_w1w2_c1, tch_svy_w1w2_c2)
# A tibble: 6 x 6
tch_id cohort w1_item1 w1_item2 w2_item1 w2_item2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 406 1 4 5 5 1
2 407 1 5 1 5 0
3 408 1 4 4 NA NA
4 415 2 4 3 5 3
5 418 2 4 1 4 0
6 419 2 3 2 NA NA
Notice We don’t have to mix knowledge this manner. We may change the order of how we be a part of knowledge (i.e., append cohorts first, then horizontally be a part of waves), or we may change the construction of the info fully (e.g., append waves in lengthy format, in addition to append cohorts, creating a really lengthy dataset).
Let’s take a look at another state of affairs the place we’re combining joins. On this case we have now a instructor questionnaire collected throughout two waves, and we have now school-level demographic knowledge, additionally collected throughout two waves.
Once more, we may mix this knowledge in a number of methods, however right here we’re going to
- First, append the 2 waves of information into lengthy format.
- Then, horizontally be a part of the college knowledge utilizing a left be a part of.
1. Append inside instructor knowledge.
# First add a wave variable
tch_svy_w1 <- tch_svy_w1 |>
mutate(wave = 1)
tch_svy_w2 <- tch_svy_w2 |>
mutate(wave = 2)
# Then append
tch_svy <- bind_rows(tch_svy_w1, tch_svy_w2) |>
relocate(wave, .after = tch_id)
tch_svy
# A tibble: 6 x 5
tch_id wave sch_id q1 q2
<dbl> <dbl> <dbl> <dbl> <dbl>
1 406 1 22 4 5
2 407 1 22 5 1
3 408 1 24 4 4
4 406 2 22 4 3
5 407 2 22 4 1
6 408 2 24 3 2
2. Append inside faculty knowledge.
# First add a wave variable
sch_w1 <- sch_w1 |>
mutate(wave = 1)
sch_w2 <- sch_w2 |>
mutate(wave = 2)
# Then append
sch_svy <- bind_rows(sch_w1, sch_w2) |>
relocate(wave, .after = sch_id)
sch_svy
# A tibble: 4 x 4
sch_id wave item1 item2
<dbl> <dbl> <dbl> <dbl>
1 22 1 500 62
2 24 1 415 85
3 22 2 520 55
4 24 2 430 90
3. Then left be a part of instructor knowledge with faculty knowledge.
Notice Discover that as a result of we have now longitudinal knowledge appended in lengthy format, I’ve to make use of a concatenated main key to hitch our knowledge.
tch_svy |>
left_join(sch_svy, by = c("sch_id", "wave"))
# A tibble: 6 x 7
tch_id wave sch_id q1 q2 item1 item2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 406 1 22 4 5 500 62
2 407 1 22 5 1 500 62
3 408 1 24 4 4 415 85
4 406 2 22 4 3 520 55
5 407 2 22 4 1 520 55
6 408 2 24 3 2 430 90
Extra assets
This weblog publish is only a primer to get you began fascinated about joins. There are a lot of extra joins, in addition to many extra combos of joins that can be utilized! Ultimately, all of it is determined by what is helpful to your mission and your functions (learn
here for extra info). Additionally, simply because you possibly can be a part of knowledge, doesn’t imply you could rush into it. Datasets might be simply saved individually till it turns into crucial so that you can be a part of them. I normally suppose that is the very best methodology as a result of it permits you to extra simply replace particular person recordsdata as wanted, and it prevents you from probably becoming a member of in a method that’s in the end not crucial or not aligned with what is required (extra info might be discovered
here).
For additional studying, try these extra very useful assets!