Working with lyrics data

The taylor_all_songs and taylor_album_songs data sets contain the lyrics for each of Taylor Swift’s songs, as well as audio characteristics. In each data set, the lyrics are stored as a list-column.

library(taylor)
library(dplyr)

track_lyrics <- taylor_album_songs %>%
  select(album_name, track_name, lyrics)

track_lyrics
#> # A tibble: 240 × 3
#>    album_name   track_name                 lyrics           
#>    <chr>        <chr>                      <list>           
#>  1 Taylor Swift Tim McGraw                 <tibble [55 × 4]>
#>  2 Taylor Swift Picture To Burn            <tibble [33 × 4]>
#>  3 Taylor Swift Teardrops On My Guitar     <tibble [36 × 4]>
#>  4 Taylor Swift A Place In This World      <tibble [27 × 4]>
#>  5 Taylor Swift Cold As You                <tibble [24 × 4]>
#>  6 Taylor Swift The Outside                <tibble [37 × 4]>
#>  7 Taylor Swift Tied Together With A Smile <tibble [36 × 4]>
#>  8 Taylor Swift Stay Beautiful             <tibble [51 × 4]>
#>  9 Taylor Swift Should've Said No          <tibble [44 × 4]>
#> 10 Taylor Swift Mary's Song (Oh My My My)  <tibble [38 × 4]>
#> # ℹ 230 more rows

In other words, both taylor_all_songs and taylor_ablum_songs are nested data frames. Each has one row per track, and the lyrics for each track are are stored in another data frame nested within each row.

There are three primary ways to access data from a nested list-column. The first is to extract individual list elements. For example, if we want to see the lyrics for “Cruel Summer,” we can look up which row of the data set contains “Cruel Summer” and then access that element of the list.

track_row <- which(track_lyrics$track_name == "Cruel Summer")
track_lyrics$lyrics[[track_row]]
#> # A tibble: 62 × 4
#>     line lyric                                            element element_artist
#>    <int> <chr>                                            <chr>   <chr>         
#>  1     1 (Yeah, yeah, yeah, yeah)                         Intro   Taylor Swift  
#>  2     2 Fever dream high in the quiet of the night       Verse 1 Taylor Swift  
#>  3     3 You know that I caught it (Oh yeah, you're righ… Verse 1 Taylor Swift  
#>  4     4 Bad, bad boy, shiny toy with a price             Verse 1 Taylor Swift  
#>  5     5 You know that I bought it (Oh yeah, you're righ… Verse 1 Taylor Swift  
#>  6     6 Killing me slow, out the window                  Pre-Ch… Taylor Swift  
#>  7     7 I'm always waiting for you to be waiting below   Pre-Ch… Taylor Swift  
#>  8     8 Devils roll the dice, angels roll their eyes     Pre-Ch… Taylor Swift  
#>  9     9 What doesn't kill me makes me want you more      Pre-Ch… Taylor Swift  
#> 10    10 And it's new, the shape of your body             Chorus  Taylor Swift  
#> # ℹ 52 more rows

As expected, this returns another data frame, with one row for each line in the song. However, this approach only allows us to unnest one track at a time. A more efficient method for extracting data in a nested list-column is to use tidyr::unnest(). This approach unnests all of the data in a list-column at once. Rather than having a data set that is one row per song, we now have a data set that is one row per line per song.

library(tidyr)

track_lyrics %>%
  unnest(lyrics)
#> # A tibble: 12,151 × 6
#>    album_name   track_name  line lyric                    element element_artist
#>    <chr>        <chr>      <int> <chr>                    <chr>   <chr>         
#>  1 Taylor Swift Tim McGraw     1 "He said the way my blu… Verse 1 Taylor Swift  
#>  2 Taylor Swift Tim McGraw     2 "Put those Georgia star… Verse 1 Taylor Swift  
#>  3 Taylor Swift Tim McGraw     3 "I said, \"That's a lie… Verse 1 Taylor Swift  
#>  4 Taylor Swift Tim McGraw     4 "Just a boy in a Chevy … Verse 1 Taylor Swift  
#>  5 Taylor Swift Tim McGraw     5 "That had a tendency of… Verse 1 Taylor Swift  
#>  6 Taylor Swift Tim McGraw     6 "On backroads at night"  Verse 1 Taylor Swift  
#>  7 Taylor Swift Tim McGraw     7 "And I was right there … Verse 1 Taylor Swift  
#>  8 Taylor Swift Tim McGraw     8 "And then the time we w… Verse 1 Taylor Swift  
#>  9 Taylor Swift Tim McGraw     9 "But when you think Tim… Chorus  Taylor Swift  
#> 10 Taylor Swift Tim McGraw    10 "I hope you think my fa… Chorus  Taylor Swift  
#> # ℹ 12,141 more rows

If we are interested in the lyrics for only a specific album or a specific song, we can always use dplyr::filter() to include only the data we are interested in.

track_lyrics %>%
  filter(track_name == "Cruel Summer") %>%
  unnest(lyrics)
#> # A tibble: 62 × 6
#>    album_name track_name    line lyric                    element element_artist
#>    <chr>      <chr>        <int> <chr>                    <chr>   <chr>         
#>  1 Lover      Cruel Summer     1 (Yeah, yeah, yeah, yeah) Intro   Taylor Swift  
#>  2 Lover      Cruel Summer     2 Fever dream high in the… Verse 1 Taylor Swift  
#>  3 Lover      Cruel Summer     3 You know that I caught … Verse 1 Taylor Swift  
#>  4 Lover      Cruel Summer     4 Bad, bad boy, shiny toy… Verse 1 Taylor Swift  
#>  5 Lover      Cruel Summer     5 You know that I bought … Verse 1 Taylor Swift  
#>  6 Lover      Cruel Summer     6 Killing me slow, out th… Pre-Ch… Taylor Swift  
#>  7 Lover      Cruel Summer     7 I'm always waiting for … Pre-Ch… Taylor Swift  
#>  8 Lover      Cruel Summer     8 Devils roll the dice, a… Pre-Ch… Taylor Swift  
#>  9 Lover      Cruel Summer     9 What doesn't kill me ma… Pre-Ch… Taylor Swift  
#> 10 Lover      Cruel Summer    10 And it's new, the shape… Chorus  Taylor Swift  
#> # ℹ 52 more rows

Finally, sometimes we want to perform a calculation on each element of a list-column. In this case, we don’t necessarily need to unnest each element. Instead, we can use a combination of dplyr::mutate() and purrr::map() to apply a function to each element of the list-column. For example, we if want to know the number of lines in each song, we can apply nrow() to each element of the list column. Because nrow() returns and integer value, we’ll use vapply() (we could also use purrr::map_int()).

track_lyrics %>%
  filter(album_name == "Lover") %>%
  mutate(lines = vapply(lyrics, nrow, integer(1)))
#> # A tibble: 18 × 4
#>    album_name track_name                             lyrics            lines
#>    <chr>      <chr>                                  <list>            <int>
#>  1 Lover      I Forgot That You Existed              <tibble [45 × 4]>    45
#>  2 Lover      Cruel Summer                           <tibble [62 × 4]>    62
#>  3 Lover      Lover                                  <tibble [33 × 4]>    33
#>  4 Lover      The Man                                <tibble [48 × 4]>    48
#>  5 Lover      The Archer                             <tibble [45 × 4]>    45
#>  6 Lover      I Think He Knows                       <tibble [65 × 4]>    65
#>  7 Lover      Miss Americana & The Heartbreak Prince <tibble [62 × 4]>    62
#>  8 Lover      Paper Rings                            <tibble [65 × 4]>    65
#>  9 Lover      Cornelia Street                        <tibble [53 × 4]>    53
#> 10 Lover      Death By A Thousand Cuts               <tibble [60 × 4]>    60
#> 11 Lover      London Boy                             <tibble [58 × 4]>    58
#> 12 Lover      Soon You'll Get Better                 <tibble [46 × 4]>    46
#> 13 Lover      False God                              <tibble [50 × 4]>    50
#> 14 Lover      You Need To Calm Down                  <tibble [40 × 4]>    40
#> 15 Lover      Afterglow                              <tibble [48 × 4]>    48
#> 16 Lover      ME!                                    <tibble [64 × 4]>    64
#> 17 Lover      It's Nice To Have A Friend             <tibble [28 × 4]>    28
#> 18 Lover      Daylight                               <tibble [59 × 4]>    59

The resulting data frame is still one row per song, because we have not unnested the lyrics. However, our summary statistic has been added as an additional column.