skipgrams with {slide}

October 18, 2019

In addition to the the essential Tidy Text Mining With R, Julia Silge’s personal blog is a frequent reference for me when doing tidy text analysis. Yesterday, I was working on detecting unusual bigrams in a corpus of restaurant reviews, and remembered a blog post featuring a slide_windows() function to simplify preparing skipgrams for computing pointwise mutual information via widyr::pairwise_pmi().

Here’s the original sliding windows code:

library(tidytext)
library(dplyr)
library(purrr)
library(slide)

slide_windows <- function(tbl, doc_var, window_size) {
  # each word gets a skipgram (window_size words) starting on the first
  # e.g. skipgram 1 starts on word 1, skipgram 2 starts on word 2
  
  each_total <- tbl %>% 
    group_by(!!doc_var) %>% 
    mutate(doc_total = n(),
            each_total = pmin(doc_total, window_size, na.rm = TRUE)) %>%
    pull(each_total)
  
  rle_each <- rle(each_total)
  counts <- rle_each[["lengths"]]
  counts[rle_each$values != window_size] <- 1
  
  # each word get a skipgram window, starting on the first
  # account for documents shorter than window
  id_counts <- rep(rle_each$values, counts)
  window_id <- rep(seq_along(id_counts), id_counts)
  
  # within each skipgram, there are window_size many offsets
  indexer <- (seq_along(rle_each[["values"]]) - 1) %>%
    map2(rle_each[["values"]] - 1,
          ~ seq.int(.x, .x + .y)) %>% 
    map2(counts, ~ rep(.x, .y)) %>%
    flatten_int() +
    window_id
  
  tbl[indexer, ] %>%
    bind_cols(data_frame(window_id)) %>%
    group_by(window_id) %>%
    filter(n_distinct(!!doc_var) == 1) %>%
    ungroup
}

Here’s what it returns on a trivial data set:

test <- tibble(
  speaker = c("banner", "strange"),
  text = c("thanos is coming", "who is thanos")
  )

test %>% 
  unnest_tokens(word, text) %>% 
  slide_windows(quo(speaker), 2)

## # A tibble: 8 x 3
##   speaker word   window_id
##   <chr>   <chr>      <int>
## 1 banner  thanos         1
## 2 banner  is             1
## 3 banner  is             2
## 4 banner  coming         2
## 5 strange who            4
## 6 strange is             4
## 7 strange is             5
## 8 strange thanos         5

The function name gave me an idea: I could use Davis Vaughan’s {slide} as another way to compute the necessary sliding windows. {slide} is especially exciting if you’ve ever had to do rolling computations with business calendars, but it is, as advertised, quite general-purpose.

Here’s the function with slide():

new_slide_windows <- function(tbl, doc_var, window_size) {
  
  window_size <- window_size - 1

  grams <- slide(tbl, ~.x, .after = window_size, .step = 1, .complete = TRUE)
  
  # because .complete returns NULL if a group is not complete
  # and I am matching the numbering scheme of the original function
  safe_mutate <- safely(mutate)
  
  out <- map2(grams, 1:length(grams), ~safe_mutate(.x, window_id = .y))
  
  out %>%
    transpose() %>% 
    pluck("result") %>% 
    compact() %>%
    # this discards any data frames with varying speakers
    discard(~length(unique(.x[[doc_var]])) > 1) %>% 
    bind_rows()
}

And it returns the same results:

test %>% 
  unnest_tokens(word, text) %>% 
  new_slide_windows("speaker", 2)

## # A tibble: 8 x 3
##   speaker word   window_id
##   <chr>   <chr>      <int>
## 1 banner  thanos         1
## 2 banner  is             1
## 3 banner  is             2
## 4 banner  coming         2
## 5 strange who            4
## 6 strange is             4
## 7 strange is             5
## 8 strange thanos         5

Rearranged slightly:

new_slide_windows <- function(tbl, doc_var, window_size) {
  
  window_size <- window_size - 1
  
  grams <- slide(tbl, ~.x, .after = window_size, .step = 1, .complete = TRUE) %>%
    compact()  %>% 
    discard(~length(unique(.x[[doc_var]])) > 1) 
  
  out <- map2(grams, 1:length(grams), ~mutate(.x, window_id = .y))
  
  out %>%
    bind_rows()
}

{tidytext} has gained a skip-gram tokenizer since the initial blog post was published, so none of this is strictly necessary, but this blog wasn’t going to start itself!


comments powered by Disqus