4 Simple collexeme analysis

Simple collexeme analysis investigates relationships between pairs of constructions, typically between a syntactic construction and a lexical item (Stefanowitsch 2013). It requires four values that are entered into a contingency table as shown in the table below: a) the frequency of a particular lexeme in a given construction; b) the frequency of all other lexemes in the same construction; c) the frequency of the lexeme in question outside of the given construction (i.e., in all other constructions); d) the frequency of all other lexemes outside of the construction. Table 4.1 summarizes the input that is needed for each of the four cells.

Tab. 4.1: Frequency information needed for simple collexeme analysis
Word lᵢ of Class L Other Words of Class L Total
Construction c of Class C Freq. of L(lᵢ) in C(c) Frequency of L(¬lᵢ) in C(c) Total frequency of C(c)
Other Constructions of class C Frequency of L(lᵢ) in C(cᵢ) Frequency of L(¬lᵢ) in C(¬cᵢ) Total frequency of C(¬c)
Total Total Total frequency of L(lᵢ) Total frequency of L(¬lᵢ) Total frequency of C

Admittedly, Table 4.1 is a bit hard to read, so it is probably helpful to work with examples. In the following, I will draw on so-called snowclones, i.e. extravagant formulaic patterns with an open slot and usually with a lexically fixed source phrase (see e.g. Bergs 2019, Ungerer & Hartmann 2020), to illustrate collostructional analysis. Consider, for instance, [the mother of all X], which is often mentioned as a paradigm example of a snowclone, as there is a relatively clear source (Saddam Hussein’s “the mother of all bombs”), and it is used productively with a broad ramge of variation in the PP-internal noun slot.

For getting hands-on experience, let us read in a dataset from the webcorpus ENCOW16A (Schäfer & Bildhauer 2012). These data have been compiled and annotated by Tobias Ungerer and myself in the context of a joint project on snowclones.

# read data
moa <- read_csv("data/mother_of_all_x/mother_of_all_ENCOW.csv")

# exclude false hits
moa <- filter(moa, keep == "y")

# print summary table
tibble(
  `number of tokens` = nrow(moa),
  `number of types` = length(unique(moa$lemma)),
  `Frequency of hangover` = length(which(moa$lemma=="hangover"))
) %>% kableExtra::kable()
number of tokens number of types Frequency of hangover
4127 1669 62

(Note: If you see kableExtra::kable() or DT::datatable() anywhere in this tutorial, as in the function above, you can ignore it - those functions are just used to create the tables for the HTML file you’re currently reading.)

A simple collexeme analysis can be used to check which lexemes occur with above-chance frequency in this slot, and which lexemes are rarer than we would expect, given a chance distribution. Take, for example, the lexical item hangover. To compute the collostruction strength of hangover, we have to fill the four cells as shown in Table 4.2.

Tab. 4.2: Example contingency table for one slot filler
Word lᵢ of Class L Other Words of Class L Total
Construction c of Class C mother of all hangovers mother of all [¬hangover] Total frequency of C(c)
Other Constructions of class C [¬mother of all] + hangover [¬mother of all] + [¬hangover] Total frequency of C(¬c)
Total Total Total frequency of L(lᵢ) Total frequency of L(¬lᵢ) Total frequency of C

The phrase the mother of all hangovers occurs 62 times in the corpus. Altogether, [mother of all X] is attested 4127 times in the corpus. The term hangover occurs a bit more than 18,000 times in the corpus, as a look at the ENCOW lemma list shows (available here.) This list can also be used to obtain the total frequency of all nouns in the corpus, which is a value that we will need later on. I’ve commented out the code, but you can uncomment it and try it out with the downloaded frequency list from webcorpora.org.

# For this tutorial, I've compiled a lemma list
# containing only the lemmas attested in the "mother of all"
# construction using this commented-out code:

# library(vroom) # for fast processing of large files
# encow <- vroom("/Users/stefanhartmann/sciebo/Tutorials/collostructions_tutorial/data/encow16ax.lp.tsv", col_names = c("Lemma", "POS", "Freq"))

## only nouns:
## filter using the Penn POS tags used in ENCOW
# encow[encow$POS %in% c("NN", 
#                        "NNS", 
#                        "NNP", 
#                        "NNPS"),]$Freq %>% sum # 1805183579


# encow <- filter(encow, Lemma %in% moa$lemma)
# write_excel_csv(encow, "data/moa_encow_freqs.csv")

encow <- read_csv("data/moa_encow_freqs.csv")

# some lemmas occur twice with different pos tags,
# so we first filter out nouns and then
# summarize all that occur with different noun tags
encow <- encow[grep("N.*", encow$POS),]

encow <- encow %>% group_by(Lemma) %>% summarise(
  Lemma = Lemma,
  Freq = sum(Freq)
) %>% unique()

encow[which(encow$Lemma=="hangover"),]
## # A tibble: 1 × 2
## # Groups:   Lemma [1]
##   Lemma     Freq
##   <chr>    <dbl>
## 1 hangover 18482

Finally, we can compute the value for filling the fourth cell by obtaining the sum of all nouns in the corpus (from and then subtracting the values the fill the other cell of course). The total number of nouns in the corpus is 1805183579 (this is the number we get from summing up the frequencies of lemmas tagged as nouns in the entire lemma frequency list linked above).

It is, however, not always easy to decide which value should enter the fourth cell (collostructional analysis has sometimes been criticized for this, see e.g. Schmid & Küchenhoff 2013, Gries 2013). For example, regarding the [mother of all N] construction, one could discuss whether the value of the fourth cell should be the frequency of all other nouns in the corpus, including proper names, or whether proper names should be excluded (especially if they are not attested in the construction, which could be seen as evidence that the construction does not readily combine with proper names).

Anyway, we will work with the full set of nouns here, without excluding proper names, as examples like the mother of all Karens seem perfectly possible. We can thus fill our cells as follows:

# the mother of all hangovers:
a <- moa %>% filter(lemma == "hangover") %>% nrow()

# the mother of all ¬hangovers:
b <- moa %>% filter(lemma != "hangover") %>% nrow()

# ¬(the mother of all) hangovers:
c <- encow[which(encow$Lemma=="hangover"),]$Freq - a

# ¬(the mother of all) ¬hangovers:
# the sum of 1805183579 has been calculated from the full
# ENCOW lemma frequency list
d <- 1805183579 - a - b - c


# table:
tibble(
  a = a,
  b = b,
  c = c,
  d = d
) %>% kableExtra::kable()
a b c d
62 4065 18420 1805161032

Using those values, we can now compute an association measure. For the sake of simplicity, we will use the chi-squared test here. Stefanowitsch & Gries (2003) use the Fisher-Yates exact test.

matrix(c(a,b,c,d), nrow = 2) %>% chisq.test()
## Warning in chisq.test(.): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  .
## X-squared = 89392, df = 1, p-value < 2.2e-16

Thus, according to the chi-squared test, the lexeme hangover occurs significantly more often in the [mother of all N] constructions than would be expected given a chance distribution. This also becomes clear if we compare the observed distribution to the expected one, which we can compute using R’s function for the chi-squared test, chisq.test(), which automatically computes expected frequencies. In the table below, the expected values are given in parantheses:

# expected:
exp <- matrix(c(a,b,c,d), nrow = 2) %>% chisq.test()
## Warning in chisq.test(.): Chi-squared approximation may be incorrect
exp <- exp$expected

# observed:
obs <-  matrix(c(a,b,c,d), nrow = 2)

tibble(
  `x` = c("mother", "not mother"),
  hangover = c(paste0(obs[1], " (", round(exp[1], digits = 3), ")"),
  paste0(obs[3], " (", round(exp[3], digits = 3), ")")),
  not_hangover = c(paste0(obs[2], " (", round(exp[2], digits = 3), ")"), paste0(obs[4], " (", round(exp[4], digits = 3), ")"))
  
) %>% kableExtra::kable()
x hangover not_hangover
mother 62 (0.042) 4065 (4126.958)
not mother 18420 (18481.958) 1805161032 (1805160970.042)

Of course we want to compute the collostruction strength not only for this one lexeme, but for all that occur in our dataset. We could write a function that automatizes what we’ve done so far; but luckily, Susanne Flach has already done that for us, and developed an entire R package dedicated to collostructional analysis. It is not (yet) available on CRAN but you can download it on her website. After installing it following the instructions given there, you should be able to load it:

library(collostructions)

The input that the package expects is simple - just check the documentation of the individual functions, for simple collexeme analysis ?collex, where it says that the first argument to be provided to the function should be “[a] data frame with types in a construction in column 1 (WORD), construction frequencies in column 2 (CXN.FREQ) and corpus frequencies in column 3 (CORP.FREQ).” Thus, we have to create a joint dataframe first - currently, the corpus frequencies for the first two cells are in the dataframe named moa, while the construction frequencies for the two remaining cells are in the dataframe encow. We first use the table function to get a dataframe containing the frequency of each lexeme in the [mother of all N] construction…

# frequency within "other of all"
tbl1 <- moa %>% 
  select(lemma) %>% # select the "lemma" column
  table %>% # tabulate
  as_tibble() %>%  # convert to dataframe
  setNames(c("Lemma", "Freq_mother")) # set the column names

… and then join this table with the list of corpus frequencies, encow:

tbl1 <- left_join(tbl1, encow)
## Joining, by = "Lemma"

Note that in this case, we don’t have to explicitly specify by which columns we want to join the data because there’s a column called “Lemma” in both dataframes, so the left_join function automatically uses the “Lemma” column as the variable to join by. If the columns had different names, say “Lemma” and “lemma”, we would have had to specify the columns explicitly, e.g. left_join(tbl1, encow, by = c("lemma" = "Lemma")).

Note that tbl1 conforms to the input required by the collex function, as cited above: The word in the first column, the construction frequency of each lexical item in the second, and its corpus frequency in the third. With the numbers in the second and third column of our new dataframe tbl1, we can fill the first two cells of our contingency table. Now you may wonder what happened to the third and fourth cells. Well, both can be easily computed as soon as we have the total number of words belonging to a certain category (in our case, nouns) in the corpus. We have seen above that ENCOW16A contains 1805183579 nouns. We can provide this value to collex as the corpsize argument. Thus, we have everything we need to do the simple collexeme analysis. However, the following line of code throws an error:

collex(tbl1, corpsize = 1805183579)
## Warning in if (class(x) == "list") {: the condition has length > 1 and
## only the first element will be used
##                   Lemma Freq_mother Freq
## 900           megamover           2   NA
## 96             beamprop           1   NA
## 164            branding           1   NA
## 228         categorical           1   NA
## 274        clear-cutter           1   NA
## 312       complementary           1   NA
## 386             cryptic           1   NA
## 502               dying           1   NA
## 504              earach           1   NA
## 646        gap-and-crap           1   NA
## 816             kicking           1   NA
## 873                 mal           1   NA
## 994             nugshot           1   NA
## 1009        omnishamble           1   NA
## 1042 parliament-squarer           1   NA
## 1166              rapid           1   NA
## 1170           rattling           1   NA
## 1247         rollicking           1   NA
## 1310        shellacking           1   NA
## 1477            tamtram           1   NA
## 1506        thighburner           1   NA
## 1574         unclogging           1   NA
## 1575      uncontainable           1   NA
## 1578       unpleaseable           1   NA
## 1632           whooping           1   NA
## 1646             wobbly           1   NA
## Error in collex(tbl1, corpsize = 1805183579): 
## Your input contains the above lines with incomplete cases.

Why? The answer is simple, but maybe not obvious. In the moa dataframe, we are dealing with manually annotated data. Tobias and I have corrected the lemmatization of the individual lexemes because we wanted to have a reliable picture of their frequency in the construction. This is not possible for the entire corpus of course. This is why there are some lexemes that are attested in our [mother of all N] dataset, but not in the corpus (because there, they are lemmatized as “” or with a wrong lemma). As the error message above shows, they have a frequency value of NA. We can replace NA by 0 in such cases using the helpful replace_na function:

tbl1 <- tbl1 %>% replace_na(list(Freq = 0))

Will this help?

collex(tbl1, corpsize = 1805183579)
## Warning in if (class(x) == "list") {: the condition has length > 1 and
## only the first element will be used
##                   Lemma Freq_mother Freq
## 900           megamover           2    0
## 96             beamprop           1    0
## 164            branding           1    0
## 228         categorical           1    0
## 274        clear-cutter           1    0
## 312       complementary           1    0
## 386             cryptic           1    0
## 502               dying           1    0
## 504              earach           1    0
## 646        gap-and-crap           1    0
## 816             kicking           1    0
## 873                 mal           1    0
## 994             nugshot           1    0
## 1009        omnishamble           1    0
## 1042 parliament-squarer           1    0
## 1166              rapid           1    0
## 1170           rattling           1    0
## 1247         rollicking           1    0
## 1310        shellacking           1    0
## 1477            tamtram           1    0
## 1506        thighburner           1    0
## 1574         unclogging           1    0
## 1575      uncontainable           1    0
## 1578       unpleaseable           1    0
## 1632           whooping           1    0
## 1646             wobbly           1    0
## Error in collex(tbl1, corpsize = 1805183579): 
## For the above item(s), the frequency in the construction is larger than the frequency in the corpus.

Not really, as the function complains about the same rows again - rightly so, because it doesn’t make much sense to look at items that are, at least according to the lemmatization of the corpus, not attested at all in our database. It would be like counting the number of apples drawn from a bag of pears. There are several ways to solve this problem. If we wanted to have very precise results, we could manually query the corpus for the tokens that are only attested in the [mother of all N] database, ideally taking spelling variants into account, thus obtaining the corpus frequency for each of the affected lexical items individually. Given the low frequency of the affected lemmas in the construction, however, this seems rather unnecessary. Personally, I tend to exclude those items (and make this transparent when I’m reporting the results in a paper).

tbl1 <- tbl1[which(tbl1$Freq_mother <= tbl1$Freq),]

Now we have a table that should work as input for collex:

tbl1 %>% collex() %>% DT::datatable()
## Warning in if (class(x) == "list") {: the condition has length > 1 and
## only the first element will be used

Et voilà, we have a collexeme table we can work with - for example, by interpreting the results: Among other things, it is interesting to see that the items most strongly attracted to the construction tend to have a negative semantic prosody, with many of them being rather colloquiual, which might contribute to the “extravagant” nature of the construction.