5 Distinctive collexeme analysis

Construction Grammar has always been quite interested in so-called alternations, e.g. the so-called dative alternation between the ditransitive (or double-object) construction (I gave her the book) and the to-dative construction (I gave the book to her). Such alternations are interesting for a variety of reasons, one of them being the fact that Construction Grammarians tend to assume a “principle of non synonymy” (see Goldberg 1995, but see Uhrig 2015 for critical remarks). Comparing the slot fillers across constructions can help to get an impression of potentially quite subtle meaning differences between such near-synonymous constructions.

To stick with the example of “snowclones”, we could say that [the mother of all N] competes with the less “snowclony” [the ADJ-est N ever/of all times], even though an obvious differences between both constructions is of course that the [mother of all N] construction, while evoking an implicit superlative, leaves the way in which something is superlative underspecified.

Despite those differences, let us compare the N slots in the [mother of all N] construction and the [ADJ-est N ever] construction using distinctive collexeme analysis. The basic logic of distinctive collexeme analysis is very similar to that of simple collexeme analysis, except that we don’t compare construction-internal and construction-external frequencies, but instead frequencies within two constructions. While this entails the obvious shortcoming that corpus frequencies are not taken into account, it can help get a clearer picture of the commonalities and differences of the two constructions.

The basic contingency table for distinctive collemexe analysis is shown in Table @ref(tab:dca_tbl).

## New names:
## * `` -> ...1
(#tab:dca_tbl)Contingency table for distinctive collexeme analysis
Word lᵢ of Class L Other Words of Class L Total
Construction c₁ of Class C Freq. of L(lᵢ) in C(c₁) Frequency of L(¬lᵢ) in C(c₁) Total frequency of C(c₁)
Construction c₂ of Class C Frequency of L(lᵢ) in C(c₂) Frequency of L(¬lᵢ) in C(¬c₂) Total frequency of C(¬c₂)
Total Total Total frequency of L(lᵢ) in C(c₁, c₂) Total frequency of L(¬lᵢ) in C(c₁, c₂) Total frequency of C(c₁, c₂)

Again, this table is easier to understand if we translate it to our example, i.e. the comparison between [mother of all N] and [ADJ-est N ever] (where we ignore the ADJ slot and only focus on the N slot):

## New names:
## * `` -> ...1
(#tab:dca_tbl2)Example contingency table for distinctive collexeme analysis
Word lᵢ of Class L Other Words of Class L Total
Construction c₁ of Class C Frequency of "hangover" in the "mother of all" cxn Frequency of all other nouns in the "mother of all" cxn Total frequency of "mother of all"
Construction c₂ of Class C Frequency of "hangover" in the "ADJ-est N ever" construction Frequency of all other nouns in the "ADJ-est N ever" construction Total frequency of "ADJ-est N ever"
Total Total frequency of "hangover" in both constructions Total frequency of all other nouns in the two constructions Total frequency of both cxns

Let us briefly work through this example. We already have the data for [mother of all N], but we still need the data for [ADJ-est N ever]. To obtain them, I have queried ENCOW16A via NoSketchEngine using the search query "[tT]he" ".*est" [tag="N.*"] "ever". We import the results using the getNSE function from the package concordances, which is available via Github and which you can install using the following command:

if(!is.element("devtools", installed.packages())) {
  install.packages("devtools")
}

devtools::install_github("hartmast/concordances")

Now we load the package using

library(concordances)

And we read in the data:

ever <- getNSE("data/adj_est_n_ever/the_ADJest_N_ever.xml", xml = TRUE, tags = TRUE, context_tags = FALSE)
## Processing tags in the keyword column ...

Now we have to extract the lemmas in the noun slot. The original data have lemma tags, and getNSE has extracted them for us to a separate column (named “Tag2_Key”):

head(ever)
##                                                   Left
## 1          . ... to Failed States Widely acclaimed as 
## 2     to be found in Destiny of the Daleks as part of 
## 3 into the world . The 175 years of his life would be 
## 4   a huge favorite ? All he did this season was have 
## 5    Hitchcock and Bloch . The first Psycho is one of 
## 6   , you have n't seen anything yet ! Quite possibly 
##                          Key
## 1   the best television ever
## 2     the biggest rouse ever
## 3 the greatest blessing ever
## 4         the best year ever
## 5     the darkest works ever
## 6       the best sequel ever
##                                                 Right
## 1        , US crime saga The Wire finally arrives on 
## 2        . Davros , and the Doctor for that matter , 
## 3  bestowed upon " the families of the earth " . The 
## 4      for a quarterback in breaking NFL records for 
## 5           committed to celluloid , and much of its 
## 6           written , " Sony Talks About PSP " takes 
##                                                 Key_with_anno
## 1   the /the  best /good  television /television  ever /ever 
## 2           the /the  biggest /big  rouse /rouse  ever /ever 
## 3  the /the  greatest /great  blessing /blessing  ever /ever 
## 4               the /the  best /good  year /year  ever /ever 
## 5           the /the  darkest /dark  works /work  ever /ever 
## 6           the /the  best /good  sequel /sequel  ever /ever 
##                     Tag1_Key                 Tag2_Key
## 1   the best television ever the good television ever
## 2     the biggest rouse ever       the big rouse ever
## 3 the greatest blessing ever  the great blessing ever
## 4         the best year ever       the good year ever
## 5     the darkest works ever       the dark work ever
## 6       the best sequel ever     the good sequel ever

Thus, we just have to extract the third word in each row of the Tag2_Key column. Extracting the third word from a single character string is simple using the strsplit command:

unlist(strsplit("The best function ever", 
                split =  " "))[3] # whitespace as separator
## [1] "function"

We can apply this function over an entire vector, or in our case: a column of a dataframe, using sapply:

ever_n <- sapply(1:nrow(ever), function(i) unlist(strsplit(ever$Tag2_Key[i], " "))[3])

From this list of nouns, we can create a frequency table:

ever_n_tbl <- table(ever_n) %>% 
  as_tibble() %>% 
  setNames(c("Lemma", "Freq_ever")) %>% 
  arrange(desc(Freq_ever)) # arrange in descending order

head(ever_n_tbl)
## # A tibble: 6 × 2
##   Lemma      Freq_ever
##   <chr>          <int>
## 1 thing           3724
## 2 (unknown)       2362
## 3 hero            1719
## 4 holiday         1673
## 5 attendance      1466
## 6 game            1250

We can now merge this with our table tbl1 compiled in Section 4.

tbl1 <- left_join(tbl1, ever_n_tbl)
## Joining, by = "Lemma"
head(tbl1)
## # A tibble: 6 × 4
##   Lemma       Freq_mother   Freq Freq_ever
##   <chr>             <int>  <dbl>     <int>
## 1 ab                    1  11706        NA
## 2 abomination           4  16830         2
## 3 accelerator           1  25388         1
## 4 accent                1 101851        10
## 5 accident              2 451655         8
## 6 achievement           1 355966        14

There are of course some lexemes that occur only in one construction and not in the other, hence the NA’s in the output. We can remove them using the replace_na function, and as this function takes a list as its argument, we can so so simultaneously for multiple columns:

tbl1 <- tbl1 %>% replace_na(list(Freq_mother = 0, Freq_ever = 0))

Now we have all we need for a distinctive collexeme analysis. In the collostructions package, we can use collex.dist to perform a distinctive collexeme analysis. According to the package vignette (see ?collex.dist), we have two options to pass our data to the function: “EITHER as aggregated frequencies with the types in column A (WORD), and the frequencies of WORD in the first construction in column 2 and in the frequencies of WORD in the second construction in column 3, OR as raw data, i.e., one observation per line, where column 1 must contain the construction types and column 2 must contain the collexeme.”

As we already have the frequency list, we go for the first option. In fact, we simply have to select the relevant columns from the tbl1 dataframe and can pass them to collex.dist.

tbl1 %>% select(Lemma, Freq_mother, Freq_ever) %>% collex.dist() %>% DT::datatable()
## Warning in if (class(x) == "list") {: the condition has length > 1 and
## only the first element will be used

Again, we see that certain words like battle, hangover, crisis occur much more often in the [mother of all N] construction than in the [ADJ-est N ever] construction. We can reverse the list to take a look at the collexemes that rather tend to occur in the [ADJ-est N ever] construction:

tbl1 %>% select(Lemma, Freq_mother, Freq_ever) %>% collex.dist(reverse = TRUE) %>% DT::datatable()
## Warning in if (class(x) == "list") {: the condition has length > 1 and
## only the first element will be used

Comparing the distinctive collexemes of both constructions is quite instructive - overall, it seems as if [mother of all N] tends to combine more with abstract concepts and nouns denoting events, while [ADJ-est N ever] combines with terms that denote more concrete/individuated concepts like persons, objects, cultural products etc.