5 Distinctive collexeme analysis
Construction Grammar has always been quite interested in so-called alternations, e.g. the so-called dative alternation between the ditransitive (or double-object) construction (I gave her the book) and the to-dative construction (I gave the book to her). Such alternations are interesting for a variety of reasons, one of them being the fact that Construction Grammarians tend to assume a “principle of non synonymy” (see Goldberg 1995, but see Uhrig 2015 for critical remarks). Comparing the slot fillers across constructions can help to get an impression of potentially quite subtle meaning differences between such near-synonymous constructions.
To stick with the example of “snowclones”, we could say that [the mother of all N] competes with the less “snowclony” [the ADJ-est N ever/of all times], even though an obvious differences between both constructions is of course that the [mother of all N] construction, while evoking an implicit superlative, leaves the way in which something is superlative underspecified.
Despite those differences, let us compare the N slots in the [mother of all N] construction and the [ADJ-est N ever] construction using distinctive collexeme analysis. The basic logic of distinctive collexeme analysis is very similar to that of simple collexeme analysis, except that we don’t compare construction-internal and construction-external frequencies, but instead frequencies within two constructions. While this entails the obvious shortcoming that corpus frequencies are not taken into account, it can help get a clearer picture of the commonalities and differences of the two constructions.
The basic contingency table for distinctive collemexe analysis is shown in Table @ref(tab:dca_tbl).
## New names:
## * `` -> ...1
Word lᵢ of Class L | Other Words of Class L | Total | |
---|---|---|---|
Construction c₁ of Class C | Freq. of L(lᵢ) in C(c₁) | Frequency of L(¬lᵢ) in C(c₁) | Total frequency of C(c₁) |
Construction c₂ of Class C | Frequency of L(lᵢ) in C(c₂) | Frequency of L(¬lᵢ) in C(¬c₂) | Total frequency of C(¬c₂) |
Total | Total Total frequency of L(lᵢ) in C(c₁, c₂) | Total frequency of L(¬lᵢ) in C(c₁, c₂) | Total frequency of C(c₁, c₂) |
Again, this table is easier to understand if we translate it to our example, i.e. the comparison between [mother of all N] and [ADJ-est N ever] (where we ignore the ADJ slot and only focus on the N slot):
## New names:
## * `` -> ...1
Word lᵢ of Class L | Other Words of Class L | Total | |
---|---|---|---|
Construction c₁ of Class C | Frequency of "hangover" in the "mother of all" cxn | Frequency of all other nouns in the "mother of all" cxn | Total frequency of "mother of all" |
Construction c₂ of Class C | Frequency of "hangover" in the "ADJ-est N ever" construction | Frequency of all other nouns in the "ADJ-est N ever" construction | Total frequency of "ADJ-est N ever" |
Total | Total frequency of "hangover" in both constructions | Total frequency of all other nouns in the two constructions | Total frequency of both cxns |
Let us briefly work through this example. We already have the data for [mother of all N], but we still need the data for [ADJ-est N ever]. To obtain them, I have queried ENCOW16A via NoSketchEngine using the search query "[tT]he" ".*est" [tag="N.*"] "ever"
. We import the results using the getNSE
function from the package concordances
, which is available via Github and which you can install using the following command:
if(!is.element("devtools", installed.packages())) {
install.packages("devtools")
}
::install_github("hartmast/concordances") devtools
Now we load the package using
library(concordances)
And we read in the data:
<- getNSE("data/adj_est_n_ever/the_ADJest_N_ever.xml", xml = TRUE, tags = TRUE, context_tags = FALSE) ever
## Processing tags in the keyword column ...
Now we have to extract the lemmas in the noun slot. The original data have lemma tags, and getNSE
has extracted them for us to a separate column (named “Tag2_Key”):
head(ever)
## Left
## 1 . ... to Failed States Widely acclaimed as
## 2 to be found in Destiny of the Daleks as part of
## 3 into the world . The 175 years of his life would be
## 4 a huge favorite ? All he did this season was have
## 5 Hitchcock and Bloch . The first Psycho is one of
## 6 , you have n't seen anything yet ! Quite possibly
## Key
## 1 the best television ever
## 2 the biggest rouse ever
## 3 the greatest blessing ever
## 4 the best year ever
## 5 the darkest works ever
## 6 the best sequel ever
## Right
## 1 , US crime saga The Wire finally arrives on
## 2 . Davros , and the Doctor for that matter ,
## 3 bestowed upon " the families of the earth " . The
## 4 for a quarterback in breaking NFL records for
## 5 committed to celluloid , and much of its
## 6 written , " Sony Talks About PSP " takes
## Key_with_anno
## 1 the /the best /good television /television ever /ever
## 2 the /the biggest /big rouse /rouse ever /ever
## 3 the /the greatest /great blessing /blessing ever /ever
## 4 the /the best /good year /year ever /ever
## 5 the /the darkest /dark works /work ever /ever
## 6 the /the best /good sequel /sequel ever /ever
## Tag1_Key Tag2_Key
## 1 the best television ever the good television ever
## 2 the biggest rouse ever the big rouse ever
## 3 the greatest blessing ever the great blessing ever
## 4 the best year ever the good year ever
## 5 the darkest works ever the dark work ever
## 6 the best sequel ever the good sequel ever
Thus, we just have to extract the third word in each row of the Tag2_Key column. Extracting the third word from a single character string is simple using the strsplit
command:
unlist(strsplit("The best function ever",
split = " "))[3] # whitespace as separator
## [1] "function"
We can apply this function over an entire vector, or in our case: a column of a dataframe, using sapply
:
<- sapply(1:nrow(ever), function(i) unlist(strsplit(ever$Tag2_Key[i], " "))[3]) ever_n
From this list of nouns, we can create a frequency table:
<- table(ever_n) %>%
ever_n_tbl as_tibble() %>%
setNames(c("Lemma", "Freq_ever")) %>%
arrange(desc(Freq_ever)) # arrange in descending order
head(ever_n_tbl)
## # A tibble: 6 × 2
## Lemma Freq_ever
## <chr> <int>
## 1 thing 3724
## 2 (unknown) 2362
## 3 hero 1719
## 4 holiday 1673
## 5 attendance 1466
## 6 game 1250
We can now merge this with our table tbl1
compiled in Section 4.
<- left_join(tbl1, ever_n_tbl) tbl1
## Joining, by = "Lemma"
head(tbl1)
## # A tibble: 6 × 4
## Lemma Freq_mother Freq Freq_ever
## <chr> <int> <dbl> <int>
## 1 ab 1 11706 NA
## 2 abomination 4 16830 2
## 3 accelerator 1 25388 1
## 4 accent 1 101851 10
## 5 accident 2 451655 8
## 6 achievement 1 355966 14
There are of course some lexemes that occur only in one construction and not in the other, hence the NA’s in the output. We can remove them using the replace_na
function, and as this function takes a list as its argument, we can so so simultaneously for multiple columns:
<- tbl1 %>% replace_na(list(Freq_mother = 0, Freq_ever = 0)) tbl1
Now we have all we need for a distinctive collexeme analysis. In the collostructions
package, we can use collex.dist
to perform a distinctive collexeme analysis. According to the package vignette (see ?collex.dist
), we have two options to pass our data to the function: “EITHER as aggregated frequencies with the types in column A (WORD), and the frequencies of WORD in the first construction in column 2 and in the frequencies of WORD in the second construction in column 3, OR as raw data, i.e., one observation per line, where column 1 must contain the construction types and column 2 must contain the collexeme.”
As we already have the frequency list, we go for the first option. In fact, we simply have to select the relevant columns from the tbl1
dataframe and can pass them to collex.dist
.
%>% select(Lemma, Freq_mother, Freq_ever) %>% collex.dist() %>% DT::datatable() tbl1
## Warning in if (class(x) == "list") {: the condition has length > 1 and
## only the first element will be used
Again, we see that certain words like battle, hangover, crisis occur much more often in the [mother of all N] construction than in the [ADJ-est N ever] construction. We can reverse the list to take a look at the collexemes that rather tend to occur in the [ADJ-est N ever] construction:
%>% select(Lemma, Freq_mother, Freq_ever) %>% collex.dist(reverse = TRUE) %>% DT::datatable() tbl1
## Warning in if (class(x) == "list") {: the condition has length > 1 and
## only the first element will be used
Comparing the distinctive collexemes of both constructions is quite instructive - overall, it seems as if [mother of all N] tends to combine more with abstract concepts and nouns denoting events, while [ADJ-est N ever] combines with terms that denote more concrete/individuated concepts like persons, objects, cultural products etc.