r - Extract and count common word-pairs from character vector -
how can find frequent pairs of adjacent words in character vector? using crude data set, example, common pairs "crude oil", "oil market", , "million barrels".
the code small example below tries identify frequent terms , then, using positive lookahead assertion, count how many times frequent terms followed frequent term. attempt crashed , burned.
any guidance appreciated how create data frame shows in first column ("pairs") common pairs , in second column ("count") number of times appeared in text.
library(qdap) library(tm) # crude data set, create text file first 3 documents, clean text <- c(crude[[1]][1], crude[[2]][1], crude[[3]][1]) text <- tolower(text) text <- tm::removenumbers(text) text <- str_replace_all(text, " ", "") # replace double spaces single space text <- str_replace_all(text, pattern = "[[:punct:]]", " ") text <- removewords(text, stopwords(kind = "smart")) # pick top 10 individual words frequency, since form common pairs freq.terms <- head(freq_terms(text.var = text), 10) # create pattern top words regex expression below freq.terms.pat <- str_c(freq.terms$word, collapse = "|") # match frequent terms followed frequent term library(stringr) pairs <- str_extract_all(string = text, pattern = "freq.terms.pat(?= freq.terms.pat)")
here effort falters.
not knowing java or python, these did not java count word pairs python count word pairs may useful references others.
thank you.
one idea here , create new corpus bigrams.:
a bigram or digram every sequence of 2 adjacent elements in string of tokens
a recursive function extract bigram :
bigram <- function(xs){ if (length(xs) >= 2) c(paste(xs[seq(2)],collapse='_'),bigram(tail(xs,-1))) }
then applying crude data tm
package. ( did text cleaning here, steps depends in text).
res <- unlist(lapply(crude,function(x){ x <- tm::removenumbers(tolower(x)) x <- gsub('\n|[[:punct:]]',' ',x) x <- gsub(' +','',x) ## after cleaning compute frequency using table freqs <- table(bigram(strsplit(x," ")[[1]])) freqs[freqs>1] })) as.data.frame(tail(sort(res),5)) tail(sort(res), 5) reut-00022.xml.hold_a 3 reut-00022.xml.in_the 3 reut-00011.xml.of_the 4 reut-00022.xml.a_futures 4 reut-00010.xml.abdul_aziz 5
the bigrams "abdul aziz" , "a futures" common. should reclean data remove (of, the,..). should start.
edit after op comments :
in case want bigrams-frequency on corpus , on idea compute bigrams in loop , compute frequency loop result. profit add better text processing-cleanings.
res <- unlist(lapply(crude,function(x){ x <- removenumbers(tolower(x)) x <- removewords(x, words=c("the","of")) x <- removepunctuation(x) x <- gsub('\n|[[:punct:]]',' ',x) x <- gsub(' +','',x) ## after cleaning compute frequency using table words <- strsplit(x," ")[[1]] bigrams <- bigram(words[nchar(words)>2]) })) xx <- as.data.frame(table(res)) setdt(xx)[order(freq)] # res freq # 1: abdulaziz_bin 1 # 2: ability_hold 1 # 3: ability_keep 1 # 4: ability_sell 1 # 5: able_hedge 1 # --- # 2177: last_month 6 # 2178: crude_oil 7 # 2179: oil_minister 7 # 2180: world_oil 7 # 2181: oil_prices 14
Comments
Post a Comment