r - Extract and count common word-pairs from character vector -


how can find frequent pairs of adjacent words in character vector? using crude data set, example, common pairs "crude oil", "oil market", , "million barrels".

the code small example below tries identify frequent terms , then, using positive lookahead assertion, count how many times frequent terms followed frequent term. attempt crashed , burned.

any guidance appreciated how create data frame shows in first column ("pairs") common pairs , in second column ("count") number of times appeared in text.

   library(qdap)    library(tm)  # crude data set, create text file first 3 documents, clean  text <- c(crude[[1]][1], crude[[2]][1], crude[[3]][1]) text <- tolower(text) text <- tm::removenumbers(text) text <- str_replace_all(text, "  ", "") # replace double spaces single space text <- str_replace_all(text, pattern = "[[:punct:]]", " ") text <- removewords(text, stopwords(kind = "smart"))  # pick top 10 individual words frequency, since form common pairs freq.terms <- head(freq_terms(text.var = text), 10)   # create pattern top words regex expression below freq.terms.pat <- str_c(freq.terms$word, collapse = "|")  # match frequent terms followed frequent term library(stringr) pairs <- str_extract_all(string = text, pattern = "freq.terms.pat(?= freq.terms.pat)") 

here effort falters.

not knowing java or python, these did not java count word pairs python count word pairs may useful references others.

thank you.

one idea here , create new corpus bigrams.:

a bigram or digram every sequence of 2 adjacent elements in string of tokens

a recursive function extract bigram :

bigram <-    function(xs){     if (length(xs) >= 2)         c(paste(xs[seq(2)],collapse='_'),bigram(tail(xs,-1)))    } 

then applying crude data tm package. ( did text cleaning here, steps depends in text).

res <- unlist(lapply(crude,function(x){    x <- tm::removenumbers(tolower(x))   x <- gsub('\n|[[:punct:]]',' ',x)   x <- gsub('  +','',x)   ## after cleaning compute frequency using table    freqs <- table(bigram(strsplit(x," ")[[1]]))   freqs[freqs>1] }))    as.data.frame(tail(sort(res),5))                           tail(sort(res), 5) reut-00022.xml.hold_a                      3 reut-00022.xml.in_the                      3 reut-00011.xml.of_the                      4 reut-00022.xml.a_futures                   4 reut-00010.xml.abdul_aziz                  5 

the bigrams "abdul aziz" , "a futures" common. should reclean data remove (of, the,..). should start.

edit after op comments :

in case want bigrams-frequency on corpus , on idea compute bigrams in loop , compute frequency loop result. profit add better text processing-cleanings.

res <- unlist(lapply(crude,function(x){   x <- removenumbers(tolower(x))   x <- removewords(x, words=c("the","of"))   x <- removepunctuation(x)   x <- gsub('\n|[[:punct:]]',' ',x)   x <- gsub('  +','',x)   ## after cleaning compute frequency using table    words <- strsplit(x," ")[[1]]   bigrams <- bigram(words[nchar(words)>2]) }))  xx <- as.data.frame(table(res)) setdt(xx)[order(freq)]   #                 res freq #    1: abdulaziz_bin    1 #    2:  ability_hold    1 #    3:  ability_keep    1 #    4:  ability_sell    1 #    5:    able_hedge    1 # ---                    # 2177:    last_month    6 # 2178:     crude_oil    7 # 2179:  oil_minister    7 # 2180:     world_oil    7 # 2181:    oil_prices   14 

Comments

Popular posts from this blog

c# - Validate object ID from GET to POST -

node.js - Custom Model Validator SailsJS -

php - Find a regex to take part of Email -