In R, for all columns count factors/characters occurence, group by key -


this question needs solving data.table or dplyr. have dataset (data frame) looks this:

summary(mooc_events)  signup_id         time                source               event          min.   :     1   min.   :2013-10-27   browser:3869940   access    :3112191    1st qu.: 18721   1st qu.:2013-12-19   server :4287337   discussion: 649259    median : 48331   median :2014-05-30                     navigate  :1009309    mean   : 63476   mean   :2014-04-05                     page_close:1237883    3rd qu.:110375   3rd qu.:2014-06-15                     problem   :1261170    max.   :200905   max.   :2014-08-01                     video     : 796958                                                            wiki      :  90507       artefact_sha           length:8157277      class :character    mode  :character   

one signup_id have multiple events, there many rows starts same signup_id.

what i'm trying achieve aggregated dataset (data.table or data frame) columns there distinct values per particular column, grouped signup_id, data this:

signup_id, source_browser, source_server, event_access, event_discussion, ... , event_wiki, artefact_sha_{first_element_in_whole_dataset}, ..., artefact_sha_{last_element_in_whole_dataset}  1, 23, 37, 9, 0, ..., 3, 7, ..., 1 2, 2, 7, 2, 2, ..., 1, 0, ..., 0 

in other words, it's counting occurrences given set of columns, grouped single column signup_id, i'm not interesting in grouping e.g. both signup_id , source.

column naming not strict (_ can substituted makes sense).

(let's skip time column @ moment)

best regards , upfront.

it's more reshaping problem solved using tidyr , reshape2 libraries.

reshaping tidyr , counting occurences reshape2 :

my example doesn't include artefact_sha because didn't understand want it.

library(dplyr) # or library(magrittr) pipe syntax library(tidyr) library(reshape2)  set.seed(42) mooc_events <- data.frame(signup_id = rep(1:3, each = 5),                      time = sys.date(),                      source = sample(c("browser", "server"), 15, rep = true),                      event = sample(c("access", "discussion", "navigate"), 15, rep = true),                      stringsasfactors = false)  mooc_events.m <-    mooc_events %>%    gather(key, value, -c(signup_id, time)) %>%    unite(var, key, value, sep = "_")  mytable <- dcast(mooc_events.m, signup_id ~ var, fun.aggregate = length)  > mytable   signup_id event_access event_discussion event_navigate source_browser source_server 1         1            1                2              2              1             4 2         2            2                0              3              1             4 3         3            0                3              2              3             2 

Comments

Popular posts from this blog

c# - Validate object ID from GET to POST -

node.js - Custom Model Validator SailsJS -

php - Find a regex to take part of Email -