How to have a spark dataframe be constantly updated as writes occur in the db backend? -
basically have spark sitting in front of database , wondering how go having dataframe updated new data backend.
the trivial way can think of solving run query against database every couple minutes inefficient , still result in having stale data time between updates.
i not 100% sure if database i'm working has restriction think rows added, there no modifications existing rows.
df rdd+schema+many other functionalities. basic spark design, rdd immutable. hence, can not update df after materialized. in case, can mix streaming + sql below:
- in db, write data queue along writes in tables
- use spark queue stream consume queue , create dstreams (rdds every x seconds)
- for each incoming rdd, join existing df , create new df
Comments
Post a Comment