Spark with Arvo, Kryo and Parquet -


i'm struggling understand arvo, kryo , parquet in context of spark. related serialization i've seen them used can't doing same thing.

parquet describes self columnar storage format , kind of when i'm saving parquet file can arvo or kryo have it? or relevant during spark job, ie. sending objects on network during shuffle or spilling disk? how arvo , kryo differ , happens when use them together?

parquet works when need read few columns when querying data. if schema has lots of columns (30+) , in queries/jobs need read of them record based formats (like avro) work better/faster.

another limitation of parquet is write-once format. need collect data in staging area , write parquet file once day (for example).

this might want use avro. e.g. can collect avro-encoded records in kafka topic or local files , have batch job converts of them parquet file @ end of day. easy implement parquet-avro library provides tools convert between avro , parquet formats automatically.

and of course can use avro outside of spark/bigdata. serialization format similar google protobuf or apache thrift.


Comments

Popular posts from this blog

c# - Validate object ID from GET to POST -

node.js - Custom Model Validator SailsJS -

php - Find a regex to take part of Email -