Spark with Arvo, Kryo and Parquet -
i'm struggling understand arvo, kryo , parquet in context of spark. related serialization i've seen them used can't doing same thing.
parquet describes self columnar storage format , kind of when i'm saving parquet file can arvo or kryo have it? or relevant during spark job, ie. sending objects on network during shuffle or spilling disk? how arvo , kryo differ , happens when use them together?
parquet works when need read few columns when querying data. if schema has lots of columns (30+) , in queries/jobs need read of them record based formats (like avro) work better/faster.
another limitation of parquet is write-once format. need collect data in staging area , write parquet file once day (for example).
this might want use avro. e.g. can collect avro-encoded records in kafka topic or local files , have batch job converts of them parquet file @ end of day. easy implement parquet-avro library provides tools convert between avro , parquet formats automatically.
and of course can use avro outside of spark/bigdata. serialization format similar google protobuf or apache thrift.
Comments
Post a Comment