r - Seeking on a gz connection is unpredictable -
i'm having trouble seeking around gzfiles in r. here's example:
set.seed(123) m=data.frame(z=runif(10000),x=rnorm(10000)) write.csv(m,"m.csv") system("gzip m.csv") file.info("m.csv.gz")$size [1] 195975
that creates m.csv.gz
, r says can seek on, , seek
seems agree:
gzf=gzfile("m.csv.gz") open(gzf,"rb") isseekable(gzf) [1] true
now small jumps, , forth, seem work, if try big jump, error:
seek(gzf,10) [1] 10 seek(gzf,20) [1] 10 seek(gzf,10) [1] 20 seek(gzf,1000) [1] 100 warning message: in seek.connection(gzf, 1000) : seek on gzfile connection returned internal error
however if reset connection , start again, can 1000 if in 100-byte steps:
for(i in seq(100,1000,by=100)){seek(gzf,i)} seek(gzf,na) [1] 1000
r has harsh words on using seek
in windows: "use of ‘seek’ on windows discouraged." on linux box (r 3.1.1, 32 bit). similar code in python using gz
library works fine, seeking over.
r 3.2.0 more informative:
warning messages: 1: in seek.connection(gzf, 1000) : invalid or incomplete compressed data 2: in seek.connection(gzf, 1000) : seek on gzfile connection returned internal error
ideas? i've submitted bug report now.
this educated guess: small jumps handled within decoded buffer, when seek more buffer size perform raw seek tries decode gzip in middle of chunk leading decoding error, bug within r library. suggest use skip instead of seek, since underlying library cannot more , won't have impact in performance.
i checked rfc1952 , rfc1951, in gzip can know complete size of file before extracting reading 'members' header , sum isize fields, cannot know how big deflated block without decoding it(the size of each symbol in dictionary), cannot seek common gzip stream.
if want seek gzip must index beforehand.
dictzip library adds headers allow seeking.
Comments
Post a Comment