r - Seeking on a gz connection is unpredictable -


i'm having trouble seeking around gzfiles in r. here's example:

set.seed(123) m=data.frame(z=runif(10000),x=rnorm(10000)) write.csv(m,"m.csv") system("gzip m.csv") file.info("m.csv.gz")$size [1] 195975 

that creates m.csv.gz, r says can seek on, , seek seems agree:

gzf=gzfile("m.csv.gz") open(gzf,"rb") isseekable(gzf) [1] true 

now small jumps, , forth, seem work, if try big jump, error:

seek(gzf,10) [1] 10 seek(gzf,20) [1] 10 seek(gzf,10) [1] 20 seek(gzf,1000) [1] 100 warning message: in seek.connection(gzf, 1000) :   seek on gzfile connection returned internal error 

however if reset connection , start again, can 1000 if in 100-byte steps:

for(i in seq(100,1000,by=100)){seek(gzf,i)} seek(gzf,na) [1] 1000 

r has harsh words on using seek in windows: "use of ‘seek’ on windows discouraged." on linux box (r 3.1.1, 32 bit). similar code in python using gz library works fine, seeking over.

r 3.2.0 more informative:

warning messages: 1: in seek.connection(gzf, 1000) : invalid or incomplete compressed data 2: in seek.connection(gzf, 1000) :   seek on gzfile connection returned internal error 

ideas? i've submitted bug report now.

this educated guess: small jumps handled within decoded buffer, when seek more buffer size perform raw seek tries decode gzip in middle of chunk leading decoding error, bug within r library. suggest use skip instead of seek, since underlying library cannot more , won't have impact in performance.

i checked rfc1952 , rfc1951, in gzip can know complete size of file before extracting reading 'members' header , sum isize fields, cannot know how big deflated block without decoding it(the size of each symbol in dictionary), cannot seek common gzip stream.

if want seek gzip must index beforehand.

dictzip library adds headers allow seeking.


Comments

Popular posts from this blog

c# - Validate object ID from GET to POST -

node.js - Custom Model Validator SailsJS -

php - Find a regex to take part of Email -