matlab - Textscan on file with large number of lines -


i'm trying analyze large file using textscan in matlab. file in question 12 gb in size , contains 250 million lines 7 (floating) numbers in each (delimited whitespace); because not fit ram of desktop, i'm using approach suggested in matlab documentation (i.e. loading , analyzing smaller block of file @ time. according documentation should allow processing "arbitrarily large delimited text file[s]"). allows me scan 43% of file, after textscan starts returning empty cells (despite there still being data left scan in file).

to debug, attempted go several positions in file using fseek function, example this:

fileinfo = dir(filename); fid = fileopen(filename); fseek(fid, floor(fileinfo.bytes/10), 'bof'); textscan(fid,'%f %f %f %f %f %f %f','delimiter',' '); 

i'm assuming way i'm using fseek here moves position indicator 10% of file. (i'm aware doesn't mean indicator @ beginning of line, if run textscan twice satisfactory answer.) now, if substitute fileinfo.bytes/10 fileinfo.bytes/2 (i.e. moving 50% of file) breaks down , textscan returns empty 1x7 cell.

i looked @ file using text editor large files, , shows entire file looks fine, , there should no reason textscan confused. possible explanation can think of goes wrong on deeper level have little understanding of. suggestions appreciated!

edit

the relevant part of code used this:

while ~feof(fid)     data = textscan(fid, formatstring, nlines, 'delimiter', ' '); %// read nlines         %// stuff end 

first tried fixing using ftell , fseek suggested hoki below. gave same error got before: matlab unable read in more approximately 43% of file. tried using headerlines solution (also suggested below), this:

i = 0; while ~feof(fid)     frewind(fid)     data = textscan(fid, formatstring, nlines, 'delimiter',' ', 'headerlines', i*nlines);         %// stuff     = + 1; end 

this seems read in data without producing errors; is, however, incredibly slow.

i'm not entirely sure understand headerlines in context, seems make textscan ignore comes before specified line. doesn't seem happen when using textscan in "appropriate" way (either or without ftell , fseek): in both cases tries continue last position, no avail because of reason don't understand yet.

fseek pointer in file when know precisely (or how many bytes) want move cursor. useful binary files when want skip records of known length. on text file more dangerous , confusing (unless absolutely sure each line same size , each element on line @ same exact place/column, doesn't happen often).

there several ways read text file block block:

1) use headerlines option

to skip block of lines on text file, can use headerlines parameter of textscan, example:

readformat = '%f %f %f %f %f %f %f' ;   %// read format specifier nlines = 10000 ;                        %// number of line read per block  fileinfo = dir(filename);  %// read first block fid = fileopen(filename); m = textscan(fid, readformat, nlines,'delimiter',' '); %// read first 10000 lines fclose(fid)     %// "m" data 

then when want read second block:

%// later read second block: fid = fileopen(filename); m = textscan(fid, readformat, nlines,'delimiter',' ','headerlines', nlines); %// read lines 10001 20000 fclose(fid) 

and if have many blocks, nth block, adapt:

%// , nth block block: fid = fileopen(filename); m = textscan(fid, readformat, nlines,'delimiter',' ','headerlines', (n-1)*nlines); fclose(fid) 

if necessary (if have many blocks), code last version in loop.

note if close file after each block reading (so file pointer start @ beginning of file when open again). closing file after reading block of data safer if processing might take long time or may error out (you don't want have files remain open long or loose fid if crash).


2) read block (without closing file)

if processing of block quick , safe enough you're sure won't bomb out, afford not close file. in case, textscan file pointer stay stopped, :

  • read block (do not close file): m = textscan(fid, readformat, nlines)
  • process save result (and release memory)
  • read next block same call: m = textscan(fid, readformat, nlines)

in case wouldn't need headerlines parameter because textscan resume reading stopped.


3) use ftell , fseek

lastly, use fseek start reading file @ precise position want, in case recommend using in conjunction ftell.

ftell return current position in open file, use know @ position stop reading last, use fseek next time go straight @ position. like:

%// read first block fid = fileopen(filename); m = textscan(fid, readformat, nlines,'delimiter',' '); lastposition = ftell(fid) ; fclose(fid)  %// stuff  %// read block: fid = fileopen(filename); fseek( fid , 'bof' , lastposition ) ; m = textscan(fid, readformat, nlines,'delimiter',' '); lastposition = ftell(fid) ; fclose(fid) %// , on ... 

Comments

Popular posts from this blog

html - Firefox flex bug applied to buttons? -

html - Missing border-right in select on Firefox -

python - build a suggestions list using fuzzywuzzy -