matlab - Textscan on file with large number of lines -
i'm trying analyze large file using textscan
in matlab. file in question 12 gb in size , contains 250 million lines 7 (floating) numbers in each (delimited whitespace); because not fit ram of desktop, i'm using approach suggested in matlab documentation (i.e. loading , analyzing smaller block of file @ time. according documentation should allow processing "arbitrarily large delimited text file[s]"). allows me scan 43% of file, after textscan starts returning empty cells (despite there still being data left scan in file).
to debug, attempted go several positions in file using fseek
function, example this:
fileinfo = dir(filename); fid = fileopen(filename); fseek(fid, floor(fileinfo.bytes/10), 'bof'); textscan(fid,'%f %f %f %f %f %f %f','delimiter',' ');
i'm assuming way i'm using fseek
here moves position indicator 10% of file. (i'm aware doesn't mean indicator @ beginning of line, if run textscan
twice satisfactory answer.) now, if substitute fileinfo.bytes/10
fileinfo.bytes/2
(i.e. moving 50% of file) breaks down , textscan
returns empty 1x7 cell.
i looked @ file using text editor large files, , shows entire file looks fine, , there should no reason textscan
confused. possible explanation can think of goes wrong on deeper level have little understanding of. suggestions appreciated!
edit
the relevant part of code used this:
while ~feof(fid) data = textscan(fid, formatstring, nlines, 'delimiter', ' '); %// read nlines %// stuff end
first tried fixing using ftell
, fseek
suggested hoki below. gave same error got before: matlab unable read in more approximately 43% of file. tried using headerlines
solution (also suggested below), this:
i = 0; while ~feof(fid) frewind(fid) data = textscan(fid, formatstring, nlines, 'delimiter',' ', 'headerlines', i*nlines); %// stuff = + 1; end
this seems read in data without producing errors; is, however, incredibly slow.
i'm not entirely sure understand headerlines
in context, seems make textscan
ignore comes before specified line. doesn't seem happen when using textscan
in "appropriate" way (either or without ftell
, fseek
): in both cases tries continue last position, no avail because of reason don't understand yet.
fseek
pointer in file when know precisely (or how many bytes) want move cursor. useful binary files when want skip records of known length. on text file more dangerous , confusing (unless absolutely sure each line same size , each element on line @ same exact place/column, doesn't happen often).
there several ways read text file block block:
1) use headerlines
option
to skip block of lines on text file, can use headerlines
parameter of textscan
, example:
readformat = '%f %f %f %f %f %f %f' ; %// read format specifier nlines = 10000 ; %// number of line read per block fileinfo = dir(filename); %// read first block fid = fileopen(filename); m = textscan(fid, readformat, nlines,'delimiter',' '); %// read first 10000 lines fclose(fid) %// "m" data
then when want read second block:
%// later read second block: fid = fileopen(filename); m = textscan(fid, readformat, nlines,'delimiter',' ','headerlines', nlines); %// read lines 10001 20000 fclose(fid)
and if have many blocks, nth
block, adapt:
%// , nth block block: fid = fileopen(filename); m = textscan(fid, readformat, nlines,'delimiter',' ','headerlines', (n-1)*nlines); fclose(fid)
if necessary (if have many blocks), code last version in loop.
note if close file after each block reading (so file pointer start @ beginning of file when open again). closing file after reading block of data safer if processing might take long time or may error out (you don't want have files remain open long or loose fid
if crash).
2) read block (without closing file)
if processing of block quick , safe enough you're sure won't bomb out, afford not close file. in case, textscan
file pointer stay stopped, :
- read block (do not close file):
m = textscan(fid, readformat, nlines)
- process save result (and release memory)
- read next block same call:
m = textscan(fid, readformat, nlines)
in case wouldn't need headerlines
parameter because textscan
resume reading stopped.
3) use ftell
, fseek
lastly, use fseek
start reading file @ precise position want, in case recommend using in conjunction ftell
.
ftell
return current position in open file, use know @ position stop reading last, use fseek
next time go straight @ position. like:
%// read first block fid = fileopen(filename); m = textscan(fid, readformat, nlines,'delimiter',' '); lastposition = ftell(fid) ; fclose(fid) %// stuff %// read block: fid = fileopen(filename); fseek( fid , 'bof' , lastposition ) ; m = textscan(fid, readformat, nlines,'delimiter',' '); lastposition = ftell(fid) ; fclose(fid) %// , on ...
Comments
Post a Comment