python - Iterating over a file omitting lines based on condition efficiently -
ahoi. tasked improve performance of bit.ly's data_hacks' sample.py, practice excercise.
i have cythonized part of code. , included pcg random generator, has far improved performance 20 seconds (down 72s), optimizing print output (by using basic c function, instead of python's write()
).
this has worked well, aside these fix-ups, i'd optimized loop itself.
the basic function, seen in bit.ly's sample.py
:
def run(sample_rate): input_stream = sys.stdin line in input_stream: if random.randint(1,100) <= sample_rate: sys.stdout.write(line)
my implementation:
cdef int take_sample(float sample_rate): cdef unsigned int floor = 1 cdef unsigned int top = 100 if pcg32_random() % 100 <= sample_rate: return 1 else: return 0 def run(float sample_rate, file): cdef char* line open(file, 'rb') f: line in f: if take_sample(sample_rate): out(line)
what improve on now, skipping next line (and preferably repeatedly) if take_sample() doesn't return true
.
my current implementation this:
def run(float sample_rate, file): cdef char* line open(file, 'rb') f: line in f: out(line) while not take_sample(sample_rate): next(f)
which appears nothing improve performance - leading me suspect i've merely replaced continue
call after if condition @ top of loop, next(f)
.
so question this:
is there more efficient way loop on file (in cython)?
i'd omit lines entirely, meaning should accessed if call out()
- case in python's for
loop? line
pointer (or comparable such) line of file? or loop load this?
i realize improve on writing in c entirely, i'd know how far can push staying python/cython.
update: i've tested c variant of code - using same test case - , clocks in @ under 2s (surprising no one). so, while true random generator , file i/o 2 major bottlenecks speaking, should pointed out python's file handling in darn slow.
so, there way make use of c's file reading, other implementing loop cython? overhead still slowing python code down significantly, makes me wonder if i'm @ sonic wall of performance, when comes file handling using cython?
if file small, may read whole .readlines()
@ once (possibly reducing io traffic) , iterate sequence of lines. if sample rate small enough, may consider sampling geometric distribution may more efficient.
i not know cython, consider also:
- simplifying
take_sample()
removal of unnecessary variables , returning boolean result of test instead of integer, - change signature of
take_sample()
take_sample(int)
avoid int-to-float conversion every test.
[edit]
according comment of @hpaulj, may better if use .read().split('\n')
instead of .readlines()
suggested me.
Comments
Post a Comment