[Pyparsing] Incremental parsing with no gaps between parsed ranges?

Discussion:

Dan Lenski

2014-10-27 15:57:10 UTC

I'm using PyParsing to parse some rather large text files with a C-like
format (braces and semicolons and all that). PyParsing works just great,
but it is slow and consumes a very large amount of memory due to the
size of my files.

I wanted to try to implement an incremental parsing approach wherein I'd
parse the top-level elements of the source file one-by-one.

The scanString method seems like the obvious way to do this. However, I
want to make sure that there is no invalid/unparseable text in-between
the sections parsed by scanString, and can't figure out a good way to do
this. Here's a simplified example that shows the problem I'm having:

sample="""f1(1,2,3); f2_no_args( );
# comment out: foo(4,5,6);
bar(7,8);
this should be an error;
baz(9,10);
"""

from pyparsing import *

COMMENT=Suppress('#' + restOfLine())
SEMI,COMMA,LPAREN,RPAREN = map(Suppress,';,()')

ident = Word(alphas, alphanums+"_")
integer = Word(nums+"+-",nums)

statement = ident("fn") + LPAREN +
Group(Optional(delimitedList(integer)))("arguments") + RPAREN + SEMI

p = statement.ignore(COMMENT)

for res, start, end in p.scanString(sample):
print "***** (%d,%d)" % (start, end)
print res.dump()

When I run this, the ranges returned by scanString are discontinguous
due to unparsed text between them ((0,10),(11,25),(53,62),(88,98)).

Two of these gaps are whitespace or comments, which should not trigger
an error, but one of them (`this should be an error;`) contains unparse-
able text, which I want to catch.

Is there a way to use pyparsing to parse a file incrementally while
still ensuring that the entire input could be parsed with the specified
parser grammar? Perhaps it is possible to make scanString "greedy" so
that it parses valid whitespace or comments following each range? If so,
that would help me resolve this issue since it would ensure that gaps
only occur between the returned ranges when there's an error.

Thanks,
Dan Lenski

PS-I found this related thread on the discussion board, but there
doesn't appear to be a resolution for this issue in it:
http://pyparsing.wikispaces.com/share/view/30891763

------------------------------------------------------------------------------

Daniel Lenski

2014-10-27 16:32:14 UTC

Permalink

Thanks for the quick response, Paul.

With a 10 MiB input file of which no top-level element is longer than ~10
kB, it takes about 5 GiB of memory and 5 minutes before parseString()
starts returning results. I tried enablePackrat() and memory usage is
somewhat higher but speed is not appreciably improved. Based on the
docstring, I wouldn't expect enablePackrat to make a big improvement, since
every lengthy block in the grammar I'm trying to parse is introduced with a
unique keyword, so I don't think there's much backtracking-and-reparsing.

So I think I'm going to have to do incremental parsing in order to get
reasonably fast feedback from the parser. Do you have any suggestions for
how to do this? I'm trying to figure out if there's a good way to do greedy
consumption of trailing (whitespace|comments) at the end of each valid
top-level element.

-Dan

Before you go too far down this path, try enabling packrat parsing, which
should help both performance and memory footprint.
ParserElement.enablePackrat()
-- Paul

Post by Dan Lenski
I'm using PyParsing to parse some rather large text files with a C-like
format (braces and semicolons and all that). PyParsing works just great,
but it is slow and consumes a very large amount of memory due to the
size of my files.

------------------------------------------------------------------------------

Dan Lenski

2014-10-27 19:08:44 UTC

Permalink

Post by Daniel Lenski
So I think I'm going to have to do incremental parsing in order to get
reasonably fast feedback from the parser. Do you have any suggestions for
how to do this? I'm trying to figure out if there's a good way to do greedy
consumption of trailing (whitespace|comments) at the end of each valid
top-level element.

I modified parseString very slightly and came up with
parseConsumeString().

This version calls self._parse() followed by self.preParse() repeatedly
to do what I want when self is the parser for a "top-level" item.

def parseConsumeString(self, instring, parseAll=False,
yieldLoc=True, loopResetCache=False):
if not loopResetCache:
ParserElement.resetCache()
if not self.streamlined:
self.streamline()
#~ self.saveAsList = True
for e in self.ignoreExprs:
e.streamline()
if not self.keepTabs:
instring = instring.expandtabs()
try:
loc = 0
while loc<len(instring):
sloc = loc
if loopResetCache:
ParserElement.resetCache()
loc, tokens = self._parse(instring, loc)
if yieldLoc:
yield tokens, sloc, loc
else:
yield tokens
loc = self.preParse(instring, loc)
except ParseBaseException as exc:
if not parseAll:
return
if ParserElement.verbose_stacktrace:
raise
else:
# catch and re-raise exception from here, clears out
pyparsing internal stack trace
raise exc

By the way, I moved ParserElement.resetCache() into the loop, in order
to drastically reduce memory consumption with packrat caching. Memory
consumption goes down from around 6G peak to around 100M peak, while
running about 15-20% faster. This is on a Core i7 980X with 8G of RAM,
Win7.

In [1]: import my_parser_module as P

In [2]: sample=open("large_file).read()
In [3]: len(sample)
Out[3]: 9153816

In [4]: %timeit -n1 for r in
P.parseConsumeString(P.TopLevel.ignore(P.COMMENT), sample, True, True,
True): pass
1 loops, best of 3: 1min 10s per loop

In [6]: %timeit -n1 for r in
P.parseConsumeString(P.TopLevel.ignore(P.COMMENT), sample, True, True,
False): pass
1 loops, best of 3: 1min 22s per loop

Thanks,
Dan

------------------------------------------------------------------------------