Dan Lenski
2014-10-27 15:57:10 UTC
I'm using PyParsing to parse some rather large text files with a C-like
format (braces and semicolons and all that). PyParsing works just great,
but it is slow and consumes a very large amount of memory due to the
size of my files.
I wanted to try to implement an incremental parsing approach wherein I'd
parse the top-level elements of the source file one-by-one.
The scanString method seems like the obvious way to do this. However, I
want to make sure that there is no invalid/unparseable text in-between
the sections parsed by scanString, and can't figure out a good way to do
this. Here's a simplified example that shows the problem I'm having:
sample="""f1(1,2,3); f2_no_args( );
# comment out: foo(4,5,6);
bar(7,8);
this should be an error;
baz(9,10);
"""
from pyparsing import *
COMMENT=Suppress('#' + restOfLine())
SEMI,COMMA,LPAREN,RPAREN = map(Suppress,';,()')
ident = Word(alphas, alphanums+"_")
integer = Word(nums+"+-",nums)
statement = ident("fn") + LPAREN +
Group(Optional(delimitedList(integer)))("arguments") + RPAREN + SEMI
p = statement.ignore(COMMENT)
for res, start, end in p.scanString(sample):
print "***** (%d,%d)" % (start, end)
print res.dump()
When I run this, the ranges returned by scanString are discontinguous
due to unparsed text between them ((0,10),(11,25),(53,62),(88,98)).
Two of these gaps are whitespace or comments, which should not trigger
an error, but one of them (`this should be an error;`) contains unparse-
able text, which I want to catch.
Is there a way to use pyparsing to parse a file incrementally while
still ensuring that the entire input could be parsed with the specified
parser grammar? Perhaps it is possible to make scanString "greedy" so
that it parses valid whitespace or comments following each range? If so,
that would help me resolve this issue since it would ensure that gaps
only occur between the returned ranges when there's an error.
Thanks,
Dan Lenski
PS-I found this related thread on the discussion board, but there
doesn't appear to be a resolution for this issue in it:
http://pyparsing.wikispaces.com/share/view/30891763
------------------------------------------------------------------------------
format (braces and semicolons and all that). PyParsing works just great,
but it is slow and consumes a very large amount of memory due to the
size of my files.
I wanted to try to implement an incremental parsing approach wherein I'd
parse the top-level elements of the source file one-by-one.
The scanString method seems like the obvious way to do this. However, I
want to make sure that there is no invalid/unparseable text in-between
the sections parsed by scanString, and can't figure out a good way to do
this. Here's a simplified example that shows the problem I'm having:
sample="""f1(1,2,3); f2_no_args( );
# comment out: foo(4,5,6);
bar(7,8);
this should be an error;
baz(9,10);
"""
from pyparsing import *
COMMENT=Suppress('#' + restOfLine())
SEMI,COMMA,LPAREN,RPAREN = map(Suppress,';,()')
ident = Word(alphas, alphanums+"_")
integer = Word(nums+"+-",nums)
statement = ident("fn") + LPAREN +
Group(Optional(delimitedList(integer)))("arguments") + RPAREN + SEMI
p = statement.ignore(COMMENT)
for res, start, end in p.scanString(sample):
print "***** (%d,%d)" % (start, end)
print res.dump()
When I run this, the ranges returned by scanString are discontinguous
due to unparsed text between them ((0,10),(11,25),(53,62),(88,98)).
Two of these gaps are whitespace or comments, which should not trigger
an error, but one of them (`this should be an error;`) contains unparse-
able text, which I want to catch.
Is there a way to use pyparsing to parse a file incrementally while
still ensuring that the entire input could be parsed with the specified
parser grammar? Perhaps it is possible to make scanString "greedy" so
that it parses valid whitespace or comments following each range? If so,
that would help me resolve this issue since it would ensure that gaps
only occur between the returned ranges when there's an error.
Thanks,
Dan Lenski
PS-I found this related thread on the discussion board, but there
doesn't appear to be a resolution for this issue in it:
http://pyparsing.wikispaces.com/share/view/30891763
------------------------------------------------------------------------------