[Pyparsing] more refinement but still lost

Discussion:

Eric S.. Johansson

2012-09-18 02:57:35 UTC

refining the testing process a little more, I've come up with some simple test cases that represent actual usage. still not grocking the given example. heck, I'm having trouble generating a bnf description. funny how when you design for human speech, you get hard to parse. :-)

how does a parser like this handle recursion? for example:

"test_9": "[nested_text [one text] some [not [very ] plain] text]",

I expect to walk depth first and on the way back, there are calls to my code so I can do "stuff". I expect something like the following calls in this sequence:

call arg name=one, parent="nested_text", text="text"
call found_plain_text, text="some"
call arg name=very, parent="not", text="plain"
call keyword name=not
call found_plain_text, text="text"
call keyword name = nested_text,

anyway, here is my latest test cases and results. I'm really lost here. the docs are not helping. I need a mentor chat.

from pyparsing import *

all_tests = {
"test_1": "some plain text",
"test_2": "[simple ]",
"test_3": "[simple_text some plain text]",
"test_4": "[onearg [one ]]",
"test_5": "[twoarg [one ] [two ]]",
"test_6": "[onearg_text [one some plain text]]",
"test_7": "[twoarg_text [one ] [two some plain text arg]]",
"test_8": "[nested_text some [not plain] text]",
"test_9": "[nested_text [one text] some [not [very ] plain] text]",
"test_10": "[nested_text_escaped [one text] some [not [very ] plain] bracketed \[text\]]",
"test_11": """[nested_text_escaped_indented
[one text] some
[not
[very ]
plain
]
bracked \[text\]
]""",

}

LBRACK,RBRACK = map(Suppress,'[]')
escapedChar = Combine('\\' + oneOf(list(printables)))
keyword = Word(alphas,alphanums).setName("keyword").setDebug()
argword = Word(alphas,alphanums).setName("argword").setDebug()
arg = Forward()
dss = Forward()

text = ZeroOrMore(escapedChar | originalTextFor(OneOrMore(Word(printables,excludeChars='[]"\'\\'))) | quotedString | dss)

arg << Group(LBRACK + argword("arg") + Group(text)("text") + RBRACK)
arg.setName("arg").setDebug()

dss << Group(LBRACK + keyword("keyword") + Group(ZeroOrMore(arg))("args") +
Group(text)("text") + RBRACK)

parser = ZeroOrMore(dss)

j = ""

for i,j in all_tests.items():
print "------", i, "---------"
test = parser.parseString(j)
print "keyword is: %s" % test.keyword
print "arg is: %s" % test.arg
print "text is: %s" % test.text

------------------------- results ---------------------------
------ test_11 ---------
Match keyword at loc 1(1,2)
Matched keyword -> ['nested']
Match arg at loc 7(1,8)
Exception raised:Expected "[" (at char 7), (line:1, col:8)
Match keyword at loc 60(2,30)
Matched keyword -> ['one']
Match arg at loc 64(2,34)
Exception raised:Expected "[" (at char 64), (line:2, col:34)
Match keyword at loc 105(3,30)
Matched keyword -> ['not']
Match arg at loc 145(4,36)
Match argword at loc 146(4,37)
Matched argword -> ['very']
Matched arg -> [['very', []]]
Match arg at loc 152(4,43)
Exception raised:Expected "[" (at char 189), (line:5, col:36)
keyword is:
arg is:
text is:
------ test_10 ---------
Match keyword at loc 1(1,2)
Matched keyword -> ['nested']
Match arg at loc 7(1,8)
Exception raised:Expected "[" (at char 7), (line:1, col:8)
Match keyword at loc 22(1,23)
Matched keyword -> ['one']
Match arg at loc 26(1,27)
Exception raised:Expected "[" (at char 26), (line:1, col:27)
Match keyword at loc 38(1,39)
Matched keyword -> ['not']
Match arg at loc 42(1,43)
Match argword at loc 43(1,44)
Matched argword -> ['very']
Matched arg -> [['very', []]]
Match arg at loc 49(1,50)
Exception raised:Expected "[" (at char 50), (line:1, col:51)
keyword is:
arg is:
text is:
------ test_7 ---------
Match keyword at loc 1(1,2)
Matched keyword -> ['twoarg']
Match arg at loc 7(1,8)
Exception raised:Expected "[" (at char 7), (line:1, col:8)
Match keyword at loc 14(1,15)
Matched keyword -> ['one']
Match arg at loc 18(1,19)
Exception raised:Expected "[" (at char 18), (line:1, col:19)
Match keyword at loc 21(1,22)
Matched keyword -> ['two']
Match arg at loc 25(1,26)
Exception raised:Expected "[" (at char 25), (line:1, col:26)
keyword is:
arg is:
text is:
------ test_6 ---------
Match keyword at loc 1(1,2)
Matched keyword -> ['onearg']
Match arg at loc 7(1,8)
Exception raised:Expected "[" (at char 7), (line:1, col:8)
Match keyword at loc 14(1,15)
Matched keyword -> ['one']
Match arg at loc 18(1,19)
Exception raised:Expected "[" (at char 18), (line:1, col:19)
keyword is:
arg is:
text is:
------ test_5 ---------
Match keyword at loc 1(1,2)
Matched keyword -> ['twoarg']
Match arg at loc 8(1,9)
Match argword at loc 9(1,10)
Matched argword -> ['one']
Matched arg -> [['one', []]]
Match arg at loc 14(1,15)
Match argword at loc 16(1,17)
Matched argword -> ['two']
Matched arg -> [['two', []]]
Match arg at loc 21(1,22)
Exception raised:Expected "[" (at char 21), (line:1, col:22)
keyword is:
arg is:
text is:
------ test_4 ---------
Match keyword at loc 1(1,2)
Matched keyword -> ['onearg']
Match arg at loc 8(1,9)
Match argword at loc 9(1,10)
Matched argword -> ['one']
Matched arg -> [['one', []]]
Match arg at loc 14(1,15)
Exception raised:Expected "[" (at char 14), (line:1, col:15)
keyword is:
arg is:
text is:
------ test_3 ---------
Match keyword at loc 1(1,2)
Matched keyword -> ['simple']
Match arg at loc 7(1,8)
Exception raised:Expected "[" (at char 7), (line:1, col:8)
keyword is:
arg is:
text is:
------ test_2 ---------
Match keyword at loc 1(1,2)
Matched keyword -> ['simple']
Match arg at loc 8(1,9)
Exception raised:Expected "[" (at char 8), (line:1, col:9)
keyword is:
arg is:
text is:
------ test_1 ---------
keyword is:
arg is:
text is:
------ test_9 ---------
Match keyword at loc 1(1,2)
Matched keyword -> ['nested']
Match arg at loc 7(1,8)
Exception raised:Expected "[" (at char 7), (line:1, col:8)
Match keyword at loc 14(1,15)
Matched keyword -> ['one']
Match arg at loc 18(1,19)
Exception raised:Expected "[" (at char 18), (line:1, col:19)
Match keyword at loc 30(1,31)
Matched keyword -> ['not']
Match arg at loc 34(1,35)
Match argword at loc 35(1,36)
Matched argword -> ['very']
Matched arg -> [['very', []]]
Match arg at loc 41(1,42)
Exception raised:Expected "[" (at char 42), (line:1, col:43)
keyword is:
arg is:
text is:
------ test_8 ---------
Match keyword at loc 1(1,2)
Matched keyword -> ['nested']
Match arg at loc 7(1,8)
Exception raised:Expected "[" (at char 7), (line:1, col:8)
Match keyword at loc 19(1,20)
Matched keyword -> ['not']
Match arg at loc 23(1,24)
Exception raised:Expected "[" (at char 23), (line:1, col:24)
keyword is:
arg is:
text is:
[Finished in 0.1s]

Paul McGuire

2012-09-18 10:46:00 UTC

Permalink

all_tests = {
"test_1": "some plain text",
"test_2": "[simple ]",
"test_3": "[simple_text some plain text]",
"test_4": "[onearg [one ]]",
"test_5": "[twoarg [one ] [two ]]",
"test_6": "[onearg_text [one some plain text]]",
"test_7": "[twoarg_text [one ] [two some plain text arg]]",
"test_8": "[nested_text some [not plain] text]",
"test_9": "[nested_text [one text] some [not [very ] plain] text]",
"test_10": "[nested_text_escaped [one text] some [not [very ] plain]
bracketed \[text\]]",
"test_11": """[nested_text_escaped_indented
[one text] some
[not
[very ]
plain
]
bracked \[text\]
]""",

}

# a simple BNF:
#
# listExpr ::= '[' listContent ']'
# listContent ::= (contentsWord | escapedChar | listExpr)*
# contentsWord ::= printableCharacter+
#
#
# Some notes:
# 1. listContent could be empty, "[]" is a valid listExpr
# 2. contentsWord cannot contain '\', '[' or ']' characters, or
# else we couldn't distinguish delimiters from contents, or
# detect escapes
#

from pyparsing import *

# start with the basics
LBRACK,RBRACK = map(Suppress,"[]")
escapedChar = Combine('\\' + oneOf(list(printables)))
contentsWord = Word(printables,excludeChars=r"\[]")

# define a placeholder for a nested list, since we need to
# reference it before it is fully defined
listExpr = Forward()

# the contents of a list is one or more contents words or lists
listContent = ZeroOrMore(contentsWord | escapedChar | listExpr)

# a list is a listContent enclosed in []'s - enclose
# in a Group so that pyparsing will maintain the nested structure
#
# since listExpr was already defined as a Forward, we use '<<' to
# "inject" the definition into the already defined Forward
listExpr << Group(LBRACK + listContent + RBRACK)

# parse the test string - note that the results no longer contain
# the parsed '[' and ']' characters, but they do retain the
# nesting of the original string in nested lists
for name,testStr in all_tests.items():
print name, listContent.parseString(testStr).asList()

# pyparsing includes a short-cut to simplify defining nested
# structures like this
print nestedExpr('[',']').parseString(all_tests['test_9']).asList()

Paul McGuire

2012-09-18 11:55:17 UTC

Permalink

Sorry for my terse reply earlier - hit send too early!

Eric, you have definitely taken on an ambitious first-project for pyparsing.
Writing BNF's takes some practice, but it is important to really get your
thoughts down about how the parser is supposed to work before getting mired
down in Words, and Groups, Forwards, etc. In your nested terms, let the
recursion in the BNF take care of nesting []'s - when you have LBRACK/RBRACK
in two different levels of your nesting, it's a sign you should rethink just
how you have defined the contents of this group.

Here's my earlier post, with annotating comments.

-- Paul

all_tests = {
"test_1": "some plain text",
"test_2": "[simple ]",
"test_3": "[simple_text some plain text]",
"test_4": "[onearg [one ]]",
"test_5": "[twoarg [one ] [two ]]",
"test_6": "[onearg_text [one some plain text]]",
"test_7": "[twoarg_text [one ] [two some plain text arg]]",
"test_8": "[nested_text some [not plain] text]",
"test_9": "[nested_text [one text] some [not [very ] plain] text]",
"test_10": "[nested_text_escaped [one text] some [not [very ] plain]
bracketed \[text\]]",
"test_11": """[nested_text_escaped_indented
[one text] some
[not
[very ]
plain
]
bracked \[text\]
]""",
}

# a simple BNF:
#
# listExpr ::= '[' listContent ']'
# listContent ::= (contentsWord | escapedChar | listExpr)*
# contentsWord ::= printableCharacter+
#
#
# Some notes:
# 1. listContent could be empty, "[]" is a valid listExpr
# 2. contentsWord cannot contain '\', '[' or ']' characters, or
# else we couldn't distinguish delimiters from contents, or
# detect escapes
#

from pyparsing import *

# start with the basics
LBRACK,RBRACK = map(Suppress,"[]")
escapedChar = Combine('\\' + oneOf(list(printables)))
contentsWord = Word(printables,excludeChars=r"\[]")

# define a placeholder for a nested list, since we need to
# reference it before it is fully defined
listExpr = Forward()

# the contents of a list is one or more contents words or lists
listContent = ZeroOrMore(contentsWord | escapedChar | listExpr)

# a list is a listContent enclosed in []'s - enclose
# in a Group so that pyparsing will maintain the nested structure
#
# since listExpr was already defined as a Forward, we use '<<' to
# "inject" the definition into the already defined Forward
listExpr << Group(LBRACK + listContent + RBRACK)

# parse the test string - note that the results no longer contain
# the parsed '[' and ']' characters, but they do retain the
# nesting of the original string in nested lists
for name,testStr in all_tests.items():
print name, listContent.parseString(testStr).asList()

prints:

test_11 [['nested_text_escaped_indented', ['one', 'text'], 'some', ['not',
['very'], 'plain'], 'bracked', '\\[', 'text', '\\]']]
test_10 [['nested_text_escaped', ['one', 'text'], 'some', ['not', ['very'],
'plain'], 'bracketed', '\\[', 'text', '\\]']]
test_7 [['twoarg_text', ['one'], ['two', 'some', 'plain', 'text', 'arg']]]
test_6 [['onearg_text', ['one', 'some', 'plain', 'text']]]
test_5 [['twoarg', ['one'], ['two']]]
test_4 [['onearg', ['one']]]
test_3 [['simple_text', 'some', 'plain', 'text']]
test_2 [['simple']]
test_1 ['some', 'plain', 'text']
test_9 [['nested_text', ['one', 'text'], 'some', ['not', ['very'], 'plain'],
'text']]
test_8 [['nested_text', 'some', ['not', 'plain'], 'text']]

# pyparsing includes a short-cut to simplify defining nested
# structures like this
print nestedExpr('[',']').parseString(all_tests['test_9']).asList()

Eric S.. Johansson

2012-09-18 18:45:16 UTC

Permalink

----- Original Message -----

Sent: Tuesday, September 18, 2012 7:55:17 AM
Subject: RE: [Pyparsing] more refinement but still lost
Sorry for my terse reply earlier - hit send too early!

Not a problem. That happens to me with speech recognition errors. Sucks being disabled and then your tools bite you in the ass.

Eric, you have definitely taken on an ambitious first-project for
pyparsing. Writing BNF's takes some practice, but it is important to really get
your thoughts down about how the parser is supposed to work before getting
mired down in Words, and Groups, Forwards, etc. In your nested terms, let
the recursion in the BNF take care of nesting []'s - when you have
LBRACK/RBRACK in two different levels of your nesting, it's a sign you should
rethink just how you have defined the contents of this group.

Yes it is an ambitious project. As I've probably said earlier, I am disabled and as a result have become quite smart about user interfaces. Sadly, not being able to write code or webpages gets in the way of proving my intellectual capabilities vis-à-vis getting a job. The usability model behind this project should allow me to show off some of my UI chops.[1]

The syntax was one I inherited from another project but it turned out to be a great base for a speech recognition friendly method of generating webpages. I know that the differentiation between arguments and keywords using the same bracket is problematic. I've been wrestling with that one for quite a while. One model says change the notation between keywords and arguments. another says change the notation completely so that the problem doesn't come up. I'm open to suggestions I think the fundamental structure should remain the same because it works for speech recognition use. Could use a little boost from a smart editor but hey, I will live with what I've got. in any case, I'm open to suggestions for reworking the syntax/notation

Mixed text:: = [<plaintext>]*
[<open square bracket> <keyword>
[<open square bracket><argument> [<mixed text>]<close square bracket>]*
[<mixed text>]
<close square bracket>]
[<plaintext>]*

I think that's the best definition of how it is now. I'm trying to think of a way I can make arguments as functions you to go away and be replaced by keyword definitions everywhere

Here's my earlier post, with annotating comments.

Thank you. I'll take a look later today.

--- eric

[1] https://docs.google.com/document/d/1In11apApKozw_UOPAhVz0ePqns72_6652Dra34xWp4E/edit
http://blog.esjworks.com
core ideas behind using speech recognition for programming