[Pyparsing] Problems with white space in line break aware parsing

Discussion:

Hans-Peter Jansen

2013-09-23 12:19:40 UTC

Hi,

after years of creating hand crafted parsers for many reasons, a new task
smelled like being a good candidate for starting with pyparsing. The first
steps look very promising, BTW. The fiddling with regexp can be very mind
boggling, while using such more or less simple python expressions is much
handier..

I have to process some machine generated PDF-content, where I don't have any
influence on the creating side.

After extracting text with PDFMiner, I have to parse what you would some
people call an unholy mess.. The major point is, it is dependent on line
breaks, and empty lines.

Attached is my starting point. Excuse some german labels please...

The script tries to parse the address data in three different forms, but
address1 is the one that creates problems. The 4th address in the test data
contains such a biest. The problem here is, the line between "Herr Pumuckl"
and "Bibi Blocksbergstrasse" contains a blank. I try to detect an empty line
with:

ParserElement.setDefaultWhitespaceChars(' \t\r')
NL = LineEnd().suppress()
empty = (NL + NL).suppress()

Although, the blank is part of default whitespace chars, it seems to get in
the way for the empty expression test. Why?

Let me know, if the script is still to complex, I can reduce it, but this
might help those, that tries to archive something similar..

Thanks in advance,
Pete

Hans-Peter Jansen

2013-09-23 13:04:27 UTC

Permalink

Something removed the script. Hmm.

Inlined below..

Post by Hans-Peter Jansen
Hi,
after years of creating hand crafted parsers for many reasons, a new task
smelled like being a good candidate for starting with pyparsing. The first
steps look very promising, BTW. The fiddling with regexp can be very mind
boggling, while using such more or less simple python expressions is much
handier..
I have to process some machine generated PDF-content, where I don't have any
influence on the creating side.
After extracting text with PDFMiner, I have to parse what you would some
people call an unholy mess.. The major point is, it is dependent on line
breaks, and empty lines.
Attached is my starting point. Excuse some german labels please...
The script tries to parse the address data in three different forms, but
address1 is the one that creates problems. The 4th address in the test data
contains such a biest. The problem here is, the line between "Herr Pumuckl"
and "Bibi Blocksbergstrasse" contains a blank. I try to detect an empty line
ParserElement.setDefaultWhitespaceChars(' \t\r')
NL = LineEnd().suppress()
empty = (NL + NL).suppress()
Although, the blank is part of default whitespace chars, it seems to get in
the way for the empty expression test. Why?
Let me know, if the script is still to complex, I can reduce it, but this
might help those, that tries to archive something similar..
Thanks in advance,
Pete

# -*- coding: utf-8 -*-

from pyparsing import *

ParserElement.setDefaultWhitespaceChars(' \t\r')
NL = LineEnd().suppress()
empty = (NL + NL).suppress()

line = restOfLine + NL
line.setParseAction(lambda t: [t[0].strip()])

name1 = line('name1')
name2 = line('name2')
strasse = line('strasse')
plz = Word(alphanums).setResultsName('plz')
ort = line('ort')
land = line('land')
bestimmt = Literal(u'Bestimmt für').suppress()

address1 = Group(name1 + name2 + empty + strasse + plz + ort + land) + empty
address2 = Group(name1 + name2 + strasse + plz + ort + land) + bestimmt
address3 = Group(name1 + strasse + plz + ort + land) + bestimmt

address = empty + Suppress(u'Warenempfänger') + empty + (address1 ^ address2 ^ address3)

teststr = u"""

Warenempfänger

Metronom Tick-Tack
12, Zone Industrielle Schéleck 22
3225 Bettembourg
Luxemburg
Bestimmt für

Warenempfänger

Humfti-Bumfti AG
Herr Wichtig
Landwehrstr. 1
34454 Bad Arolsen-Mengeringhausen
Deutschland
Bestimmt für

Warenempfänger

Fa. Simsalabim

Im Acker 88
76437 Rastatt
Deutschland
Bestimmt für

Warenempfänger

Hotzenplotz GmbH
Herrn Pumuckl

Bibi Blocksberggasse 1
66955 Pirmasens
Deutschland

Warenempfänger

Uga Uga

Am Nashorn 66
66424 Homburg / Saar
Deutschland
Bestimmt für

"""

for idx, (tok, sloc, eloc) in enumerate(address.scanString(teststr)):
try:
print 'page %s: (0x%x, 0x%x): \n%s' % (idx, sloc, eloc, tok[0].asDict())
except ParseException, err:
log.error('page %s: %s' % err)
log.error(err.line)
log.error(' ' * (err.column - 1) + '^')

Hans-Peter Jansen

2013-09-23 23:28:28 UTC

Permalink

Got it,

it was a matter of excluding the right things..

Sorry for disturbance,
Pete