Hans-Peter Jansen
2013-09-23 12:19:40 UTC
Hi,
after years of creating hand crafted parsers for many reasons, a new task
smelled like being a good candidate for starting with pyparsing. The first
steps look very promising, BTW. The fiddling with regexp can be very mind
boggling, while using such more or less simple python expressions is much
handier..
I have to process some machine generated PDF-content, where I don't have any
influence on the creating side.
After extracting text with PDFMiner, I have to parse what you would some
people call an unholy mess.. The major point is, it is dependent on line
breaks, and empty lines.
Attached is my starting point. Excuse some german labels please...
The script tries to parse the address data in three different forms, but
address1 is the one that creates problems. The 4th address in the test data
contains such a biest. The problem here is, the line between "Herr Pumuckl"
and "Bibi Blocksbergstrasse" contains a blank. I try to detect an empty line
with:
ParserElement.setDefaultWhitespaceChars(' \t\r')
NL = LineEnd().suppress()
empty = (NL + NL).suppress()
Although, the blank is part of default whitespace chars, it seems to get in
the way for the empty expression test. Why?
Let me know, if the script is still to complex, I can reduce it, but this
might help those, that tries to archive something similar..
Thanks in advance,
Pete
after years of creating hand crafted parsers for many reasons, a new task
smelled like being a good candidate for starting with pyparsing. The first
steps look very promising, BTW. The fiddling with regexp can be very mind
boggling, while using such more or less simple python expressions is much
handier..
I have to process some machine generated PDF-content, where I don't have any
influence on the creating side.
After extracting text with PDFMiner, I have to parse what you would some
people call an unholy mess.. The major point is, it is dependent on line
breaks, and empty lines.
Attached is my starting point. Excuse some german labels please...
The script tries to parse the address data in three different forms, but
address1 is the one that creates problems. The 4th address in the test data
contains such a biest. The problem here is, the line between "Herr Pumuckl"
and "Bibi Blocksbergstrasse" contains a blank. I try to detect an empty line
with:
ParserElement.setDefaultWhitespaceChars(' \t\r')
NL = LineEnd().suppress()
empty = (NL + NL).suppress()
Although, the blank is part of default whitespace chars, it seems to get in
the way for the empty expression test. Why?
Let me know, if the script is still to complex, I can reduce it, but this
might help those, that tries to archive something similar..
Thanks in advance,
Pete