[Pyparsing] HTML Injection OR Word boundary detection from HTML

Discussion:

Geoff Jukes

2013-01-15 19:09:36 UTC

Hi,

First - Sorry for the long email and lack of PyParsing example code.

I'm trying to modify some HTML, wrapping 'words' in 'MARK' tags. I've tried
BeautifulSoup, HTMLParser, and Regex's, all with limited success. I think
PyParsing is the right solution - all the other solutions are more for
scraping/extracting data from HTML.

I hate asking questions without some code, but I'm so new tto PyParsing that
I really am not sure where to start. My gut tells me it's the right tool for
the job though. Can anyone help me?

Take the following HTML as an example:

-----

<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-US">

<head>

<title>Some Book</title>

<link rel="stylesheet"
type="application/vnd.adobe-page-template+xml" href="page-template.xpgt"/>

<style>

.italic {font-style:
italic;}

.bold {font-style: bold;}

</style>

</head>

<body class="text" id="text">

<div class="chapter" id="ch02">

<div class="chapterHead">

<h2
class="chapterNumber">Some Book</h2>

</div>

<div class="chapterBody">

I have so many National
Geographic’s at home.

</div>

</div>

</body>

</html>

-----

There are 2 lines of interest:

-----

I have so many National
Geographic’s at home.

<h2 class="chapterNumber">Some Book</h2>

-----

I am tring to wrap the 'words' in 'MARK' tags. So my 'perfect' result would
be:

-----

<h2 class="chapterNumber">Some
Book</h2>

I have so many National Geographic’s at home.

-----

Now there is obviously some complexity in there, over and above the 'mark'
injection. For example, the word "Geographic's" is split by a close-span,
which started before the word 'National'. So the formatting is 'replayed'
during the injection. There are also some differences in the 'MARK' location
- Sometimes 'tight' to the word, sometimes with 'SPAN' tags inside.

I don't expect PyParser to be able to do that for me (I would love it if it
could!) and so I am happy to have PyParser generate 'broken' HTML, that I
will fix-up post-process. So the following output would be acceptible:

-----

<h2 class="chapterNumber">Some
Book</h2>

I have so many National Geographic’s at home.

-----

Note that the "National Geographics's" are now 'broken'. A 'Word' can be
described as: Any text that is terminated with a space or Punctuation, but
excluding the terminator. An added complexity in my full-file is that Quote
marks could also terminate a word, but only if it's not an apostrophe (e.g.
"I'm excited" has 2 words (I'm, Excited). And Quotes could be HTMLEntities.
But again, I am happy to deal with that post-process.

An acceptable alternative would be for PyParser to return the start and end
locations of 'whole' words (taking into account any interspersed HTML like
the close-span in Geographic's) then I can 'shuffle' the Mark tag injection
post-process.

Again, I'm sorry for not posting example code - I'm still wrapping my head
around how PyParser works. So if anyone can give me pointers, I'm happy to
do the legwork myself! I'm going to spend all day trying to work this out.

If I can get the start and end locations of 'whole' words (taking into
account any interspersed HTML like the close-span in Geographic's) then I
can 'shuffle' the Mark tag injection post-process.

Many thanks in advance,

Geoff

Geoff Jukes

2013-01-15 21:01:21 UTC

Permalink

With the following:

-----
from pyparsing import *
html = EXAMPLE_HTML_FROM_PREVIOUS_POST
word = Word(printables)
word.ignore(anyOpenTag)
word.ignore(anyCloseTag)
word.ignore(commonHTMLEntity)

text = word
for w, s, e in text.scanString(html):
print '%s between %s and %s' %(w, s, e, html[s:e])
-----

I get (with some ommisions):

-----
....
['I'] between 2443 and 2444 [

Paul McGuire

2013-01-16 04:11:48 UTC

Permalink

Geoff Jukes [mailto:]

1970-01-01 00:00:00 UTC

Permalink

Geoff -

Congratulations on your first steps with pyparsing. You have found
scanString and how it returns the start and end locations of each match.
Pyparsing also includes transformString which is a wrapper around scanString
to do the kind of injection function you are doing. transformString applies
all parse actions that can modify or enhance the parsed strings by returning
a different string than the one passed in in the tokens argument. See how
I've added a parse action to a slightly different version of your word
expression:

word = Word(alphas, printables,excludeChars='<>&')
word.ignore(anyOpenTag)
word.ignore(anyCloseTag)
word.ignore(commonHTMLEntity)

tagnum = 0
def addMarkTags(tokens):
global tagnum
tagnum += 1
return "%s" % (tagnum, tokens[0])
word.setParseAction(addMarkTags)

print word.transformString(html)

This will print:

<h2 class="chapterNumber">Some Book</h2>

I have so many National Geographic&rsquo;s at home.

I think transformString is the avenue to follow for this project.

-- Paul

-----Original Message-----
From: Geoff Jukes [mailto:***@jukes.org]
Sent: Tuesday, January 15, 2013 3:01 PM
To: pyparsing-***@lists.sourceforge.net
Subject: Re: [Pyparsing] HTML Injection OR Word boundary detection from HTML

With the following:

-----
from pyparsing import *
html = EXAMPLE_HTML_FROM_PREVIOUS_POST
word = Word(printables)
word.ignore(anyOpenTag)
word.ignore(anyCloseTag)
word.ignore(commonHTMLEntity)

text = word
for w, s, e in text.scanString(html):
print '%s between %s and %s' %(w, s, e, html[s:e])
-----

I get (with some ommisions):

-----
....
['I'] between 2443 and 2444 [