html - Finding a regexp pattern not preceeded by something -
i have following html file structure:
<table> <tr class="heading"> <td colspan="2"> <h2 class="groupheader">public types</h2> <!-- don't want that! we're in table.--> </td> </tr> <tr>...</tr> </table> <h2 class="groupheader">detailed description</h2> <!-- want until next h2--> <div class="textblock"><p>provides functions control generation of single data log file. </p> <h4>example</h4> <div class="fragment"><div class="line">test <a href="aaa">stuff</a>();</div> <div class="line">...</div> <div class="line">...</div> </div> </div> <!-- end of first result --> <h2 class="groupheader">member</h2> <!-- want until next h2 or hr--> <a class="anchor"></a> <div class="memitem"> <div class="memproto"> <table class="memname"> <tr> <td class="memname">enum <a class="el" href="...">test</a></td> </tr> </table> </div><div class="memdoc"> <hr><!-- end of 2nd result -->
and regexp, need content between each titles till next title or hr tag, expect if it's in table.
so far, i've got h2->h2|hr content. goes like:
(?s)(<h2 class="groupheader">.*?)(<h2|<hr)
how can skip content under h2 contained in table? i've tried noodling negative behind i'm not getting anywhere.
thank help.
note html should parsed appropriate parser
now, since left html-looking input, , task
to content between each titles till next title or hr tag, expect if it's in table
let me show how done.
you can obtain substrings need of tempered greedy token ((?:(?!<\/table|<h2|<hr)(?:<table\b[^<]*>.*?<\/table>|.))*)
(that matches symbol not starting of alternatives in negative lookahead before - thus, keeping match within <table>
boundaries - , matching inner tables) positive lookahead @ end:
(?s)<h2 class="groupheader">[^<]*<\/h2>\s*((?:(?!<\/table|<h2|<hr)(?:<table\b[^<]*>.*?<\/table>|.))*)(?=<h2|<hr)
see demo.
note instead of h2
can use h\d+
support level of h
.
Comments
Post a Comment