python - how to iterate through xml data to remove next duplicate element using lxml -

- May 15, 2010

i struggling come simple solution iterates on xml data remove next element if dplicate of actual one.

example:

from "input":

<root>     <b attrib1="abc" attrib2="def">         <c>data1</c>     </b>     <b attrib1="abc" attrib2="def">         <c>data2</c>     </b>     <b attrib1="uvw" attrib2="xyz">         <c>data3</c>     </b>     <b attrib1="abc" attrib2="def">         <c>data4</c>     </b>     <b attrib1="abc" attrib2="def">         <c>data5</c>     </b>     <b attrib1="abc" attrib2="def">         <c>data6</c>     </b> </root>

i "output":

<root>     <b attrib1="abc" attrib2="def">         <c>data1</c>     </b>     <b attrib1="uvw" attrib2="xyz">         <c>data3</c>     </b>     <b attrib1="abc" attrib2="def">         <c>data4</c>     </b> </root>'''

for doing came following code:

from lxml import etree io import stringio   xml = ''' <root>     <b attrib1="abc" attrib2="def">         <c>data1</c>     </b>     <b attrib1="abc" attrib2="def">         <c>data2</c>     </b>     <b attrib1="uvw" attrib2="xyz">         <c>data3</c>     </b>     <b attrib1="abc" attrib2="def">         <c>data4</c>     </b>     <b attrib1="abc" attrib2="def">         <c>data5</c>     </b>     <b attrib1="abc" attrib2="def">         <c>data6</c>     </b> </root>'''  # simulate above xml read file file = stringio(unicode(xml))  # reading xml file tree = etree.parse(file) root = tree.getroot()  # iterate on "b" elements element in root.iter('b'):     # checks if last "b" element has been reached.     # on last element raises "attributeerror" eception , terminates loop     try:         # attributes of actual element         elem_attrib_act = element.attrib         # attributes of next element         elem_attrib_next = element.getnext().attrib     except attributeerror:         # if no other element, break         break     print('attributes of actual elem:', elem_attrib_act, 'attributes of next elem:', elem_attrib_next)     if elem_attrib_act == elem_attrib_next:         print('next elem duplicate of actual 1 -> remove it')         # remove next element approach not working         # if uncomment, removes elements of "data2" stops         # how remove next duplicate element?         #element.getparent().remove(element.getnext())     else:         print('next elem not duplicate of actual')  print('result:') print(etree.tostring(root))

uncommenting line

#element.getparent().remove(element.getnext())

removes elements around "data2" stops execution. resulting xml one:

<root>     <b attrib1="abc" attrib2="def">         <c>data1</c>     </b>     <b attrib1="uvw" attrib2="xyz">         <c>data3</c>     </b>     <b attrib1="abc" attrib2="def">         <c>data4</c>     </b>     <b attrib1="abc" attrib2="def">         <c>data5</c>     </b>     <b attrib1="abc" attrib2="def">         <c>data6</c>     </b> </root>

my impression "cut branch on sitting"...

any suggestions how solve one?

i think suspicion correct, if put print statement before break in except block can see it's breaking because element has been removed (i think)

<b attrib1="abc" attrib2="def">     <c>data2</c> </b>

try using getprevious() instead of getnext(). updated use list comprehension avoid error on first element (which of course raise exception @ .getprevious()):

for element in [e e in root.iter('b')][1:]:     try:         if element.getprevious().attrib == element.attrib:             element.getparent().remove(element)     except:         print 'except  ' print etree.tostring(root)

results:

<root> <b attrib1="abc" attrib2="def">     <c>data1</c> </b> <b attrib1="uvw" attrib2="xyz">     <c>data3</c> </b> <b attrib1="abc" attrib2="def">     <c>data4</c> </b> </root>

Search This Blog

YU

python - how to iterate through xml data to remove next duplicate element using lxml -

Comments

Post a Comment

Popular posts from this blog

mysql - FireDac error 314 - but DLLs are in program directory -

python - ValueError: could not convert string to float -

php - Laravel Get all child node count with condition -