python - how to iterate through xml data to remove next duplicate element using lxml -
i struggling come simple solution iterates on xml data remove next element if dplicate of actual one.
example:
from "input":
<root> <b attrib1="abc" attrib2="def"> <c>data1</c> </b> <b attrib1="abc" attrib2="def"> <c>data2</c> </b> <b attrib1="uvw" attrib2="xyz"> <c>data3</c> </b> <b attrib1="abc" attrib2="def"> <c>data4</c> </b> <b attrib1="abc" attrib2="def"> <c>data5</c> </b> <b attrib1="abc" attrib2="def"> <c>data6</c> </b> </root>
i "output":
<root> <b attrib1="abc" attrib2="def"> <c>data1</c> </b> <b attrib1="uvw" attrib2="xyz"> <c>data3</c> </b> <b attrib1="abc" attrib2="def"> <c>data4</c> </b> </root>'''
for doing came following code:
from lxml import etree io import stringio xml = ''' <root> <b attrib1="abc" attrib2="def"> <c>data1</c> </b> <b attrib1="abc" attrib2="def"> <c>data2</c> </b> <b attrib1="uvw" attrib2="xyz"> <c>data3</c> </b> <b attrib1="abc" attrib2="def"> <c>data4</c> </b> <b attrib1="abc" attrib2="def"> <c>data5</c> </b> <b attrib1="abc" attrib2="def"> <c>data6</c> </b> </root>''' # simulate above xml read file file = stringio(unicode(xml)) # reading xml file tree = etree.parse(file) root = tree.getroot() # iterate on "b" elements element in root.iter('b'): # checks if last "b" element has been reached. # on last element raises "attributeerror" eception , terminates loop try: # attributes of actual element elem_attrib_act = element.attrib # attributes of next element elem_attrib_next = element.getnext().attrib except attributeerror: # if no other element, break break print('attributes of actual elem:', elem_attrib_act, 'attributes of next elem:', elem_attrib_next) if elem_attrib_act == elem_attrib_next: print('next elem duplicate of actual 1 -> remove it') # remove next element approach not working # if uncomment, removes elements of "data2" stops # how remove next duplicate element? #element.getparent().remove(element.getnext()) else: print('next elem not duplicate of actual') print('result:') print(etree.tostring(root))
uncommenting line
#element.getparent().remove(element.getnext())
removes elements around "data2" stops execution. resulting xml one:
<root> <b attrib1="abc" attrib2="def"> <c>data1</c> </b> <b attrib1="uvw" attrib2="xyz"> <c>data3</c> </b> <b attrib1="abc" attrib2="def"> <c>data4</c> </b> <b attrib1="abc" attrib2="def"> <c>data5</c> </b> <b attrib1="abc" attrib2="def"> <c>data6</c> </b> </root>
my impression "cut branch on sitting"...
any suggestions how solve one?
i think suspicion correct, if put print statement before break in except
block can see it's breaking because element has been removed (i think)
<b attrib1="abc" attrib2="def"> <c>data2</c> </b>
try using getprevious()
instead of getnext()
. updated use list comprehension avoid error on first element (which of course raise exception @ .getprevious()
):
for element in [e e in root.iter('b')][1:]: try: if element.getprevious().attrib == element.attrib: element.getparent().remove(element) except: print 'except ' print etree.tostring(root)
results:
<root> <b attrib1="abc" attrib2="def"> <c>data1</c> </b> <b attrib1="uvw" attrib2="xyz"> <c>data3</c> </b> <b attrib1="abc" attrib2="def"> <c>data4</c> </b> </root>
Comments
Post a Comment