parsing - Split text in text file on the basis of comma and space (python) -


i need parse text of text file 2 categories:

  1. university
  2. location(example: lahore, peshawar, jamshoro, faisalabad)

but text file contain following text:

"imperial college of business studies, lahore" "government college university faisalabad" "imperial college of business studies lahore" "university of peshawar, peshawar" "university of sindh, jamshoro" "london school of economics" "lahore school of economics, lahore" 

i have written code separate locations on basis of 'comma'. below code work first line of file , prints 'lahore' after give following error 'list index out of range'.

file = open(path,'r') content = file.read().split('\n')  line in content:     rep = line.replace('"','')     loc = rep.split(',')[1]     print "uni: "+replace     print "loc: "+str(loc) 

please i'm stuck on this. thanks

it appear can line has location if there comma. make sense parse file in 2 passes. first pass can build set holding known locations. can start off known examples or problem cases.

pass 2 use comma match known locations if there no comma, line split set of words. intersection of these location set should give location. if there no intersection flagged "unknown".

locations = set(["london", "faisalabad"])  open(path, 'r') f_input:     unknown = 0     # pass 1, build set of locations     line in f_input:         line = line.strip(' ,"\n')         if ',' in line:             loc = line.rsplit("," ,1)[1].strip()             locations.add(loc)      # pass 2, try , find location in line     f_input.seek(0)      line in f_input:         line = line.strip(' "\n')         if ',' in line:             uni, loc = line.rsplit("," ,1)             loc = loc.strip()         else:             uni = line             loc_matches = set(re.findall(r"\b(\w+)\b", line)).intersection(locations)              if loc_matches:                 loc = list(loc_matches)[0]             else:                 loc = "<unknown location>"                 unknown += 1          uni = uni.strip()          print "uni:", uni         print "loc:", loc      print "unknown locations:", unknown 

output be:

uni: imperial college of business studies loc: lahore uni: government college university faisalabad loc: faisalabad uni: imperial college of business studies lahore loc: lahore uni: university of peshawar loc: peshawar uni: university of sindh loc: jamshoro uni: london school of economics loc: london uni: lahore school of economics loc: lahore unknown locations: 0 

Comments

Popular posts from this blog

html - Firefox flex bug applied to buttons? -

html - Missing border-right in select on Firefox -

c# - two queries in same method -