dictionary - Use levenshtein distance for keys in defaultdict in python -

- September 15, 2012

i doing sequencing analysis, , i'm trying create default dictionary of genetic sequence based on identifiers. looking @ following example, have created dict, , put both sequences agagag , atatat in same list because have same identifier of cccccc:

input:

ccccccagagag ccccccatatat

code:

from collections import defaultdict d = defaultdict(list) d['cccccc'].append('agagag') d['cccccc'].append('atatat')

the problem have if key sequence within levenshtein distance of 1 want treated same key. if come across sequence looks this:

ccccctacacac

i want through dict , see there cccccc , see distance('cccccc', 'ccccct') < 2 maybe change ccccca cccccc , append same list above.

hopefully there way of doing this. thanks.

you can use difflib.sequencematcher returns 1 equal sequences , can use difference compare :

in case :

>>> import difflib >>> difflib.sequencematcher(none,'cccccc', 'ccccct').ratio() 0.8333333333333334

demo :

>>> itertools import combinations >>> import difflib  >>> li=['aaaaaaacdcba', 'ccccccatatat', 'ccccccagagag', 'ccccctacacac', 'aaaaaaacacac'] >>> d = defaultdict(list) >>> in li: ...     d[i[:6]].append(i[6:]) ...  >>> keys=d.keys() >>> i,j in combinations(keys,2): ...      if difflib.sequencematcher(none,i, j).ratio()>0.8: ...         d[i].extend(d[j]) ...         del d[j] ...  >>> d defaultdict(<type 'list'>, {'aaaaaa': ['acdcba', 'acacac'], 'cccccc': ['atatat', 'agagag', 'acacac']}) >>>

Search This Blog

YU

dictionary - Use levenshtein distance for keys in defaultdict in python -

Comments

Post a Comment

Popular posts from this blog

mysql - FireDac error 314 - but DLLs are in program directory -

c# - How do I debug "System.DllNotFoundException: The specified procedure could not be found"? -

c# - Binding Winform Chart using INotifyPropertyChanged Interface -