dictionary - Use levenshtein distance for keys in defaultdict in python -


i doing sequencing analysis, , i'm trying create default dictionary of genetic sequence based on identifiers. looking @ following example, have created dict, , put both sequences agagag , atatat in same list because have same identifier of cccccc:

input:

ccccccagagag ccccccatatat 

code:

from collections import defaultdict d = defaultdict(list) d['cccccc'].append('agagag') d['cccccc'].append('atatat') 

the problem have if key sequence within levenshtein distance of 1 want treated same key. if come across sequence looks this:

ccccctacacac 

i want through dict , see there cccccc , see distance('cccccc', 'ccccct') < 2 maybe change ccccca cccccc , append same list above.

hopefully there way of doing this. thanks.

you can use difflib.sequencematcher returns 1 equal sequences , can use difference compare :

in case :

>>> import difflib >>> difflib.sequencematcher(none,'cccccc', 'ccccct').ratio() 0.8333333333333334 

demo :

>>> itertools import combinations >>> import difflib  >>> li=['aaaaaaacdcba', 'ccccccatatat', 'ccccccagagag', 'ccccctacacac', 'aaaaaaacacac'] >>> d = defaultdict(list) >>> in li: ...     d[i[:6]].append(i[6:]) ...  >>> keys=d.keys() >>> i,j in combinations(keys,2): ...      if difflib.sequencematcher(none,i, j).ratio()>0.8: ...         d[i].extend(d[j]) ...         del d[j] ...  >>> d defaultdict(<type 'list'>, {'aaaaaa': ['acdcba', 'acacac'], 'cccccc': ['atatat', 'agagag', 'acacac']}) >>>  

Comments

Popular posts from this blog

html - Firefox flex bug applied to buttons? -

html - Missing border-right in select on Firefox -

python - build a suggestions list using fuzzywuzzy -