dictionary - Use levenshtein distance for keys in defaultdict in python -
i doing sequencing analysis, , i'm trying create default dictionary of genetic sequence based on identifiers. looking @ following example, have created dict, , put both sequences agagag
, atatat
in same list because have same identifier of cccccc
:
input:
ccccccagagag ccccccatatat
code:
from collections import defaultdict d = defaultdict(list) d['cccccc'].append('agagag') d['cccccc'].append('atatat')
the problem have if key sequence within levenshtein distance of 1 want treated same key. if come across sequence looks this:
ccccctacacac
i want through dict , see there cccccc
, see distance('cccccc', 'ccccct') < 2
maybe change ccccca
cccccc
, append same list above.
hopefully there way of doing this. thanks.
you can use difflib.sequencematcher
returns 1 equal sequences , can use difference compare :
in case :
>>> import difflib >>> difflib.sequencematcher(none,'cccccc', 'ccccct').ratio() 0.8333333333333334
demo :
>>> itertools import combinations >>> import difflib >>> li=['aaaaaaacdcba', 'ccccccatatat', 'ccccccagagag', 'ccccctacacac', 'aaaaaaacacac'] >>> d = defaultdict(list) >>> in li: ... d[i[:6]].append(i[6:]) ... >>> keys=d.keys() >>> i,j in combinations(keys,2): ... if difflib.sequencematcher(none,i, j).ratio()>0.8: ... d[i].extend(d[j]) ... del d[j] ... >>> d defaultdict(<type 'list'>, {'aaaaaa': ['acdcba', 'acacac'], 'cccccc': ['atatat', 'agagag', 'acacac']}) >>>
Comments
Post a Comment