Python Gaussian Kernel density calculate score for new values -
this code:
import numpy np scipy.stats.kde import gaussian_kde scipy.stats import norm numpy import linspace,hstack pylab import plot,show,hist import re import json attribute_file="path" attribute_values = [line.rstrip('\n') line in open(attribute_file)] obs=[] #assume list obs loaded obs=np.asarray(osservazioni) obs=np.sort(obs,kind='mergesort') x_min=osservazioni[0] x_max=osservazioni[len(obs)-1] # obtaining pdf (my_pdf function!) my_pdf = gaussian_kde(obs) # plotting result x = linspace(0,x_max,1000) plot(x,my_pdf(x),'r') # distribution function hist(obs,normed=1,alpha=.3) # histogram show() new_values = np.asarray([-1, 0, 2, 3, 4, 5, 768])[:, np.newaxis] e in new_values: print (str(e)+" - "+str(my_pdf(e)*100*2))
problem: obs array contains list of obs. need calcolate score (between 0 , 1) new values
[-1, 0, 2, 3, 4, 500, 768]
so value -1 must have discrete score because doesn't appears in distribution next 1 value common in observations.
the reason have many more 1's in observations 768's. if -1 not 1, gets high predicted value, because histogram has larger larger value @ 1 @ 768.
up multiplicative constant, formula prediction is:
where k kernel, d observations , h bandwitdh. looking @ the doc gaussian_kde
, see if no value provided bw_method
, estimated in way, here doesn't suit you.
so can try different values: larger bandwidth, more points far new data taken account, limit case being constant predicted function.
on other hand, small bandwidth takes close points account, thing want.
some graphs illustrate influence of bandwidth:
code used:
import matplotlib.pyplot plt f, axarr = plt.subplots(2, 2, figsize=(10, 10)) i, h in enumerate([0.01, 0.1, 1, 5]): my_pdf = gaussian_kde(osservazioni, h) axarr[i//2, i%2].plot(x, my_pdf(x), 'r') # distribution function axarr[i//2, i%2].set_title("bandwidth: {0}".format(h)) axarr[i//2, i%2].hist(osservazioni, normed=1, alpha=.3) # histogram
with current code, x=-1, value of k((x-x_i)/h) x_i's equal 1 smaller 1, add lot of these values (there 921 1s in observations, , 357 2s)
on other hand x = 768, value of kernel 1 x_i's 768, there not many such points (39 precise). here lot of "small" terms make larger sum small number of larger terms.
if don't want behavior, can decrease size of gaussian kernel : way penalty (k(-2)) paid because of distance between -1 , 1 higher. think overfitting observations.
a formula determine whether new sample acceptable (compared empirical distribution) or not more of statistical problem, can have @ stats.stackexchange.com
you can try use low value bandwidth, give peaked predicted function. can normalize function, dividing maximal value.
after that, predicted values between 0 , 1:
maxdensityvalue = np.max(my_pdf(x)) e in new_values: print("{0} {1}".format(e, my_pdf(e)/maxdensityvalue))
Comments
Post a Comment