Python Gaussian Kernel density calculate score for new values -


this code:

import numpy np scipy.stats.kde import gaussian_kde scipy.stats import norm numpy import linspace,hstack pylab import plot,show,hist  import re import json  attribute_file="path"  attribute_values = [line.rstrip('\n') line in open(attribute_file)]  obs=[]  #assume list obs loaded  obs=np.asarray(osservazioni) obs=np.sort(obs,kind='mergesort') x_min=osservazioni[0] x_max=osservazioni[len(obs)-1]    # obtaining pdf (my_pdf function!) my_pdf = gaussian_kde(obs)  # plotting result x = linspace(0,x_max,1000)  plot(x,my_pdf(x),'r') # distribution function  hist(obs,normed=1,alpha=.3) # histogram show()  new_values = np.asarray([-1, 0, 2, 3, 4, 5, 768])[:, np.newaxis] e in new_values:     print (str(e)+" - "+str(my_pdf(e)*100*2)) 

problem: obs array contains list of obs. need calcolate score (between 0 , 1) new values

[-1, 0, 2, 3, 4, 500, 768]

so value -1 must have discrete score because doesn't appears in distribution next 1 value common in observations.

the reason have many more 1's in observations 768's. if -1 not 1, gets high predicted value, because histogram has larger larger value @ 1 @ 768.

up multiplicative constant, formula prediction is:

enter image description here

where k kernel, d observations , h bandwitdh. looking @ the doc gaussian_kde, see if no value provided bw_method, estimated in way, here doesn't suit you.

so can try different values: larger bandwidth, more points far new data taken account, limit case being constant predicted function.

on other hand, small bandwidth takes close points account, thing want.

some graphs illustrate influence of bandwidth: enter image description here

code used:

import matplotlib.pyplot plt f, axarr = plt.subplots(2, 2, figsize=(10, 10)) i, h in enumerate([0.01, 0.1, 1, 5]):     my_pdf = gaussian_kde(osservazioni, h)     axarr[i//2, i%2].plot(x, my_pdf(x), 'r') # distribution function     axarr[i//2, i%2].set_title("bandwidth: {0}".format(h))     axarr[i//2, i%2].hist(osservazioni, normed=1, alpha=.3) # histogram 

with current code, x=-1, value of k((x-x_i)/h) x_i's equal 1 smaller 1, add lot of these values (there 921 1s in observations, , 357 2s)

on other hand x = 768, value of kernel 1 x_i's 768, there not many such points (39 precise). here lot of "small" terms make larger sum small number of larger terms.

if don't want behavior, can decrease size of gaussian kernel : way penalty (k(-2)) paid because of distance between -1 , 1 higher. think overfitting observations.

a formula determine whether new sample acceptable (compared empirical distribution) or not more of statistical problem, can have @ stats.stackexchange.com

you can try use low value bandwidth, give peaked predicted function. can normalize function, dividing maximal value.

after that, predicted values between 0 , 1:

maxdensityvalue = np.max(my_pdf(x)) e in new_values:     print("{0} {1}".format(e, my_pdf(e)/maxdensityvalue)) 

Comments

Popular posts from this blog

html - Firefox flex bug applied to buttons? -

html - Missing border-right in select on Firefox -

python - build a suggestions list using fuzzywuzzy -