pandas - Create advanced frequency table with Python -


i trying make frequency table based on dataframe pandas , python. in fact it's same a previous question of mine used r.

let's have dataframe in pandas looks (in fact dataframe larger, illustrative purposes limited rows):

node    |   precedingword ------------------------- a-bom       de a-bom       die a-bom       de a-bom       een a-bom       n a-bom       de acroniem    het acroniem    t acroniem    het acroniem    n acroniem    een act         de act         het act         die act         dat act         t act         n 

i'd use these values make count of precedingwords per node, subcategories. instance: 1 column add values titled neuter, non-neuter , last 1 rest. neuter contain values precedingword 1 of these values: t,het, dat. non-neuter contain de , die, , rest contain doesn't belong neuter or non-neuter. (it nice if dynamic, in other words rest uses sort of reversed variable used neuter , non-neuter. or subtracts values in neuter , non-neuter length of rows node.)

example output (in new dataframe, let's freqdf, this:

node    |   neuter   | nonneuter   | rest ----------------------------------------- a-bom       0          4             2 acroniem    3          0             2 act         3          2             1 

i found an answer similar question use case isn't same. seems me in question variables independent. however, in case obvious have multiple rows same node, should brought down single 1 frequency - show in expected output above.

i thought (untested):

def specificfreq(d):       uniqueword in d['node']         return pd.series({'node': uniqueword ,             'neuter': sum(d['node' == uniqueword] & d['precedingword'] == 't|het|dat'),             'nonneuter':  sum(d['node' == uniqueword] & d['precedingword'] == 'de|die'),             'rest': len(uniqueword) - neuter - nonneuter}) # length of rows specific word, distracted neuter , nonneuter values above  df.groupby('node').apply(specificfreq) 

but highly doubt correct way of doing this.

as proposed in r solution, can first change name , perform cross tabulation:

df.loc[df.precedingword.isin(neuter), "gender"] = "neuter" df.loc[df.precedingword.isin(non_neuter), "gender"] = "non_neuter" df.loc[df.precedingword.isin(neuter + non_neuter)==0, "gender"] = "rest" # neuter + non_neuter concatenation of both lists.  pd.crosstab(df.node, df.gender) gender    neuter  non_neuter  rest node                               a-bom          0           4     2 acroniem       3           0     2 act            3           2     1 

this 1 better because if word in neuter or non_neuter not present in precedingword, won't raise keyerror in former solution.


former solution, less clean.

given dataframe, can make simple cross tabulation:

ct = pd.crosstab(df.node, df.precedingword)  

which gives:

pw        dat  de  die  een  het  n  t node                                   a-bom       0   3    1    1    0  1  0 acroniem    0   0    0    1    2  1  1 act         1   1    1    0    1  1  1 

then, want sum columns together:

neuter = ["t", "het", "dat"] non_neuter = ["de","die"] freqdf = pd.dataframe()  freqdf["neuter"] = ct[neuter].sum(axis=1) ct.drop(neuter, axis=1, inplace=1)  freqdf["non_neuter"] = ct[non_neuter].sum(axis=1) ct.drop(non_neuter, axis=1, inplace=1)  freqdf["rest"] = ct.sum(axis=1) 

which gives freqdf:

          neuter  non_neuter  rest node                               a-bom          0           4     2 acroniem       3           0     2 act            3           2     1 

hth


Comments

Popular posts from this blog

html - Firefox flex bug applied to buttons? -

html - Missing border-right in select on Firefox -

python - build a suggestions list using fuzzywuzzy -