pandas - Create advanced frequency table with Python -
i trying make frequency table based on dataframe pandas
, python. in fact it's same a previous question of mine used r.
let's have dataframe in pandas looks (in fact dataframe larger, illustrative purposes limited rows):
node | precedingword ------------------------- a-bom de a-bom die a-bom de a-bom een a-bom n a-bom de acroniem het acroniem t acroniem het acroniem n acroniem een act de act het act die act dat act t act n
i'd use these values make count of precedingwords per node, subcategories. instance: 1 column add values titled neuter
, non-neuter
, last 1 rest
. neuter
contain values precedingword 1 of these values: t
,het
, dat
. non-neuter
contain de
, die,
, rest
contain doesn't belong neuter
or non-neuter
. (it nice if dynamic, in other words rest
uses sort of reversed variable used neuter , non-neuter. or subtracts values in neuter , non-neuter length of rows node.)
example output (in new dataframe, let's freqdf
, this:
node | neuter | nonneuter | rest ----------------------------------------- a-bom 0 4 2 acroniem 3 0 2 act 3 2 1
i found an answer similar question use case isn't same. seems me in question variables independent. however, in case obvious have multiple rows same node, should brought down single 1 frequency - show in expected output above.
i thought (untested):
def specificfreq(d): uniqueword in d['node'] return pd.series({'node': uniqueword , 'neuter': sum(d['node' == uniqueword] & d['precedingword'] == 't|het|dat'), 'nonneuter': sum(d['node' == uniqueword] & d['precedingword'] == 'de|die'), 'rest': len(uniqueword) - neuter - nonneuter}) # length of rows specific word, distracted neuter , nonneuter values above df.groupby('node').apply(specificfreq)
but highly doubt correct way of doing this.
as proposed in r solution, can first change name , perform cross tabulation:
df.loc[df.precedingword.isin(neuter), "gender"] = "neuter" df.loc[df.precedingword.isin(non_neuter), "gender"] = "non_neuter" df.loc[df.precedingword.isin(neuter + non_neuter)==0, "gender"] = "rest" # neuter + non_neuter concatenation of both lists. pd.crosstab(df.node, df.gender) gender neuter non_neuter rest node a-bom 0 4 2 acroniem 3 0 2 act 3 2 1
this 1 better because if word in neuter
or non_neuter
not present in precedingword
, won't raise keyerror
in former solution.
former solution, less clean.
given dataframe, can make simple cross tabulation:
ct = pd.crosstab(df.node, df.precedingword)
which gives:
pw dat de die een het n t node a-bom 0 3 1 1 0 1 0 acroniem 0 0 0 1 2 1 1 act 1 1 1 0 1 1 1
then, want sum columns together:
neuter = ["t", "het", "dat"] non_neuter = ["de","die"] freqdf = pd.dataframe() freqdf["neuter"] = ct[neuter].sum(axis=1) ct.drop(neuter, axis=1, inplace=1) freqdf["non_neuter"] = ct[non_neuter].sum(axis=1) ct.drop(non_neuter, axis=1, inplace=1) freqdf["rest"] = ct.sum(axis=1)
which gives freqdf
:
neuter non_neuter rest node a-bom 0 4 2 acroniem 3 0 2 act 3 2 1
hth
Comments
Post a Comment