python - Pandas dataframe vectorized sampling -
i have simple df forming pivot_table:
d = {'one' : ['a', 'b', 'b', 'c', 'c', 'c'], 'two' : [6., 5., 4., 3., 2., 1.], 'three' : [6., 5., 4., 3., 2., 1.], 'four' : [6., 5., 4., 3., 2., 1.]} df = pd.dataframe(d) pivot = pd.pivot_table(df,index=['one','two'])
i randomly sample 1 row each different element column 'one' of resulting pivot object. (in example, 'a' sampled while there more options 'b' , 'c'.) began using 0.18.0 version of pandas , aware of .sample method. messed .groupby method applying sampling function this:
grouped = pivot.groupby('one').apply(lambda x: x.sample(n=1, replace=false))
i raise keyerror when tried variations on theme thought time fresh perspective on seemingly simple question...
thanks assistance!
the keyerror raised since 'one' not column in pivot
name of index:
in [11]: pivot out[11]: 4 3 1 2 6.0 6.0 6.0 b 4.0 4.0 4.0 5.0 5.0 5.0 c 1.0 1.0 1.0 2.0 2.0 2.0 3.0 3.0 3.0
you have use level argument:
in [12]: pivot.groupby(level='one').apply(lambda x: x.sample(n=1, replace=false)) out[12]: 4 3 1 one 2 6.0 6.0 6.0 b b 4.0 4.0 4.0 c c 1.0 1.0 1.0
this isn't quite right since index repeated! it's better as_index=false
:
in [13]: pivot.groupby(level='one', as_index=false).apply(lambda x: x.sample(n=1)) out[13]: 4 3 1 2 0 6.0 6.0 6.0 1 b 4.0 4.0 4.0 2 c 2.0 2.0 2.0
note: picks random row each time.
as alternative, potentially more performant variant (that pulls out subframe:
in [21]: df.iloc[[np.random.choice(x) x in g.indices.values()]] out[21]: 4 1 3 2 1 5.0 b 5.0 5.0 3 3.0 c 3.0 3.0 0 6.0 6.0 6.0
Comments
Post a Comment