python - Pandas dataframe vectorized sampling -


i have simple df forming pivot_table:

    d = {'one' : ['a', 'b', 'b', 'c', 'c', 'c'], 'two' : [6., 5., 4., 3., 2., 1.],     'three' : [6., 5., 4., 3., 2., 1.], 'four' : [6., 5., 4., 3., 2., 1.]}     df = pd.dataframe(d)     pivot = pd.pivot_table(df,index=['one','two']) 

i randomly sample 1 row each different element column 'one' of resulting pivot object. (in example, 'a' sampled while there more options 'b' , 'c'.) began using 0.18.0 version of pandas , aware of .sample method. messed .groupby method applying sampling function this:

    grouped = pivot.groupby('one').apply(lambda x: x.sample(n=1, replace=false)) 

i raise keyerror when tried variations on theme thought time fresh perspective on seemingly simple question...

thanks assistance!

the keyerror raised since 'one' not column in pivot name of index:

in [11]: pivot out[11]:          4  3 1 2   6.0   6.0    6.0 b   4.0   4.0    4.0     5.0   5.0    5.0 c   1.0   1.0    1.0     2.0   2.0    2.0     3.0   3.0    3.0 

you have use level argument:

in [12]: pivot.groupby(level='one').apply(lambda x: x.sample(n=1, replace=false)) out[12]:              4  3 1 one 2     6.0   6.0    6.0 b   b   4.0   4.0    4.0 c   c   1.0   1.0    1.0 

this isn't quite right since index repeated! it's better as_index=false:

in [13]: pivot.groupby(level='one', as_index=false).apply(lambda x: x.sample(n=1)) out[13]:            4  3   1 2 0   6.0   6.0    6.0 1 b   4.0   4.0    4.0 2 c   2.0   2.0    2.0 

note: picks random row each time.


as alternative, potentially more performant variant (that pulls out subframe:

in [21]: df.iloc[[np.random.choice(x) x in g.indices.values()]] out[21]:    4 1  3  2 1   5.0   b    5.0  5.0 3   3.0   c    3.0  3.0 0   6.0      6.0  6.0 

Comments

Popular posts from this blog

c# - DevExpress.Wpf.Grid.InfiniteGridSizeException was unhandled -

scala - 'wrong top statement declaration' when using slick in IntelliJ -

PySide and Qt Properties: Connecting signals from Python to QML -