r - Distinct (dplyr) not working well - unique observation based on criteria -


i have data duplicated records , of them should not there (mark , recov should once per band, recap can present several times). want select unique observations (band) based on values in column (variable=="mark") , keep rest of data "recap" , "recov".

i used dyplr, group data band select unique record when column variable=="mark", code:

uniq <- df %>%group_by(band)  %>% distinct(variable=="mark") 

i found out not working well, when looking observations values variable=="recap" have been deleted (example: in band=113749924, recap value 1993 missing, same case in band=113728509 there recap value missing)

this data example:

structure(list(band = c(113728501l, 113728502l, 113728503l, 113728504l,  113728505l, 113728505l, 113728506l, 113728506l, 113728507l, 113728508l,  113728509l, 113728509l, 113728509l, 113728509l, 113728510l, 113728510l,  113729709l, 113729709l, 113729709l, 113729710l, 113729711l, 113729712l,  113729713l, 113729714l, 113729715l, 113729716l, 113729717l, 113729718l,  113729719l, 113729720l, 113729720l, 113729721l, 113729722l, 113729723l,  113729724l, 113729725l, 113729726l, 113729727l, 113729728l, 113729729l,  113729730l, 113729731l, 113729732l, 113729733l, 113729733l, 113729733l,  113729734l, 113729735l, 113729735l, 113729735l, 113729914l, 113729914l,  113729914l, 113729914l, 113729915l, 113729916l, 113729917l, 113729918l,  113729919l, 113729920l, 113729921l, 113729922l, 113729923l, 113729924l,  113729925l, 113729926l, 113729927l, 113729928l, 113729929l, 113749923l,  113749924l, 113749924l, 113749924l), variable = structure(c(1l,  1l, 1l, 1l, 1l, 3l, 1l, 2l, 1l, 1l, 1l, 1l, 2l, 2l, 1l, 2l, 1l,  3l, 2l, 1l, 1l, 1l, 1l, 1l, 1l, 1l, 1l, 1l, 1l, 1l, 3l, 1l, 1l,  1l, 1l, 1l, 1l, 1l, 1l, 1l, 1l, 1l, 1l, 1l, 3l, 2l, 1l, 1l, 3l,  2l, 1l, 1l, 2l, 2l, 1l, 1l, 1l, 1l, 1l, 1l, 1l, 1l, 1l, 1l, 1l,  1l, 1l, 1l, 1l, 1l, 1l, 3l, 2l), .label = c("mark", "recap",  "recov"), class = "factor"), year = c(1994l, 1994l, 1994l, 1994l,  1994l, 2012l, 1994l, 1999l, 1994l, 1994l, 1994l, 1994l, 2002l,  2003l, 1994l, 1996l, 1994l, 2002l, 1998l, 1994l, 1994l, 1994l,  1994l, 1994l, 1994l, 1994l, 1994l, 1994l, 1994l, 1994l, 1995l,  1994l, 1994l, 1994l, 1994l, 1994l, 1994l, 1994l, 1994l, 1994l,  1994l, 1994l, 1994l, 1994l, 2002l, 2001l, 1994l, 1994l, 1999l,  1998l, 1994l, 1994l, 1999l, 2005l, 1994l, 1994l, 1994l, 1994l,  1994l, 1994l, 1994l, 1994l, 1994l, 1994l, 1994l, 1994l, 1994l,  1994l, 1994l, 1991l, 1991l, 1994l, 1993l)), .names = c("band",  "variable", "year"), class = "data.frame", row.names = c(na,  -73l)) 

at end have (example 113749924):

band      year variable 113749924 1991 mark 113749924 1993 recap 113749924 1994 recov 

could please me find wrong or maybe suggest me alternative code?

thanks lot!

one option group_by 'band', filter rows 'variable' 'mark', distinct rows , bind (bind_rows) filtered dataset 'variable' not 'mark'.

df %>%   group_by(band) %>%   filter(variable=="mark") %>%   ungroup() %>%  distinct() %>%  bind_rows(., filter(df, variable!="mark")) %>%  arrange(band) %>%  data.frame        band variable year 1  113728501     mark 1994 2  113728502     mark 1994 3  113728503     mark 1994 4  113728504     mark 1994 5  113728505     mark 1994 6  113728505    recov 2012 7  113728506     mark 1994 8  113728506    recap 1999 9  113728507     mark 1994 10 113728508     mark 1994 11 113728509     mark 1994  ###only 1 mark. 12 113728509    recap 2002 13 113728509    recap 2003 14 113728510     mark 1994 15 113728510    recap 1996 16 113729709     mark 1994 17 113729709    recov 2002 18 113729709    recap 1998 19 113729710     mark 1994 20 113729711     mark 1994 21 113729712     mark 1994 22 113729713     mark 1994 23 113729714     mark 1994 24 113729715     mark 1994 25 113729716     mark 1994 26 113729717     mark 1994 27 113729718     mark 1994 28 113729719     mark 1994 29 113729720     mark 1994 30 113729720    recov 1995 31 113729721     mark 1994 32 113729722     mark 1994 33 113729723     mark 1994 34 113729724     mark 1994 35 113729725     mark 1994 36 113729726     mark 1994 37 113729727     mark 1994 38 113729728     mark 1994 39 113729729     mark 1994 40 113729730     mark 1994 41 113729731     mark 1994 42 113729732     mark 1994 43 113729733     mark 1994 44 113729733    recov 2002 45 113729733    recap 2001 46 113729734     mark 1994 47 113729735     mark 1994 48 113729735    recov 1999 49 113729735    recap 1998 50 113729914     mark 1994 51 113729914    recap 1999 52 113729914    recap 2005 53 113729915     mark 1994 54 113729916     mark 1994 55 113729917     mark 1994 56 113729918     mark 1994 57 113729919     mark 1994 58 113729920     mark 1994 59 113729921     mark 1994 60 113729922     mark 1994 61 113729923     mark 1994 62 113729924     mark 1994 63 113729925     mark 1994 64 113729926     mark 1994 65 113729927     mark 1994 66 113729928     mark 1994 67 113729929     mark 1994 68 113749923     mark 1991 69 113749924     mark 1991 70 113749924    recov 1994 71 113749924    recap 1993 

or option group_by both 'band', , 'variable', create logical condition row_number() greater 1 , 'variable' 'mark', negate (!) , filter rows.

df %>%      group_by(band, variable) %>%      filter(!(row_number() >1 & variable =="mark")) 

Comments

Popular posts from this blog

PySide and Qt Properties: Connecting signals from Python to QML -

c# - DevExpress.Wpf.Grid.InfiniteGridSizeException was unhandled -

scala - 'wrong top statement declaration' when using slick in IntelliJ -