r - Distinct (dplyr) not working well - unique observation based on criteria -
i have data duplicated records , of them should not there (mark
, recov
should once per band
, recap
can present several times). want select unique observations (band
) based on values in column (variable=="mark"
) , keep rest of data "recap"
, "recov"
.
i used dyplr
, group data band select unique record when column variable=="mark"
, code:
uniq <- df %>%group_by(band) %>% distinct(variable=="mark")
i found out not working well, when looking observations values variable=="recap"
have been deleted (example: in band=113749924
, recap value 1993 missing, same case in band=113728509
there recap value missing)
this data example:
structure(list(band = c(113728501l, 113728502l, 113728503l, 113728504l, 113728505l, 113728505l, 113728506l, 113728506l, 113728507l, 113728508l, 113728509l, 113728509l, 113728509l, 113728509l, 113728510l, 113728510l, 113729709l, 113729709l, 113729709l, 113729710l, 113729711l, 113729712l, 113729713l, 113729714l, 113729715l, 113729716l, 113729717l, 113729718l, 113729719l, 113729720l, 113729720l, 113729721l, 113729722l, 113729723l, 113729724l, 113729725l, 113729726l, 113729727l, 113729728l, 113729729l, 113729730l, 113729731l, 113729732l, 113729733l, 113729733l, 113729733l, 113729734l, 113729735l, 113729735l, 113729735l, 113729914l, 113729914l, 113729914l, 113729914l, 113729915l, 113729916l, 113729917l, 113729918l, 113729919l, 113729920l, 113729921l, 113729922l, 113729923l, 113729924l, 113729925l, 113729926l, 113729927l, 113729928l, 113729929l, 113749923l, 113749924l, 113749924l, 113749924l), variable = structure(c(1l, 1l, 1l, 1l, 1l, 3l, 1l, 2l, 1l, 1l, 1l, 1l, 2l, 2l, 1l, 2l, 1l, 3l, 2l, 1l, 1l, 1l, 1l, 1l, 1l, 1l, 1l, 1l, 1l, 1l, 3l, 1l, 1l, 1l, 1l, 1l, 1l, 1l, 1l, 1l, 1l, 1l, 1l, 1l, 3l, 2l, 1l, 1l, 3l, 2l, 1l, 1l, 2l, 2l, 1l, 1l, 1l, 1l, 1l, 1l, 1l, 1l, 1l, 1l, 1l, 1l, 1l, 1l, 1l, 1l, 1l, 3l, 2l), .label = c("mark", "recap", "recov"), class = "factor"), year = c(1994l, 1994l, 1994l, 1994l, 1994l, 2012l, 1994l, 1999l, 1994l, 1994l, 1994l, 1994l, 2002l, 2003l, 1994l, 1996l, 1994l, 2002l, 1998l, 1994l, 1994l, 1994l, 1994l, 1994l, 1994l, 1994l, 1994l, 1994l, 1994l, 1994l, 1995l, 1994l, 1994l, 1994l, 1994l, 1994l, 1994l, 1994l, 1994l, 1994l, 1994l, 1994l, 1994l, 1994l, 2002l, 2001l, 1994l, 1994l, 1999l, 1998l, 1994l, 1994l, 1999l, 2005l, 1994l, 1994l, 1994l, 1994l, 1994l, 1994l, 1994l, 1994l, 1994l, 1994l, 1994l, 1994l, 1994l, 1994l, 1994l, 1991l, 1991l, 1994l, 1993l)), .names = c("band", "variable", "year"), class = "data.frame", row.names = c(na, -73l))
at end have (example 113749924):
band year variable 113749924 1991 mark 113749924 1993 recap 113749924 1994 recov
could please me find wrong or maybe suggest me alternative code?
thanks lot!
one option group_by
'band', filter
rows 'variable' 'mark', distinct
rows , bind (bind_rows
) filter
ed dataset 'variable' not 'mark'.
df %>% group_by(band) %>% filter(variable=="mark") %>% ungroup() %>% distinct() %>% bind_rows(., filter(df, variable!="mark")) %>% arrange(band) %>% data.frame band variable year 1 113728501 mark 1994 2 113728502 mark 1994 3 113728503 mark 1994 4 113728504 mark 1994 5 113728505 mark 1994 6 113728505 recov 2012 7 113728506 mark 1994 8 113728506 recap 1999 9 113728507 mark 1994 10 113728508 mark 1994 11 113728509 mark 1994 ###only 1 mark. 12 113728509 recap 2002 13 113728509 recap 2003 14 113728510 mark 1994 15 113728510 recap 1996 16 113729709 mark 1994 17 113729709 recov 2002 18 113729709 recap 1998 19 113729710 mark 1994 20 113729711 mark 1994 21 113729712 mark 1994 22 113729713 mark 1994 23 113729714 mark 1994 24 113729715 mark 1994 25 113729716 mark 1994 26 113729717 mark 1994 27 113729718 mark 1994 28 113729719 mark 1994 29 113729720 mark 1994 30 113729720 recov 1995 31 113729721 mark 1994 32 113729722 mark 1994 33 113729723 mark 1994 34 113729724 mark 1994 35 113729725 mark 1994 36 113729726 mark 1994 37 113729727 mark 1994 38 113729728 mark 1994 39 113729729 mark 1994 40 113729730 mark 1994 41 113729731 mark 1994 42 113729732 mark 1994 43 113729733 mark 1994 44 113729733 recov 2002 45 113729733 recap 2001 46 113729734 mark 1994 47 113729735 mark 1994 48 113729735 recov 1999 49 113729735 recap 1998 50 113729914 mark 1994 51 113729914 recap 1999 52 113729914 recap 2005 53 113729915 mark 1994 54 113729916 mark 1994 55 113729917 mark 1994 56 113729918 mark 1994 57 113729919 mark 1994 58 113729920 mark 1994 59 113729921 mark 1994 60 113729922 mark 1994 61 113729923 mark 1994 62 113729924 mark 1994 63 113729925 mark 1994 64 113729926 mark 1994 65 113729927 mark 1994 66 113729928 mark 1994 67 113729929 mark 1994 68 113749923 mark 1991 69 113749924 mark 1991 70 113749924 recov 1994 71 113749924 recap 1993
or option group_by
both 'band', , 'variable', create logical condition row_number()
greater 1 , 'variable' 'mark', negate (!
) , filter
rows.
df %>% group_by(band, variable) %>% filter(!(row_number() >1 & variable =="mark"))
Comments
Post a Comment