skip to Main Content

As I am very new to R Programming, I need your help to find the answer

I have the below data frame as input data, now I want to return the rows which are having the same EntryName but the Sequence is different

EntryName Entry GeneNames Organism Length Sequence Postion
HXA13_HUMAN P31271 HOXA13 HOX Human 388 AAAA 12
SOX21_HUMAN Q9Y651 SOX21 SOX25 Human 276 AAAA 13
RBM24_HUMAN Q9BX46 RBM24 RNPC6 Human 236 AAAE 14
MZT1_HUMAN Q08AG7 MZT1 C13orf Human 191 AAAK 15
HXA13_HUMAN P51589 HOXA13 HOXk Human 100 ABAB 120

Now I want to filter the rows for sequence AAAA and it should return the entire row where EntryName is matching with AAAA’s EntryName for other Sequences

I am expecting the below output

EntryName Entry GeneNames Organism Length Sequence Postion
HXA13_HUMAN P31271 HOXA13 HOX Human 388 AAAA 12
HXA13_HUMAN P51589 HOXA13 HOXk Human 100 ABAB 120

Along with the R script, MongoDB is also helpful
Thank you so much in advance!

2

Answers


  1. We could do a group by filter

    library(dplyr)
    df1 %>%
        group_by(EntryName) %>%
        filter('AAAA' %in% Sequence) %>%
        ungroup
    

    Or it could be

    df1 %>%
        group_by(EntryName) %>%
        filter(n_distinct(Sequence) > 1) %>%
        ungroup
    

    -output

    # A tibble: 2 × 7
      EntryName   Entry  GeneNames   Organism Length Sequence Postion
      <chr>       <chr>  <chr>       <chr>     <int> <chr>      <int>
    1 HXA13_HUMAN P31271 HOXA13 HOX  Human       388 AAAA          12
    2 HXA13_HUMAN P51589 HOXA13 HOXk Human       100 ABAB         120
    

    data

    df1 <- structure(list(EntryName = c("HXA13_HUMAN", "SOX21_HUMAN", "RBM24_HUMAN", 
    "MZT1_HUMAN", "HXA13_HUMAN"), Entry = c("P31271", "Q9Y651", "Q9BX46", 
    "Q08AG7", "P51589"), GeneNames = c("HOXA13 HOX", "SOX21 SOX25", 
    "RBM24 RNPC6", "MZT1 C13orf", "HOXA13 HOXk"), Organism = c("Human", 
    "Human", "Human", "Human", "Human"), Length = c(388L, 276L, 236L, 
    191L, 100L), Sequence = c("AAAA", "AAAA", "AAAE", "AAAK", "ABAB"
    ), Postion = c(12L, 13L, 14L, 15L, 120L)), 
    class = "data.frame", row.names = c(NA, 
    -5L))
    
    Login or Signup to reply.
  2. Base R:

    subset(df1, EntryName %in% unique(EntryName[Sequence == "AAAA"]))
    
     EntryName   Entry  GeneNames   Organism Length Sequence Postion
      <chr>       <chr>  <chr>       <chr>     <int> <chr>      <int>
    1 HXA13_HUMAN P31271 HOXA13 HOX  Human       388 AAAA          12
    2 SOX21_HUMAN Q9Y651 SOX21 SOX25 Human       276 AAAA          13
    3 HXA13_HUMAN P51589 HOXA13 HOXk Human       100 ABAB         120
    

    We could also use any:

    library(dplyr)
    df1 %>%
      group_by(EntryName) %>%
      filter(any(Sequence=="AAAA")) %>%
      ungroup
    
     EntryName   Entry  GeneNames   Organism Length Sequence Postion
      <chr>       <chr>  <chr>       <chr>     <int> <chr>      <int>
    1 HXA13_HUMAN P31271 HOXA13 HOX  Human       388 AAAA          12
    2 SOX21_HUMAN Q9Y651 SOX21 SOX25 Human       276 AAAA          13
    3 HXA13_HUMAN P51589 HOXA13 HOXk Human       100 ABAB         120
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search