skip to Main Content

I have a data-set that looks similar to the one shown below.

Cat_1   Cat_2
SEO Branded
SEO Branded
SEO Branded
SEO Branded
SEO Branded
SEO Branded
SEO Branded
SEO Branded
SEO Non branded
SEO Non branded
SEO Non branded
SEO Non branded
SEO Unknown
SEO Unknown
SEO Unknown

I am trying to replace the Cat_2 factors = “Unknown” with either “Branded” or “Non branded”. The number of “Unknown” factors to be replaced with “Branded” or “Non branded” is equal to the distribution of the “Branded” and “Non branded” factors. The actual distribution of “Branded” and “Non branded” factors is not an easy 66.67/33.33 split and will change depending on the data frame being worked on.

I am using the following code to get the percentage split:

a<-(sum(dataframe$Cat_2=="Branded"))
b<-(sum(dataframe$Cat_2=="Non Branded"))
c<-a+b
percenta<-a/c
percentb<-b/c
d<-(sum(dataframe$Cat_2=="Unknown"))

To make things easier I then created a separate data-frame, dataframe.2, that only contains the Cat_2=”Unknown” factors which looks like:

Cat_1   Cat_2
SEO Unknown
SEO Unknown
SEO Unknown

I am trying to use a for loop to iterate over the contents of dataframe.2 and replace each instance of “Unknown” with “Branded” up to the point that the number of factors with “Branded” = percenta*d (where percenta and d are as per the equations above). The code I am using is:

for (row in dataframe.2){
  if (sum(dataframe.2$Cat_2=="Unknown")<percenta*d){
    revalue(dataframe.2$Cat_2, c("Unknown"="Branded"))
} else{(revalue(dataframe.2$Cat_2, c("Unknown"="Non Branded")))
}}

However it doesn’t change the Cat_2 factors within the dataframe.2

2

Answers


  1. We can do this much faster and not use any for loops at all, as well as running only on your initial data (dubbed df here).

    First, find how many “Branded” we need:

    brandedunknowns <- round(sum(df$Cat_2 == "Unknown") * 
                         (sum(df$Cat_2 == "Branded") / sum(df$Cat_2 != "Unknown")))
    

    This simply finds the total proportion of Branded in known, multiplies by the number of unknowns, and rounds to the nearest number.

    Then, we make the first brandedunknowns unknowns to be Branded:

    df$Cat_2[df$Cat_2 == "Unknown"][1 : brandedunknowns] <- "Branded"
    

    Then turn the rest into “Non Branded”:

    df$Cat_2[df$Cat_2 == "Unknown"] <- "Non Branded" 
    
    Login or Signup to reply.
  2. Here is another possibility. The result is different from the previous one in the sense that the “Unknown” entries are not filled sequentially with a block of identical entries, but randomly with “Branded” and “Non branded”. The choice is defined by a probability that corresponds to the ratio of the initial values of “Branded” and “Non branded”:

    x <- as.data.frame(as.matrix(table(df))) 
    

    Here we have used table() to generate a helpful summary of the data and converted the result into a dataframe to facilitate the accessibility of the entries in the table.

    Next, we can calculate the ratio of “Branded” among the labeled (i.e., not “Unknown”) entries:

    branded_ratio <- x$Freq[x$Cat_2=="Branded"] / sum(x$Freq[x$Cat_2!="Unknown"])
    

    With this ratio, we can generate a random sequence of 0 and 1 with the length of the number of “Unknown” entries and store it in a variable idx. The probability to obtain a zero corresponds to the previously calculated ratio of “Branded” labels:

    set.seed(123) #set seed for reproducibility
    idx <- sample(2,x$Freq[x$Cat_2=="Unknown"], prob=c(branded_ratio, 1-branded_ratio),replace=T) - 1 
    

    Finally, we can assign “Non branded” or “Branded” to the entries with “Unknown” according to the sequence of zeros and ones calculated before:

    df$Cat_2[df$Cat_2=="Unknown"] <- ifelse(idx,"Non branded","Branded") 
    

    And this is the result for the example in the OP:

    > df
    #   Cat_1       Cat_2
    #1    SEO     Branded
    #2    SEO     Branded
    #3    SEO     Branded
    #4    SEO     Branded
    #5    SEO     Branded
    #6    SEO     Branded
    #7    SEO     Branded
    #8    SEO     Branded
    #9    SEO Non branded
    #10   SEO Non branded
    #11   SEO Non branded
    #12   SEO Non branded
    #13   SEO     Branded
    #14   SEO Non branded
    #15   SEO     Branded
    

    data

    df <- structure(list(Cat_1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "SEO", class = "factor"), 
    Cat_2 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
    2L, 2L, 3L, 3L, 3L), .Label = c("Branded", "Non branded", 
    "Unknown"), class = "factor")), .Names = c("Cat_1", "Cat_2"), 
    class = "data.frame", row.names = c(NA, -15L))
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search