Split dataframe into two groups - Photoshop

Luciano
July 11, 2015
127 views
0 votes
2 Answers

I’ve simulated this data.frame:

library(plyr); library(ggplot2)
count <- rev(seq(0, 500, 20))
tide <- seq(0, 5, length.out = length(count))
df <- data.frame(count, tide)

count_sim <- unlist(llply(count, function(x) rnorm(20, x, 50)))
count_sim_df <- data.frame(tide=rep(tide,each=20), count_sim)

And it can be plotted like this:

ggplot(df, aes(tide, count)) + geom_jitter(data = count_sim_df, aes(tide, count_sim), position = position_jitter(width = 0.09)) + geom_line(color = "red")

I now want to split count_sim_df into two group: high and low. When I plot the split count_sim_df, it should look like this (everything in green and blue is photoshopped). The bit that I’m finding tricky is getting overlap between high and low around the middle values of tide.

This is how I want to split count_sim_df into high and low:

assign half of count_sim_df to high and half of count_sim_df to low
reassign the values of count to create overlap between high and low around the middle values of tide

Tags: dataframe ggplot2 r visualization

Answers

Here’s my revised suggestion. I hope it helps.

middle_tide <- mean(count_sim_df$tide)
hilo_margin <- 0.3
middle_df <- subset(count_sim_df,tide > (middle_tide * (1 - hilo_margin)))
middle_df <- subset(middle_df, tide < (middle_tide * (1 + hilo_margin)))
upper_df <- count_sim_df[count_sim_df$tide > (middle_tide * (1 + hilo_margin)),]
lower_df <- count_sim_df[count_sim_df$tide < (middle_tide * (1 - hilo_margin)),]
idx <- sample(2,nrow(middle_df), replace = T)
count_sim_high <- rbind(middle_df[idx==1,], upper_df)
count_sim_low <- rbind(middle_df[idx==2,], lower_df)
p <- ggplot(df, aes(tide, count)) + 
   geom_jitter(data = count_sim_high, aes(tide, count_sim), position = position_jitter(width = 0.09), alpha=0.4, col=3, size=3) + 
   geom_jitter(data = count_sim_low, aes(tide, count_sim), position = position_jitter(width = 0.09), alpha=0.4, col=4, size=3) + 
   geom_line(color = "red")

I might still not have fully understood your procedure to split into high and low, especially what you mean by “reassigning the value of count”. In this case here I have defined an overlap region of 30% around the middle value of tide and assigned randomly half of the points within this transition region to the “high” and the other half to the “low” set.

Here’s a way to generate the sample dataset and the groupings using relatively little code and just base R:

library(ggplot2)
count <- rev(seq(0, 500, 20))
tide <- seq(0, 5, length.out = length(count))
df <- data.frame(count, tide)

count_sim_df <- data.frame(tide = rep(tide,each=20),
                           count = rnorm(20 * nrow(df), rep(count, each = 20), 50))
margin <- 0.3
count_sim_df$`tide level` <-
  with(count_sim_df,
    factor((tide >= quantile(tide, 0.5 + margin / 2) |
           (tide >= quantile(tide, 0.5 - margin / 2) & sample(0:1, length(tide), TRUE))),
           labels = c("Low", "High")))
ggplot(df, aes(x = tide, y = count)) +
  geom_line(colour = "red") +
  geom_point(aes(colour = `tide level`), count_sim_df, position = "jitter") +
  scale_colour_manual(values = c(High = "green", Low = "blue"))

Please signup or login to give your own answer.

Click here to cancel reply.

Split dataframe into two groups – Photoshop

Answers