skip to Main Content

I have a fairly large dataset (~50K entries) which I use to generate a correlation matrix. This works well, using "only" ~20GB RAM.

Then, I want to extract only the unique pairwise combinations from it and convert it into a data frame. This is where I run into issues. Either too much RAM usage or overflowing the indexing variable(s). I know there are >2B combinations, so I am aware it explodes a bit in size, but still..

I have tried different ways to achieve this, but with no success.

Mock data:

df = matrix(runif(1),nrow=50000, ncol=50000, dimnames=list(seq(1,50000,by=1), seq(1,50000,by=1)))

Trying to extract upper/lower triangle from the correlation matrix and then reshape it:

df[lower.tri(df, diag = T),] = NA
df = reshape2::melt(df, na.rm = T)

crashes with:

Error in df[lower.tri(bla, diag = T), ] = NA : 
  long vectors not supported yet: ../../src/include/Rinlinedfuns.h:522

It crashes with the same error if you do only: df = df[lower.tri(df, diag = T),]
(I did read through Large Matrices in R: long vectors not supported yet but I didn’t find it helpful for my situation)

I also tried:

df = subset(as.data.frame(as.table(df)),
       match(Var1, names(annotation_table)) > match(Var2, names(annotation_table)))

to use only R-base packages, but it eventually ran out of memory after ~1 day. This is the most RAM intensive part: as.data.frame(as.table(df)) so I tried also replacing it with reshape2::melt(df) but it also ran out of RAM

I am running the code on an Ubuntu machine with 128GB RAM. I do have larger machines, but i would’ve expected that this amount of RAM should suffice.

Any help would be highly appreciated. Thank you.

2

Answers


  1. Chosen as BEST ANSWER

    Okay, after digging a bit more and trying other stuff, I found one solution that eventually worked in a previous post:

    upper_ind = which(upper.tri(df, diag=F), arr.ind = T)
    gc() # clean up and free some RAM
    df = data.frame(first = dimnames(df)[[2]][upper_ind[,2]],
                    second = dimnames(df)[[1]][upper_ind[,1]],
                    correlation = df[upper_ind])
    

    For reference, testing it out on my real data (49,100 x 49,100 correlation matrix):

    • only retrieving the index of the elements within the upper triangle of the correlation matrix (first command) peaked at ~90GB RAM
    • putting together the matrix (second command) peaked at ~100GB of RAM

    If you have such large datasets, I really suggest you call the garbage collector between the two commands as it actually helped. Timewise, it took less than 10 minutes. It is not ideal, but given my setup and time constraints, it is a solution.

    Thank you @Robert Hacken for spotting that erroneous , in df[lower.tri(df, diag = T),] = NA (i.e., the comma before the closing bracket should be removed.

    I think that what @Mikael Jagan has proposed might be more memory efficient, but I did not manage to successfully run his code.


  2. If you have as much RAM as you say, then this really should work without issue for n much larger than 6. If you see errors not related to memory usage, then you should share the code that you evaluated, since probably you have made a mistake adapting the example …

    set.seed(0)
    n <- 6L
    x <- provideDimnames(cor(matrix(rnorm(n * n), n, n)))
    x
    
                A           B            C           D           E            F
    A  1.00000000  0.42679900  0.113100027 -0.03952030 -0.02406114 -0.693427730
    B  0.42679900  1.00000000  0.519377903  0.06136646 -0.51713799 -0.331961466
    C  0.11310003  0.51937790  1.000000000 -0.43996491 -0.32225557 -0.006199606
    D -0.03952030  0.06136646 -0.439964909  1.00000000 -0.42053358  0.537301520
    E -0.02406114 -0.51713799 -0.322255571 -0.42053358  1.00000000 -0.367595700
    F -0.69342773 -0.33196147 -0.006199606  0.53730152 -0.36759570  1.000000000
    
    s <- seq_len(n) - 1L
    nms <- dimnames(x)
    dat <- data.frame(val = x[sequence(s, seq.int(1L, length(x), n))],
                      row = gl(n, 1L, labels = nms[[1L]])[sequence(s, 1L)], 
                      col = rep.int(gl(n, 1L, labels = nms[[2L]]), s))
    dat
    
                val row col
    1   0.426798998   A   B
    2   0.113100027   A   C
    3   0.519377903   B   C
    4  -0.039520302   A   D
    5   0.061366463   B   D
    6  -0.439964909   C   D
    7  -0.024061141   A   E
    8  -0.517137993   B   E
    9  -0.322255571   C   E
    10 -0.420533577   D   E
    11 -0.693427730   A   F
    12 -0.331961466   B   F
    13 -0.006199606   C   F
    14  0.537301520   D   F
    15 -0.367595700   E   F
    

    If you are using a version of R older than 4.0.0, where sequence is defined differently, then you’ll want something like:

    sequence <- function(nvec, from = 1L, by = 1L)
        unlist(.mapply(seq.int,
                       list(from = as.integer(from),
                            by = as.integer(by),
                            length.out = as.integer(nvec)),
                       NULL),
               recursive = FALSE, use.names = FALSE)
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search