skip to Main Content

For these two days, I have met a weird question.

STAR from https://github.com/alexdobin/STAR is a program used to build suffix array indexes. I have been used this program for years. It works OK until recently.

For these days, when I run STAR, it will always be killed.

root@localhost:STAR --runMode genomeGenerate --runThreadN 10  --limitGenomeGenerateRAM 31800833920 --genomeDir star_GRCh38 --genomeFastaFiles GRCh38.fa --sjdbGTFfile GRCh38.gtf --sjdbOverhang 100
.
.
.
Killed

root@localhost:STAR --runMode genomeGenerate --runThreadN 10 --genomeDir star_GRCh38 --genomeFastaFiles GRCh38.fa --sjdbGTFfile GRCh38.gtf --sjdbOverhang 100
Jun 03 10:15:08 ..... started STAR run
Jun 03 10:15:08 ... starting to generate Genome files
Jun 03 10:17:24 ... starting to sort Suffix Array. This may take a long time...
Jun 03 10:17:51 ... sorting Suffix Array chunks and saving them to disk...
Killed

A month ago, the same command with same inputs and same parameters runs OK. It does cost some memory, but not a lot.

I have tried 3 recently released version of this program, all failed. So I do not think it is the question of STAR program but my sever configuration.

I also tried to run this program as both root and normal user, no lucky for each.

I suspect there is a limitation of memory usage in my server.

But I do not know how the memory is limited? I wonder if some one can give me some hints.

Thanks!

Tong

The following is my debug process and system info.

Command dmesg -T| grep -E -i -B5 'killed process' showing it is Out of memory problem.

But before the STAR program is killed, top command showing only 5% mem is occupied by this porgram.

[一 6  1 23:43:00 2020] [40479]  1002 40479   101523    18680     112      487             0 /anaconda2/bin/
[一 6  1 23:43:00 2020] [40480]  1002 40480   101526    18681     112      486             0 /anaconda2/bin/
[一 6  1 23:43:00 2020] [40481]  1002 40481   101529    18682     112      485             0 /anaconda2/bin/
[一 6  1 23:43:00 2020] [40482]  1002 40482   101531    18673     111      493             0 /anaconda2/bin/
[一 6  1 23:43:00 2020] Out of memory: Kill process 33822 (STAR) score 36 or sacrifice child
[一 6  1 23:43:00 2020] Killed process 33822 (STAR) total-vm:23885188kB, anon-rss:10895128kB, file-rss:4kB, shmem-rss:0kB

[三 6  3 10:02:13 2020] [12296]  1002 12296   101652    18681     113      486             0 /anaconda2/bin/
[三 6  3 10:02:13 2020] [12330]  1002 12330   101679    18855     112      486             0 /anaconda2/bin/
[三 6  3 10:02:13 2020] [12335]  1002 12335   101688    18682     112      486             0 /anaconda2/bin/
[三 6  3 10:02:13 2020] [12365]  1349 12365    30067     1262      11        0             0 bash
[三 6  3 10:02:13 2020] Out of memory: Kill process 7713 (STAR) score 40 or sacrifice child
[三 6  3 10:02:13 2020] Killed process 7713 (STAR) total-vm:19751792kB, anon-rss:12392428kB, file-rss:0kB, shmem-rss:0kB
--
[三 6月  3 10:42:17 2020] [ 4697]  1002  4697   101526    18681     112      486             0 /anaconda2/bin/
[三 6月  3 10:42:17 2020] [ 4698]  1002  4698   101529    18682     112      485             0 /anaconda2/bin/
[三 6月  3 10:42:17 2020] [ 4699]  1002  4699   101532    18680     112      487             0 /anaconda2/bin/
[三 6月  3 10:42:17 2020] [ 4701]  1002  4701   101534    18673     110      493             0 /anaconda2/bin/
[三 6月  3 10:42:17 2020] Out of memory: Kill process 21097 (STAR) score 38 or sacrifice child
[三 6月  3 10:42:17 2020] Killed process 21097 (STAR) total-vm:19769500kB, anon-rss:11622928kB, file-rss:884kB, shmem-rss:0kB

Command free -hl showing I have enough memory.

              total        used        free      shared  buff/cache   available
Mem:           251G         10G         11G        227G        229G         12G
Low:           251G        240G         11G
High:            0B          0B          0B
Swap:           29G         29G          0B

Also as showed by ulimit -a, no virtual memory limitation is set.

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 1030545
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 65536
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1030545
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Here is the version of my Centos and Kernel (output by hostnamectl):

hostnamectl
   Static hostname: localhost.localdomain
         Icon name: computer-server
           Chassis: server
  Operating System: CentOS Linux 7 (Core)
       CPE OS Name: cpe:/o:centos:centos:7
            Kernel: Linux 3.10.0-514.26.2.el7.x86_64
      Architecture: x86-64

Here shows the content of cat /etc/security/limits.conf:

#*               soft    core            0
#*               hard    rss             10000
#@student        hard    nproc           20
#@faculty        soft    nproc           20
#@faculty        hard    nproc           50
#ftp             hard    nproc           0
#@student        -       maxlogins       4
* soft nofile 65536
* hard nofile 65536

#@intern    hard    as 162400000
#@intern    hard    nproc 150

# End of file

As suggested, I have updated the output of df -h:

Filesystem            All  Used  Available (Used)% Mount
devtmpfs             126G     0  126G    0% /dev
tmpfs                126G  1.3M  126G    1% /dev/shm
tmpfs                126G  4.0G  122G    4% /run
tmpfs                126G     0  126G    0% /sys/fs/cgroup
/dev/mapper/cl-root  528G  271G  257G   52% /
/dev/sda1            492M  246M  246M   51% /boot
tmpfs                 26G     0   26G    0% /run/user/0
tmpfs                 26G     0   26G    0% /run/user/1002
tmpfs                 26G     0   26G    0% /run/user/1349
tmpfs                 26G     0   26G    0% /run/user/1855
ls -a /dev/shm/
.  ..
grep Shmem /proc/meminfo
Shmem:          238640272 kB

Several tmpfs costs 126G memory. I am googleing this, but still not sure what should be done?

That is the problem of shared memory due to program terminated abnormally.

ipcrm is used to clear all shared memory and then STAR running is fine.

$ ipcrm
.....
$ free -h
              total        used        free      shared  buff/cache   available
Mem:           251G         11G        226G        3.9G         14G        235G
Swap:           29G        382M         29G

2

Answers


  1. It looks like the problem is with shared memory: you have 227G of memory eaten up by shared objects.

    Shared memory files are persistent. Have a look in /dev/shm and any other tmpfs mounts to see if there are large files that can be removed to free up more physical memory (RAM+swap).

    $ ls -l /dev/shm
    ...
    $ df -h | grep '^Filesystem|^tmpfs'
    ...
    
    Login or Signup to reply.
  2. When I run a program called STAR, it will always be killed.

    It probably has some memory leak. Even old programs may have residual bugs, and they could appear in some very particular cases.

    Check with strace(1) or ltrace(1) and pmap(1). Learn also to query /proc/, see proc(5), top(1), htop(1). See LinuxAteMyRam and read about memory over-commitment and virtual address space and perhaps a textbook on operating systems.

    If you have access to the source code of your STAR, consider recompiling it with all warnings and debug info (with GCC, you would pass -Wall -Wextra -g to gcc or g++) then use valgrind and/or some address sanitizer. If you don’t have legal access to the source code of STAR, contact the entity (person or organization) which provided it to you.

    You could be interested in that draft report and by the Clang static analyzer or Frama-C (or coding your own GCC plugin).

    So I do not think it is the question of STAR program but my server configuration.

    I recommend using valgrind or gdb and inspect your /proc/ to validate that optimistic hypothesis.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search