skip to Main Content

Working on Debian 11 (Bullseye) first with the distribution’s MariaDB version 10.5 and now with the version 10.6.7 from MariaDB repositories.

I’m failing to get correct indexes for some big tables from the dump of a genetics database (ensembl homo_sapiens_variation_106_37) from here: ftp://ftp.ensembl.org/pub/grch37/release-106/mysql/homo_sapiens_variation_106_37/

The one table is variation_feature:

CREATE TABLE `variation_feature` (
  `variation_feature_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `seq_region_id` int(10) unsigned NOT NULL,
  `seq_region_start` int(11) NOT NULL,
  `seq_region_end` int(11) NOT NULL,
  `seq_region_strand` tinyint(4) NOT NULL,
  `variation_id` int(10) unsigned NOT NULL,
  `allele_string` varchar(50000) DEFAULT NULL,
  `ancestral_allele` varchar(50) DEFAULT NULL,
  `variation_name` varchar(255) DEFAULT NULL,
  `map_weight` int(11) NOT NULL,
  `flags` set('genotyped') DEFAULT NULL,
  `source_id` int(10) unsigned NOT NULL,
  `consequence_types` set('intergenic_variant','splice_acceptor_variant','splice_donor_variant','stop_lost','coding_sequence_variant','missense_variant','stop_gained','synonymous_variant','frameshift_variant','non_coding_transcript_variant','non_coding_transcript_exon_variant','mature_miRNA_variant','NMD_transcript_variant','5_prime_UTR_variant','3_prime_UTR_variant','incomplete_terminal_codon_variant','intron_variant','splice_region_variant','downstream_gene_variant','upstream_gene_variant','start_lost','stop_retained_variant','inframe_insertion','inframe_deletion','transcript_ablation','transcript_fusion','transcript_amplification','transcript_translocation','TFBS_ablation','TFBS_fusion','TFBS_amplification','TFBS_translocation','regulatory_region_ablation','regulatory_region_fusion','regulatory_region_amplification','regulatory_region_translocation','feature_elongation','feature_truncation','regulatory_region_variant','TF_binding_site_variant','protein_altering_variant','start_retained_variant') NOT NULL DEFAULT 'intergenic_variant',
  `variation_set_id` set('1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','21','22','23','24','25','26','27','28','29','30','31','32','33','34','35','36','37','38','39','40','41','42','43','44','45','46','47','48','49','50','51','52','53','54','55','56','57','58','59','60','61','62','63','64') NOT NULL DEFAULT '',
  `class_attrib_id` int(10) unsigned DEFAULT '0',
  `somatic` tinyint(1) NOT NULL DEFAULT '0',
  `minor_allele` varchar(50) DEFAULT NULL,
  `minor_allele_freq` float DEFAULT NULL,
  `minor_allele_count` int(10) unsigned DEFAULT NULL,
  `alignment_quality` double DEFAULT NULL,
  `evidence_attribs` set('367','368','369','370','371','372','418','421','573','585') DEFAULT NULL,
  `clinical_significance` set('uncertain significance','not provided','benign','likely benign','likely pathogenic','pathogenic','drug response','histocompatibility','other','confers sensitivity','risk factor','association','protective','affects') DEFAULT NULL,
  `display` int(1) DEFAULT '1',
  PRIMARY KEY (`variation_feature_id`),
  KEY `pos_idx` (`seq_region_id`,`seq_region_start`,`seq_region_end`),
  KEY `variation_idx` (`variation_id`),
  KEY `variation_set_idx` (`variation_set_id`),
  KEY `consequence_type_idx` (`consequence_types`),
  KEY `source_idx` (`source_id`)
) ENGINE=MyISAM AUTO_INCREMENT=743963234 DEFAULT CHARSET=latin1;

It has over 700,000,000 records and occupies on the disk:

# ls -lh variation_feature.*
-rw-rw---- 1 mysql mysql 56K Mai  3 09:41 variation_feature.frm
-rw-rw---- 1 mysql mysql 55G Mai  2 20:44 variation_feature.MYD
-rw-rw---- 1 mysql mysql 61G Mai  2 22:27 variation_feature.MYI

Despite not getting any errors importing the variation_feature.txt some essential indexes are not working.
In this case, selecting a known row of data based on variation_id won’t return anything, e.g.

SELECT * 
FROM variation_feature 
WHERE variation_id = 617544728;

--> nothing

The value 617544728 seems not to be in the index, because

SELECT variation_id 
FROM variation_feature 
WHERE variation_id = 617544728;

--> nothing

Disabling the index and waiting for the long table scan returns the row:

ALTER TABLE variation_feature ALTER INDEX variation_idx IGNORED;

SELECT * 
FROM variation_feature 
WHERE variation_id = 617544728;

variation_feature_id  seq_region_id  seq_region_start  seq_region_end  seq_region_strand  variation_id  allele_string  ancestral_allele  variation_name  map_weight  flags   source_id  consequence_types  variation_set_id                                                                             class_attrib_id  somatic  minor_allele  minor_allele_freq  minor_allele_count  alignment_quality  evidence_attribs                 clinical_significance  display
--------------------  -------------  ----------------  --------------  -----------------  ------------  -------------  ----------------  --------------  ----------  ------  ---------  -----------------  -------------------------------------------------------------------------------------------  ---------------  -------  ------------  -----------------  ------------------  -----------------  -------------------------------  ---------------------  -------
           632092737          27511         230845794       230845794                  1     617544728  A/G            G                 rs699                    1  <null>          1  missense_variant   2,5,6,9,10,11,12,13,15,16,17,23,24,25,26,30,40,42,43,44,45,47,48,49,50,51,52,53,54,55,56,57                2    false  A                        0.2949                1477             <null>  368,370,371,372,418,421,573,585  benign,risk factor           1

myisamchk is fixing the indexes without error, but the index "variation_idx" won’t work.

DROPping and re-CREATing the one index runs without error, but the index won’t work.

The other indexes are OK.

In another genome of this database (ensembl homo_sapiens_variation_106_38 – slightly bigger) from here: ftp://ftp.ensembl.org/pub/release-106/mysql/homo_sapiens_variation_106_38/ I have the same problem (on another computer but with the same program installations).
With one difference: there is also the PRIMARY KEY (variation_feature_id) not working.

myisamchk is also running without error, but to no avail.

mysqlcheck (version 10.6 running very slow compared to 10.5) returns on the first computer then:

homo_sapiens_variation_106_37.variation_feature
error    : Key in wrong position at page 22405134336
error    : Corrupt

Now, this we know, but no repair tool can really repair or give a hint, what’s wrong.

I’ve CREATEd an index on variation_name: it’s working.

My changes to mariadb.cnf to adapt to the huge databases and the mysql versions of ensembl:

[mysqld]

## this was some time ago - because some bug mysql or mariadb didn't took this from the system
default-time-zone = Europe/Berlin

## ensembl has mysql version 5.6
## because of table creation scripts I need compatibility:
show_compatibility_56 = ON
performance_schema

## ensembl writes in DATETIME fields: "0000-00-00 00:00:00"
## the new sql_mode in 5.7 doesn't allow it any more
## SHOW VARIABLES LIKE 'sql_mode' ;
## ONLY_FULL_GROUP_BY,STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION
## so change sql_mode deleting NO_ZERO_IN_DATE,NO_ZERO_DATE
sql_mode = ONLY_FULL_GROUP_BY,STRICT_TRANS_TABLES,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION

datadir = /mnt/SRVDATA/var/lib/mysql
tmpdir  = /mnt/WORK/tmp

[mariadb]

[mariadb-10.6]

## MyISAM for building ensembl homo_sapiens
lower_case_table_names=1

bulk_insert_buffer_size = 1G
myisam_sort_buffer_size = 56G
sort_buffer_size = 56G

Thank you for the patience to read all this.


Edit, unresolved, supplementary information, March 3, 2023

10 months later I’m bumping this with supplementary information.

I’m working now on a Workstation with 2 Xeons and 96 GB ECC RAM. The hard drives are new.

MariaDB is on Debian 11 (bullseye) in the version: "Ver 15.1 Distrib 10.9.5-MariaDB, for debian-linux-gnu (x86_64) using readline EditLine wrapper"

The database I need is already three versions further and the files can be found under ftp://ftp.ensembl.org/pub/release-109/mysql/homo_sapiens_variation_109_38/

I encounter the same issue trying to load the variation_feature.txt (which is 103G big and contains 747758162 lines/records).

I tried both MyISAM and ARIA engines with the same described effect.

Despite "LOAD DATA LOCAL INFILE", myisamchk and aria_chk running and ending with success, I can select a record based on a condition on the PRIMARY KEY or variation_idx only if I disable the indexes (and wait a long time).

If I run a mysqlchk the table will be marked as faulty with a key in wrong index position.

The same cycle all over again: loading OK, mysqlchk NOT OK, repairing OK, mysqlchk NOT OK!

Following the recommendation to use InnoDB I started the "LOAD DATA LOCAL INFILE" with "set unique_checks=0" 7 days ago (is still running), having the following customized innodb configs:

innodb_file_per_table = ON
innodb_buffer_pool_size = 76G
innodb_log_buffer_size = 1G
innodb_log_file_size = 2G
innodb_write_io_threads = 16

top shows MariaDB has reserved 83.4Gb virtual resources and filled actually 79Gb IN RAM but uses only ~2% CPU.
iotop shows a poor disk activity of ~0.5-2Mb/sec.

Thank you for any hint!

2

Answers


  1. sort_buffer_size should be limited to 1% of RAM. The huge setting you have could cause swapping, which is terrible for performance. Ditto for myisam_sort_buffer_size, though it it probably not relevant to the Question.

    You should consider moving from MyISAM to InnoDB, if for no other reason than avoiding corrupt indexes. It will, however, increase the disk footprint (data + indexes) to upwards of 300GB.

    InnoDB is much better at concurrent access to huge tables.

    Login or Signup to reply.
  2. I think the advice of Rick James, to use InnoDB, was a good one, but it shouldn’t take 7 days, of course.

    To speeds up this process you can start with a table which doesn’t have any indexes at all. Building the indexes is one of the reasons this process is so slow.

    700,000,000 records is also a lot. You could split this into more manageable chucks. Say 700 files of 1 million records. See the split command. Writing a script that will load so many files will be essential.

    After loading these files into a database table you can create the indexes one by one.

    This is basically a "Divide and conquer" strategy. That being said, you are pushing the boundaries of what these databases normally have to cope with. My databases do normally grow to a few gigs, but rarely on a single table. I always try to keep my databases within reasonable size limits.

    If my tables get really big, I use partitioning to make them more efficient. This is a "Divide and conquer" strategy build into MySQL. Note that you can partition your table before you’ve created all the indexes.

    Finally, you could ask yourself if it is really necessary to store everything in a single table. Especially the allele_string varchar(50000) column could make your table very big. Why not put this in another table, or several other tables? If you create a script to load multiple smaller CSV files, you can also use that to disperse incoming data among several tables.

    I’m sorry, but I have no real answer as to what’s going wrong with your current setup. Everything tells me that it could work, but somehow it doesn’t.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search