skip to Main Content

My project https://simbiat.ru/fftracker has a table, that is a junction table containing IDs of achievements earned by each character ID. It works fine, but my concerns is its size: before optimization it was 6.1 GBs and after it – 4 GBs. And it’s not the size of the index

Data    3.1 GiB
Index   901.5   MiB
Overhead    2.5 MiB
Effective   4.0 GiB
Total   4.0 GiB

I do not see any performance impact of it now, but I am concerned about the size, especially since the table is updated frequently and grows larger and larger. Even though it has just 3 columns (definition generated by phpMyAdmin):

CREATE TABLE `ff__character_achievement` (
  `characterid` int(10) UNSIGNED NOT NULL COMMENT 'Character ID taken from Lodestone URL (https://eu.finalfantasyxiv.com/lodestone/character/characterid/)',
  `achievementid` smallint(5) UNSIGNED NOT NULL COMMENT 'Achievement ID taken from Lodestone (https://eu.finalfantasyxiv.com/lodestone/character/characterid/achievement/detail/achievementid/)',
  `time` date NOT NULL COMMENT 'Date when achievement was received according to Lodestone'
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci ROW_FORMAT=COMPRESSED;


ALTER TABLE `ff__character_achievement`
  ADD PRIMARY KEY (`characterid`,`achievementid`) USING BTREE,
  ADD KEY `ach` (`achievementid`) USING BTREE;


ALTER TABLE `ff__character_achievement`
  ADD CONSTRAINT `char_ach_ach` FOREIGN KEY (`achievementid`) REFERENCES `ff__achievement` (`achievementid`) ON DELETE CASCADE ON UPDATE CASCADE,
  ADD CONSTRAINT `char_ach_char` FOREIGN KEY (`characterid`) REFERENCES `ff__character` (`characterid`) ON DELETE CASCADE ON UPDATE CASCADE;
COMMIT;

Considering that original tables are 408KBs and 707MBs and fill with lots of other data (varchars and text even), I would expect the junction table to be 2GBs with indexes at most. Yes, I can remove the date column, but it does not help much: 2.8 instead of 3.1GBs. Which seems weird to me, but that’s the values reported by MariaDB.
The purpose of the table is to dynamically determine the rarest achievements as seen on https://simbiat.ru/fftracker/statistics/achievements/ and also generate random list of characters that have the achievements as seen on https://simbiat.ru/fftracker/achievement/1/ (as example). I think that for this purpose 4GBs for a table is too excessive.
So my question is, whether there may be some kind of way to store the same data and easily utilize it for the same purpose, but have smaller footprint?
From what I know it may be possible to store achievement IDs for each character as a JSON string, but I will definitely lose CONSTRAINT (which I prefer to retain) and I do not see an easy to use way to search through respective JSONs, unless doing a WHERE or FULLTEXT search, which does not seem to be appropriate for this.
Any suggestion will be appreciated.

2

Answers


  1. Keep in mind storing the data as JSON will probably make it larger, not smaller.

    I ran some tests by converting a data dump from Stack Overflow into JSON data, and found it caused the size of the data to increase 2x – 3x. Integers are worse than strings in this way.

    See my presentation How to Use JSON in MySQL Wrong, or the video. This applies to MariaDB as well.

    The size of a row should be based on the size of the data types. Four bytes for INT, two bytes for SMALLINT, three bytes for DATE. There’s a bit of overhead as you know. Let’s call it 16 bytes per row. So for it to occupy 4GiB, you must have on the order of 270 million rows, right?

    That’s a fair bit of data, but it’s not a deal-breaker. At my company, we have many tables that have over a billion rows, with much larger content than a couple of integers. We start to get a bit worried if the table is over 500GiB. At that point, operations like backups or schema changes become too difficult.

    There might be ideas to shrink these rows like using a SMALLINT instead of an INT for the character id. But these ideas inevitably make your use of the data more complicated.

    You have to ask yourself if it’s going to make a difference. Even if you cut the size of the data types in half, pretty soon you’ll just accumulate twice as many rows, and then you’ll be back where you started.

    Ultimately, you just have to decide if this database is worth investing into adequate server capacity, in storage and RAM.

    Login or Signup to reply.
  2. Analysis of Size

    The 3 columns take this much for the data BTree and PRIMARY KEY:

    • 4 bytes for INT
    • 2 bytes for SMALLINT
    • 3 bytes for DATE
    • Total, including overhead, is about 40 bytes.

    Since the lone secondary index has two of those 3 columns, I would not expect it to be much smaller, so I am surprised by the ratio of 3.1 : 0.9 GiB

    3.1GiB/40B ~= 80M — Is that about how many rows in that table? So, my 40B guestimate is a bit high, but Bill’s was a bit low. Still we are all "in the same ballpark" for this kind of computation. Caveat: The 142M is actually an approximation; the real value could easily be Bill’s 270M or my 80. The real count is available only from COUNT(*) unless you are running a version of MariaDB that gives the exact number.

    The FKs may be forcing an extra index on the reference tables, but do not impact this table (except for slowing down INSERTs).

    Possible workaround

    How many different "achievements" are there? If less than 64, we can discuss the use of SET or BIGINT UNSIGNED. Such a column could be added to the player table and consume only 8 (or fewer) bytes per player. Since I don’t know what you are doing with the data, I can’t be more specific. See also 1 << n, &, BIT_COUNT().

    Please show us the SELECT that needs optimizing.

    Another approach

    Invent 3000 distinct 3-letter words. Have a TEXT column for each of the 142M rows; it would contain the relevant ‘words’. Add FULLTEXT(col) to build a fulltext index of those "words". Now searching 142M rows for one or a small number of "id" "words" would be very efficient.

    The space for the column would be 4 bytes per id per row (plus some overhead).

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search