Context
To describe the situation I find myself in currently, let’s take the 3 tables relevant to this:
CREATE TABLE steam_user (
id SERIAL PRIMARY KEY,
username TEXT NOT NULL,
steam_id TEXT NOT NULL,
profile_url TEXT NOT NULL,
avatar_url TEXT NOT NULL
);
CREATE TABLE achievement (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL,
display_name TEXT NOT NULL,
description TEXT NOT NULL,
icon_url TEXT DEFAULT '' NOT NULL,
categories SMALLINT[] DEFAULT '{}' NOT NULL
);
CREATE TABLE achievements_on_steam_users (
id SERIAL PRIMARY KEY,
steam_user_id INTEGER NOT NULL, /* foreign key */
achievement_id INTEGER NOT NULL, /* foreign key */
achieved BOOLEAN NOT NULL DEFAULT FALSE,
unlock_time TIMESTAMP NOT NULL DEFAULT '1970-01-01 00:00:00'
);
I’m pulling achievements for a specific set of games from the Steam API and then connecting them together in the achievements_on_steam_users
junction table.
The problem
achievements_on_steam_users
gains over 1000 rows every time i pull achievement data for a user. This will happen once every time a new user’s data is queried on the site. I pull the state of all achievements from steam and then save it to my database.
Use Case
One of the pages required me to query all of these entries for one specific user. Optimally I’d prefer to be able to JOIN
this query with the achievement
table. Furthermore I would like to support a case, where users can reset their achievement data (on Steam). This is why I would prefer to have rows with the achieved = FALSE
state in them and just update as I go, instead of only adding completed achievements to this list, that I’d then have to delete.
These are the two main queries that I’m worrying about:
SELECT *
FROM achievements_on_steam_users
WHERE steam_user_id = $1 /* some id */
QUERY PLAN |
---|
Seq Scan on achievements_on_steam_users (cost=0.00..25.60 rows=1328 width=21) |
Filter: (steam_user_id = 1) |
SELECT *
FROM achievements_on_steam_users
INNER JOIN achievement ON achievements_on_steam_users.achievement_id = achievement.id
WHERE steam_id = $1
QUERY PLAN |
---|
Hash Right Join (cost=42.20..107.98 rows=1328 width=282) |
Hash Cond: (achievement.id = achievements_on_steam_users.achievement_id) |
-> Seq Scan on achievement (cost=0.00..62.28 rows=1328 width=261) |
-> Hash (cost=25.60..25.60 rows=1328 width=21) |
-> Seq Scan on achievements_on_steam_users (cost=0.00..25.60 rows=1328 width=21) |
Filter: (steam_user_id = 1) |
Keep in mind that these times aren’t indicative of a lot, since I currently only have test data (my own) in the database.
Question
Would it make sense to instead store these values in a json
column directly inside of the steam_user
table. I would define a schema similar to:
{
//This could be any ID from the `achievement` table
"333": {
"achieved": true,
"unlockTime": "1970-01-01 00:00:00"
}
}
For those unfamiliar with steam achievements; the data inside of the achievement
table changes very rarely and even then the IDs would never change. At best new IDs would get added.
Is that a good idea? Maybe someone more familiar with this volume of data can tell me if the queries would considerably slow down in this case?
For now all of my mitigations have been on the code side and I’ve worked around this, as of now, non-issue. However I’d like to mitigate the amount of rows before this actually gets released publicly, since I feel like the verbosity is unnecessary.
2
Answers
Very unlikely, unless your data is very dynamically defined and is rarely queried in its parts, only as a whole.
The JSON schema would have a far larger storage footprint. Your current schema incidentally can be improved by removing the unnecessary
id
column, and making the two FKs into a composite primary key.Your current schema would take per row, and after calculating padding, 24 header bytes plus 24 bytes of data, for a total of 48 bytes per row, or 48000 bytes per user. Incidentally, Postgres is particularly bad for small rows, some DBMSs are much more efficient.
The JSON schema on the other hand, requires much more. It’s hard to tell without actual data, but a quick check using a fiddle gives about 70kb for a single user. On top of that, it’s much less efficient to query a single value.
So it only really makes sense to use this type of JSON structure when the whole blob is being queried always (or almost always), not when you need individual values. Even then, it’s still against first principles of database design, because referential integrity can’t be enforced, and would only be advised if you already have JSON or if performance demanded it for some reason.
What I don’t really understand is why you need all those rows. It sounds as if you only actually need to know when that
achievement_id
was attained by the user, so you only need to store rows whereachieved
istrue
. Then you can have the schema:Or for a JSON schema you can just use:
Never use an array of primary keys instead of a junction table to model an m-to-n relationship. That will perform worse, and you cannot have referential integrity.
To convince yourself, benchmark both solutions with realistic amounts of data or just write the SQL statement that joins the tables an take a look at it. The join statement for the data model with the arrays will look complicated, which usually is a bad sign.