skip to Main Content

CONTEXT:

I have a mysql version 8 table as follows.

id  |   field1  | field2    |   ...... | field10    | timeStamp
_______________________________________________________________

fieldN are float datatype. id is unique, and indexed. The values that need to go into this table are being generated by different distributed program nodes on a network. Different nodes produce data for one or more fields. Altogether, I have around 35000 nodes generating data every minute, so that is about 600 inputs per second.
(Note: it was a much smaller set of nodes in the beginning, slowly kept growing). Currently, for each request coming from the node, I run an UPDATE query something like this.

Node 1 eg: UPDATE table set field1=24.5, field6=44.6,timeStamp=now() where id=1
Node 2 eg: UPDATE table set field3=53.3, field9=4.2, field10=98.8, timeStamp=now() where id=2
Node 3 eg: UPDATE table set field1=55.5, field8=7.2, field9=8.8, timeStamp=now() where id=3

etc. So, which fields/ # of fields being updated depends on any given node, as shown above. Some may have data for 2 fields, some just 1, some even 5 or 6, etc.

TASK AT HAND:

To reduce the number of db connectons/ queries (currently one per each node per minute), I am putting things together to use Kafka to collect all these updates, and then run them in batches instead of individual queries.

Once the data from various nodes are in kafka, This is where I am not sure what is the best way to proceed. I did some reading, and came across the following options.

1. Transactions,
2. CASE
3. MultiQuery.

I use docker. I ran tests Based on the test script (https://github.com/yanxurui/keepcoding/blob/master/php/mysql/multiple_update.php), and found that that CASE method seems to be the fastest. While Multiquery seems to show the lowest time of execution with this test script, it looks like it is just dumping all those queries into mysql, and it takes much longer for mysql to actually finish the queries (as observed by info from docker stats). Whereas, the case method seems to be faster.

MY SOLUTION

After considering various approaches, here is what I am thinking.

UPDATE table set 
    field1= CASE    when id=1 THEN 24.5 
                    when id=3 THEN 55.5
                    else field1
            END,
    field3= CASE    when id=2 THEN 53.3
                    else field3
            END,
    etc..,
    timeStamp=now() 
    WHERE id IN (1,2,3,...)
    

Which means, For each field in the table, I have to iterate through the list of nodes to see if they have a value to report to that field, and so on.

QUESTIONS

  1. Am I right in selecting the CASE method? Or did i miss anything in the tests I did?
  2. If CASE method is the right way to go, is my solution the right way to frame the UPDATE Query?

2

Answers


  1. Save the data which must be used for UPDATE to temporary table of the same structure (if a column should be updated then it contains according value otherwise NULL). Then use

    UPDATE maintable
      JOIN temptable USING (id)
      SET maintable.column1 = COALESCE(temptable.column1, maintable.column1),
    --    ...
          maintable.column10 = COALESCE(temptable.column10, maintable.column10),
          `timestamp` = NOW();
    

    Also I’d recommend you to alter the table, set ON UPDATE CURRENT_TIMESTAMP option to the timestamp column, and remove this column explicit updating from the query.

    Login or Signup to reply.
  2. Have you tried something like this (with ON DUPLICATE KEY):

    INSERT INTO table (id,field,field2,...,field10) VALUES 
    (1, 24.5,null,null,null,..null),
    (2, null,null,null,null,..98.8),
    (3, 55.5,null,null,null,..8.8)
        ON DUPLICATE KEY UPDATE 
      field1=IF(VALUES(field1) = NULL, field1, VALUES(field1)), 
      field2=IF(VALUES(field2) = NULL, field2, VALUES(field2)),       ...
      field10=IF(VALUES(field10) = NULL, field10,  VALUES(field10)),
      `timestamp` = NOW();
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search