How to add multihreading to a continuous Python function? - Ubuntu

EmmaLincoln
November 6, 2022
149 views
0 votes
2 Answers

I have an infinite loop Python function for measuring how fast are SECP256K1 public Keys are generated.

The script:

from time import time

a = 0
b = 7
n = 0xfffffffffffffffffffffffffffffffebaaedce6af48a03bbfd25e8cd0364141
gx = 0x79be667ef9dcbbac55a06295ce870b07029bfcdb2dce28d959f2815b16f81798
gy = 0x483ada7726a3c4655da4fbfc0e1108a8fd17b448a68554199c47d08ffb10d4b8
prime = 2**256 - 2**32 - 977

def addition(currentX, currentY, gx, gy, a, b, prime):
    if gy == 0:
        return (None, None)
    elif currentX is None and currentY is None:
        return (gx, gy)
    elif currentX == gx and currentY != gy:
        return (None, None)
    elif currentX == gx and currentY == gy and currentY == 0:
        return (None, None)
    elif currentX == gx and currentY == gy:
        s1 = (3 * pow(gx, 2, prime) + a) % prime
        s2 = (gy * 2) % prime
        s = (s1 * pow(s2, (prime - 2), prime)) % prime
        currentX = (s ** 2 - 2 * gx) % prime
        currentY = (s * (gx - currentX) - gy) % prime
    elif currentX != gx:
        s1 = (currentY - gy)
        s2 = (currentX - gx)
        s = (s1 * pow(s2, (prime - 2), prime)) % prime
        currentX = ((s ** 2) - gx - currentX) % prime
        currentY = ((s * (gx - currentX)) - gy) % prime

    return (currentX, currentY)


def secp256k1BinaryExpansion(privateKey, gx, gy, a, b, prime):
    #if pow(gy, 2, prime) != (pow(gx, 3, prime) + a * gx + b) % prime:
        #return "The point is not on the curve"
    coef = privateKey
    currentX, currentY = gx, gy
    resultX, resultY = None, None
    while coef:
        if coef & 1:
            resultX, resultY = addition(resultX, resultY, currentX, currentY, a, b, prime)
        currentX, currentY = addition(currentX, currentY, currentX, currentY, a, b, prime)
        coef >>= 1
    return (resultX, resultY)


def testLoop(gx, gy, a, b, prime):
    count = 1 #Count is the number of all calculations
    counter = 0 #Counter is for measuring the speed of the function
    timeOne = time()
    pubX, pubY = None, None
    while True:
        pubX, pubY = secp256k1BinaryExpansion(count, gx, gy, a, b, prime)
        #print("Case ", count,":", pubX,pubY)
        count += 1
        counter += 1
        timeTwo = time()
        if (timeTwo - timeOne) >= 10:
            print("The speed is: ", counter / (timeTwo - timeOne), "c/s")
            timeOne = time()
            counter = 0

testLoop(gx, gy, a, b, prime)

Whenever I am launching the script on Pycharm, it outputs aroud 100 c/s on Windows and 300 c/s on Ubuntu.

When it happens, on both os, only 1 core out ouf 4 gets loaded with this task for 100%, hence only 25% of CPU power is allocated to this.
The CPU: intel core i5-4440 cpu @ 3.10ghz

I’d like to allocate 2-3 cores to the task, so it gets loaded like: 50-75%.

The truth is I’ve read documentation and watched tutorials on Python Parallelism/Multithreading and it’s confusing.

Not really sure how to allocate a single job across the cores.

May be you could help out?

Answers

- J233r244meRichard
- November 6, 2022 at 3:12 pm
- 0 votes
0
You can use the package gmpy2 for that and it is 10 times faster (on my machine) by just changing the beginning of the code to:
```
import gmpy2

a = gmpy2.mpz(0)
b = gmpy2.mpz(7)
n = gmpy2.mpz(0xfffffffffffffffffffffffffffffffebaaedce6af48a03bbfd25e8cd0364141)
gx = gmpy2.mpz(0x79be667ef9dcbbac55a06295ce870b07029bfcdb2dce28d959f2815b16f81798)
gy = gmpy2.mpz(0x483ada7726a3c4655da4fbfc0e1108a8fd17b448a68554199c47d08ffb10d4b8)
prime = gmpy2.mpz(2**256 - 2**32 - 977)
```
The reason is that gmpy2 use GMP internally which is highly optimized in C while large Python integers are not much optimized comparatively in the CPython interpreter (though it is also implemented in C).

Since your numbers are 256-bit wise, implementing that in C/C++ using an implementation specifically optimized for 256-bit wise integers should be significantly faster.

Regarding the parallel implementation, you need to use multiprocessing to get ride of the Global Interpreter Lock (GIL) that will serialize all the computational operations in the provided code. The thing is multiprocessing require data to be pickled and to send/receive data between multiple processes which are expensive operations. Since each call to secp256k1BinaryExpansion takes 12 us on my machine with gmpy2, the pickling overhead can be significant. This is particularly true since gmpy2 integers should be converted to integers first, then pickled, then sent to the target process, then unpickled and then converted back to a gmpy2 integer. Not to mention the same process should be applied on the results of the secp256k1BinaryExpansion function, that is 2 integers. It takes about 1 us per integer on my machine without the inter-process communication. This means the maximum speed up with multiple processes is 4x (certainly significantly less in practice due to the synchronizations). Input/Output results need to be sent/received by quite large chunks so to mitigate the overheads. If you still want to do that, then you need to use a multiprocessing.Queue to communicate data between processes in the infinite loop.

I strongly advise you to use a native language like C/C++ to do that as it will not only make GMP operation faster, but also remove most of the high overhead (multiprocessing, pickling and inter-process synchronization). If you want to do that, you can write a thread pool and use atomic queues for the multi-thread communication. Working on chunks is still important in that case since the overhead of using multiple thread will still certainly be significant (especially since the secp256k1BinaryExpansion function should be significantly faster).
Login or Signup to reply.

as outlined by @Jérôme Richard using pure python for this is inefficient but i will just to show how this may be done in parallel, multithreading won’t benefit you here as these functions are locked by the GIL, so multithreading will run as fast as serial code, so you must use multiprocessing.

if this is only for benchmarking purpose you can pass iterools.repeat to produce an infinite iterator as an input argument to imap_unordered from multiprocessing.pool, now the input has to be a tuple, since imap_unordered needs the input to be 1 argument, we just unpack the input inside the function, then in the return just unpack your returns in a loop from the returns_itertor.

from time import time
import itertools
import multiprocessing
a = 0
b = 7
n = 0xfffffffffffffffffffffffffffffffebaaedce6af48a03bbfd25e8cd0364141
gx = 0x79be667ef9dcbbac55a06295ce870b07029bfcdb2dce28d959f2815b16f81798
gy = 0x483ada7726a3c4655da4fbfc0e1108a8fd17b448a68554199c47d08ffb10d4b8
prime = 2**256 - 2**32 - 977

def addition(currentX, currentY, gx, gy, a, b, prime):
    if gy == 0:
        return (None, None)
    elif currentX is None and currentY is None:
        return (gx, gy)
    elif currentX == gx and currentY != gy:
        return (None, None)
    elif currentX == gx and currentY == gy and currentY == 0:
        return (None, None)
    elif currentX == gx and currentY == gy:
        s1 = (3 * pow(gx, 2, prime) + a) % prime
        s2 = (gy * 2) % prime
        s = (s1 * pow(s2, (prime - 2), prime)) % prime
        currentX = (s ** 2 - 2 * gx) % prime
        currentY = (s * (gx - currentX) - gy) % prime
    elif currentX != gx:
        s1 = (currentY - gy)
        s2 = (currentX - gx)
        s = (s1 * pow(s2, (prime - 2), prime)) % prime
        currentX = ((s ** 2) - gx - currentX) % prime
        currentY = ((s * (gx - currentX)) - gy) % prime

    return (currentX, currentY)


def secp256k1BinaryExpansion(input_tuple):
    privateKey, gx, gy, a, b, prime = input_tuple
    #if pow(gy, 2, prime) != (pow(gx, 3, prime) + a * gx + b) % prime:
        #return "The point is not on the curve"
    coef = privateKey
    currentX, currentY = gx, gy
    resultX, resultY = None, None
    while coef:
        if coef & 1:
            resultX, resultY = addition(resultX, resultY, currentX, currentY, a, b, prime)
        currentX, currentY = addition(currentX, currentY, currentX, currentY, a, b, prime)
        coef >>= 1
    return (resultX, resultY)


def testLoop(gx, gy, a, b, prime):
    count = 1 #Count is the number of all calculations
    counter = 0 #Counter is for measuring the speed of the function
    timeOne = time()
    pubX, pubY = None, None
    input_tuple = (count, gx, gy, a, b, prime)
    infinite_iterator = itertools.repeat(input_tuple)
    number_of_cores = 2
    chunk_size = 10
    with multiprocessing.Pool(number_of_cores) as pool:
        returns_iterator = pool.imap_unordered(secp256k1BinaryExpansion,infinite_iterator,chunksize=chunk_size)
        for item in returns_iterator:
            pubX, pubY = item  # unpack the return
            #print("Case ", count,":", pubX,pubY)
            count += 1
            counter += 1
            timeTwo = time()
            if (timeTwo - timeOne) >= 10:
                print("The speed is: ", counter / (timeTwo - timeOne), "c/s")
                timeOne = time()
                counter = 0

if __name__ == "__main__":  # must be used on windows
    testLoop(gx, gy, a, b, prime)

The speed is:  10040.80010823988 c/s

changing the number_of_cores will make your pc use more cores, and increase the throughput, but in production you might want to use C++ or cython with gmpy2 or another library for arithmetic as outlined in the other answer, playing with the chunk_size is also beneficial.

the reason why this has almost 10 times the speed you used to see is because this is actually using 3 cores, 1 for checking the time and incrementing the counters and 2 cores for doing the actual maths … those two cores are operating as fast as python can work unhindered by the loop overhead.

Please signup or login to give your own answer.

Click here to cancel reply.

How to add multihreading to a continuous Python function? – Ubuntu

Answers