I have an infinite loop Python function for measuring how fast are SECP256K1 public Keys are generated.
The script:
from time import time
a = 0
b = 7
n = 0xfffffffffffffffffffffffffffffffebaaedce6af48a03bbfd25e8cd0364141
gx = 0x79be667ef9dcbbac55a06295ce870b07029bfcdb2dce28d959f2815b16f81798
gy = 0x483ada7726a3c4655da4fbfc0e1108a8fd17b448a68554199c47d08ffb10d4b8
prime = 2**256 - 2**32 - 977
def addition(currentX, currentY, gx, gy, a, b, prime):
if gy == 0:
return (None, None)
elif currentX is None and currentY is None:
return (gx, gy)
elif currentX == gx and currentY != gy:
return (None, None)
elif currentX == gx and currentY == gy and currentY == 0:
return (None, None)
elif currentX == gx and currentY == gy:
s1 = (3 * pow(gx, 2, prime) + a) % prime
s2 = (gy * 2) % prime
s = (s1 * pow(s2, (prime - 2), prime)) % prime
currentX = (s ** 2 - 2 * gx) % prime
currentY = (s * (gx - currentX) - gy) % prime
elif currentX != gx:
s1 = (currentY - gy)
s2 = (currentX - gx)
s = (s1 * pow(s2, (prime - 2), prime)) % prime
currentX = ((s ** 2) - gx - currentX) % prime
currentY = ((s * (gx - currentX)) - gy) % prime
return (currentX, currentY)
def secp256k1BinaryExpansion(privateKey, gx, gy, a, b, prime):
#if pow(gy, 2, prime) != (pow(gx, 3, prime) + a * gx + b) % prime:
#return "The point is not on the curve"
coef = privateKey
currentX, currentY = gx, gy
resultX, resultY = None, None
while coef:
if coef & 1:
resultX, resultY = addition(resultX, resultY, currentX, currentY, a, b, prime)
currentX, currentY = addition(currentX, currentY, currentX, currentY, a, b, prime)
coef >>= 1
return (resultX, resultY)
def testLoop(gx, gy, a, b, prime):
count = 1 #Count is the number of all calculations
counter = 0 #Counter is for measuring the speed of the function
timeOne = time()
pubX, pubY = None, None
while True:
pubX, pubY = secp256k1BinaryExpansion(count, gx, gy, a, b, prime)
#print("Case ", count,":", pubX,pubY)
count += 1
counter += 1
timeTwo = time()
if (timeTwo - timeOne) >= 10:
print("The speed is: ", counter / (timeTwo - timeOne), "c/s")
timeOne = time()
counter = 0
testLoop(gx, gy, a, b, prime)
Whenever I am launching the script on Pycharm, it outputs aroud 100 c/s on Windows and 300 c/s on Ubuntu.
When it happens, on both os, only 1 core out ouf 4 gets loaded with this task for 100%, hence only 25% of CPU power is allocated to this.
The CPU: intel core i5-4440 cpu @ 3.10ghz
I’d like to allocate 2-3 cores to the task, so it gets loaded like: 50-75%.
The truth is I’ve read documentation and watched tutorials on Python Parallelism/Multithreading and it’s confusing.
Not really sure how to allocate a single job across the cores.
May be you could help out?
2
Answers
You can use the package gmpy2 for that and it is 10 times faster (on my machine) by just changing the beginning of the code to:
The reason is that gmpy2 use GMP internally which is highly optimized in C while large Python integers are not much optimized comparatively in the CPython interpreter (though it is also implemented in C).
Since your numbers are 256-bit wise, implementing that in C/C++ using an implementation specifically optimized for 256-bit wise integers should be significantly faster.
Regarding the parallel implementation, you need to use multiprocessing to get ride of the Global Interpreter Lock (GIL) that will serialize all the computational operations in the provided code. The thing is multiprocessing require data to be pickled and to send/receive data between multiple processes which are expensive operations. Since each call to
secp256k1BinaryExpansion
takes 12 us on my machine with gmpy2, the pickling overhead can be significant. This is particularly true since gmpy2 integers should be converted to integers first, then pickled, then sent to the target process, then unpickled and then converted back to a gmpy2 integer. Not to mention the same process should be applied on the results of thesecp256k1BinaryExpansion
function, that is 2 integers. It takes about 1 us per integer on my machine without the inter-process communication. This means the maximum speed up with multiple processes is 4x (certainly significantly less in practice due to the synchronizations). Input/Output results need to be sent/received by quite large chunks so to mitigate the overheads. If you still want to do that, then you need to use amultiprocessing.Queue
to communicate data between processes in the infinite loop.I strongly advise you to use a native language like C/C++ to do that as it will not only make GMP operation faster, but also remove most of the high overhead (multiprocessing, pickling and inter-process synchronization). If you want to do that, you can write a thread pool and use atomic queues for the multi-thread communication. Working on chunks is still important in that case since the overhead of using multiple thread will still certainly be significant (especially since the
secp256k1BinaryExpansion
function should be significantly faster).as outlined by @Jérôme Richard using pure python for this is inefficient but i will just to show how this may be done in parallel, multithreading won’t benefit you here as these functions are locked by the GIL, so multithreading will run as fast as serial code, so you must use multiprocessing.
if this is only for benchmarking purpose you can pass
iterools.repeat
to produce an infinite iterator as an input argument toimap_unordered
from multiprocessing.pool, now the input has to be a tuple, sinceimap_unordered
needs the input to be 1 argument, we just unpack the input inside the function, then in the return just unpack your returns in a loop from the returns_itertor.changing the
number_of_cores
will make your pc use more cores, and increase the throughput, but in production you might want to use C++ or cython with gmpy2 or another library for arithmetic as outlined in the other answer, playing with thechunk_size
is also beneficial.the reason why this has almost 10 times the speed you used to see is because this is actually using 3 cores, 1 for checking the time and incrementing the counters and 2 cores for doing the actual maths … those two cores are operating as fast as python can work unhindered by the loop overhead.