skip to Main Content

I’m writing code that will subtract corresponding bytes in two arrays and count the number of resulting bytes surpassing a given threshold. AFAIU, it would really benefit from .NET SIMD, but System.Numerics.Vector.IsHardwareAccelerated returns false when I compile C# on Raspberry Pi 4.

My dotnet version is 3.1.406, I’ve added

  <PropertyGroup>
    <Optimize>true</Optimize>
  </PropertyGroup>

to the csproj and running release configuration.

Is there any way I can leverage SIMD support in .NET on Raspberry Pi 4? Maybe with .NET 5?

Update
I installed .NET 5 and tried .NET Intrinsics, but none is supported:

Console.WriteLine(System.Runtime.Intrinsics.Arm.AdvSimd.IsSupported); //false
Console.WriteLine(System.Runtime.Intrinsics.Arm.Aes.IsSupported);  //false
Console.WriteLine(System.Runtime.Intrinsics.Arm.ArmBase.IsSupported); //false
Console.WriteLine(System.Runtime.Intrinsics.Arm.Crc32.IsSupported); //false
Console.WriteLine(System.Runtime.Intrinsics.Arm.Dp.IsSupported); //false
Console.WriteLine(System.Runtime.Intrinsics.Arm.Rdm.IsSupported); //false
Console.WriteLine(System.Runtime.Intrinsics.Arm.Sha1.IsSupported); //false
Console.WriteLine(System.Runtime.Intrinsics.Arm.Sha256.IsSupported); //false

I’m on 32-bit Raspbian (Debian derivative), is there any chance I need 64-bit version for this to work?

P.S. To clarify, in plain C# the algorhytm looks like this:

        public static int ScalarTest(byte[] lhs, byte[] rhs)
        {
            var result = 0;

            for (int index = 0; index < lhs.Length; index++)
            {
                var a = lhs[index];
                var b = rhs[index];
                if (b > a)
                {
                    (b, a) = (a, b);
                }
                result += ((a - b) >= 16) ? 1 : 0;
            }

            return result;
        }

2

Answers


  1. Chosen as BEST ANSWER

    Following @Soonts answer, after switching to 64bit Raspbian, here is what I got in NET 5. Most of the instructions I'm looking for are supported.

    Console.WriteLine(System.Runtime.InteropServices.RuntimeInformation.OSDescription);
    //Linux 5.4.51-v8+ #1333 SMP PREEMPT Mon Aug 10 16:58:35 BST 2020
    
    Console.WriteLine(System.Runtime.InteropServices.RuntimeInformation.ProcessArchitecture);
    //Arm64
    
    Console.WriteLine(System.Environment.Is64BitOperatingSystem);           //true
    
    Console.WriteLine(System.Numerics.Vector.IsHardwareAccelerated);        //true
    Console.WriteLine(Vector<byte>.Count);                                  //16
    Console.WriteLine(Vector<sbyte>.Count);                                 //16
    Console.WriteLine(Vector<short>.Count);                                 //8
    Console.WriteLine(Vector<ushort>.Count);                                //8
    Console.WriteLine(Vector<int>.Count);                                   //4
    Console.WriteLine(Vector<uint>.Count);                                  //4
    Console.WriteLine(Vector<long>.Count);                                  //2
    Console.WriteLine(Vector<ulong>.Count);                                 //2
    
    Console.WriteLine(Vector<float>.Count);                                 //4
    Console.WriteLine(Vector<double>.Count);                                //2
    
    Console.WriteLine(System.Runtime.Intrinsics.Arm.AdvSimd.IsSupported);   //true
    Console.WriteLine(System.Runtime.Intrinsics.Arm.Aes.IsSupported);       //false
    Console.WriteLine(System.Runtime.Intrinsics.Arm.ArmBase.IsSupported);   //true
    Console.WriteLine(System.Runtime.Intrinsics.Arm.Crc32.IsSupported);     //true
    Console.WriteLine(System.Runtime.Intrinsics.Arm.Dp.IsSupported);        //false
    Console.WriteLine(System.Runtime.Intrinsics.Arm.Rdm.IsSupported);       //false
    Console.WriteLine(System.Runtime.Intrinsics.Arm.Sha1.IsSupported);      //false
    Console.WriteLine(System.Runtime.Intrinsics.Arm.Sha256.IsSupported);    //false
    

    After implementing the algorhytm which compares two byte arrays for elements with abs. difference exceeding certain threshold, on my Pi 4 I got following benchmark measurements (average of 3runs post warmup):

    C# Loop:

    59ms

    System.Numerics.Vector:

    21ms

    System.Runtime.Intrinsics.Arm.AdvSimd:

    17ms

    System.Runtime.Intrinsics.Arm.AdvSimd with optimized vector creation from https://gist.github.com/IKoshelev/325f0e10bee0806d7bb2c9d63d09ba9e

    2ms !!!


  2. Despite the API is done and even documented, the implementation is missing. Take a look. 8-byte SIMD vectors is essential part of NEON ISA for decades now (was introduced in 2005), yet the .NET runtime only implements them when compiling for ARM64 (released in 2013).

    I don’t work for Microsoft and have no idea how exactly they compile their binaries, but the source code tells they have at least some support for NEON when building for ARM64 target. If you want these intrinsics in .NET, you can try the 64-bit OS.

    There’s a workaround — implement your performance-critical pieces in C++, compile a shared library for Linux, then use [DllImport] to consume these functions from .NET. I have built non-trivial Linux software that way (example), using the following gcc flags to build the DLLs: -march=native -mfpu=neon-fp16 -mfp16-format=ieee -ffast-math -O3 -fPIC This way it will work for 32-bit OS, and won’t require anything special from .NET runtime, I’ve tested with .NET Core 2.1.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search