Fast delegate invoke
Problem
Do you want to invoke action million times? Delegates are slow - they cannot be inlined. Not inlined methods have a call overhead - jumps and reduntant data movement. Sure, you can just inline manually or replace delegates with methods, but there are a lot of boilerplate code.
However there are a trick. C# dynamic methods allow to generate IL code at runtime. You can emit call instruction using MethodInfo which will be inlined by compiler (or not - it’s depends on size of method and compiler settings).
Dynamic methods
Instance Action
Let’s create method to call Action<int>
First, create instance:
DynamicMethod meth = new("CallFun", typeof(void), new[] {typeof(int), typeof(object?)});
Second, obtain IlGenerator:
ILGenerator il = meth.GetILGenerator();
And finally, emit IL code:
il.EmitLdArg(1); // load instance
il.EmitLdArg(0); // load int argument
il.Emit(OpCodes.Call, original.Method); // call MethodInfo of delegate
il.Emit(OpCodes.Ret); // return;
Let’s invoke our dynamic method:
Action<int,object?> generated = meth.CreateDelegate<Action<int,object?>>();
generated(7653, original.Target);
Static Action
If you have an error in previous section, that’s probably because you used static methods.
All anonymous methods are instance (even if they have static modifier), but local functions and normal methods can be static.
You can check it using original.Method.IsStatic
.
To call static method, you don’t need instance, so remove second argument:
DynamicMethod meth = new("CallFun", typeof(void), new[] {typeof(int)});
ILGenerator il = meth.GetILGenerator();
il.EmitLdArg(0); // load int argument
il.Emit(OpCodes.Call, original.Method); // call MethodInfo of delegate
il.Emit(OpCodes.Ret); // return;
Loops
Methods above are not so efficient - you just create another delegate that call’s method. If you want to invoke original delegate in a loop efficiently - emit a loop in DynamicMethod.
You can write required code and use Rider IL viewer to view il code you need to emit
The required code for for (int i = 0; i < end; i++) {}
will be:
// set end to 1000
IL_0000: ldc.i4 1000
IL_0005: stloc.0
// set i to 0
IL_0006: ldc.i4.0
IL_0007: stloc.1
// goto if (i >= end) break;
IL_0008: br.s IL_0018
// start of loop
// loop body
// ...
// i++
IL_0014: ldloc.1 // i
IL_0015: ldc.i4.1
IL_0016: add
IL_0017: stloc.1 // i
// if (i >= end) break;
IL_0018: ldloc.1 // i
IL_0019: ldloc.0 // end
IL_001a: blt.s IL_000a
// end of loop
So, our loop method will be:
// setup
Action<int> original = i => Console.WriteLine($"[{i}] Hello!");
// the arguments are: (iterations, instance)
// iterations argument is not a argument from original delegate
DynamicMethod meth = new("CallFun", typeof(void), new[] {typeof(int), typeof(object)});
ILGenerator il = meth.GetILGenerator();
// init i
il.DeclareLocal(typeof(int)); // declare i
il.Emit(OpCodes.Ldc_I4_0); // load 0 as int
il.Emit(OpCodes.Stloc_0); // store to i
// define loop labels
Label testLabel = il.DefineLabel();
Label execLabel = il.DefineLabel();
// loop start
il.Emit(OpCodes.Br, testLabel);
il.MarkLabel(execLabel);
// loop body
il.Emit(OpCodes.Ldarg_1); // load instance
il.Emit(OpCodes.Ldloc_0); // load i
il.Emit(OpCodes.Call, original.Method); // call
// i++
il.Emit(OpCodes.Ldloc_0); // load i
il.Emit(OpCodes.Ldc_I4_1); // load 1 as int
il.Emit(OpCodes.Add);
il.Emit(OpCodes.Stloc_0); // store to i
// loop test (if (i < iterations) goto exec;)
il.MarkLabel(testLabel);
il.Emit(OpCodes.Ldloc_0) // load i
il.Emit(OpCodes.Ldarg_0); // load iterations
il.Emit(OpCodes.Blt, execLabel); // goto exec if i < iterations
// return
il.Emit(OpCodes.Ret);
// complete and invoke
Action<int,object?> generated = meth.CreateDelegate<Action<int,object?>>();
generated(10, original.Target);
Batches
Check instruction support before execution! Many processors doesn't support all of the instruction sets
JIT compiler doesn’t vectorize code (like c++), but you can manually vectorize code using System.Runtime.Intrinsics:
int iterations = 1000;
int batchSize = 4;
int end = iterations & ~(batchSize-1);
for (int i = 0; i < end; i+=batchSize) {
// simd code
}
for (int i = end; i < iterations; i++) {
// simple code
}
Benchmarks
Formula:
- simple:
*v = 3 * *v
- Sse41 :
*(Vector128<int>*)v = Sse41.MultiplyLow(i->extraData, *(Vector128<int>*)v)
Setup:
- lib: Benchmark.Net
- cpu: AMD FX(tm)-8300
- logical cores: 8
- physical cores: 4
- OS: Arch linux
This scale is logarithmic