The results:

You see that for my situation a cache block of 32k give the best results.
But what worries me is that in all cached cases the Random read 4k 32 queues, 16 threads is much slower than without cache, about 33% slower.
Is there an optimization that I overlooked, something I can do to improve this to at least the physical disk speed?