Numa and CPU

tverweij · Post by **tverweij** » Fri Feb 28, 2025 12:42 pm

I am testing on a new machine (version 4.4.1) and I see that primocache is not balancing its processor usage over the Numa nodes, what is needed to access the near memory. Almost all CPU used is located on the first node which can cause an overload on the first processor and will cause the memory bandwidth to the second Numa node to be slower than needed.

This machine is still clean and not used by any other process - but I see this on the operational machines too; the first NUMA node is much more used than the second one. Note that Numa optimization is not only about memory but also about CPU.

This was tested on a ReFS volume with a128 Gb cache, using CrystalDiskMark 8.0.6, on a DL360 Gen10 with 512Gb RAM and 2 Xeon Gold 6138 CPU's (no HT, total of 40 CPU's, 20 CPU's per Numa Node).

Post by **Support** » Mon Mar 03, 2025 7:01 am

Have you enabled NUMA-Aware in PrimoCache? Please see
https://kb.romexsoftware.com/en-us/2-pr ... numa-aware

tverweij · Post by **tverweij** » Mon Mar 03, 2025 11:06 am

Yes, I have, on all machines.

tverweij · Post by **tverweij** » Tue Mar 04, 2025 3:09 pm

See the following screenshot:

: Screenshot 2025-03-04 110532.png (312.79 KiB) Viewed 17550 times

On the first node, you see almost no kernel times (the dark areas)
On the second node, you see a lot of kernel times.

So, somehow, in this case almost all kernel time (means kernel processes like Primo Cache) is located on the second node instead of balanced over both nodes.

tverweij · Post by **tverweij** » Tue Mar 04, 2025 3:11 pm

Or, as node view (instead of core view):

: Screenshot 2025-03-04 111011.png (135.67 KiB) Viewed 17550 times

Post by **Support** » Thu Mar 06, 2025 10:33 am

Thank you for the detailed information!

We checked the code and found that the CPU load balance should primarily be left up to the Windows system and application to decide on which CPU to schedule threads to run. In most cases, reads and writes in PrimoCache work in the context of the application thread. PrimoCache will not forcefully switch threads to run on another CPU, as this would incur extra processing time and reduce read and write speed.

Some reads and writes in PrimoCache are performed in the context of system threads. Normally, Windows schedules threads to run and balances CPU load. We'll look into seeing if there are any additional benefits from forcing these reads and writes to be balanced across different CPUs.

tverweij · Post by **tverweij** » Thu Mar 06, 2025 1:02 pm

On this specific machine, each CPU has 2.4 GT/s memory over 6 lanes per CPU, giving 14.4 GT/s memory transfer per Numa node for NEAR memory.
The CPU's are interconnected with a speed of 9.6 GT/s, making FAR memory 33% slower than NEAR memory.
But the usage of FAR memory also slows the memory of the other node by 66% (as the other node is using the memory bandwidth).
And as another side effect for using FAR memory, the latency for memory access will be 2 to 4 times higher, causing extra CPU waits in CPU-cache loads, causing the effective CPU speed to go down on both the nodes.

So, just using a random core will really slow down the total memory throughput and increases the memory latency of the machine - causing CPU cycles to go to waste, not only on the requesting Numa node, but also on the requested one.

I know this is mostly a server problem; lots of processes, lots of memory transfers and lots of disk transfers; but it might also impact gaming.

(BTW, the first message was on a new machine, the graphics and this information on an operational machine -with 20 cores, 10 cores on each node; 2 times a XEON Silver 4210R)

Post by **Support** » Tue Mar 11, 2025 12:55 pm

For a caching program, it is difficult to predict which node's memory is better for cached data. Applications running on different nodes may access cached data at the same time. The cost of switching application to another node far exceeds the cost of accessing FAR memory. So we have to evenly allocate the memory of each node and let Windows and the application decide which nodes the process runs on.

Romex Software Forum

Numa and CPU

Numa and CPU

Re: Numa and CPU

Re: Numa and CPU

Re: Numa and CPU

Re: Numa and CPU

Re: Numa and CPU

Re: Numa and CPU

Re: Numa and CPU