I am testing on a new machine (version 4.4.1) and I see that primocache is not balancing its processor usage over the Numa nodes, what is needed to access the near memory. Almost all CPU used is located on the first node which can cause an overload on the first processor and will cause the memory bandwidth to the second Numa node to be slower than needed.
This machine is still clean and not used by any other process - but I see this on the operational machines too; the first NUMA node is much more used than the second one. Note that Numa optimization is not only about memory but also about CPU.
This was tested on a ReFS volume with a128 Gb cache, using CrystalDiskMark 8.0.6, on a DL360 Gen10 with 512Gb RAM and 2 Xeon Gold 6138 CPU's (no HT, total of 40 CPU's, 20 CPU's per Numa Node).
Numa and CPU
Re: Numa and CPU
Have you enabled NUMA-Aware in PrimoCache? Please see
https://kb.romexsoftware.com/en-us/2-pr ... numa-aware
https://kb.romexsoftware.com/en-us/2-pr ... numa-aware
Re: Numa and CPU
Yes, I have, on all machines.
Re: Numa and CPU
See the following screenshot:
On the second node, you see a lot of kernel times.
So, somehow, in this case almost all kernel time (means kernel processes like Primo Cache) is located on the second node instead of balanced over both nodes.
On the first node, you see almost no kernel times (the dark areas)On the second node, you see a lot of kernel times.
So, somehow, in this case almost all kernel time (means kernel processes like Primo Cache) is located on the second node instead of balanced over both nodes.
Re: Numa and CPU
Or, as node view (instead of core view):
Re: Numa and CPU
Thank you for the detailed information!
We checked the code and found that the CPU load balance should primarily be left up to the Windows system and application to decide on which CPU to schedule threads to run. In most cases, reads and writes in PrimoCache work in the context of the application thread. PrimoCache will not forcefully switch threads to run on another CPU, as this would incur extra processing time and reduce read and write speed.
Some reads and writes in PrimoCache are performed in the context of system threads. Normally, Windows schedules threads to run and balances CPU load. We'll look into seeing if there are any additional benefits from forcing these reads and writes to be balanced across different CPUs.
We checked the code and found that the CPU load balance should primarily be left up to the Windows system and application to decide on which CPU to schedule threads to run. In most cases, reads and writes in PrimoCache work in the context of the application thread. PrimoCache will not forcefully switch threads to run on another CPU, as this would incur extra processing time and reduce read and write speed.
Some reads and writes in PrimoCache are performed in the context of system threads. Normally, Windows schedules threads to run and balances CPU load. We'll look into seeing if there are any additional benefits from forcing these reads and writes to be balanced across different CPUs.
Re: Numa and CPU
On this specific machine, each CPU has 2.4 GT/s memory over 6 lanes per CPU, giving 14.4 GT/s memory transfer per Numa node for NEAR memory.
The CPU's are interconnected with a speed of 9.6 GT/s, making FAR memory 33% slower than NEAR memory.
But the usage of FAR memory also slows the memory of the other node by 66% (as the other node is using the memory bandwidth).
And as another side effect for using FAR memory, the latency for memory access will be 2 to 4 times higher, causing extra CPU waits in CPU-cache loads, causing the effective CPU speed to go down on both the nodes.
So, just using a random core will really slow down the total memory throughput and increases the memory latency of the machine - causing CPU cycles to go to waste, not only on the requesting Numa node, but also on the requested one.
I know this is mostly a server problem; lots of processes, lots of memory transfers and lots of disk transfers; but it might also impact gaming.
(BTW, the first message was on a new machine, the graphics and this information on an operational machine -with 20 cores, 10 cores on each node; 2 times a XEON Silver 4210R)
The CPU's are interconnected with a speed of 9.6 GT/s, making FAR memory 33% slower than NEAR memory.
But the usage of FAR memory also slows the memory of the other node by 66% (as the other node is using the memory bandwidth).
And as another side effect for using FAR memory, the latency for memory access will be 2 to 4 times higher, causing extra CPU waits in CPU-cache loads, causing the effective CPU speed to go down on both the nodes.
So, just using a random core will really slow down the total memory throughput and increases the memory latency of the machine - causing CPU cycles to go to waste, not only on the requesting Numa node, but also on the requested one.
I know this is mostly a server problem; lots of processes, lots of memory transfers and lots of disk transfers; but it might also impact gaming.
(BTW, the first message was on a new machine, the graphics and this information on an operational machine -with 20 cores, 10 cores on each node; 2 times a XEON Silver 4210R)
Re: Numa and CPU
For a caching program, it is difficult to predict which node's memory is better for cached data. Applications running on different nodes may access cached data at the same time. The cost of switching application to another node far exceeds the cost of accessing FAR memory. So we have to evenly allocate the memory of each node and let Windows and the application decide which nodes the process runs on.