Page 3 of 7

Re: Event 129 secnvme

Posted: Tue Mar 05, 2019 9:50 am
by Support
@neatchee, thank you for the bug report!
Could you tell us your computer hardware configuration (motherboard/cpu/ram/storage disks), and a screenshot of PrimoCache main dialog which shows cache configuration and statistics? And the game name?
Thanks a lot!

Re: Event 129 secnvme

Posted: Tue Mar 05, 2019 11:14 am
by neatchee
Motherboard: Asus Prime Z270-A
BIOS: Rev 1302 (latest)
CPU: Intel Core i7-7700K
RAM: 16GB Kingston HyperX Black DDR4-3000 (2x8GB)
Storage 1: (C:\, Operating System) - KINGSTON SSDNow V300 - 120GB (SV300S37A120G)
Storage 2: (D:\, Bulk Storage, Games, etc)- HGST Deskstar 4TB 7200RPM (HDN726040ALE614)
Storage 3: (E:\, Scratch, Documents, etc) - OCZ-VERTEX2 - 120GB (OCZSSD2-2VTXE120G)
Storage 4: (L2CACHE for D:\) - Samsung 970 EVO - 500GB (MZ-V7E500BW)

Game(s) causing issue: Anthem, Destiny 2

And here is the screenshot you requested...
PLEASE NOTE: Whenever this crash occurs, the cache is cleared (good!). This screenshot was taken after re-populating the cache with only the Anthem data (by triggering a full read on all files in the game directory, using Python) so the cache hit rate is obviously low.
Under normal usage I get a 99+% cache hit rate


Image

Re: Event 129 secnvme

Posted: Wed Mar 06, 2019 8:20 am
by Support
Thank you very much for your detailed information! We'll try to set up a similar computer and do the testing.

Re: Event 129 secnvme

Posted: Wed Mar 06, 2019 8:43 am
by neatchee
Some more information from testing:
  • I removed some applications that were regularly querying the device (e.g. hardware monitor software) to make sure that wasn't interfering
  • I tried setting the M.2 PCIE lanes to X2 (instead of X4)
Neither of these helped :(
However it is important to note that when I switched to X2 PCIE lanes, the driver reporting the error was different: storahci
(This is expected, but noteworthy because it means that this problem is not specific to the NVME specification; it will happen with SATA/AHCI too)

Next steps:
  • I have requested a warranty replacement from Samsung, just to make sure the device isn't defective
  • I am getting another M.2 SSD - Crucial MX500 M.2 SATA (not NVME) - and will see if I can reproduce the issue
I would be happy to help collect additional test data! I work as a software tester at a big video game studio (you've definitely heard of our games heh) so I am happy to do some advanced debugging if you aren't able to get the issue to happen for you!

Re: Event 129 secnvme

Posted: Wed Mar 06, 2019 4:31 pm
by Support
We do appreciate your testing! We're looking forward to the results.

Re: Event 129 secnvme

Posted: Wed Mar 06, 2019 8:12 pm
by neatchee
My current theory is that this is a device defect, either a flaw in the design of the 970 EVO, poor motherboard design, or a common manufacturing defect of the 970 EVO.
  • I was able to reproduce this issue (only once) using the Samsung Magician benchmarking utility.
  • If Link Power Management is enabled, I see dramatic instability for the device even without PrimoCache.
  • This suggests a device-level failure, likely related to voltage, during very high throughput or when transitioning out of low power states quickly (when waking up for high throughput requests).
  • The M.2 slots on my motherboard are run through the PCH (not the dedicated PCIE controller in the CPU), so I began experimenting with small voltage adjustments.
  • After increasing the PCH voltage by approx. 0.02v I believe I am seeing increased stability
These results are preliminary, so not confirmed yet. Needs more testing time. :)

EDIT: No dice. Took a little longer than previous cases, but same behavior. I'm not inclined to continue tuning voltage for this issue until I've tested a replacement drive.

Re: Event 129 secnvme

Posted: Thu Mar 07, 2019 9:52 am
by Support
Interesting finding! :)

Re: Event 129 secnvme

Posted: Fri Mar 08, 2019 3:45 am
by Jaga
Just to add to the information on this topic:

I have a new 1TB Samsung 970 EVO and a 32GB L1 read/write (w/deferred writes) Cache Task in Primocache (that is caching the EVO), and don't have any problems whatsoever. Primocache never has problems, and I never see any errors thrown.

It sounds like the problems you're seeing may be related to the special drive access that Primocache uses on the L2STORAGE volume, which in your case is the NVMe. It's definitely not a device defect, since Primocache can talk just fine with the 970 EVO when it's a cached volume.

Hopefully Samsung Magician isn't running all the time either, since it's a drive management utility that -can- mess with caching on a drive if it's enabled at Windows startup.

Re: Event 129 secnvme

Posted: Fri Mar 08, 2019 4:12 am
by neatchee
It sounds like the problems you're seeing may be related to the special drive access that Primocache uses on the L2STORAGE volume
This is my best guess as well, considering I don't seem to have problems just using the drive on its own as storage.
It's definitely not a device defect, since Primocache can talk just fine with the 970 EVO when it's a cached volume.
I don't necessarily agree with this assessment. In fact, quite the opposite: using the 970 EVO as a cache volume, instead of the volume being cached, dramatically increases workload and thus any device defect would have more opportunity to manifest (especially if it's related to voltage instability or something similar). Besides, the cache operates normally for some time before the failure occurs, so it's not as simple as "primocache = issue"
Hopefully Samsung Magician isn't running all the time either, since it's a drive management utility that -can- mess with caching on a drive if it's enabled at Windows startup.
Not sure how I feel about this...I'm currently running Magician at startup, but it's also almost completely passive. The only active operations it does on the drive, as far as I'm aware, is when a benchmark test is run. Otherwise it's just there to read S.M.A.R.T. values and check firmware.
I haven't tried running without Magician since I initially upgraded the drive's firmware, so maybe I'll give that a shot if the RMA replacement doesn't help.

Re: Event 129 secnvme

Posted: Fri Mar 08, 2019 4:30 am
by Jaga
Will be interesting to hear what the replacement drive does for you.