(L1 &) L2 & defer write latency (Native & Average unaffected!)

Found a bug? Report here
enetec
Level 2
Level 2
Posts: 8
Joined: Sat Feb 04, 2023 11:19 am

(L1 &) L2 & defer write latency (Native & Average unaffected!)

Post by enetec »

Hi guys,
at first let me make you my congratulations for this GREAT piece of software!

I'm an IT manager and I found it recently, searching for a product of this type for a particular project I'm setting up. And it fits my needs greatly way!

Because of my engineering formation and having the need of maximum reliability in my "critical" project, I've read ALL the "fuc**d" manual, documentation and FAQs, many posts here and then I have set up a testing system to try all the options, make some sintetic & real world benchmark and fine tuning at best for my needs...

It's during these tests that I found what I think it's a BUG or, at least, a bad documented behaviour that could be very dangerous if not well understood by users.

After many tests I was finally working with this (very good IMHO!) configuration:
- L1 R/W: 3 GB / 1 GB
- L2 R/W: (about...) 650 GB / 100 GB
- Block size: 32 KB
- Strategy: Read & Write
- Defer write: 1 sec
- Mode: Intelligent
- Options: FreeWritten, FlushSleep
- Prefetch: Disabled
- Volatile cache content: enabled
- Free cache on written: enabled
- Flush L1 to L2: disabled

As you can see this is a good performance strategy with an OPEN eye to security of data (volatile cache content, only 1 sec defer write latency, which makes the difference anyway!).

It's during the final tests that I found the BUG (or the not well documented behaviour!): defer write latency is complied *ONLY* by L1 cache, with L2 cache which will retain data for A LOT of time (even in idle!) without flushing it not in 1 second, not in seconds, neither in minutes!

Since as you well documented L2 is lost anyway in case of failure/crash, this behaviour is really dangerous since user could think to have only data for 1 second to be flushed and instead (as in some of my tests...) there are GB & GB of data in L2 still to be flushed...!!! :o

I don't know if this is a bug or an intended behaviour, BUT anyway, this behaviour is not so well documented and in setup screen too there is nothing indicating that defer write latency has to be intended for L1 only!

I would like to have a support comment on this... :?:

P.S.: obviously I've changed my testing setup with L2 set as read only but I lost some "populating benefit" this way... :cry:

EDIT: updated title with recent discoveries... see latest posts of the topic! ;)
Last edited by enetec on Sun Feb 12, 2023 9:46 pm, edited 1 time in total.
Nick7
Level 5
Level 5
Posts: 46
Joined: Sun Jun 25, 2017 7:50 am

Re: L2 & defer write latency

Post by Nick7 »

Just a note: 'maximum reliability in my "critical" project' and defer write do not go along together at all.
In case of crash or power loss, you LOSE data.
Even worse - with defer write, data is written out of order, and can cause silent corruptions.

Only use defer write on data you do not care much about - as in you are OK to lose it (or restore from backup).

Read caching is fine.
enetec
Level 2
Level 2
Posts: 8
Joined: Sat Feb 04, 2023 11:19 am

Re: L2 & defer write latency

Post by enetec »

Yes, I know but I've already done some simulated blackouts (and I have BIG ups anyway) and some suddenly hard resets without big issues with 1 sec latency on L1. With only 2 seconds it's already a bit different...

BUT let's try to not go off topic here... we are talking about L2 not complaing to defer write latency here... (and not of small amounts of time but of LOT of MINUTES before only begin to flush!!).
enetec
Level 2
Level 2
Posts: 8
Joined: Sat Feb 04, 2023 11:19 am

Re: L2 & defer write latency

Post by enetec »

I read from viewtopic.php?p=15666#p15666 ...

"The option "Flush L1 to L2" applies when specified defer-write latency hasn't expired. If the latency has expired, this means all deferred write data , whether in L1 or L2, should be flushed into disks.

So I think mine has to be considered as a BUG and not as a not well documented behaviour...
User avatar
Support
Support Team
Support Team
Posts: 3623
Joined: Sun Dec 21, 2008 2:42 am

Re: L2 & defer write latency

Post by Support »

How do you judge that the deferred write-data in L2 cache have not been flushed?

Only the statistic "Deferred Blocks" can indicate how many cache blocks currently have deferred write-data. So you can check if it becomes 0 after a latency time if there is no more write incoming.
enetec
Level 2
Level 2
Posts: 8
Joined: Sat Feb 04, 2023 11:19 am

Re: L2 & defer write latency

Post by enetec »

Support wrote: Mon Feb 06, 2023 7:58 am How do you judge that the deferred write-data in L2 cache have not been flushed?

Only the statistic "Deferred Blocks" can indicate how many cache blocks currently have deferred write-data. So you can check if it becomes 0 after a latency time if there is no more write incoming.
Exactly. The counter remains at very BIG numbers, doesn't start to fluah even if the writings are ended by some seconds and if I force cache flush by command I can see GB (& GB) of writings on disk before it ends if I previouy did a big elaboration like e.g. .VDI compacting or big video elaborations...

This doesn't seem to happen with L1 cache only used as write...

Is there a way I can provide you a LOG of what I see?
User avatar
Support
Support Team
Support Team
Posts: 3623
Joined: Sun Dec 21, 2008 2:42 am

Re: L2 & defer write latency

Post by Support »

Could you please record your screen to show the problem and changes in PrimoCache stats (keeping the PrimoCache GUI displayed at the top of screen)? You can send the video or the link to [email protected]. Thank you in advance.
enetec
Level 2
Level 2
Posts: 8
Joined: Sat Feb 04, 2023 11:19 am

Re: L2 & defer write latency

Post by enetec »

Support wrote: Mon Feb 06, 2023 12:01 pm Could you please record your screen to show the problem and changes in PrimoCache stats (keeping the PrimoCache GUI displayed at the top of screen)? You can send the video or the link to [email protected]. Thank you in advance.
Ok, I've realized a video on a test system where issue is well visible...

In the video we are going to compact a .VDI (similar to .VHD) file of about 34GB on the same HDD (read & write on the same). This disk is an old RAID5 of HDDs. L1 cache (R & W) is (obviously) on RAM and L2 (R & W) on a fast NVMe.

I've added some notes during video... it's not so short but all phases were needed (compression and after compression).

For a technical issue I've the video splitted in two parts, with a short missing part I lost to capture (not so important anyway...). I'm sorry for it. :roll:

But it's all self explaining anyway... we'll reach about 33/34 GB of data cached in L1+L2 and only some MB of writes on disk... then we'll have a (slow too if not forced...!) flushing of MINUTES... :wtf:

Video files:

Part 1: https://mega.nz/file/coQ1gZZa#5YX-rLKDy ... IYVrLU5fJY

Part 2: https://mega.nz/file/hww3ECpT#L5M2_BQcT ... LA3iCrnWJs

Hope this could be interesting for you all... ;)

Let me know... best regards! :wave:
User avatar
Support
Support Team
Support Team
Posts: 3623
Joined: Sun Dec 21, 2008 2:42 am

Re: L2 & defer write latency

Post by Support »

Sorry for the late response. The videos describe the issue very clearly. Thank you!

We think this issue was probably caused by the following two things:

1) "Urgent Writes" was triggered, because L1 cache space was small and the option "Flush L1 to L2" was not checked. "Urgent Writes" is very slow to flush data to target disks and "Normal Writes" by the latency expiration will not be triggered when "Urgent Writes" is triggered. For more information about parameter tuning on Defer-Write, please see
https://www.romexsoftware.com/en-us/pri ... write.html

2) The disk had lots of read IOs during the cloning (because reading from the disk itself) and it seemed that the disk couldn't process write IOs in time. I see the read speed ranged to 40 - 80 Mb/s during the cloning, and I'm not sure about this disk's ability to handle simultaneous read and write operations.

So you can enable the option "Flush L1 to L2" and then check if the issue is alleviated. This option should be enabled by default when defer-write is enabled on both L1 and L2. Currently we leave it unchecked by default for back compatibility. However, we will change it in next releases. Also the slow issue of "Urgent Writes" would also be improved in next releases.

And below are my answers to other questions mentioned in the videos.
1) "Deferred Blocks" in the stat panel: the percentage number in parenthesis is the percentage of deferred blocks out of total cache blocks, not out of total write-data. Please see
https://www.romexsoftware.com/en-us/pri ... mance.html
2) A force flush (manual flush) is designed to be faster than a normal flush (triggered by latency expiration). Force flush will greatly occupy the disk IO processing time, causing other applications to respond slowly to read/write tasks. Just like the reason#2 mentioned above. In order not to affect other applications too much, normal flush is designed to be a little bit slower, giving idle time for other tasks. Even so, there are still many users complaining that normal flush causes other applications to respond slowly.
enetec
Level 2
Level 2
Posts: 8
Joined: Sat Feb 04, 2023 11:19 am

Re: L2 & defer write latency

Post by enetec »

OK, thank you very much for your reply! :thumbup:

Important thing for me was that you take knowledge of the issue, even (more!) if it came out in particular conditions... ;)

I will do some more testing in next days, testing "flush L1 to L2" (as you hinted) and some other write modes ("Intelligent" was my first choice since benchmarks, but it could do better of worse than others in this case... I wanna see it!)

Only two notes about your considerations:

1) "urgent writes" is probably triggered BUT, if you look with attention to written to disk value in video, for many minutes it doesn't change AT ALL... still fixed to some hundred of MB, while several GB are stored in L2... so there should be something else too...! :?:

2) disk, as I said is a RAID5 of HDD and, even if under heavy READ load, is capable of write (I've tested it now!). Writes are not asked AT ALL by PrimoCache driver IMHO, since write to disk queue is empty at all for most time of operations, and when something is sent to disk, it doesn't ingenerate high queue time (my little "LED" software is very sharp on it... I use it by a LOT of time, and it is very useful to check queue issues...), so RAID seems capable of handle eventual writes at that level...

I'll report my more tests anyway...

Thank you! :wave:

P.S.: next week I'll start using PrimoCache on my "production" system which has a RAID10 of SSD as disk... only with L1 and no defer write at all for now. ;)
Post Reply