Preload currently accessed files -> improve Audio/Video/Film

Post by **Support** » Tue Jan 12, 2016 5:06 am

Axel Mertes wrote:See the yellow marked fields in the attached table for details. They show how much bytes you actually need to index a cached block, so how much RAM you consume for this.

Well, the result is correct

.

Axel Mertes · Post by **Axel Mertes** » Tue Jan 12, 2016 10:12 am

Hehe, correct maybe, but you have to admit: the result is confusion too!

Post by **Support** » Wed Jan 13, 2016 1:53 am

@Axel Mertes

I'm sorry that I don't know much about the detail of program implementation. Also the R&D team may tune the algorithm in future. But I'm sure that so far the digits are correct. You may just use the table to evaluate how many memory needed for the overhead with specified block size.

Axel Mertes · Post by **Axel Mertes** » Wed Feb 10, 2016 9:35 am

We have now started some deeper testing.

I enabled a 1.7 TByte cache (the biggest possible, the SSD RAID was 4 TB actually) for a 10.5 TByte HDD RAID6 array.
I now have ~15 GByte L1 and 1.7 TByte L2 cache. Everything uses a 64 KByte block/cluster size as far as I can tell, so that caching is as efficient as possible.

The deferred writes (10s) helps to reduce on SSD L2 cache wear, right?

We have a redundant power supply (3x) and ECC memory and dual CPUs.
I am not sure if there is anything else I could / should do to make data loss less possible?

The SSDs are currently in a RAID0 stripe (no redundancy) as cache.
Are writes really performed from L1 to L2 cache and only then to HDD RAID6, or will they just go from L1 to HDD directly?

The point is, if my L2 SSD RAID0 is affected by e.g. an SSD failing, I don't want to loose GBytes of cached data not being written yet to HDD.
I know that if RAM fails I might already encounter this.

So my ideal scenario may probably not having any write caching, except for collecting/deferring writes towards the L2 SSD cache. Other writes to the HDD should - at best - go straight onto the HDD RAID, so to never loose anything. Think of it as a read-only cache with reduced wear & tear on SSD writes to the read cache.

And are fresh written blocks automatically placed in the read cache too?
Because in our scenario it is extremely likely that these will be read soon after (writing image sequences, replaing after rendering is finsihed).
In fact, any recently touched data, regardless if read or written, should stay in the read cache.
Which caching scheme works like that?

Is that something currently supported / existing in PrimoCache?
Its not so obvious to understand and I think it would make a lot of sense - especially with important and large data amounts.

I now want to play with the settings and do some real world testing, to see if the cache size is fine for our or how we can improbe the system.

In the context of the original proposal in this thread:

Is there a "debug version" or "debug mode" for PrimoCache, which could create a simple record of the I/O operations that PrimiCache performs?

It would be a quickly filling file that would contain source disk/block number, I/O operation (write, read, etc.) and e.g. L1 cache block, L2 cache block etc. Also when are blocks deferred etc..

If we had such a "log file" from a few hours runtime, I could easily predict how efficient automatic preloading of blocks in PrimoCache might be and how many blocks for read ahead would be the best choice. It would be a damn good starting point to perform optimizations.

I made some tests today. We experience like five to tenfold of the performance of the cached drive, even more when it comes to I/Os.
Using ATTO bench I get in the range between 2 and 4 GByte/s for the cache drive locally, while it was around 300 to 350 MByte/s before, without cache.

Over a 10 GBit Ethernet connection I saw an increase to 450 to 550 MByte/s throughput, both reading and writing, using Blackmagicdesign's DiskSpeed tool. The very same drive without cache runs in a range of around 150 MByte/s writes (the RAID has a cache too) and 80-100 MByte/s reads real world throughput over the 10 GBit Ethernet.

Amazingly, a local OCZ RevoDrive3x2 performs with 1200 MByte/s writes, 1600 MByte/s reads in ATTO Bench, while showing only 300 MByte/s writes and 330 MByte/s reads in Blackmagicdesign's DiskSpeed. Presumeably the I/O of the servers PrimoCache is even higher than that of the local high performanc PCIe SSD...!!!

In fact, we can now replace our FibreChannel with Ethernet with many advantages:
- less switches
- less PCIe cards
- we are able to run defragmentation (impossible on the SAN without demounting the drive)
- we can use Undelete server for on-the-fly protection and fragmentation prevention

If Romex Software could add support for bigger caches (byebye MBR formatted cache) and pre-reading of blocks (adjustable e.g. per cached drive) then we might make a huge leap forward.

I'd be glad if you can comment on the above questions.

Post by **Support** » Sun Feb 14, 2016 1:19 pm

Axel Mertes wrote:The deferred writes (10s) helps to reduce on SSD L2 cache wear, right?

No. Defer-Write has nothing to do with L2 cache. It aims to the target volumes.

Axel Mertes wrote:I am not sure if there is anything else I could / should do to make data loss less possible?

Defer-Write will greatly improve target disks' write performance. However, a power outage or system failure (hang or crash) might result in data loss or corruption because in such scenarios the cache has no chance to write data back to the disk.

Axel Mertes wrote:Are writes really performed from L1 to L2 cache and only then to HDD RAID6, or will they just go from L1 to HDD directly?

Level-2 cache only stores read-data. Write-data will not be gathered into level-2 cache.

Axel Mertes wrote:And are fresh written blocks automatically placed in the read cache too?

Yes if you choose cache strategy "Read-data & Write-data". (Of course, fresh written blocks will be placed into level-1 cache only)

Axel Mertes wrote:Is there a "debug version" or "debug mode" for PrimoCache, which could create a simple record of the I/O operations that PrimiCache performs?

Sorry, so far we don't have such version.

Axel Mertes · Post by **Axel Mertes** » Sun Feb 14, 2016 4:59 pm

Hi Support,

thank for the detailed response. Given your answers I would then re-formulate my questions as suggestions for future implementation:

1. Implement a caching strategy, where any written block will be automatically written to L2 cache too (rather then L1 cache only as of now).

Reason:
We create far more - soon to be re-read - data in short time, than the L1 cache can keep. By moving these blocks to a *large* L2 cache will have a strong impact on read speeds. We do e.g. render jobs in 4K film resolution on a render farm. The machines involved might create 100 GBytes of data for review in just a few minutes. If all has to be re-read from HDD rather than from L2 cache takes unnecessary wear on the HDD system plus is a waste on the L2 cache performance.

2. Implement a cache strategy which provides a pure read cache and write blocks straight onto the target disks, but which will also copy written blocks immediately to the L1 and L2 cache for soon re-read.

Reason:
See #1

3.Optionally a defer write function might be implemented for writing fresh written blocks to the L2 SSD cache.

Reason:
This will reduce wear on the L2 cache SSDs.

4. Implement an automatic pre-read of user customizeable amount of blocks for any *read* block into L1 / L2 cache.

Example:
I want to have at least e.g. 10 blocks pre-read. So whenever a block "n" is read, its made sure that the blocks "n+1" up to "n+9" are also read - in sequential order - into the cache. If a block is already in the cache (which would be the standart case in this scenario), it will be checked if the blocks "n+1" to "n+9" are also already in the cache. If not, the missing blocks are read into the cache (usually block "n+9" in this example). So the PrimoCache runs "ahead" of any application asking for data. By the number of blocks to be pre-read we can adjust latency for third party use. The check might work from "n+9" downwards to "n+1", as this would be faster

Reason:
In sequential read situations like in our film production scenario, we have a lot of dump software which is unable to preload. If PrimoCache implements that, it would off-load this to a more general level and improve total system performance. Clearly it will pre-load a certain amount of non required blocks too, but the majority of blocks will be sequential reads.

We tend to defragment very often and very efficient. We do minimize fragmentation by the use of Undelete / Undelete Server, which simply does not delete files immediately nor free up "holes" in the file system. We delete all files centrally at a given time, e.g. at night. Then we start the defrag process which closes the gaps. This strategy reduces fragmentation seriously plus has the benefit of a safety strategy for files being deleted/overwritten by mistake.

Further, in a true desaster scenario, you can recover files far better and more reliable and with higher chance from a defragmented volume rather than from a fragmented volume. We also see improved defragmentation speed using PrimoCache Server too!

5. Implement a transaction log that is kept in RAM and flushed to disk from time to time or by user activation.

Reason:
If you do it for a full session, it can be helpful to analyze system failures or corruptions.
It is extremely helpful to predict optimizations for caching.
I even think of a clever software analyzing such a log and adjusting optimizations based on the analyzed use case. One can simulate, based on the log file, how different cache strategies will work and how much efficiency they will bring. Given your answers on my previous comment, I am sure we can improve the cache performance by a few hundret percent and at the same time improving HDD speed by freeing up unnecessary I/Os.

I'd be willing to beta test this (>20 decades strong experience with software testing on my side) and assist on how to improve.
You may contact me off-list if you are interested.

Post by **Support** » Mon Feb 15, 2016 7:16 am

Axel Mertes wrote:1. Implement a caching strategy, where any written block will be automatically written to L2 cache too (rather then L1 cache only as of now).Reason: We create far more - soon to be re-read - data in short time, than the L1 cache can keep. By moving these blocks to a *large* L2 cache will have a strong impact on read speeds.

When the write-data is re-read, this data will be stored into level-2 cache.

Axel Mertes wrote:2. Implement a cache strategy which provides a pure read cache and write blocks straight onto the target disks, but which will also copy written blocks immediately to the L1 and L2 cache for soon re-read.

If you choose cache strategy "Read & Write" and without Defer-Write enabled, write-data will be written through to the target disks and also copied to L1 cache.

Axel Mertes wrote:4. Implement an automatic pre-read of user customizeable amount of blocks for any *read* block into L1 / L2 cache.

Thanks. This "Read-Ahead" feature might be implemented in future.

Again, thank you very much for your kind suggestions!

Axel Mertes · Post by **Axel Mertes** » Mon Feb 15, 2016 10:42 am

support wrote:
Axel Mertes wrote:1. Implement a caching strategy, where any written block will be automatically written to L2 cache too (rather then L1 cache only as of now).Reason: We create far more - soon to be re-read - data in short time, than the L1 cache can keep. By moving these blocks to a *large* L2 cache will have a strong impact on read speeds.
When the write-data is re-read, this data will be stored into level-2 cache.

Yes, I understand that re-reading data from HDD will force it to be moved to read L2 cache.

However, it is a HUGE waste of performance in our scenario.

Example:

When I have to render a "4K" 4096x2160y resolution image sequence in uncompressed 32 Bit/channel OpenEXR format, then each single frame will be ~138 MByte in size. For a 40 seconds sequence at 24 frames/seconds that is 40 seconds * 24 frames/second * ~138 MByte/frame = ~132480 MBytes = ~132.5 GBytes. For 40 seconds results in 4K! The renderfarm can process that data in just a few minutes.

The size of this sequence alone is about 10 times the size of the L1 cache. So writing the data to the L1 cache has no effect at all, because it is being wiped out long before the sequence rendering is finished. Then the user reviews the sequence and it is being forced to re-read it from HDD RAID. Better would be to write data to HDD and L1 cache, and make deferred writes from L1 to L2 cache as well (deferred to reduce wear on the L2 cache SSD). This off-loads the L1 cache and writes data extremely fast to the L2 cache.

If you would allow storing write data to the L2 cache too - right in the moment it is written to HDD as well - then we would not need to read from the HDD RAID at all if our L2 cache is big enough. Our L2 cache is currently 2 TByte. We would make it even larger if you implement that at some point, but 2 TByte is OK for the moment for a single drive.

Our users most time work like this:
Render, review, make changes, re-render, review, make changes, render, and so on. It can be a few dozen times or even a hundret times until the user is satisfied with the results. So its the very same sequence, over and over again. It would happen 100% inside the L2 cache (beside the written data also being written to HDD), but the reads might come 100% from the cache, freeing a lot of performance from the HDD RAID. This will result in significant reduction on wear & tear on the HDDs and improved overall performance of the HDD RAID (due to not needing to read the sequences all the time).

I have made huge efforts to analyse the amount of data "touched" by our users on a single day. As a result I found that we usually do access around 0.5 to 2 TByte of data per single working day. But the truth is: We access far more, because when we summarize the reads and the over-writes of the very same files, then it can be a few TBytes per day. So if the L2 cache is bigger than the total size of the files, we might work from inside the cache alone. If we over-write the data blocks with fresh versions, this will stay in the cache too.

support wrote:
Axel Mertes wrote:2. Implement a cache strategy which provides a pure read cache and write blocks straight onto the target disks, but which will also copy written blocks immediately to the L1 and L2 cache for soon re-read.
If you choose cache strategy "Read & Write" and without Defer-Write enabled, write-data will be written through to the target disks and also copied to L1 cache.

See example above.

I stumble over:

"...and without Defer-Write enabled, write-data will be written through to the target disks and also copied to L1 cache."

So this copy to L1 cache will NOT happen when defer write is enabled?
I always thought:
The L1 cache is used to do the defer write process, collecting blocks in L1 cache and writing them "at once" in combined - potentially sequential - write operations to HDD.
Now you wrote the exact opposite. Is that a mistake?

I am sure a lot of customers in my business would pay even more than the current Server version prise for this, as it would safe quite a lot of time and money in the long run, as you can easily demonstrate.

support wrote:
Axel Mertes wrote:4. Implement an automatic pre-read of user customizeable amount of blocks for any *read* block into L1 / L2 cache.
Thanks. This "Read-Ahead" feature might be implemented in future.

Again, thank you very much for your kind suggestions!

This is welcome news. Having also a version that outputs the cache operations (a log file...) would help a lot in simulating the results. If you provide me a version that outputs a log, I can do the simulation...

Btw, could you integrate a simple timer into the cache statistics, so we can see since how long the cache is running to produce the given statistics?
And whenever we reset the statistics, we reset the timer too, so we get an idea how much data in which time is processed.

Post by **Support** » Thu Feb 18, 2016 6:09 am

Axel Mertes wrote:So this copy to L1 cache will NOT happen when defer write is enabled?I always thought: The L1 cache is used to do the defer write process, collecting blocks in L1 cache and writing them "at once" in combined - potentially sequential - write operations to HDD.Now you wrote the exact opposite. Is that a mistake?

Sorry, I didn't explain it clearly. With "Read & Write" or "Write Only" cache strategy, write-data always will be stored into L1 cache. Besides above, without Defer-Write, write-data will also be immediately written to the target disk, while with Defer-Write write-data will be written to the target disk after certain delays.

Axel Mertes wrote:Btw, could you integrate a simple timer into the cache statistics, so we can see since how long the cache is running to produce the given statistics?And whenever we reset the statistics, we reset the timer too, so we get an idea how much data in which time is processed.

Sure, thank you for the suggestion!

Axel Mertes · Post by **Axel Mertes** » Fri Feb 19, 2016 2:46 pm

Hi Support,

when you are implementing the time display / reset, you may add a curve graph for the L2 cache as well.
And what about other graphs e.g. for defer writes %? But maybe then its better to think about implementing a logging option (e.g. timely interval), so we can create such curves for analysis ourselfs from the logs. But that would not be the same log as requested above, where *every* operation should be recorded for deep analysis and optimization.

Looking forward to your next version.

Romex Software Forum

Preload currently accessed files -> improve Audio/Video/Film

Re: Preload currently accessed files -> improve Audio/Video/Film

Re: Preload currently accessed files -> improve Audio/Video/Film

Re: Preload currently accessed files -> improve Audio/Video/Film

Re: Preload currently accessed files -> improve Audio/Video/Film

Re: Preload currently accessed files -> improve Audio/Video/Film

Re: Preload currently accessed files -> improve Audio/Video/Film

Re: Preload currently accessed files -> improve Audio/Video/Film

Re: Preload currently accessed files -> improve Audio/Video/Film

Re: Preload currently accessed files -> improve Audio/Video/Film

Re: Preload currently accessed files -> improve Audio/Video/Film