BonzaiDuck wrote:Yo, Greg!
I'm trying to wrap my brain around what you're trying to do.
Is this a database system? Is your data stored in relational tables and files?
I do computer forensics. My datasets are copies (images) of entire disk drives with all the sectors physically on the drive. In general, it is considered unstructured data.
BonzaiDuck wrote:How big is the largest file that gets loaded from these spinner drives? I could ask why so many USB3 connections, or why you haven't built some sort of NAS, but it's peripheral.
I'm processing a 3.7TB drive right now. I made the copy of it over the weekend, so this data is now connected to my PC for the first time.
I segmented it down to 8GB per segment file, but it is misleading to think of it as 8GB. It is really 3.7TB. Obviously that is too large to pre-load the L2 Cache, so I'm not doing that. To leverage the L2 cache, the first thing I did was "hash" all the PSTs on the drive. There was 500GB of PSTs on the drive. I hashed those first thing this morning, that caused the PSTs to all be in L2 cache after the hashing was done. After hashing those, I compared the hashes and ignored any PST copies that had identical hashes to others on the drive. That got me down to 235 GB of unique PSTs.
Within that 3.7TB, the largest files I care about are those PST files. To repeat, there's 235GB of unique PST files on the drive image. The biggest single PST is 40GB. I put all of those PSTs in L2 cache as my first action for the day.
I just finished parsing those PSTs now. That means going through each PST and pulling out every email. It took right at 2-hours. That is crazy fast. 5 years ago with a decent PC and rotating drives, my benchmark was 1 GB of PST data processed per hour. This is going at closer to 120 GB per hour.
Note that PSTs are not read linearly when parsing them. You do a ton of random i/o. That's why it takes about 1GB/hour on rotating drives and no decent caching mechanism.
Now that the PSTs are parsed, I have to run a keyword search against the entire drive. This time I'm only search existing files (non-deleted), so its 2.8 TB of data I need to search. I'm going to "pause" my cache for the 3.7TB image because I don't want to the PSTs/emails to be dropped.
Since only 10% of search will be in L2 cache, I imagine it will take overnight to run. It's only 10:30 AM here, so I'll find out tomorrow AM if it got done that fast or not.
BonzaiDuck wrote:The choice of an L1 should probably anticipate the largest file size loaded at one time. I wouldn't attempt to use an entire 1TB NVMe drive as SSD cache unless you could actually do what you asked with your thread.
Last week I saw my 2TB L2 Cache down to 512GB free. I've got my L1 cache set at 32GB, but maybe I should go to 64GB based on what you said.
BonzaiDuck wrote:For workstations, when they released the Z68 chipset or possibly others, or if you had one of several Marvell chipsets in a storage controller, you could use Intel ISRT or Marvell's Hyper-Duo for SSD-caching, but Primo is hardware and storage-mode agnostic. The original solutions of some 5 years ago limited you to an SSD cache-drive size of about 60 GB.
Thanks
BonzaiDuck wrote:The first thought one would have about this is to accelerate the boot-system disk. If it was an SSD, you couldn't do much before until we got these Sammy NVMe M.2 SSD (cards), or some other PCIE NVMe. Now you can accelerate an SATA SSD to an NVMe M.2, and cache it to RAM as well.
It's a SSD, but a SATA one. I've added it do my L2 cache setup. I don't know how much it helps, but it doesn't seem to hurt.
BonzaiDuck wrote:So that's my first priority. If it were a matter of a database, it would be on my home-server and accelerated there, limited only by my Gigabit Ethernet connection to that server. That would be second priority. But I can't imagine using more than -- say -- 100 GB caching volumes on an NVMe. I've discovered that I can cache my SATA SSD boot-disk with only about 40GB of a 100GB caching volume, and I've probably "over-cached" my HDDs with the remaining 60GB.
Primo doesn't quite work the way you might want it -- as with your question. But it works nevertheless in a stealthy way to cache to SSD or NVMe SSD.[/quote]
I've found I can preload it pretty well using cygwin (a linux compatibility layer that is opensource/free). I start a cygwin bash shell, then:
cd <image_dir>; cat * | dd of=/dev/null bs=1M status=progress
With the gather interval set at "1", that seems to do an excellent job of pre-loading the cache with that image prior to me starting to do my analysis.
Last week I was working with a set of 8 images spread across 6 USB-3 drives. I simultaneously started up a cygwin pre-load command on each drive. PrimoCache seemed to do an excellent job of putting all of those images into L2 cache simultaneously. That's when I saw my free L2 cache drop to 500 GB or so.
BonzaiDuck wrote:I'm just clueless about data-sets the size you describe, or how they could all be a single file.
For me, I'm testing out my dual-boot [Win7/10] by splitting disk resources so they don't get mixed up during any given OS user session. My boot-system is a cheap ADATA 480GB SSD. I have a 2TB Seagate Barracuda spinner with an extension to the Program Files and other things that would be specific to a given OS -- divided in 1TB parts for each respective OS. Then there's a 1TB media drive which isn't cached at all. RAM wouldn't help it; and I'd have to split it between the OS's in different volumes to cache it any other way. It doesn't need to be cached.
But both OS'es have to be able to access the same data, change and append to it, without screwing up a cache. That data would either be on an uncached (or with RAM-only) NVMe volume, or an HDD volume. The problem of using the media drive for those files as well -- I don't want to cache a drive containing 10GB HD movies and DVR captures.
Thanks
BonzaiDuck wrote:I did a maintenance check on a 60GB caching-SSD after two years, before I flushed the cache and recreated the volume during the maintenance. It had filled up caching the OS-boot drive -- only a few GB free space. But the TBW racked up by the drive was less than 5TB.
My 2TB L2 Cache drive is at 9.1 TB written and I've only had it 3 weeks. Say 0.5 TB/day. 1200 TWB means 2400 days before it dies of use. That's 6 1/2 years. That seems fine.