What the people are asking for + a few others.

mabellon · Post by **mabellon** » Tue May 29, 2012 5:56 pm

I was only half serious about the Filesystem part. That would simply be the cleanest solution - caching built right in. Obviously the work involved would mostly outweigh the gains. Personally I am OK with not-100% solutions at home as long as the error cases are fully disclosed.

Agreed. For removable flash drives, hopefully they can be detected and flagged not to be persistent. Otherwise, user error. Ignoring dual boot and USB flash, I'm still not convinced that there isn't a possibility of things changing in 'offline mode' (that is normal use-cases where the storage changes without FC, regardless of if FC can handle them) There are just a ton of possibilities, but hopefully they figure them out or work around. More food for thought - Safe Mode, restore points, defrag (moves blocks, but files are the same), chkdsk remapping bad sectors. I honestly don't use these features, so I'm not sure.

You've mentioned MD5 a few times. Please elaborate. It seems that you want to read the file off disk again, hash it and compare against the cache before using the cached copy? Sorry, I must be misunderstanding you because that would seem to defeat the entire purpose. At that point you've not only wasted CPU cycles on MD5, but also waited on the slow disk. If NTFS already had checksums/MD5 precomputed then it could be a great idea.

Also, depending on the implementation, File Access timestamps may not perform all that well. If the cache is implemented at block level, you would probably still have to re-read the file to block mapping and file timestamps every time on boot. Not to mention reading timestamps for every little file. Perhaps just an inconvenience, but still not ideal performance wise. Better than what we have currently, that's for sure.

Perhaps another possibility would be using the NTFS log/journal. I know little about it, but in theory, if the log hasn't changed since the last time FC was running, the cache is good. If the log has changed, either fallback to timestamps or rebuild the cache.

Mradr · Post by **Mradr** » Tue May 29, 2012 8:50 pm

I was only half serious about the Filesystem part. That would simply be the cleanest solution - caching built right in. Obviously the work involved would mostly outweigh the gains. Personally I am OK with not-100% solutions at home as long as the error cases are fully disclosed.

lol I know. It would take a while to create a file system and even then we wouldn't really see a big boots in speed. Same here. I don't mind there is a bit of risk as long as I am also disclosed about it myself, but I think FC is more worried about user error than just taking a leap of fath in its users. Wtich, I can understand. I mean there are a few people I question... but idk. Simply placing the warnning sign will get them out of a lot of crap and then it just falls on the user for their mistakes of missuse. Trust me. I am a software writer myself xD I spend more time trying to protect the user from themselves than the actull program half the time. Sometimes, it is just best to say, "Look, here is the program. If you "$%#"-up, don't come crying to me for bypassing my protection," lol.

Agreed. For removable flash drives, hopefully they can be detected and flagged not to be persistent. Otherwise, user error. Ignoring dual boot and USB flash, I'm still not convinced that there isn't a possibility of things changing in 'offline mode' (that is normal use-cases where the storage changes without FC, regardless of if FC can handle them) There are just a ton of possibilities, but hopefully they figure them out or work around.

I only mentioned it once ^^; We wouldn't be using MD5. I was just trying to pick a hash ag. to use for the example. Also, most of this will be on the first run of the program as my little flow chart was saying. So only once will this change and then it'll be updated untill the data is changed again per boot. FC doesn't work on first runs anyways, so there wouldn't really be much of a differents. Actully you would see a boots because the data is already cache and would be able to do it in time actully (unless it is new data, then it'll have to cache that data as it runs). <- What do I mean? Is this:

Cpu cycles are actully cheap at this point because the data is loading from the hard drive. Hard drive takes a while to read their data and is usally read in ms while the cpu actully runs in ns because it is really fast. It takes 1000000 ns to be 1 ms. Because of this, it would be faster to check to see if the data is already in our cache before hitting the hdd. Again, it only has to do this once, so this only happens on the first time the data is read per boot. If there is a cache file, then it will load faster because it doesn't have to hit the HDD as hard (think of it as a half speed boots), or every other time would be the full speed boots.

Again, FC perty much does this already. Just not with persistent cacheing. I did a program like this a while back at the file system level. It boots my ability to read files almost in half as the data was already cached and all it had to do was check to make sure the file had not changed on first load, so I know it can be done.

I think at the block level there is a checksum, but I could be wrong at this. This would also increase speed ... well by a lot if FC wouldn't need to actully do the checksum it self.

Also, depending on the implementation, File Access timestamps may not perform all that well. If the cache is implemented at block level, you would probably still have to re-read the file to block mapping and file timestamps every time on boot. Not to mention reading timestamps for every little file. Perhaps just an inconvenience, but still not ideal performance wise. Better than what we have currently, that's for sure.

Perhaps another possibility would be using the NTFS log/journal. I know little about it, but in theory, if the log hasn't changed since the last time FC was running, the cache is good. If the log has changed, either fallback to timestamps or rebuild the cache.

I think the log/journal you're talking about is the Access timestamps. A system doesn't have to update the log/journal for say. Window does by default, but I know you can turn it off if you wish, so yes you would run the risk again of another user error if they turn it off using that method. You really only see the slow done on the firest read again as I said above. After that, you wouldn't need to worry about having to keep re-reading the timestamps.

The method I was using protected against that, so I don't know. lol I know reading timestamps is slow, so I was going for the checksums on the block level or a hash as they can not be changed by the user.

More food for thought - Safe Mode, restore points, defrag (moves blocks, but files are the same), chkdsk remapping bad sectors. I honestly don't use these features, so I'm not sure.

Again, as far as I know, the NTFS file system would be able to track changes within the files if they do use it. I think using the hash or a checksum would still be the better idea instead even if they are a bit cpu hungry.

Btw ^^ Nice talking to you. You made me think

haha ^^ I jsut hope FC/Support reads our post sometime. I think we're on a roll on figuring something out that might work and still be somewhat safe. I emailed them about it ^^;, so hopfully they check it out. We could be totally wrong also, haha. I hope it goes ok and they like the idea.

Support, if you need, I could try to rewrite this a bit more if you are having a hard time following our plan.

Also note:
[0.8.0].[4]
5) Throtter defer write speed is has become number 6.
6) Keep-Alive Performance Monitor with auto start is now number 5.
5) Keep-Alive Performance Monitor with auto added auto start and another tip to using the performance monitor using windows performance monitor instead.

mabellon · Post by **mabellon** » Tue May 29, 2012 10:32 pm

I think the log/journal you're talking about is the Access timestamps.

Nope. Timestamps are per file, stored in the MFT I believe. The NTFS journal should be a better choice.

"The change journal is much more efficient than time stamps or file notifications for determining changes in a particular namespace. Applications that must rescan an entire volume to determine changes can now scan once and subsequently refer to the change journal. The I/O cost depends on how many files have changed, not on how many files exist on the volume."

See NTFS Change Journal
http://technet.microsoft.com/en-us/libr ... s.10).aspx

And here's another article on MSDN. This one uses the anecdote of an automatic backup program. If the app is running (ideal) you can get change notifications. If you can't guarantee that you could brute force scan the volume. Instead MSDN suggests using the NTFS Journal. They also bring up a good point - the file system indexes (Vista/Win7) aren't lost when "offline" by relying on the journal. So it can be done!
http://msdn.microsoft.com/en-us/library ... s.85).aspx

Mradr · Post by **Mradr** » Tue May 29, 2012 11:16 pm

I see. I just found the time to look it up. Looks perty cool actully. The file system will update the journal per write/change.
It seem like it would work ^^ Tho we still would have to see if it works in offline mode and this again forces the use of the NTFS wtich means we're moving up from the block system to the file system.

But also we have to worry about a down side it seems:
http://en.wikipedia.org/wiki/Journaling_file_system

Write hazards

The write cache in most operating systems sorts its writes (using the elevator algorithm or some similar scheme) to maximize throughput. To avoid an out-of-order write hazard with a metadata-only journal, writes for file data must be sorted so that they are committed to storage before their associated metadata. This can be tricky to implement because it requires coordination within the operating system kernel between the file system driver and write cache. An out-of-order write hazard can also exist if the underlying storage:
cannot write blocks atomically, or
does not honor requests to flush its write cache

To complicate matters, many mass storage devices have their own write caches, in which they may aggressively reorder writes for better performance. (This is particularly common on magnetic hard drives, which have large seek latencies that can be minimized with elevator sorting.) Some journaling file systems conservatively assume such write-reordering always takes place, and sacrifice performance for correctness by forcing the device to flush its cache at certain points in the journal (called barriers in ext3 and ext4).[4]

They also bring up a good point - the file system indexes (Vista/Win7) aren't lost when "offline" by relying on the journal. So it can be done!

Not sure I follow what you mean here. Indexing isn't really used all that much unless you do a lot of searches. lol.

Also here is a image I found that will help us better see what we're dealing with: >4 = slowest, 1< = fastest factor of lets say ^10

From what I can tell, there is a checksum at the block layer, but I can't seem to finde the command witch again tells me there might not be one. You would have to create one at that level.

Oh snaps we might have a winner here:
http://anselmo.homeunix.net/OReilly/boo ... l2/088.htm

The hash table array is stored in bdev_hashtable variable; it includes 64 lists of block device descriptors. Each descriptor is a block_device data structure whose fields are shown in Table 13-4.

That would be a increase of 1000 fold if it does what I think it does instead of the NTFS' journals.

Either way, we just found two ways to make this both safe and useable is how I look at it xD

mabellon · Post by **mabellon** » Tue May 29, 2012 11:57 pm

I was aware of the write issues actually - I'm ignoring write caching. I've always assumed FC was ignoring flushes to get great write performance. I don't think I would trust defer writes.

Not sure I follow what you mean here. Indexing isn't really used all that much unless you do a lot of searches. lol.

Basically the search indexer is building its own cache to improve search performance. It's obviously caching something different, but ultimately it relies on files and their contents. File changes could invalidate the cached index. As the MSDN article implies, the index doesn't need to be rebuilt every time you reboot thanks to the NTFS journal. In this case, an invalid cached entry probably would only means bad search results, or slower searches. Whereas invalid FC cache means potential corruption. Nevertheless, the principle is the same - rely on the NTFS journal state to indicate if your cache is stale.

Mradr · Post by **Mradr** » Wed May 30, 2012 12:08 am

mabellon wrote:I was aware of the write issues actually - I'm ignoring write caching. I've always assumed FC was ignoring flushes to get great write performance. I don't think I would trust defer writes.

Not sure I follow what you mean here. Indexing isn't really used all that much unless you do a lot of searches. lol.
Basically the search indexer is building its own cache to improve search performance. It's obviously caching something different, but ultimately it relies on files and their contents. File changes could invalidate the cached index. As the MSDN article implies, the index doesn't need to be rebuilt every time you reboot thanks to the NTFS journal. In this case, an invalid cached entry probably would only means bad search results, or slower searches. Whereas invalid FC cache means potential corruption. Nevertheless, the principle is the same - rely on the NTFS journal state to indicate if your cache is stale.

Ha, I guess we had two different goals then also

Either way, we have something that will work now. FC just needs to read our comments and we just have to hope they work the way we think they work.

I've always assumed FC was ignoring flushes to get great write performance. I don't think I would trust defer writes.

They do, but I wanted a way to do both. I mean there wouldn't really be any lost to do both at the same time. Oh yea? Same here when it comes to using L1 myself, but I could trust L2 if it was a non-volatile type storage drive. I mean Intel does it just fine with their RST program that just came out with. It does seem to increase speeds by a lot using the SSD as a sort of cache drive. More or less you are getting a software raid in the end.

mabellon · Post by **mabellon** » Wed May 30, 2012 1:12 am

I mean Intel does it just fine with their RST program that just came out with.

Thanks for reminding me. I was so excited when this feature was announced... and so disappointed when I found out that the new chipset was a requirement.

But Intel actually doesn't handle this either. They have 2 modes. The "Enhanced mode" is write-through. So the slow disk never misses any writes and you still have slow write performance. The "Maximum" performance mode is write-back just like 'defer-writes'. It still suffers from risking corruption in that case. Source http://www.anandtech.com/show/4329/inte ... g-review/2

Might be worth risking for L2 only.

Post by **Support** » Wed May 30, 2012 3:59 am

Regarding the persistent cache, the major problem is how to know if the data was offline changed or not.
Here 'offline' refers to any scenarios that FC is not running, as mabellon said. Offline scenarios can be can be classified into two categories: FC-Aware Scenarios and FC-Unaware Scenarios.
FC-Aware Scenarios: FC knows that the cached data may be changed. Such as users temporarily stop/pause caching, do other disk tasks, and then restart caching again.
FC-Unaware Scenarios: FC doesn't get any notifications when the cached data may be changed. Such as booting into another OS, or mounting the disk on another computer.
Though FC-Unaware Scenarios don't happen always, we have to handle them. Otherwise we may get lots of complains from users

Ok, now lets back to detecting approaches to determine if the cached data has been outdated or not. When FC finds the cached data is outdated, obviously it can invalidate the existing cache and rebuild the cache. You guys have already discussed some approaches:
1) MD5/HASH
2) TimeStamp
3) NTFS Change Journal
The detecting approach shall be reliable and quick. Basically we won't prefer MD5/HASH as it has to read the original disk data which would degrade the performance a lot. TimeStamp may not be reliable. The volume's timestamp may not changed at all. NTFS Change Journal is a great idea, but obviously it is only for NTFS file system, though it shall not be a problem.

Just some comments. Thank you, Mradr and mabellon. Great discussion!

Mradr · Post by **Mradr** » Wed May 30, 2012 2:22 pm

I say we use the NTFS Change Journal system then. It would force the use of the NTFS, but most users running windows are already using it for trim support, and/or dual booting linx to have a common ground between the two. If anything, it would just be an "add on" that users would be able to click on to enable persistent cache as long as they meet the requirments (both drive using NTFS). ^^ Sound good to me.

manus · Post by **manus** » Wed May 30, 2012 8:54 pm

support wrote:Regarding the persistent cache, the major problem is how to know
.....
Just some comments. Thank you, Mradr and mabellon. Great discussion!

Did you think my idea will be possible:

Maybe you can only save which blocks (address) are used and not the block himself. And at start reload blocks directly from hard-drive. With this system you are sure that the data is right and fresh.
It's not a persistent cache but it keep the optimization usage of blocks.

And you can reload sequentially by ordering address to get data more quickly.
This method can work for L1 and L2 cache.

Romex Software Forum

What the people are asking for + a few others.

Re: What the people are asking for + a few others.

Re: What the people are asking for + a few others.

Re: What the people are asking for + a few others.

Re: What the people are asking for + a few others.

Re: What the people are asking for + a few others.

Re: What the people are asking for + a few others.

Re: What the people are asking for + a few others.

Re: What the people are asking for + a few others.

Re: What the people are asking for + a few others.

Re: What the people are asking for + a few others.