On Wed, Apr 05 2017, Jan Kara wrote: > On Wed 05-04-17 11:43:32, NeilBrown wrote: >> On Tue, Apr 04 2017, J. Bruce Fields wrote: >> >> > On Thu, Mar 30, 2017 at 02:35:32PM -0400, Jeff Layton wrote: >> >> On Thu, 2017-03-30 at 12:12 -0400, J. Bruce Fields wrote: >> >> > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote: >> >> > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote: >> >> > > > Because if above is acceptable we could make reported i_version to be a sum >> >> > > > of "superblock crash counter" and "inode i_version". We increment >> >> > > > "superblock crash counter" whenever we detect unclean filesystem shutdown. >> >> > > > That way after a crash we are guaranteed each inode will report new >> >> > > > i_version (the sum would probably have to look like "superblock crash >> >> > > > counter" * 65536 + "inode i_version" so that we avoid reusing possible >> >> > > > i_version numbers we gave away but did not write to disk but still...). >> >> > > > Thoughts? >> >> > >> >> > How hard is this for filesystems to support? Do they need an on-disk >> >> > format change to keep track of the crash counter? Maybe not, maybe the >> >> > high bits of the i_version counters are all they need. >> >> > >> >> >> >> Yeah, I imagine we'd need a on-disk change for this unless there's >> >> something already present that we could use in place of a crash counter. >> > >> > We could consider using the current time instead. So, put the current >> > time (or time of last boot, or this inode's ctime, or something) in the >> > high bits of the change attribute, and keep the low bits as a counter. >> >> This is a very different proposal. >> I don't think Jan was suggesting that the i_version be split into two >> bit fields, one the change-counter and one the crash-counter. >> Rather, the crash-counter was multiplied by a large-number and added to >> the change-counter with the expectation that while not ever >> change-counter landed on disk, at least 1 in every large-number would. >> So after each crash we effectively add large-number to the >> change-counter, and can be sure that number hasn't been used already. > > Yes, that was my thinking. > >> To store the crash-counter in each inode (which does appeal) you would >> need to be able to remove it before adding the new crash counter, and >> that requires bit-fields. Maybe there are enough bits. > > Furthermore you'd have a potential problem that you need to change > i_version on disk just because you are reading after a crash and such > changes tend to be problematic (think of read-only mounts and stuff like > that). > >> If you want to ensure read-only files can remain cached over a crash, >> then you would have to mark a file in some way on stable storage >> *before* allowing any change. >> e.g. you could use the lsb. Odd i_versions might have been changed >> recently and crash-count*large-number needs to be added. >> Even i_versions have not been changed recently and nothing need be >> added. >> >> If you want to change a file with an even i_version, you subtract >> crash-count*large-number >> to the i_version, then set lsb. This is written to stable storage before >> the change. >> >> If a file has not been changed for a while, you can add >> crash-count*large-number >> and clear lsb. >> >> The lsb of the i_version would be for internal use only. It would not >> be visible outside the filesystem. >> >> It feels a bit clunky, but I think it would work and is the best >> combination of Jan's idea and your requirement. >> The biggest cost would be switching to 'odd' before an changes, and the >> unknown is when does it make sense to switch to 'even'. > > Well, there is also a problem that you would need to somehow remember with > which 'crash count' the i_version has been previously reported as that is > not stored on disk with my scheme. So I don't think we can easily use your > scheme. I don't think there is a problem here.... maybe I didn't explain properly or something. I'm assuming there is a crash-count that is stored once per filesystem. This might be a disk-format change, or maybe the "Last checked" time could be used with ext4 (that is a bit horrible though). Every on-disk i_version has a flag to choose between: - use this number as it is, but update it on-disk before any change - add multiple of current crash-count to this number before use. If you crash during an update, the i_version is thus automatically increased. To change from the first option to the second option you subtract the multiple of the current crash-count (which might make the stored i_version negative), and flip the bit. To change from the second option to the first, you add the multiple of the current crash-count, and flip the bit. In each case, the externally visible i_version does not change. Nothing needs to be stored except the per-inode i_version and the per-fs crash_count. > > So the options we have are: > > 1) Keep i_version as is, make clients also check for i_ctime. > Pro: No on-disk format changes. > Cons: After a crash, i_version can go backwards (but when file changes > i_version, i_ctime pair should be still different) or not, data can be > old or not. I like to think of this approach as using the i_version as an extension to the i_ctime. i_ctime doesn't necessarily change on every file modification, either because it is not a modification that is meant to change i_ctime, or because i_ctime doesn't have the resolution to show a very small change in time, or because the clock that is used to update i_ctime doesn't have much resolution. So when a change happens, if the stored c_time changes, set i_version to zero, otherwise increment i_version. Then the externally visible i-version is a combination of the stored c_time and the stored i_version. If you only used 1-second ctime resolution for versioning purposes, you could provide a 64bit i_version as 34 bits of ctime and 30 bits of changes-in-one-second. It is important that the resolution of ctime used is less that the fastest possible restart after a crash. I don't think that i_version going backwards should be a problem, as long as an old version means exactly the same old data. Presumably journalling would ensure that the data and ctime/version are updated atomically. > > 2) Fsync when reporting i_version. > Pro: No on-disk format changes, strong consistency of i_version and > data. > Cons: Difficult to implement for filesystems due to locking constrains. > High performance overhead or i_version reporting. This reminds me of the old ext3 fsync-when-renaming a file. People might depend on it for all the wrong reasons, and other people might studiously avoid it due to the performance implications. > > 3) Some variant of crash counter. > Pro: i_version cannot go backwards. > Cons: Requires on-disk format changes. After a crash data can be old > (however i_version increased). If it is essential for i_version to always go forward, then I think this is the best approach. If an i_version reset can be tolerated, then I think a time-plus-version-count approach is likely to be best. Thanks, NeilBrown > > Honza > -- > Jan Kara > SUSE Labs, CR > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html