From: Jan Kara <firstname.lastname@example.org> To: NeilBrown <email@example.com> Cc: "J. Bruce Fields" <firstname.lastname@example.org>, Jeff Layton <email@example.com>, Jan Kara <firstname.lastname@example.org>, Christoph Hellwig <email@example.com>, firstname.lastname@example.org, email@example.com, firstname.lastname@example.org, email@example.com, firstname.lastname@example.org, email@example.com Subject: Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization Date: Wed, 5 Apr 2017 10:05:51 +0200 [thread overview] Message-ID: <20170405080551.GC8899@quack2.suse.cz> (raw) In-Reply-To: <firstname.lastname@example.org> On Wed 05-04-17 11:43:32, NeilBrown wrote: > On Tue, Apr 04 2017, J. Bruce Fields wrote: > > > On Thu, Mar 30, 2017 at 02:35:32PM -0400, Jeff Layton wrote: > >> On Thu, 2017-03-30 at 12:12 -0400, J. Bruce Fields wrote: > >> > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote: > >> > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote: > >> > > > Because if above is acceptable we could make reported i_version to be a sum > >> > > > of "superblock crash counter" and "inode i_version". We increment > >> > > > "superblock crash counter" whenever we detect unclean filesystem shutdown. > >> > > > That way after a crash we are guaranteed each inode will report new > >> > > > i_version (the sum would probably have to look like "superblock crash > >> > > > counter" * 65536 + "inode i_version" so that we avoid reusing possible > >> > > > i_version numbers we gave away but did not write to disk but still...). > >> > > > Thoughts? > >> > > >> > How hard is this for filesystems to support? Do they need an on-disk > >> > format change to keep track of the crash counter? Maybe not, maybe the > >> > high bits of the i_version counters are all they need. > >> > > >> > >> Yeah, I imagine we'd need a on-disk change for this unless there's > >> something already present that we could use in place of a crash counter. > > > > We could consider using the current time instead. So, put the current > > time (or time of last boot, or this inode's ctime, or something) in the > > high bits of the change attribute, and keep the low bits as a counter. > > This is a very different proposal. > I don't think Jan was suggesting that the i_version be split into two > bit fields, one the change-counter and one the crash-counter. > Rather, the crash-counter was multiplied by a large-number and added to > the change-counter with the expectation that while not ever > change-counter landed on disk, at least 1 in every large-number would. > So after each crash we effectively add large-number to the > change-counter, and can be sure that number hasn't been used already. Yes, that was my thinking. > To store the crash-counter in each inode (which does appeal) you would > need to be able to remove it before adding the new crash counter, and > that requires bit-fields. Maybe there are enough bits. Furthermore you'd have a potential problem that you need to change i_version on disk just because you are reading after a crash and such changes tend to be problematic (think of read-only mounts and stuff like that). > If you want to ensure read-only files can remain cached over a crash, > then you would have to mark a file in some way on stable storage > *before* allowing any change. > e.g. you could use the lsb. Odd i_versions might have been changed > recently and crash-count*large-number needs to be added. > Even i_versions have not been changed recently and nothing need be > added. > > If you want to change a file with an even i_version, you subtract > crash-count*large-number > to the i_version, then set lsb. This is written to stable storage before > the change. > > If a file has not been changed for a while, you can add > crash-count*large-number > and clear lsb. > > The lsb of the i_version would be for internal use only. It would not > be visible outside the filesystem. > > It feels a bit clunky, but I think it would work and is the best > combination of Jan's idea and your requirement. > The biggest cost would be switching to 'odd' before an changes, and the > unknown is when does it make sense to switch to 'even'. Well, there is also a problem that you would need to somehow remember with which 'crash count' the i_version has been previously reported as that is not stored on disk with my scheme. So I don't think we can easily use your scheme. So the options we have are: 1) Keep i_version as is, make clients also check for i_ctime. Pro: No on-disk format changes. Cons: After a crash, i_version can go backwards (but when file changes i_version, i_ctime pair should be still different) or not, data can be old or not. 2) Fsync when reporting i_version. Pro: No on-disk format changes, strong consistency of i_version and data. Cons: Difficult to implement for filesystems due to locking constrains. High performance overhead or i_version reporting. 3) Some variant of crash counter. Pro: i_version cannot go backwards. Cons: Requires on-disk format changes. After a crash data can be old (however i_version increased). Honza -- Jan Kara <email@example.com> SUSE Labs, CR
next prev parent reply other threads:[~2017-04-05 8:06 UTC|newest] Thread overview: 87+ messages / expand[flat|nested] mbox.gz Atom feed top 2016-12-21 17:03 Jeff Layton 2016-12-21 17:03 ` [RFC PATCH v1 01/30] lustre: don't set f_version in ll_readdir Jeff Layton 2016-12-21 17:03 ` [RFC PATCH v1 02/30] ecryptfs: remove unnecessary i_version bump Jeff Layton 2016-12-21 17:03 ` [RFC PATCH v1 03/30] ceph: remove the bump of i_version Jeff Layton 2016-12-21 17:03 ` [RFC PATCH v1 04/30] f2fs: don't bother setting i_version Jeff Layton 2016-12-21 17:03 ` [RFC PATCH v1 05/30] hpfs: don't bother with the i_version counter Jeff Layton 2016-12-21 17:03 ` [RFC PATCH v1 06/30] jfs: remove initialization of " Jeff Layton 2016-12-21 17:03 ` [RFC PATCH v1 07/30] nilfs2: remove inode->i_version initialization Jeff Layton 2016-12-21 17:03 ` [RFC PATCH v1 08/30] orangefs: remove initialization of i_version Jeff Layton 2016-12-21 17:03 ` [RFC PATCH v1 09/30] reiserfs: remove unneeded i_version bump Jeff Layton 2016-12-21 17:03 ` [RFC PATCH v1 10/30] ntfs: remove i_version handling Jeff Layton 2016-12-21 17:03 ` [RFC PATCH v1 11/30] fs: new API for handling i_version Jeff Layton 2017-03-03 22:36 ` J. Bruce Fields 2017-03-04 0:09 ` Jeff Layton 2017-03-03 23:55 ` NeilBrown 2017-03-04 1:58 ` Jeff Layton 2016-12-21 17:03 ` [RFC PATCH v1 12/30] fat: convert to new i_version API Jeff Layton 2016-12-21 17:03 ` [RFC PATCH v1 13/30] affs: " Jeff Layton 2016-12-21 17:03 ` [RFC PATCH v1 14/30] afs: " Jeff Layton 2016-12-21 17:03 ` [RFC PATCH v1 15/30] btrfs: " Jeff Layton 2016-12-21 17:03 ` [RFC PATCH v1 16/30] exofs: switch " Jeff Layton 2016-12-21 17:03 ` [RFC PATCH v1 17/30] ext2: convert " Jeff Layton 2016-12-21 17:03 ` [RFC PATCH v1 18/30] ext4: " Jeff Layton 2016-12-21 17:03 ` [RFC PATCH v1 19/30] nfs: " Jeff Layton 2016-12-21 17:03 ` [RFC PATCH v1 20/30] nfsd: " Jeff Layton 2016-12-21 17:03 ` [RFC PATCH v1 21/30] ocfs2: " Jeff Layton 2016-12-21 17:03 ` [RFC PATCH v1 22/30] ufs: use " Jeff Layton 2016-12-21 17:03 ` [RFC PATCH v1 23/30] xfs: convert to " Jeff Layton 2016-12-21 17:03 ` [RFC PATCH v1 24/30] IMA: switch IMA over " Jeff Layton 2016-12-21 17:03 ` [RFC PATCH v1 25/30] fs: add a "force" parameter to inode_inc_iversion Jeff Layton 2016-12-21 17:03 ` [RFC PATCH v1 26/30] fs: only set S_VERSION when updating times if it has been queried Jeff Layton 2016-12-21 17:03 ` [RFC PATCH v1 27/30] xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need incrementing Jeff Layton 2016-12-21 17:03 ` [RFC PATCH v1 28/30] btrfs: only dirty the inode in btrfs_update_time if something was changed Jeff Layton 2016-12-21 17:03 ` [RFC PATCH v1 29/30] fs: track whether the i_version has been queried with an i_state flag Jeff Layton 2017-03-04 0:03 ` NeilBrown 2017-03-04 0:43 ` Jeff Layton 2016-12-21 17:03 ` [RFC PATCH v1 30/30] fs: convert i_version counter over to an atomic64_t Jeff Layton 2016-12-22 8:38 ` Amir Goldstein 2016-12-22 13:27 ` Jeff Layton 2017-03-04 0:00 ` NeilBrown 2016-12-22 8:45 ` [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization Christoph Hellwig 2016-12-22 14:42 ` Jeff Layton 2017-03-20 21:43 ` J. Bruce Fields 2017-03-21 13:45 ` Christoph Hellwig 2017-03-21 16:30 ` J. Bruce Fields 2017-03-21 17:23 ` Jeff Layton 2017-03-21 17:37 ` J. Bruce Fields 2017-03-21 17:51 ` J. Bruce Fields 2017-03-21 18:30 ` J. Bruce Fields 2017-03-21 18:46 ` Jeff Layton 2017-03-21 19:13 ` J. Bruce Fields 2017-03-21 21:54 ` Jeff Layton 2017-03-29 11:15 ` Jan Kara 2017-03-29 17:54 ` Jeff Layton 2017-03-29 23:41 ` Dave Chinner 2017-03-30 11:24 ` Jeff Layton 2017-04-04 18:38 ` J. Bruce Fields 2017-03-30 6:47 ` Jan Kara 2017-03-30 11:11 ` Jeff Layton 2017-03-30 16:12 ` J. Bruce Fields 2017-03-30 18:35 ` Jeff Layton 2017-03-30 21:11 ` Boaz Harrosh 2017-04-04 18:31 ` J. Bruce Fields 2017-04-05 1:43 ` NeilBrown 2017-04-05 8:05 ` Jan Kara [this message] 2017-04-05 18:14 ` J. Bruce Fields 2017-05-11 18:59 ` J. Bruce Fields 2017-05-11 22:22 ` NeilBrown 2017-05-12 16:21 ` J. Bruce Fields 2017-10-30 13:21 ` Jeff Layton 2017-05-12 8:27 ` Jan Kara 2017-05-12 15:56 ` J. Bruce Fields 2017-05-12 11:01 ` Jeff Layton 2017-05-12 15:57 ` J. Bruce Fields 2017-04-06 1:12 ` NeilBrown 2017-04-06 7:22 ` Jan Kara 2017-04-05 17:26 ` J. Bruce Fields 2017-04-01 23:05 ` Dave Chinner 2017-04-03 14:00 ` Jan Kara 2017-04-04 12:34 ` Dave Chinner 2017-04-04 17:53 ` J. Bruce Fields 2017-04-05 1:26 ` NeilBrown 2017-03-21 21:45 ` Dave Chinner 2017-03-22 19:53 ` Jeff Layton 2017-03-03 23:00 ` J. Bruce Fields 2017-03-04 0:53 ` Jeff Layton 2017-03-08 17:29 ` J. Bruce Fields
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20170405080551.GC8899@quack2.suse.cz \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --subject='Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization' \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).