linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: NeilBrown <neil@brown.name>
Cc: "J. Bruce Fields" <bfields@fieldses.org>,
	Jeff Layton <jlayton@redhat.com>, Jan Kara <jack@suse.cz>,
	Christoph Hellwig <hch@infradead.org>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-nfs@vger.kernel.org, linux-ext4@vger.kernel.org,
	linux-btrfs@vger.kernel.org, linux-xfs@vger.kernel.org
Subject: Re: [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization
Date: Wed, 5 Apr 2017 10:05:51 +0200	[thread overview]
Message-ID: <20170405080551.GC8899@quack2.suse.cz> (raw)
In-Reply-To: <878tnfiq7v.fsf@notabene.neil.brown.name>

On Wed 05-04-17 11:43:32, NeilBrown wrote:
> On Tue, Apr 04 2017, J. Bruce Fields wrote:
> 
> > On Thu, Mar 30, 2017 at 02:35:32PM -0400, Jeff Layton wrote:
> >> On Thu, 2017-03-30 at 12:12 -0400, J. Bruce Fields wrote:
> >> > On Thu, Mar 30, 2017 at 07:11:48AM -0400, Jeff Layton wrote:
> >> > > On Thu, 2017-03-30 at 08:47 +0200, Jan Kara wrote:
> >> > > > Because if above is acceptable we could make reported i_version to be a sum
> >> > > > of "superblock crash counter" and "inode i_version". We increment
> >> > > > "superblock crash counter" whenever we detect unclean filesystem shutdown.
> >> > > > That way after a crash we are guaranteed each inode will report new
> >> > > > i_version (the sum would probably have to look like "superblock crash
> >> > > > counter" * 65536 + "inode i_version" so that we avoid reusing possible
> >> > > > i_version numbers we gave away but did not write to disk but still...).
> >> > > > Thoughts?
> >> > 
> >> > How hard is this for filesystems to support?  Do they need an on-disk
> >> > format change to keep track of the crash counter?  Maybe not, maybe the
> >> > high bits of the i_version counters are all they need.
> >> > 
> >> 
> >> Yeah, I imagine we'd need a on-disk change for this unless there's
> >> something already present that we could use in place of a crash counter.
> >
> > We could consider using the current time instead.  So, put the current
> > time (or time of last boot, or this inode's ctime, or something) in the
> > high bits of the change attribute, and keep the low bits as a counter.
> 
> This is a very different proposal.
> I don't think Jan was suggesting that the i_version be split into two
> bit fields, one the change-counter and one the crash-counter.
> Rather, the crash-counter was multiplied by a large-number and added to
> the change-counter with the expectation that while not ever
> change-counter landed on disk, at least 1 in every large-number would.
> So after each crash we effectively add large-number to the
> change-counter, and can be sure that number hasn't been used already.

Yes, that was my thinking.

> To store the crash-counter in each inode (which does appeal) you would
> need to be able to remove it before adding the new crash counter, and
> that requires bit-fields.  Maybe there are enough bits.

Furthermore you'd have a potential problem that you need to change
i_version on disk just because you are reading after a crash and such
changes tend to be problematic (think of read-only mounts and stuff like
that).
 
> If you want to ensure read-only files can remain cached over a crash,
> then you would have to mark a file in some way on stable storage
> *before* allowing any change.
> e.g. you could use the lsb.  Odd i_versions might have been changed
> recently and crash-count*large-number needs to be added.
> Even i_versions have not been changed recently and nothing need be
> added.
> 
> If you want to change a file with an even i_version, you subtract
>   crash-count*large-number
> to the i_version, then set lsb.  This is written to stable storage before
> the change.
> 
> If a file has not been changed for a while, you can add
>   crash-count*large-number
> and clear lsb.
> 
> The lsb of the i_version would be for internal use only.  It would not
> be visible outside the filesystem.
> 
> It feels a bit clunky, but I think it would work and is the best
> combination of Jan's idea and your requirement.
> The biggest cost would be switching to 'odd' before an changes, and the
> unknown is when does it make sense to switch to 'even'.

Well, there is also a problem that you would need to somehow remember with
which 'crash count' the i_version has been previously reported as that is
not stored on disk with my scheme. So I don't think we can easily use your
scheme.

So the options we have are:

1) Keep i_version as is, make clients also check for i_ctime.
   Pro: No on-disk format changes.
   Cons: After a crash, i_version can go backwards (but when file changes
   i_version, i_ctime pair should be still different) or not, data can be
   old or not.

2) Fsync when reporting i_version.
   Pro: No on-disk format changes, strong consistency of i_version and
        data.
   Cons: Difficult to implement for filesystems due to locking constrains.
         High performance overhead or i_version reporting.

3) Some variant of crash counter.
   Pro: i_version cannot go backwards.
   Cons: Requires on-disk format changes. After a crash data can be old
         (however i_version increased).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

  reply	other threads:[~2017-04-05  8:06 UTC|newest]

Thread overview: 87+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-12-21 17:03 [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 01/30] lustre: don't set f_version in ll_readdir Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 02/30] ecryptfs: remove unnecessary i_version bump Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 03/30] ceph: remove the bump of i_version Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 04/30] f2fs: don't bother setting i_version Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 05/30] hpfs: don't bother with the i_version counter Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 06/30] jfs: remove initialization of " Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 07/30] nilfs2: remove inode->i_version initialization Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 08/30] orangefs: remove initialization of i_version Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 09/30] reiserfs: remove unneeded i_version bump Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 10/30] ntfs: remove i_version handling Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 11/30] fs: new API for handling i_version Jeff Layton
2017-03-03 22:36   ` J. Bruce Fields
2017-03-04  0:09     ` Jeff Layton
2017-03-03 23:55   ` NeilBrown
2017-03-04  1:58     ` Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 12/30] fat: convert to new i_version API Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 13/30] affs: " Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 14/30] afs: " Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 15/30] btrfs: " Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 16/30] exofs: switch " Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 17/30] ext2: convert " Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 18/30] ext4: " Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 19/30] nfs: " Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 20/30] nfsd: " Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 21/30] ocfs2: " Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 22/30] ufs: use " Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 23/30] xfs: convert to " Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 24/30] IMA: switch IMA over " Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 25/30] fs: add a "force" parameter to inode_inc_iversion Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 26/30] fs: only set S_VERSION when updating times if it has been queried Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 27/30] xfs: avoid setting XFS_ILOG_CORE if i_version doesn't need incrementing Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 28/30] btrfs: only dirty the inode in btrfs_update_time if something was changed Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 29/30] fs: track whether the i_version has been queried with an i_state flag Jeff Layton
2017-03-04  0:03   ` NeilBrown
2017-03-04  0:43     ` Jeff Layton
2016-12-21 17:03 ` [RFC PATCH v1 30/30] fs: convert i_version counter over to an atomic64_t Jeff Layton
2016-12-22  8:38   ` Amir Goldstein
2016-12-22 13:27     ` Jeff Layton
2017-03-04  0:00   ` NeilBrown
2016-12-22  8:45 ` [RFC PATCH v1 00/30] fs: inode->i_version rework and optimization Christoph Hellwig
2016-12-22 14:42   ` Jeff Layton
2017-03-20 21:43     ` J. Bruce Fields
2017-03-21 13:45       ` Christoph Hellwig
2017-03-21 16:30         ` J. Bruce Fields
2017-03-21 17:23           ` Jeff Layton
2017-03-21 17:37             ` J. Bruce Fields
2017-03-21 17:51               ` J. Bruce Fields
2017-03-21 18:30             ` J. Bruce Fields
2017-03-21 18:46               ` Jeff Layton
2017-03-21 19:13                 ` J. Bruce Fields
2017-03-21 21:54                   ` Jeff Layton
2017-03-29 11:15                 ` Jan Kara
2017-03-29 17:54                   ` Jeff Layton
2017-03-29 23:41                     ` Dave Chinner
2017-03-30 11:24                       ` Jeff Layton
2017-04-04 18:38                       ` J. Bruce Fields
2017-03-30  6:47                     ` Jan Kara
2017-03-30 11:11                       ` Jeff Layton
2017-03-30 16:12                         ` J. Bruce Fields
2017-03-30 18:35                           ` Jeff Layton
2017-03-30 21:11                             ` Boaz Harrosh
2017-04-04 18:31                             ` J. Bruce Fields
2017-04-05  1:43                               ` NeilBrown
2017-04-05  8:05                                 ` Jan Kara [this message]
2017-04-05 18:14                                   ` J. Bruce Fields
2017-05-11 18:59                                     ` J. Bruce Fields
2017-05-11 22:22                                       ` NeilBrown
2017-05-12 16:21                                         ` J. Bruce Fields
2017-10-30 13:21                                           ` Jeff Layton
2017-05-12  8:27                                       ` Jan Kara
2017-05-12 15:56                                         ` J. Bruce Fields
2017-05-12 11:01                                       ` Jeff Layton
2017-05-12 15:57                                         ` J. Bruce Fields
2017-04-06  1:12                                   ` NeilBrown
2017-04-06  7:22                                     ` Jan Kara
2017-04-05 17:26                                 ` J. Bruce Fields
2017-04-01 23:05                           ` Dave Chinner
2017-04-03 14:00                             ` Jan Kara
2017-04-04 12:34                               ` Dave Chinner
2017-04-04 17:53                                 ` J. Bruce Fields
2017-04-05  1:26                                 ` NeilBrown
2017-03-21 21:45             ` Dave Chinner
2017-03-22 19:53               ` Jeff Layton
2017-03-03 23:00 ` J. Bruce Fields
2017-03-04  0:53   ` Jeff Layton
2017-03-08 17:29     ` J. Bruce Fields

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170405080551.GC8899@quack2.suse.cz \
    --to=jack@suse.cz \
    --cc=bfields@fieldses.org \
    --cc=hch@infradead.org \
    --cc=jlayton@redhat.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=neil@brown.name \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).