linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Jan Kara <jack@suse.cz>
Cc: Ira Weiny <ira.weiny@intel.com>, Christoph Hellwig <hch@lst.de>,
	Dave Chinner <david@fromorbit.com>,
	"Theodore Y. Ts'o" <tytso@mit.edu>,
	Dan Williams <dan.j.williams@intel.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	linux-ext4 <linux-ext4@vger.kernel.org>,
	linux-xfs <linux-xfs@vger.kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5
Date: Fri, 3 Apr 2020 11:29:04 -0700	[thread overview]
Message-ID: <20200403182904.GP80283@magnolia> (raw)
In-Reply-To: <20200403170338.GD29920@quack2.suse.cz>

On Fri, Apr 03, 2020 at 07:03:38PM +0200, Jan Kara wrote:
> On Fri 03-04-20 08:48:29, Ira Weiny wrote:
> > On Fri, Apr 03, 2020 at 09:27:31AM +0200, Christoph Hellwig wrote:
> > > On Thu, Apr 02, 2020 at 01:55:19PM -0700, Ira Weiny wrote:
> > > > > I'd just return an error for that case, don't play silly games like
> > > > > evicting the inode.
> > > > 
> > > > I think I agree with Christoph here.  But I want to clarify.  I was heading in
> > > > a direction of failing the ioctl completely.  But we could have the flag change
> > > > with an appropriate error which could let the user know the change has been
> > > > delayed.
> > > > 
> > > > But I don't immediately see what error code is appropriate for such an
> > > > indication.  Candidates I can envision:
> > > > 
> > > > EAGAIN
> > > > ERESTART
> > > > EUSERS
> > > > EINPROGRESS
> > > > 
> > > > None are perfect but I'm leaning toward EINPROGRESS.
> > > 
> > > I really, really dislike that idea.  The whole point of not forcing
> > > evictions is to make it clear - no this inode is "busy" you can't
> > > do that.  A reasonably smart application can try to evict itself.
> > 
> > I don't understand.  What Darrick proposed would never need any
> > evictions.  If the file has blocks allocated the FS_XFLAG_DAX flag can
> > not be changed.  So I don't see what good eviction would do at all.
> 
> I guess there's some confusion here (may well be than on my side). Darrick
> propose that we can switch FS_XFLAG_DAX only when file has no blocks
> allocated - fine by me. But that still does not mean than we can switch
> S_DAX immediately, does it? Because that would still mean we need to switch
> aops on living inode and that's ... difficult and Christoph didn't want to
> clutter the code with it.

IIRC, the reason Ira was trying to introduce this file operations lock
is because there isn't any other safe way to change the operations
dynamically, because some of those operations don't start taking any
locks at all until after we've accessed the file operations pointer.

Now that I think about it some more, that's also means we can't change
S_DAX even on files without blocks allocated.  Look at fallocate, it
doesn't even take i_rwsem until we've called into (say)
xfs_file_fallocate.

Ok, so with that in mind, I think we simply have to say that
FS_XFLAG_DAX is 100% advisory, and S_DAX will magically switch some time
in the future or after the next umount/mount cycle.

FSSETXATTR can't evict an inode it has just set FS_XFLAG_DAX on, because
it applies that change to the fd passed into ioctl(), which means that the
caller has a reference to the fd -> file -> inode, which means the inode
is unevictable until after the call completes.

IOWs, 

> So I've understood Darrick's proposal as: Just switch FS_XFLAG_DAX flag,
> S_DAX flag will magically switch when inode gets evicted and the inode gets
> reloaded from the disk again. Did I misunderstand anything?

That was /my/ understanding. :)

> And my thinking was that this is surprising behavior for the user and so it
> will likely generate lots of bug reports along the lines of "DAX inode flag
> does not work!". So I was pondering how to make the behavior less
> confusing... The ioctl I've suggested was just a poor attempt at that.

At best, userspace uses FSSETXATTR to change FS_XFLAG_DAX, and then
calls some as-yet-undefined ioctl to try to evict the inode from memory.
I'm not sure how you'd actually do that, though, considering that you'd
have to close the fd and as soon as that happens the inode can disappear
permanently.

Ok, that's not sane.  Forget I ever wrote that.

> > > But returning an error and doing a lazy change anyway is straight from
> > > the playbook for arcane and confusing API designs.
> > 
> > Jan countered with a proposal that the FS_XFLAG_DAX does change with
> > blocks allocated.  But that S_DAX would change on eviction.  Adding that
> > some eviction ioctl could be added.
> 
> No, I didn't mean that we can change FS_XFLAG_DAX with blocks allocated. I
> was still speaking about the case without blocks allocated.
> 
> > You then proposed just returning an error for that case.  (This lead me to
> > believe that you were ok with an eviction based change of S_DAX.)
> > 
> > So I agreed that changing S_DAX could be delayed until an explicit eviction.
> > But, to aid the 'smart application', a different error code could be used to
> > indicate that the FS_XFLAG_DAX had been changed but that until that explicit
> > eviction occurs S_DAX would remain.

There's no point in returning a magic error code from FSSETXATTR, as I
realized above.  To restate: FSSETXATTR can't evict the inode, so it
would always have to return the magic error code.  The best we can do is
tell userspace that they can set the advisory FS_XFLAG_DAX flag and then
check STATX_ATTR_DAX immediately afterwards.  If some day in the future
we get smarter and can change it immediately, the statx output will
reflect that.

> > So I don't fully follow what you mean by 'lazy change'?
> > 
> > Do you still really, really dislike an explicit eviction method for changing
> > the S_DAX flag?
> > 
> > If FS_XFLAG_DAX can never be changed on a file with blocks allocated and the
> > user wants to change the mode of operations on their 'data'; they would have to
> > create a new file with the proper setting and move the data there.  For example
> > copy the file into a directory marked FS_XFLAG_DAX==true?
> > 
> > I'm ok with either interface as I think both could be clear if documented.
> 
> I agree that what Darrick suggested is technically easily doable and can be
> documented. But it is not natural behavior (i.e., different than all inode
> flags we have) and we know how careful people are when reading
> documentation...

To reflect all that I've rambled in this thread, I withdraw the previous
paragraph and submit this one for consideration:

 - There exists an advisory file inode flag FS_XFLAG_DAX.

   If FS_XFLAG_DAX is set and the fs is on pmem then it will always
   enable S_DAX at inode load time; if FS_XFLAG_DAX is not set, it will
   never enable S_DAX.  The advice can be overridden by mount option.

   Changing this flag does not necessarily change the S_DAX state
   immediately but programs can query the S_DAX state via statx to
   detect when the new advice has gone into effect.

Consider this from the perspective of minimizing changes to userspace
programs between now and the future.  If your program really wants
S_DAX, you can write:

	fd = open(...);

	ioctl(fd, FSGETXATTR, &fsx);

	if (fsx.xflags & FS_XFLAG_DAX)
		return 0;

	fsx.xflags |= FS_XFLAG_DAX;
	ioctl(fd, FSSETXATTR, &fsx);

	do {
		statx(fd, STATX_GIVE_ME_EVERYTHING, &statx);
		if (statx.attrs & STATX_ATTR_DAX)
			return 0;

		/* do stupid magic to evict things */
	} while (we haven't gotten bored and wandered away);

This code snippet will work with the current limitations of the kernel,
and it'll continue working even on Linux v5000 where we finally figure
out how to change S_DAX on the fly.

--D

> 
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

  parent reply	other threads:[~2020-04-03 18:29 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-27  5:24 [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5 ira.weiny
2020-02-27  5:24 ` [PATCH V5 01/12] fs/xfs: Remove unnecessary initialization of i_rwsem ira.weiny
2020-02-27 17:25   ` Ira Weiny
2020-02-27  5:24 ` [PATCH V5 02/12] fs: Remove unneeded IS_DAX() check ira.weiny
2020-02-27  5:24 ` [PATCH V5 03/12] fs/stat: Define DAX statx attribute ira.weiny
2020-02-27  5:24 ` [PATCH V5 04/12] fs/xfs: Isolate the physical DAX flag from enabled ira.weiny
2020-02-27  5:24 ` [PATCH V5 05/12] fs/xfs: Create function xfs_inode_enable_dax() ira.weiny
2020-03-01 22:37   ` Dave Chinner
2020-02-27  5:24 ` [PATCH V5 06/12] fs: Add locking for a dynamic address space operations state ira.weiny
2020-03-02  1:26   ` Dave Chinner
2020-03-02  1:36     ` Dave Chinner
2020-02-27  5:24 ` [PATCH V5 07/12] fs: Prevent DAX state change if file is mmap'ed ira.weiny
2020-02-27  5:24 ` [PATCH V5 08/12] fs/xfs: Hold off aops users while changing DAX state ira.weiny
2020-02-27  5:24 ` [PATCH V5 09/12] fs/xfs: Clean up locking in dax invalidate ira.weiny
2020-02-27  5:24 ` [PATCH V5 10/12] fs/xfs: Allow toggle of effective DAX flag ira.weiny
2020-02-27  5:24 ` [PATCH V5 11/12] fs/xfs: Remove xfs_diflags_to_linux() ira.weiny
2020-02-27  5:24 ` [PATCH V5 12/12] Documentation/dax: Update Usage section ira.weiny
2020-03-05 15:51 ` [PATCH V5 00/12] Enable per-file/per-directory DAX operations V5 Christoph Hellwig
2020-03-09 17:04   ` Ira Weiny
2020-03-11  3:36     ` Darrick J. Wong
2020-03-11  6:29       ` Christoph Hellwig
2020-03-11 17:07         ` Dan Williams
2020-03-16  9:52           ` Jan Kara
2020-03-16  9:55             ` Christoph Hellwig
2020-04-01  4:00               ` Darrick J. Wong
2020-04-01 10:25                 ` Jan Kara
2020-04-02  8:53                   ` Christoph Hellwig
2020-04-02 20:55                     ` Ira Weiny
2020-04-03  7:27                       ` Christoph Hellwig
2020-04-03 15:48                         ` Ira Weiny
2020-04-03 17:03                           ` Jan Kara
2020-04-03 18:18                             ` Ira Weiny
2020-04-03 18:21                               ` Ira Weiny
2020-04-03 18:37                               ` Darrick J. Wong
2020-04-05  6:19                                 ` Ira Weiny
2020-04-06 10:00                               ` Jan Kara
2020-04-03 18:29                             ` Darrick J. Wong [this message]
2020-04-03 16:05                         ` Darrick J. Wong
2020-04-03  4:39                 ` Ira Weiny
2020-03-11  6:39       ` Dave Chinner
2020-03-11  6:44         ` Christoph Hellwig
2020-03-11 17:07           ` Dan Williams
2020-03-12  0:49           ` Dave Chinner
2020-03-12  3:00             ` Darrick J. Wong
2020-03-12  7:27             ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200403182904.GP80283@magnolia \
    --to=darrick.wong@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=dan.j.williams@intel.com \
    --cc=david@fromorbit.com \
    --cc=hch@lst.de \
    --cc=ira.weiny@intel.com \
    --cc=jack@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=tytso@mit.edu \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).