All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: "Darrick J. Wong" <djwong@kernel.org>
Cc: Jan Kara <jack@suse.cz>,
	linux-fsdevel@vger.kernel.org,
	Christoph Hellwig <hch@infradead.org>,
	ceph-devel@vger.kernel.org, Chao Yu <yuchao0@huawei.com>,
	Damien Le Moal <damien.lemoal@wdc.com>,
	"Darrick J. Wong" <darrick.wong@oracle.com>,
	Jaegeuk Kim <jaegeuk@kernel.org>,
	Jeff Layton <jlayton@kernel.org>,
	Johannes Thumshirn <jth@kernel.org>,
	linux-cifs@vger.kernel.org, linux-ext4@vger.kernel.org,
	linux-f2fs-devel@lists.sourceforge.net, linux-mm@kvack.org,
	linux-xfs@vger.kernel.org, Miklos Szeredi <miklos@szeredi.hu>,
	Steve French <sfrench@samba.org>, Ted Tso <tytso@mit.edu>,
	Matthew Wilcox <willy@infradead.org>
Subject: Re: [PATCH 03/11] mm: Protect operations adding pages to page cache with invalidate_lock
Date: Wed, 19 May 2021 08:36:37 +1000	[thread overview]
Message-ID: <20210518223637.GJ2893@dread.disaster.area> (raw)
In-Reply-To: <20210514161730.GL9675@magnolia>

On Fri, May 14, 2021 at 09:17:30AM -0700, Darrick J. Wong wrote:
> On Fri, May 14, 2021 at 09:19:45AM +1000, Dave Chinner wrote:
> > On Thu, May 13, 2021 at 11:52:52AM -0700, Darrick J. Wong wrote:
> > > On Thu, May 13, 2021 at 07:44:59PM +0200, Jan Kara wrote:
> > > > On Wed 12-05-21 08:23:45, Darrick J. Wong wrote:
> > > > > On Wed, May 12, 2021 at 03:46:11PM +0200, Jan Kara wrote:
> > > > > > +->fallocate implementation must be really careful to maintain page cache
> > > > > > +consistency when punching holes or performing other operations that invalidate
> > > > > > +page cache contents. Usually the filesystem needs to call
> > > > > > +truncate_inode_pages_range() to invalidate relevant range of the page cache.
> > > > > > +However the filesystem usually also needs to update its internal (and on disk)
> > > > > > +view of file offset -> disk block mapping. Until this update is finished, the
> > > > > > +filesystem needs to block page faults and reads from reloading now-stale page
> > > > > > +cache contents from the disk. VFS provides mapping->invalidate_lock for this
> > > > > > +and acquires it in shared mode in paths loading pages from disk
> > > > > > +(filemap_fault(), filemap_read(), readahead paths). The filesystem is
> > > > > > +responsible for taking this lock in its fallocate implementation and generally
> > > > > > +whenever the page cache contents needs to be invalidated because a block is
> > > > > > +moving from under a page.
> > > > > > +
> > > > > > +->copy_file_range and ->remap_file_range implementations need to serialize
> > > > > > +against modifications of file data while the operation is running. For blocking
> > > > > > +changes through write(2) and similar operations inode->i_rwsem can be used. For
> > > > > > +blocking changes through memory mapping, the filesystem can use
> > > > > > +mapping->invalidate_lock provided it also acquires it in its ->page_mkwrite
> > > > > > +implementation.
> > > > > 
> > > > > Question: What is the locking order when acquiring the invalidate_lock
> > > > > of two different files?  Is it the same as i_rwsem (increasing order of
> > > > > the struct inode pointer) or is it the same as the XFS MMAPLOCK that is
> > > > > being hoisted here (increasing order of i_ino)?
> > > > > 
> > > > > The reason I ask is that remap_file_range has to do that, but I don't
> > > > > see any conversions for the xfs_lock_two_inodes(..., MMAPLOCK_EXCL)
> > > > > calls in xfs_ilock2_io_mmap in this series.
> > > > 
> > > > Good question. Technically, I don't think there's real need to establish a
> > > > single ordering because locks among different filesystems are never going
> > > > to be acquired together (effectively each lock type is local per sb and we
> > > > are free to define an ordering for each lock type differently). But to
> > > > maintain some sanity I guess having the same locking order for doublelock
> > > > of i_rwsem and invalidate_lock makes sense. Is there a reason why XFS uses
> > > > by-ino ordering? So that we don't have to consider two different orders in
> > > > xfs_lock_two_inodes()...
> > > 
> > > I imagine Dave will chime in on this, but I suspect the reason is
> > > hysterical raisins^Wreasons.
> > 
> > It's the locking rules that XFS has used pretty much forever.
> > Locking by inode number always guarantees the same locking order of
> > two inodes in the same filesystem, regardless of the specific
> > in-memory instances of the two inodes.
> > 
> > e.g. if we lock based on the inode structure address, in one
> > instancex, we could get A -> B, then B gets recycled and
> > reallocated, then we get B -> A as the locking order for the same
> > two inodes.
> > 
> > That, IMNSHO, is utterly crazy because with non-deterministic inode
> > lock ordered like this you can't make consistent locking rules for
> > locking the physical inode cluster buffers underlying the inodes in
> > the situation where they also need to be locked.
> 
> <nod> That's protected by the ILOCK, correct?
> 
> > We've been down this path before more than a decade ago when the
> > powers that be decreed that inode locking order is to be "by
> > structure address" rather than inode number, because "inode number
> > is not unique across multiple superblocks".
> > 
> > I'm not sure that there is anywhere that locks multiple inodes
> > across different superblocks, but here we are again....
> 
> Hm.  Are there situations where one would want to lock multiple
> /mappings/ across different superblocks?  The remapping code doesn't
> allow cross-super operations, so ... pipes and splice, maybe?  I don't
> remember that code well enough to say for sure.

Hmmmm. Doing read IO into a buffer that is mmap()d from another
file, and we take a page fault on it inside the read IO path? We're
copying from a page in one mapping and taking a fault in another
mapping and hence taking the invalidate_lock to populate the page
cache for the second mapping...

I haven't looked closely enough at where the invalidate_lock is held
in the read path to determine if this is an issue, but if it is then
it is also a potential deadlock scenario...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

WARNING: multiple messages have this Message-ID (diff)
From: Dave Chinner <david@fromorbit.com>
To: "Darrick J. Wong" <djwong@kernel.org>
Cc: linux-cifs@vger.kernel.org,
	Damien Le Moal <damien.lemoal@wdc.com>,
	linux-ext4@vger.kernel.org, Jan Kara <jack@suse.cz>,
	"Darrick J. Wong" <darrick.wong@oracle.com>,
	Steve French <sfrench@samba.org>,
	Jeff Layton <jlayton@kernel.org>,
	Matthew Wilcox <willy@infradead.org>,
	linux-f2fs-devel@lists.sourceforge.net,
	Christoph Hellwig <hch@infradead.org>,
	linux-mm@kvack.org, Miklos Szeredi <miklos@szeredi.hu>,
	Ted Tso <tytso@mit.edu>,
	linux-fsdevel@vger.kernel.org, Jaegeuk Kim <jaegeuk@kernel.org>,
	ceph-devel@vger.kernel.org, Johannes Thumshirn <jth@kernel.org>,
	linux-xfs@vger.kernel.org
Subject: Re: [f2fs-dev] [PATCH 03/11] mm: Protect operations adding pages to page cache with invalidate_lock
Date: Wed, 19 May 2021 08:36:37 +1000	[thread overview]
Message-ID: <20210518223637.GJ2893@dread.disaster.area> (raw)
In-Reply-To: <20210514161730.GL9675@magnolia>

On Fri, May 14, 2021 at 09:17:30AM -0700, Darrick J. Wong wrote:
> On Fri, May 14, 2021 at 09:19:45AM +1000, Dave Chinner wrote:
> > On Thu, May 13, 2021 at 11:52:52AM -0700, Darrick J. Wong wrote:
> > > On Thu, May 13, 2021 at 07:44:59PM +0200, Jan Kara wrote:
> > > > On Wed 12-05-21 08:23:45, Darrick J. Wong wrote:
> > > > > On Wed, May 12, 2021 at 03:46:11PM +0200, Jan Kara wrote:
> > > > > > +->fallocate implementation must be really careful to maintain page cache
> > > > > > +consistency when punching holes or performing other operations that invalidate
> > > > > > +page cache contents. Usually the filesystem needs to call
> > > > > > +truncate_inode_pages_range() to invalidate relevant range of the page cache.
> > > > > > +However the filesystem usually also needs to update its internal (and on disk)
> > > > > > +view of file offset -> disk block mapping. Until this update is finished, the
> > > > > > +filesystem needs to block page faults and reads from reloading now-stale page
> > > > > > +cache contents from the disk. VFS provides mapping->invalidate_lock for this
> > > > > > +and acquires it in shared mode in paths loading pages from disk
> > > > > > +(filemap_fault(), filemap_read(), readahead paths). The filesystem is
> > > > > > +responsible for taking this lock in its fallocate implementation and generally
> > > > > > +whenever the page cache contents needs to be invalidated because a block is
> > > > > > +moving from under a page.
> > > > > > +
> > > > > > +->copy_file_range and ->remap_file_range implementations need to serialize
> > > > > > +against modifications of file data while the operation is running. For blocking
> > > > > > +changes through write(2) and similar operations inode->i_rwsem can be used. For
> > > > > > +blocking changes through memory mapping, the filesystem can use
> > > > > > +mapping->invalidate_lock provided it also acquires it in its ->page_mkwrite
> > > > > > +implementation.
> > > > > 
> > > > > Question: What is the locking order when acquiring the invalidate_lock
> > > > > of two different files?  Is it the same as i_rwsem (increasing order of
> > > > > the struct inode pointer) or is it the same as the XFS MMAPLOCK that is
> > > > > being hoisted here (increasing order of i_ino)?
> > > > > 
> > > > > The reason I ask is that remap_file_range has to do that, but I don't
> > > > > see any conversions for the xfs_lock_two_inodes(..., MMAPLOCK_EXCL)
> > > > > calls in xfs_ilock2_io_mmap in this series.
> > > > 
> > > > Good question. Technically, I don't think there's real need to establish a
> > > > single ordering because locks among different filesystems are never going
> > > > to be acquired together (effectively each lock type is local per sb and we
> > > > are free to define an ordering for each lock type differently). But to
> > > > maintain some sanity I guess having the same locking order for doublelock
> > > > of i_rwsem and invalidate_lock makes sense. Is there a reason why XFS uses
> > > > by-ino ordering? So that we don't have to consider two different orders in
> > > > xfs_lock_two_inodes()...
> > > 
> > > I imagine Dave will chime in on this, but I suspect the reason is
> > > hysterical raisins^Wreasons.
> > 
> > It's the locking rules that XFS has used pretty much forever.
> > Locking by inode number always guarantees the same locking order of
> > two inodes in the same filesystem, regardless of the specific
> > in-memory instances of the two inodes.
> > 
> > e.g. if we lock based on the inode structure address, in one
> > instancex, we could get A -> B, then B gets recycled and
> > reallocated, then we get B -> A as the locking order for the same
> > two inodes.
> > 
> > That, IMNSHO, is utterly crazy because with non-deterministic inode
> > lock ordered like this you can't make consistent locking rules for
> > locking the physical inode cluster buffers underlying the inodes in
> > the situation where they also need to be locked.
> 
> <nod> That's protected by the ILOCK, correct?
> 
> > We've been down this path before more than a decade ago when the
> > powers that be decreed that inode locking order is to be "by
> > structure address" rather than inode number, because "inode number
> > is not unique across multiple superblocks".
> > 
> > I'm not sure that there is anywhere that locks multiple inodes
> > across different superblocks, but here we are again....
> 
> Hm.  Are there situations where one would want to lock multiple
> /mappings/ across different superblocks?  The remapping code doesn't
> allow cross-super operations, so ... pipes and splice, maybe?  I don't
> remember that code well enough to say for sure.

Hmmmm. Doing read IO into a buffer that is mmap()d from another
file, and we take a page fault on it inside the read IO path? We're
copying from a page in one mapping and taking a fault in another
mapping and hence taking the invalidate_lock to populate the page
cache for the second mapping...

I haven't looked closely enough at where the invalidate_lock is held
in the read path to determine if this is an issue, but if it is then
it is also a potential deadlock scenario...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


_______________________________________________
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

  parent reply	other threads:[~2021-05-18 22:36 UTC|newest]

Thread overview: 60+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-12 13:46 [PATCH 0/11 v5] fs: Hole punch vs page cache filling races Jan Kara
2021-05-12 13:46 ` [f2fs-dev] " Jan Kara
2021-05-12 13:46 ` [PATCH 01/11] mm: Fix comments mentioning i_mutex Jan Kara
2021-05-12 13:46   ` [f2fs-dev] " Jan Kara
2021-05-12 13:46 ` [PATCH 02/11] documentation: Sync file_operations members with reality Jan Kara
2021-05-12 13:46   ` [f2fs-dev] " Jan Kara
2021-05-12 13:46 ` [PATCH 03/11] mm: Protect operations adding pages to page cache with invalidate_lock Jan Kara
2021-05-12 13:46   ` [f2fs-dev] " Jan Kara
2021-05-12 14:20   ` Matthew Wilcox
2021-05-12 14:20     ` [f2fs-dev] " Matthew Wilcox
2021-05-13 17:49     ` Jan Kara
2021-05-13 17:49       ` [f2fs-dev] " Jan Kara
2021-05-12 14:40   ` Matthew Wilcox
2021-05-12 14:40     ` [f2fs-dev] " Matthew Wilcox
2021-05-13 19:01     ` Jan Kara
2021-05-13 19:01       ` [f2fs-dev] " Jan Kara
2021-05-13 19:38       ` Matthew Wilcox
2021-05-13 19:38         ` [f2fs-dev] " Matthew Wilcox
2021-05-14 11:07         ` Jan Kara
2021-05-14 11:07           ` [f2fs-dev] " Jan Kara
2021-05-12 15:23   ` Darrick J. Wong
2021-05-12 15:23     ` [f2fs-dev] " Darrick J. Wong
2021-05-13 17:44     ` Jan Kara
2021-05-13 17:44       ` [f2fs-dev] " Jan Kara
2021-05-13 18:52       ` Darrick J. Wong
2021-05-13 18:52         ` [f2fs-dev] " Darrick J. Wong
2021-05-13 23:19         ` Dave Chinner
2021-05-13 23:19           ` [f2fs-dev] " Dave Chinner
2021-05-14 16:17           ` Darrick J. Wong
2021-05-14 16:17             ` [f2fs-dev] " Darrick J. Wong
2021-05-17 11:21             ` Jan Kara
2021-05-17 11:21               ` [f2fs-dev] " Jan Kara
2021-05-18 22:36             ` Dave Chinner [this message]
2021-05-18 22:36               ` Dave Chinner
2021-05-19 10:57               ` Jan Kara
2021-05-19 10:57                 ` [f2fs-dev] " Jan Kara
2021-05-12 13:46 ` [PATCH 04/11] ext4: Convert to use mapping->invalidate_lock Jan Kara
2021-05-12 13:46   ` [f2fs-dev] " Jan Kara
2021-05-12 13:46 ` [PATCH 05/11] ext2: Convert to using invalidate_lock Jan Kara
2021-05-12 13:46   ` [f2fs-dev] " Jan Kara
2021-05-12 13:46 ` [PATCH 06/11] xfs: Convert to use invalidate_lock Jan Kara
2021-05-12 13:46   ` [f2fs-dev] " Jan Kara
2021-05-12 13:46 ` [PATCH 07/11] zonefs: Convert to using invalidate_lock Jan Kara
2021-05-12 13:46   ` [f2fs-dev] " Jan Kara
2021-05-13  0:34   ` Damien Le Moal
2021-05-13  0:34     ` [f2fs-dev] " Damien Le Moal
2021-05-13  0:34     ` Damien Le Moal
2021-05-12 13:46 ` [PATCH 08/11] f2fs: " Jan Kara
2021-05-12 13:46   ` [f2fs-dev] " Jan Kara
2021-05-12 18:00   ` kernel test robot
2021-05-12 18:00     ` kernel test robot
2021-05-12 13:46 ` [PATCH 09/11] fuse: " Jan Kara
2021-05-12 13:46   ` [f2fs-dev] " Jan Kara
2021-05-12 13:46 ` [PATCH 10/11] ceph: Fix race between hole punch and page fault Jan Kara
2021-05-12 13:46   ` [f2fs-dev] " Jan Kara
2021-05-12 15:19   ` Jeff Layton
2021-05-12 15:19     ` [f2fs-dev] " Jeff Layton
2021-05-12 15:19     ` Jeff Layton
2021-05-12 13:46 ` [PATCH 11/11] cifs: " Jan Kara
2021-05-12 13:46   ` [f2fs-dev] " Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210518223637.GJ2893@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=damien.lemoal@wdc.com \
    --cc=darrick.wong@oracle.com \
    --cc=djwong@kernel.org \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=jaegeuk@kernel.org \
    --cc=jlayton@kernel.org \
    --cc=jth@kernel.org \
    --cc=linux-cifs@vger.kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-f2fs-devel@lists.sourceforge.net \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=miklos@szeredi.hu \
    --cc=sfrench@samba.org \
    --cc=tytso@mit.edu \
    --cc=willy@infradead.org \
    --cc=yuchao0@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.