Linux-ext4 Archive on lore.kernel.org
 help / color / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Jan Kara <jack@suse.cz>
Cc: linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org,
	linux-xfs@vger.kernel.org, Ted Tso <tytso@mit.edu>,
	Christoph Hellwig <hch@infradead.org>,
	Amir Goldstein <amir73il@gmail.com>
Subject: Re: [PATCH 2/7] mm: Protect operations adding pages to page cache with i_mapping_lock
Date: Wed, 14 Apr 2021 10:01:13 +1000
Message-ID: <20210414000113.GG63242@dread.disaster.area> (raw)
In-Reply-To: <20210413112859.32249-2-jack@suse.cz>

On Tue, Apr 13, 2021 at 01:28:46PM +0200, Jan Kara wrote:
> Currently, serializing operations such as page fault, read, or readahead
> against hole punching is rather difficult. The basic race scheme is
> like:
> 
> fallocate(FALLOC_FL_PUNCH_HOLE)			read / fault / ..
>   truncate_inode_pages_range()
> 						  <create pages in page
> 						   cache here>
>   <update fs block mapping and free blocks>
> 
> Now the problem is in this way read / page fault / readahead can
> instantiate pages in page cache with potentially stale data (if blocks
> get quickly reused). Avoiding this race is not simple - page locks do
> not work because we want to make sure there are *no* pages in given
> range. inode->i_rwsem does not work because page fault happens under
> mmap_sem which ranks below inode->i_rwsem. Also using it for reads makes
> the performance for mixed read-write workloads suffer.
> 
> So create a new rw_semaphore in the inode - i_mapping_sem - that
> protects adding of pages to page cache for page faults / reads /
> readahead.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
....
> diff --git a/mm/filemap.c b/mm/filemap.c
> index bd7c50e060a9..bc82a7856d3e 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -77,7 +77,8 @@
>   *        ->i_pages lock
>   *
>   *  ->i_rwsem
> - *    ->i_mmap_rwsem		(truncate->unmap_mapping_range)
> + *    ->i_mapping_sem		(acquired by fs in truncate path)
> + *      ->i_mmap_rwsem		(truncate->unmap_mapping_range)

This is now officially a confusing mess.

We have inode->i_mapping that points to an address space, which has
an internal i_mmap variable and an i_mmap_rwsem that serialises
access to the i_mmap tree.

Now we have a inode->i_mapping_sem (which is actually a rwsem, not a
sem) that sorta serialises additions to the mapping tree
(inode->i_mapping->i_pages) against truncate, but it does not
serialise accesses to the rest of the inode->i_mapping structure
itself despite the similarlities in naming.

Then we have the inode_lock() API and the i_mmap_lock() API that
wrap around the i_rwsem and i_mmap_rwsem, but there's no API for
inode_mapping_lock()...

THen we have the mmap_lock in the page fault path as well, which is
also an rwsem despite the name, and that protects something
completely different to the i_mmap and the i_mapping.

IOWs, we have 4 layers of entwined structures and locking here
that pretty much all have the same name but protect different things
and not always the obvious thing the name suggests. This makes it
really difficult to actually read the code and understand that the
correct lock is being used in the correct place...


>   *
>   *  ->mmap_lock
>   *    ->i_mmap_rwsem
> @@ -85,7 +86,8 @@
>   *        ->i_pages lock	(arch-dependent flush_dcache_mmap_lock)
>   *
>   *  ->mmap_lock
> - *    ->lock_page		(access_process_vm)
> + *    ->i_mapping_sem		(filemap_fault)
> + *      ->lock_page		(filemap_fault, access_process_vm)
>   *
>   *  ->i_rwsem			(generic_perform_write)
>   *    ->mmap_lock		(fault_in_pages_readable->do_page_fault)
> @@ -2276,16 +2278,28 @@ static int filemap_update_page(struct kiocb *iocb,
>  {
>  	int error;
>  
> +	if (iocb->ki_flags & IOCB_NOWAIT) {
> +		if (!down_read_trylock(&mapping->host->i_mapping_sem))
> +			return -EAGAIN;
> +	} else {
> +		down_read(&mapping->host->i_mapping_sem);
> +	}

We really need a lock primitive for this. The number of times this
exact lock pattern is being replicated all through the IO path is
getting out of hand.

static inline bool
down_read_try_or_lock(struct rwsem *sem, bool try)
{
	if (try) {
		if (!down_read_trylock(sem))
			return false;
	} else {
		down_read(&mapping->host->i_mapping_sem);
	}
	return true;
}

and the callers become:

	if (!down_read_try_or_lock(sem, (iocb->ki_flags & IOCB_NOWAIT)))
		return -EAGAIN;

We can do the same with mutex_try_or_lock(), down_try_or_lock(), etc
and we don't need to rely on cargo cult knowledge to propagate this
pattern anymore. Because I'm betting relatively few people actually
know why the code is written this way because the only place it is
documented is in an XFS commit message....

Doing this is a separate cleanup, though, and not something that
needs to be done in this patchset.

> index c5b0457415be..ac5bb50b3a4c 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -192,6 +192,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
>  	 */
>  	unsigned int nofs = memalloc_nofs_save();
>  
> +	down_read(&mapping->host->i_mapping_sem);
>  	/*
>  	 * Preallocate as many pages as we will need.
>  	 */

I can't say I'm a great fan of having the mapping reach back up to
the host to lock the host. THis seems the wrong way around to me
given that most of the locking in the IO path is in "host locks
mapping" and "mapping locks internal mapping structures" order...

I also come back to the naming confusion here, in that when we look
at this in long hand from the inode perspective, this chain actually
looks like:

	lock(inode->i_mapping->inode->i_mapping_sem)

i.e. the mapping is reaching back up outside it's scope to lock
itself against other inode->i_mapping operations. Smells of layering
violations to me.

So, next question: should this truncate semanphore actually be part
of the address space, not the inode? This patch is actually moving
the page fault serialisation from the inode into the address space
operations when page faults and page cache operations are done, so
maybe the lock should also make that move? That would help clear up
the naming problem, because now we can name it based around what it
serialises in the address space, not the address space as a whole...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  parent reply index

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-13 11:28 [PATCH 0/7 RFC v3] fs: Hole punch vs page cache filling races Jan Kara
2021-04-13 11:28 ` [PATCH 1/7] mm: Fix comments mentioning i_mutex Jan Kara
2021-04-13 12:38   ` Christoph Hellwig
2021-04-13 11:28 ` [PATCH 2/7] mm: Protect operations adding pages to page cache with i_mapping_lock Jan Kara
2021-04-13 12:57   ` Christoph Hellwig
2021-04-13 13:56     ` Jan Kara
2021-04-14  0:01   ` Dave Chinner [this message]
2021-04-14 12:23     ` Jan Kara
2021-04-14 21:57       ` Dave Chinner
2021-04-15 13:11         ` Jan Kara
2021-04-14 22:25     ` Matthew Wilcox
2021-04-15  2:05       ` Dave Chinner
2021-04-13 11:28 ` [PATCH 3/7] ext4: Convert to use inode->i_mapping_sem Jan Kara
2021-04-13 11:28 ` [PATCH 4/7] ext2: Convert to using i_mapping_sem Jan Kara
2021-04-13 11:28 ` [PATCH 5/7] xfs: Convert to use i_mapping_sem Jan Kara
2021-04-13 13:05   ` Christoph Hellwig
2021-04-13 13:42     ` Jan Kara
2021-04-13 11:28 ` [PATCH 6/7] zonefs: Convert to using i_mapping_sem Jan Kara
2021-04-13 11:28 ` [PATCH 7/7] fs: Remove i_mapping_sem protection from .page_mkwrite handlers Jan Kara
2021-04-13 13:09 ` [PATCH 0/7 RFC v3] fs: Hole punch vs page cache filling races Christoph Hellwig
2021-04-13 14:17   ` Jan Kara
2021-04-19 15:20 ` Matthew Wilcox
2021-04-19 16:25   ` Jan Kara
2021-04-20 22:12   ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210414000113.GG63242@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=amir73il@gmail.com \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-ext4 Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-ext4/0 linux-ext4/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-ext4 linux-ext4/ https://lore.kernel.org/linux-ext4 \
		linux-ext4@vger.kernel.org
	public-inbox-index linux-ext4

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-ext4


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git