Linux-BTRFS Archive on lore.kernel.org
 help / color / Atom feed
From: Ritesh Harjani <riteshh@linux.ibm.com>
To: Andres Freund <andres@anarazel.de>,
	Dave Chinner <david@fromorbit.com>,
	David Sterba <dsterba@suse.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>,
	linux-fsdevel@vger.kernel.org, jack@suse.com, hch@infradead.org,
	linux-ext4@vger.kernel.org, linux-btrfs@vger.kernel.org
Subject: Re: Odd locking pattern introduced as part of "nowait aio support"
Date: Wed, 11 Sep 2019 16:01:14 +0530
Message-ID: <20190911103117.E32C34C044@d06av22.portsmouth.uk.ibm.com> (raw)
In-Reply-To: <20190911093926.pfkkx25mffzeuo32@alap3.anarazel.de>

Hi,

On 9/11/19 3:09 PM, Andres Freund wrote:
> Hi,
> 
> On 2019-09-11 14:04:20 +1000, Dave Chinner wrote:
>> On Tue, Sep 10, 2019 at 03:33:27PM -0700, Andres Freund wrote:
>>> Hi,
>>>
>>> Especially with buffered io it's fairly easy to hit contention on the
>>> inode lock, during writes. With something like io_uring, it's even
>>> easier, because it currently (but see [1]) farms out buffered writes to
>>> workers, which then can easily contend on the inode lock, even if only
>>> one process submits writes.  But I've seen it in plenty other cases too.
>>>
>>> Looking at the code I noticed that several parts of the "nowait aio
>>> support" (cf 728fbc0e10b7f3) series introduced code like:
>>>
>>> static ssize_t
>>> ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>>> {
>>> ...
>>> 	if (!inode_trylock(inode)) {
>>> 		if (iocb->ki_flags & IOCB_NOWAIT)
>>> 			return -EAGAIN;
>>> 		inode_lock(inode);
>>> 	}
>>
>> The ext4 code is just buggy here - we don't support RWF_NOWAIT on
>> buffered write >
> But both buffered and non-buffered writes go through
> ext4_file_write_iter(). And there's a preceding check, at least these
> days, preventing IOCB_NOWAIT to apply to buffered writes:
> 
> 	if (!o_direct && (iocb->ki_flags & IOCB_NOWAIT))
> 		return -EOPNOTSUPP;
> 

-EOPNOTSUPP is now taken care in ext4 iomap patch series as well.


> 
> I do really wish buffered NOWAIT was supported... The overhead of having
> to do async buffered writes through the workqueue in io_uring, even if
> an already existing page is targeted, is quite noticable. Even if it
> failed with EAGAIN as soon as the buffered write's target isn't in the
> page cache, it'd be worthwhile.
> 
> 
>>> isn't trylocking and then locking in a blocking fashion an inefficient
>>> pattern? I.e. I think this should be
>>>
>>> 	if (iocb->ki_flags & IOCB_NOWAIT) {
>>> 		if (!inode_trylock(inode))
>>> 			return -EAGAIN;
>>> 	}
>>>          else
>>>          	inode_lock(inode);
>>
>> Yes, you are right.
>>
>> History: commit 91f9943e1c7b ("fs: support RWF_NOWAIT
>> for buffered reads") which introduced the first locking pattern
>> you describe in XFS.
> 
> Seems, as part of the nowait work, the pattern was introduced in ext4,
> xfs and btrfs. And fixed in xfs.
> 
> 
>> That was followed soon after by:
>>
>> commit 942491c9e6d631c012f3c4ea8e7777b0b02edeab
>> Author: Christoph Hellwig <hch@lst.de>
>> Date:   Mon Oct 23 18:31:50 2017 -0700
>>
>>      xfs: fix AIM7 regression
>>      
>>      Apparently our current rwsem code doesn't like doing the trylock, then
>>      lock for real scheme.  So change our read/write methods to just do the
>>      trylock for the RWF_NOWAIT case.  This fixes a ~25% regression in
>>      AIM7.
>>      
>>      Fixes: 91f9943e ("fs: support RWF_NOWAIT for buffered reads")
>>      Reported-by: kernel test robot <xiaolong.ye@intel.com>
>>      Signed-off-by: Christoph Hellwig <hch@lst.de>
>>      Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
>>      Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
>>
>> Which changed all the trylock/eagain/lock patterns to the second
>> form you quote. None of the other filesystems had AIM7 regressions
>> reported against them, so nobody changed them....
> 
> :(
> 
> 
>>> Obviously this isn't going to improve scalability to a very significant
>>> degree. But not unnecessarily doing two atomic ops on a contended lock
>>> can't hurt scalability either. Also, the current code just seems
>>> confusing.
>>>
>>> Am I missing something?
>>
>> Just that the sort of performance regression testing that uncovers
>> this sort of thing isn't widely done, and most filesystems are
>> concurrency limited in some way before they hit inode lock
>> scalability issues. Hence filesystem concurrency foccussed
>> benchmarks that could uncover it (like aim7) won't because the inode
>> locks don't end up stressed enough to make a difference to
>> benchmark performance.
> 
> Ok.  Goldwyn, do you want to write a patch, or do you want me to write
> one up?

I am anyways looking into ext4 performance issue of mixed parallel DIO 
workload. This will require some new APIs for inode locking similar to
that of XFS.
In that I can take care of this symantics reported here by you (which is 
taken care by XFS in above patch) for ext4.


Thanks
-ritesh

> 
> 
> Greetings,
> 
> Andres Freund
> 


  parent reply index

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20190910223327.mnegfoggopwqqy33@alap3.anarazel.de>
     [not found] ` <20190911040420.GB27547@dread.disaster.area>
2019-09-11  9:39   ` Andres Freund
2019-09-11 10:19     ` Christoph Hellwig
2019-09-11 10:31     ` Ritesh Harjani [this message]
2019-09-11 10:55     ` Goldwyn Rodrigues
2019-09-11 16:45     ` Fix inode sem regression for nowait Goldwyn Rodrigues
2019-09-11 16:45       ` [PATCH 1/3] btrfs: fix inode rwsem regression Goldwyn Rodrigues
2019-09-11 17:21         ` David Sterba
2019-09-11 16:45       ` [PATCH 2/3] ext4: " Goldwyn Rodrigues
2019-09-12  8:52         ` Ritesh Harjani
2019-09-12  9:26           ` Matthew Bobrowski
2019-09-23 10:10         ` Jan Kara
2019-09-23 13:18           ` Theodore Y. Ts'o
2019-09-11 16:45       ` [PATCH 3/3] f2fs: " Goldwyn Rodrigues
2019-09-12  6:17         ` Chao Yu
2019-09-13 19:46         ` Jaegeuk Kim
2019-09-16  1:16           ` Chao Yu

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190911103117.E32C34C044@d06av22.portsmouth.uk.ibm.com \
    --to=riteshh@linux.ibm.com \
    --cc=andres@anarazel.de \
    --cc=david@fromorbit.com \
    --cc=dsterba@suse.com \
    --cc=hch@infradead.org \
    --cc=jack@suse.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=rgoldwyn@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Linux-BTRFS Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-btrfs/0 linux-btrfs/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-btrfs linux-btrfs/ https://lore.kernel.org/linux-btrfs \
		linux-btrfs@vger.kernel.org linux-btrfs@archiver.kernel.org
	public-inbox-index linux-btrfs

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-btrfs


AGPL code for this site: git clone https://public-inbox.org/ public-inbox