Re: [RFC] performance regression with "ext4: Allow parallel DIO reads"

From: Andreas Dilger <adilger@dilger.ca>
To: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>,
	Joseph Qi <joseph.qi@linux.alibaba.com>,
	"Theodore Y. Ts'o" <tytso@mit.edu>,
	Joseph Qi <jiangqi903@gmail.com>,
	Ext4 Developers List <linux-ext4@vger.kernel.org>,
	Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>,
	Liu Bo <bo.liu@linux.alibaba.com>
Subject: Re: [RFC] performance regression with "ext4: Allow parallel DIO reads"
Date: Mon, 26 Aug 2019 13:10:17 -0600	[thread overview]
Message-ID: <94515D9C-045C-46EA-9F3C-E13CB2DAA1F9@dilger.ca> (raw)
In-Reply-To: <20190826083958.GA10614@quack2.suse.cz>

[-- Attachment #1: Type: text/plain, Size: 4722 bytes --]

On Aug 26, 2019, at 2:39 AM, Jan Kara <jack@suse.cz> wrote:
> 
> On Sat 24-08-19 12:18:40, Dave Chinner wrote:
>> On Fri, Aug 23, 2019 at 09:08:53PM +0800, Joseph Qi wrote:
>>> 
>>> 
>>> On 19/8/23 18:16, Dave Chinner wrote:
>>>> On Fri, Aug 23, 2019 at 03:57:02PM +0800, Joseph Qi wrote:
>>>>> Hi Dave,
>>>>> 
>>>>> On 19/8/22 13:40, Dave Chinner wrote:
>>>>>> On Wed, Aug 21, 2019 at 09:04:57AM +0800, Joseph Qi wrote:
>>>>>>> Hi Ted，
>>>>>>> 
>>>>>>> On 19/8/21 00:08, Theodore Y. Ts'o wrote:
>>>>>>>> On Tue, Aug 20, 2019 at 11:00:39AM +0800, Joseph Qi wrote:
>>>>>>>>> 
>>>>>>>>> I've tested parallel dio reads with dioread_nolock, it
>>>>>>>>> doesn't have significant performance improvement and still
>>>>>>>>> poor compared with reverting parallel dio reads. IMO, this
>>>>>>>>> is because with parallel dio reads, it take inode shared
>>>>>>>>> lock at the very beginning in ext4_direct_IO_read().
>>>>>>>> 
>>>>>>>> Why is that a problem?  It's a shared lock, so parallel
>>>>>>>> threads should be able to issue reads without getting
>>>>>>>> serialized?
>>>>>>>> 
>>>>>>> The above just tells the result that even mounting with
>>>>>>> dioread_nolock, parallel dio reads still has poor performance
>>>>>>> than before (w/o parallel dio reads).
>>>>>>> 
>>>>>>>> Are you using sufficiently fast storage devices that you're
>>>>>>>> worried about cache line bouncing of the shared lock?  Or do
>>>>>>>> you have some other concern, such as some other thread
>>>>>>>> taking an exclusive lock?
>>>>>>>> 
>>>>>>> The test case is random read/write described in my first
>>>>>>> mail. And
>>>>>> 
>>>>>> Regardless of dioread_nolock, ext4_direct_IO_read() is taking
>>>>>> inode_lock_shared() across the direct IO call.  And writes in
>>>>>> ext4 _always_ take the inode_lock() in ext4_file_write_iter(),
>>>>>> even though it gets dropped quite early when overwrite &&
>>>>>> dioread_nolock is set.  But just taking the lock exclusively
>>>>>> in write fro a short while is enough to kill all shared
>>>>>> locking concurrency...
>>>>>> 
>>>>>>> from my preliminary investigation, shared lock consumes more
>>>>>>> in such scenario.
>>>>>> 
>>>>>> If the write lock is also shared, then there should not be a
>>>>>> scalability issue. The shared dio locking is only half-done in
>>>>>> ext4, so perhaps comparing your workload against XFS would be
>>>>>> an informative exercise...
>>>>> 
>>>>> I've done the same test workload on xfs, it behaves the same as
>>>>> ext4 after reverting parallel dio reads and mounting with
>>>>> dioread_lock.
>>>> 
>>>> Ok, so the problem is not shared locking scalability ('cause
>>>> that's what XFS does and it scaled fine), the problem is almost
>>>> certainly that ext4 is using exclusive locking during
>>>> writes...
>>>> 
>>> 
>>> Agree. Maybe I've misled you in my previous mails.I meant shared
>>> lock makes worse in case of mixed random read/write, since we
>>> would always take inode lock during write.  And it also conflicts
>>> with dioread_nolock. It won't take any inode lock before with
>>> dioread_nolock during read, but now it always takes a shared
>>> lock.
>> 
>> No, you didn't mislead me. IIUC, the shared locking was added to the
>> direct IO read path so that it can't run concurrently with
>> operations like hole punch that free the blocks the dio read might
>> currently be operating on (use after free).
>> 
>> i.e. the shared locking fixes an actual bug, but the performance
>> regression is a result of only partially converting the direct IO
>> path to use shared locking. Only half the job was done from a
>> performance perspective. Seems to me that the two options here to
>> fix the performance regression are to either finish the shared
>> locking conversion, or remove the shared locking on read and re-open
>> a potential data exposure issue...
> 
> We actually had a separate locking mechanism in ext4 code to avoid stale
> data exposure during hole punch when unlocked DIO reads were running. But
> it was kind of ugly and making things complex. I agree we need to move ext4
> DIO path conversion further to avoid taking exclusive lock when we won't
> actually need it.

It seems to me that the right solution for the short term is to revert
the patch in question, since that appears to be incomplete, and reverting
it will restore the performance.  I haven't seen any comments posted with
a counter-example that the original patch actually improved performance,
or that reverting it will cause some other performance regression.

We can then leave implementing a more complete solution to a later kernel.

Cheers, Andreas

[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]