From: Ritesh Harjani <riteshh@linux.ibm.com>
To: Jan Kara <jack@suse.cz>
Cc: tytso@mit.edu, linux-ext4@vger.kernel.org,
linux-fsdevel@vger.kernel.org, mbobrowski@mbobrowski.org,
Ritesh Harjani <riteshh@linux.ibm.com>
Subject: Re: [RFCv3 3/4] ext4: start with shared iolock in case of DIO instead of excl. iolock
Date: Sat, 23 Nov 2019 18:47:33 +0530 [thread overview]
Message-ID: <20191123131734.D566711C050@d06av25.portsmouth.uk.ibm.com> (raw)
In-Reply-To: <20191120135514.GD9509@quack2.suse.cz>
Hello Jan,
Thanks for a thorough review.
On 11/20/19 7:25 PM, Jan Kara wrote:
> On Wed 20-11-19 10:30:23, Ritesh Harjani wrote:
>> Earlier there was no shared lock in DIO read path.
>> But this patch (16c54688592ce: ext4: Allow parallel DIO reads)
>> simplified some of the locking mechanism while still allowing
>> for parallel DIO reads by adding shared lock in inode DIO
>> read path.
>>
>> But this created problem with mixed read/write workload.
>> It is due to the fact that in DIO path, we first start with
>> exclusive lock and only when we determine that it is a ovewrite
>> IO, we downgrade the lock. This causes the problem, since
>> with above patch we have shared locking in DIO reads.
>>
>> So, this patch tries to fix this issue by starting with
>> shared lock and then switching to exclusive lock only
>> when required based on ext4_dio_write_checks().
>>
>> Other than that, it also simplifies below cases:-
>>
>> 1. Simplified ext4_unaligned_aio API to ext4_unaligned_io.
>> Previous API was abused in the sense that it was not really checking
>> for AIO anywhere also it used to check for extending writes.
>> So this API was renamed and simplified to ext4_unaligned_io()
>> which actully only checks if the IO is really unaligned.
>>
>> Now, in case of unaligned direct IO, iomap_dio_rw needs to do
>> zeroing of partial block and that will require serialization
>> against other direct IOs in the same block. So we take a excl iolock
>> for any unaligned DIO.
>> In case of AIO we also need to wait for any outstanding IOs to
>> complete so that conversion from unwritten to written is completed
>> before anyone try to map the overlapping block. Hence we take
>> excl iolock and also wait for inode_dio_wait() for unaligned DIO case.
>> Please note since we are anyway taking an exclusive lock in unaligned IO,
>> inode_dio_wait() becomes a no-op in case of non-AIO DIO.
>>
>> 2. Added ext4_extending_io(). This checks if the IO is extending the file.
>>
>> 3. Added ext4_dio_write_checks().
>> In this we start with shared iolock and only switch to exclusive iolock
>> if required. So in most cases with aligned, non-extening, dioread_nolock
>> overwrites tries to write with a shared locking.
>> If not, then we restart the operation in ext4_dio_write_checks(),
>> after acquiring excl iolock.
>>
>> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
>> ---
>> fs/ext4/file.c | 191 ++++++++++++++++++++++++++++++++++++-------------
>> 1 file changed, 142 insertions(+), 49 deletions(-)
>
> Thanks for the patch! Some comments below...
>
>> @@ -365,15 +383,110 @@ static const struct iomap_dio_ops ext4_dio_write_ops = {
>> .end_io = ext4_dio_write_end_io,
>> };
>>
>> +/*
>> + * The intention here is to start with shared lock acquired then see if any
>> + * condition requires an exclusive inode lock. If yes, then we restart the
>> + * whole operation by releasing the shared lock and acquiring exclusive lock.
>> + *
>> + * - For unaligned_io we never take shared lock as it may cause data corruption
>> + * when two unaligned IO tries to modify the same block e.g. while zeroing.
>> + *
>> + * - For extending writes case we don't take the shared lock, since it requires
>> + * updating inode i_disksize and/or orphan handling with exclusive lock.
>> + *
>> + * - shared locking will only be true mostly in case of overwrites with
>> + * dioread_nolock mode. Otherwise we will switch to excl. iolock mode.
>> + */
>> +static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
>> + unsigned int *iolock, bool *unaligned_io,
>> + bool *extend)
>> +{
>> + struct file *file = iocb->ki_filp;
>> + struct inode *inode = file_inode(file);
>> + loff_t offset = iocb->ki_pos;
>> + loff_t final_size;
>> + size_t count;
>> + ssize_t ret;
>> +
>> +restart:
>> + /* Fallback to buffered I/O if the inode does not support direct I/O. */
>> + if (!ext4_dio_supported(inode)) {
>> + ext4_iunlock(inode, *iolock);
>> + return ext4_buffered_write_iter(iocb, from);
>> + }
>
> I don't think it is good to hide things like this fallback to buffered IO
> in ext4_dio_write_checks(). Similarly with 'unaligned_io' and 'extend'
Yes, make sense. Yup will move above block of code in
ext4_dio_write_iter() before calling for ext4_dio_write_checks().
> variables below. So I'd rather leave this in ext4_dio_write_iter() and just
> move file_modified() from ext4_write_checks() since that seems to be the
Yes, in this patch itself that has been done. We removed file_modified()
from ext4_write_checks() & renamed it to ext4_generic_write_checks().
> only thing that cannot be always done with shared i_rwsem, can it?
AFAIU, that's right. Exclusive lock must be needed to change the
security info for inode.
>
>> +
>> + ret = ext4_generic_write_checks(iocb, from);
>> + if (ret <= 0) {
>> + ext4_iunlock(inode, *iolock);
>> + return ret;
>> + }
>> +
>> + /* Recalculate since offset & count may change above. */
>> + offset = iocb->ki_pos;
>> + count = iov_iter_count(from);
>> + final_size = offset + count;
>> +
>> + if (ext4_unaligned_io(inode, from, offset))
>> + *unaligned_io = true;
>
> No need to recheck alignment here. That cannot change over time regardless
> of locks we hold...
hmm. I think I got confused by function iov_iter_truncate() called
in ext4_generic_write_checks().
But looking at it again, I agree that alignment won't change here.
will remove the alignment check from here.
>
>> +
>> + if (ext4_extending_io(inode, offset, count))
>> + *extend = true;
>> + /*
>> + * Determine whether the IO operation will overwrite allocated
>> + * and initialized blocks. If so, check to see whether it is
>> + * possible to take the dioread_nolock path.
>> + *
>> + * We need exclusive i_rwsem for changing security info
>> + * in file_modified().
>> + */
>> + if (*iolock == EXT4_IOLOCK_SHARED &&
>> + (!IS_NOSEC(inode) || *unaligned_io || *extend ||
>> + !ext4_should_dioread_nolock(inode) ||
>> + !ext4_overwrite_io(inode, offset, count))) {
>> + ext4_iunlock(inode, *iolock);
>> + *iolock = EXT4_IOLOCK_EXCL;
>> + ext4_ilock(inode, *iolock);
>> + goto restart;
>> + }
>> +
>> + ret = file_modified(file);
>> + if (ret < 0) {
>> + ext4_iunlock(inode, *iolock);
>> + return ret;
>> + }
>> +
>> + return count;
>> +}
>> +
>> static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
>> {
>> ssize_t ret;
>> - size_t count;
>> - loff_t offset;
>> handle_t *handle;
>> struct inode *inode = file_inode(iocb->ki_filp);
>> - bool extend = false, overwrite = false, unaligned_aio = false;
>> - unsigned int iolock = EXT4_IOLOCK_EXCL;
>> + loff_t offset = iocb->ki_pos;
>> + size_t count = iov_iter_count(from);
>> + bool extend = false, unaligned_io = false;
>> + unsigned int iolock = EXT4_IOLOCK_SHARED;
>> +
>> + /*
>> + * We initially start with shared inode lock
>> + * unless it is unaligned IO which needs
>> + * exclusive lock anyways.
>> + */
>> + if (ext4_unaligned_io(inode, from, offset)) {
>> + unaligned_io = true;
>> + iolock = EXT4_IOLOCK_EXCL;
>> + }
>> + /*
>> + * Extending writes need exclusive lock.
>> + */
>> + if (ext4_extending_io(inode, offset, count)) {
>> + extend = true;
>> + iolock = EXT4_IOLOCK_EXCL;
>> + }
>
> You cannot read EXT4_I(inode)->i_disksize without some lock (either
> inode->i_rwsem or EXT4_I(inode)->i_data_sem). So I'd just do here a quick
> check with i_size here (probably don't set extend, but just make note to
> start with exclusive i_rwsem) and later when we hold i_rwsem, we can do a
> reliable check.
hmm. my bad. I think I later moved this block of code from
ext4_dio_write_checks() down here.
Thanks for correcting it.
Will only check against 'i_size' here then along with a
comment that a more reliable check with
shared lock has been done in ext4_dio_write_checks()
via ext4_extending_io().
>
>> + if (iolock == EXT4_IOLOCK_SHARED && !ext4_should_dioread_nolock(inode))
>> + iolock = EXT4_IOLOCK_EXCL;
>>
>> if (iocb->ki_flags & IOCB_NOWAIT) {
>> if (!ext4_ilock_nowait(inode, iolock))
>> @@ -382,47 +495,28 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
>> ext4_ilock(inode, iolock);
>> }
>>
>> - if (!ext4_dio_supported(inode)) {
>> - ext4_iunlock(inode, iolock);
>> - /*
>> - * Fallback to buffered I/O if the inode does not support
>> - * direct I/O.
>> - */
>> - return ext4_buffered_write_iter(iocb, from);
>> - }
>> -
>> - ret = ext4_write_checks(iocb, from);
>> - if (ret <= 0) {
>> - ext4_iunlock(inode, iolock);
>> + ret = ext4_dio_write_checks(iocb, from, &iolock, &unaligned_io,
>> + &extend);
>> + if (ret <= 0)
>> return ret;
>> - }
>>
>> - /*
>> - * Unaligned asynchronous direct I/O must be serialized among each
>> - * other as the zeroing of partial blocks of two competing unaligned
>> - * asynchronous direct I/O writes can result in data corruption.
>> - */
>> offset = iocb->ki_pos;
>> count = iov_iter_count(from);
>> - if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
>> - !is_sync_kiocb(iocb) && ext4_unaligned_aio(inode, from, offset)) {
>> - unaligned_aio = true;
>> - inode_dio_wait(inode);
>> - }
>>
>> /*
>> - * Determine whether the I/O will overwrite allocated and initialized
>> - * blocks. If so, check to see whether it is possible to take the
>> - * dioread_nolock path.
>> + * Unaligned direct IO must be serialized among each other as zeroing
>> + * of partial blocks of two competing unaligned IOs can result in data
>> + * corruption.
>> + *
>> + * So we make sure we don't allow any unaligned IO in flight.
>> + * For IOs where we need not wait (like unaligned non-AIO DIO),
>> + * below inode_dio_wait() may anyway become a no-op, since we start
>> + * with exclusive lock.
>> */
>> - if (!unaligned_aio && ext4_overwrite_io(inode, offset, count) &&
>> - ext4_should_dioread_nolock(inode)) {
>> - overwrite = true;
>> - ext4_ilock_demote(inode, iolock);
>> - iolock = EXT4_IOLOCK_SHARED;
>> - }
>> + if (unaligned_io)
>> + inode_dio_wait(inode);
>>
>> - if (offset + count > EXT4_I(inode)->i_disksize) {
>> + if (extend) {
>> handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
>> if (IS_ERR(handle)) {
>> ret = PTR_ERR(handle);
>> @@ -435,12 +529,11 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
>> goto out;
>> }
>>
>> - extend = true;
>> ext4_journal_stop(handle);
>> }
>>
>> ret = iomap_dio_rw(iocb, from, &ext4_iomap_ops, &ext4_dio_write_ops,
>> - is_sync_kiocb(iocb) || unaligned_aio || extend);
>> + is_sync_kiocb(iocb) || unaligned_io || extend);
>>
>> if (extend)
>> ret = ext4_handle_inode_extension(inode, offset, ret, count);
>
> Honza
>
next prev parent reply other threads:[~2019-11-23 13:17 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-11-20 5:00 [RFCv3 0/4] ext4: Introducing ilock wrapper APIs & fixing i_rwsem scalablity prob. in DIO mixed-rw Ritesh Harjani
2019-11-20 5:00 ` [RFCv3 1/4] ext4: fix ext4_dax_read/write inode locking sequence for IOCB_NOWAIT Ritesh Harjani
2019-11-20 12:51 ` Jan Kara
2019-11-22 5:53 ` Matthew Bobrowski
2019-11-20 5:00 ` [RFCv3 2/4] ext4: Add ext4_ilock & ext4_iunlock API Ritesh Harjani
2019-11-20 11:23 ` Matthew Bobrowski
2019-11-20 12:18 ` Ritesh Harjani
2019-11-20 16:35 ` Matthew Wilcox
2019-11-23 11:51 ` Ritesh Harjani
2019-11-20 13:11 ` Jan Kara
2019-11-20 16:06 ` Darrick J. Wong
2019-11-20 5:00 ` [RFCv3 3/4] ext4: start with shared iolock in case of DIO instead of excl. iolock Ritesh Harjani
2019-11-20 13:55 ` Jan Kara
2019-11-23 13:17 ` Ritesh Harjani [this message]
2019-11-20 5:00 ` [RFCv3 4/4] ext4: Move to shared iolock even without dioread_nolock mount opt Ritesh Harjani
2019-11-20 14:32 ` Jan Kara
2019-11-26 10:51 ` Ritesh Harjani
2019-11-26 12:45 ` Ritesh Harjani
2019-11-29 17:23 ` Jan Kara
2019-11-29 17:18 ` Jan Kara
2019-12-03 11:54 ` Ritesh Harjani
2019-12-03 12:39 ` Jan Kara
2019-12-03 13:10 ` Ritesh Harjani
2019-12-03 13:48 ` Jan Kara
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20191123131734.D566711C050@d06av25.portsmouth.uk.ibm.com \
--to=riteshh@linux.ibm.com \
--cc=jack@suse.cz \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=mbobrowski@mbobrowski.org \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).