From: Ritesh Harjani <riteshh@linux.ibm.com>
To: Jan Kara <jack@suse.cz>
Cc: tytso@mit.edu, linux-ext4@vger.kernel.org,
linux-fsdevel@vger.kernel.org, mbobrowski@mbobrowski.org
Subject: Re: [RFCv3 4/4] ext4: Move to shared iolock even without dioread_nolock mount opt
Date: Tue, 3 Dec 2019 17:24:44 +0530 [thread overview]
Message-ID: <20191203115445.6F802AE059@d06av26.portsmouth.uk.ibm.com> (raw)
In-Reply-To: <20191129171836.GB27588@quack2.suse.cz>
Hello Jan,
I have compiled something based on our discussion.
Could you please share your thoughts on below.
On 11/29/19 10:48 PM, Jan Kara wrote:
> Hello Ritesh!
>
> On Tue 26-11-19 16:21:15, Ritesh Harjani wrote:
>> On 11/20/19 8:02 PM, Jan Kara wrote:
>>> On Wed 20-11-19 10:30:24, Ritesh Harjani wrote:
>>>> We were using shared locking only in case of dioread_nolock
>>>> mount option in case of DIO overwrites. This mount condition
>>>> is not needed anymore with current code, since:-
>>>>
>>>> 1. No race between buffered writes & DIO overwrites.
>>>> Since buffIO writes takes exclusive locks & DIO overwrites
>>>> will take share locking. Also DIO path will make sure
>>>> to flush and wait for any dirty page cache data.
>>>>
>>>> 2. No race between buffered reads & DIO overwrites, since there
>>>> is no block allocation that is possible with DIO overwrites.
>>>> So no stale data exposure should happen. Same is the case
>>>> between DIO reads & DIO overwrites.
>>>>
>>>> 3. Also other paths like truncate is protected,
>>>> since we wait there for any DIO in flight to be over.
>>>>
>>>> 4. In case of buffIO writes followed by DIO reads:
>>>> Since here also we take exclusive locks in ext4_write_begin/end().
>>>> There is no risk of exposing any stale data in this case.
>>>> Since after ext4_write_end, iomap_dio_rw() will wait to flush &
>>>> wait for any dirty page cache data.
>>>>
>>>> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
>>>
>>> There's one more case to consider here as I mentioned in [1]. There can be
>>
>> Yes, I should have mentioned about this in cover letter and about my
>> thoughts on that.
>> I was of the opinion that since the race is already existing
>> and it may not be caused due to this patch, so we should handle that in
>> incremental fashion and as a separate patch series after this one.
>> Let me know your thoughts on above.
>
> Yes, I'm fine with that.
Sure thanks, will do that.
>
>> Also, I wanted to have some more discussions on this race before
>> making the changes.
>> But nevertheless, it's the right time to discuss those changes here.
>>
>>> mmap write instantiating dirty page and then someone starting writeback
>>> against that page while DIO read is running still theoretically leading to
>>> stale data exposure. Now this patch does not have influence on that race
>>> but:
>>
>> Yes, agreed.
>>
>>>
>>> 1) We need to close the race mentioned above. Maybe we could do that by
>>> proactively allocating unwritten blocks for a page being faulted when there
>>> is direct IO running against the file - the one who fills holes through
>>> mmap write while direct IO is running on the file deserves to suffer the
>>> performance penalty...
>>
>> I was giving this a thought. So even if we try to penalize mmap
>> write as you mentioned above, what I am not sure about it, is that, how can
>> we reliably detect that the DIO is in progress?
>>
>> Say even if we try to check for atomic_read(&inode->i_dio_count) in mmap
>> ext4_page_mkwrite path, it cannot be reliable unless there is some sort of a
>> lock protection, no?
>> Because after the check the DIO can still snoop in, right?
>
> Yes, doing this reliably will need some code tweaking. Also thinking about
> this in detail, doing a reliable check in ext4_page_mkwrite() is
> somewhat difficult so it will be probably less error-prone to deal with the
> race in the writeback path.
hmm. But if we don't do in ext4_page_mkwrite, then I am afraid on
how to handle nodelalloc scenario. Where we will directly go and
allocate block via ext4_get_block() in ext4_page_mkwrite(),
as explained below.
I guess we may need some tweaking at both places.
>
> My preferred way of dealing with this would be to move inode_dio_begin()
> call in iomap_dio_rw() a bit earlier before page cache invalidation and add
> there smp_mb_after_atomic() (so that e.g. nrpages checks cannot get
> reordered before the increment). Then the check on i_dio_count in
> ext4_writepages() will be reliable if we do it after gathering and locking
> pages for writeback (i.e., in mpage_map_and_submit_extent()) - either we
> see i_dio_count elevated and use the safe (but slower) writeback using
> unwritten extents, or we see don't and then we are sure DIO will not start
> until writeback of the pages we have locked has finished because of
> filemap_write_and_wait() call in iomap_dio_rw().
>
>
Thanks for explaining this in detail. I guess I understand this part now
Earlier my understanding towards mapping->nrpages was not complete.
AFAIU, with your above suggestion the race won't happen for delalloc
cases. But what if it is a nodelalloc mount option?
Say with above changes i.e. after tweaking iomap_dio_rw() code as you
mentioned above. Below race could still happen, right?
iomap_dio_rw()
filemap_write_and_wait_range()
inode_dio_begin()
smp_mb__after_atomic()
invalidate_inode_pages2_range()
ext4_page_mkwrite()
block_page_mkwrite()
lock_page()
ext4_get_block()
ext4_map_blocks()
//this will return IOMAP_MAPPED entry
submit_bio()
// this goes and reads the block
// with stale data allocated,
// by ext4_page_mkwrite()
Now, I am assuming that ext4_get_block() via ext4_page_mkwrite() path
may try to create the block for hole then and there itself.
And if submit_bio() from DIO path is serviced late i.e. after
ext4_get_block() has already allocated block there, then this may expose
stale data. Thoughts?
So to avoid both such races in delalloc & in nodelalloc path,
we should add the checks at both ext4_writepages() & also at
ext4_page_mkwrite().
For ext4_page_mkwrite(), why don't we just change the "get_block"
function pointer which is passed to block_page_mkwrite()
as below. This should solve our race since
ext4_dio_check_get_block() will be only called with lock_page()
held. And also with inode_dio_begin() now moved up before
invalidate_inode_pages2_range(), we could be sure
about DIO is currently running or not in ext4_page_mkwrite() path.
Does this looks correct to you?
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 381813205f99..74c33d03592c 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -806,6 +806,19 @@ int ext4_get_block_unwritten(struct inode *inode,
sector_t iblock,
EXT4_GET_BLOCKS_IO_CREATE_EXT);
}
+int ext4_dio_check_get_block(struct inode *inode, sector_t iblock,
+ struct buffer_head *bh, int create)
+{
+ get_block_t *get_block;
+
+ if (!atomic_read(&inode->i_dio_count))
+ get_block = ext4_get_block;
+ else
+ get_block = ext4_get_block_unwritten;
+
+ return get_block(inode, iblock, bh, create);
+}
+
/* Maximum number of blocks we map for direct IO at once. */
#define DIO_MAX_BLOCKS 4096
@@ -2332,7 +2345,8 @@ static int mpage_map_one_extent(handle_t *handle,
struct mpage_da_data *mpd)
struct inode *inode = mpd->inode;
struct ext4_map_blocks *map = &mpd->map;
int get_blocks_flags;
- int err, dioread_nolock;
+ int err;
+ bool dio_in_progress = atomic_read(&inode->i_dio_count);
trace_ext4_da_write_pages_extent(inode, map);
/*
@@ -2353,8 +2367,14 @@ static int mpage_map_one_extent(handle_t *handle,
struct mpage_da_data *mpd)
get_blocks_flags = EXT4_GET_BLOCKS_CREATE |
EXT4_GET_BLOCKS_METADATA_NOFAIL |
EXT4_GET_BLOCKS_IO_SUBMIT;
- dioread_nolock = ext4_should_dioread_nolock(inode);
- if (dioread_nolock)
+
+ /*
+ * There could be race between DIO read & ext4_page_mkwrite
+ * where in delalloc case, we may go and try to allocate the
+ * block here but if DIO read is in progress then it may expose
+ * stale data, hence use unwritten blocks for allocation
+ * when DIO is in progress.
+ */
+ if (dio_in_progress)
get_blocks_flags |= EXT4_GET_BLOCKS_IO_CREATE_EXT;
if (map->m_flags & (1 << BH_Delay))
get_blocks_flags |= EXT4_GET_BLOCKS_DELALLOC_RESERVE;
@@ -2362,7 +2382,7 @@ static int mpage_map_one_extent(handle_t *handle,
struct mpage_da_data *mpd)
err = ext4_map_blocks(handle, inode, map, get_blocks_flags);
if (err < 0)
return err;
- if (dioread_nolock && (map->m_flags & EXT4_MAP_UNWRITTEN)) {
+ if (dio_in_progress && (map->m_flags & EXT4_MAP_UNWRITTEN)) {
if (!mpd->io_submit.io_end->handle &&
ext4_handle_valid(handle)) {
mpd->io_submit.io_end->handle = handle->h_rsv_handle;
@@ -5906,10 +5926,7 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
}
unlock_page(page);
/* OK, we need to fill the hole... */
- if (ext4_should_dioread_nolock(inode))
- get_block = ext4_get_block_unwritten;
- else
- get_block = ext4_get_block;
+ get_block = ext4_dio_check_get_block;
retry_alloc:
handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE,
ext4_writepage_trans_blocks(inode));
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 2f88d64c2a4d..09d0601e5ecb 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -465,6 +465,8 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
if (ret)
goto out_free_dio;
+ inode_dio_begin(inode);
+ smp_mb__after_atomic();
/*
* Try to invalidate cache pages for the range we're direct
* writing. If this invalidation fails, tough, the write will
@@ -484,8 +486,6 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
goto out_free_dio;
}
- inode_dio_begin(inode);
-
blk_start_plug(&plug);
do {
ret = iomap_apply(inode, pos, count, flags, ops, dio,
-ritesh
next prev parent reply other threads:[~2019-12-03 11:54 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-11-20 5:00 [RFCv3 0/4] ext4: Introducing ilock wrapper APIs & fixing i_rwsem scalablity prob. in DIO mixed-rw Ritesh Harjani
2019-11-20 5:00 ` [RFCv3 1/4] ext4: fix ext4_dax_read/write inode locking sequence for IOCB_NOWAIT Ritesh Harjani
2019-11-20 12:51 ` Jan Kara
2019-11-22 5:53 ` Matthew Bobrowski
2019-11-20 5:00 ` [RFCv3 2/4] ext4: Add ext4_ilock & ext4_iunlock API Ritesh Harjani
2019-11-20 11:23 ` Matthew Bobrowski
2019-11-20 12:18 ` Ritesh Harjani
2019-11-20 16:35 ` Matthew Wilcox
2019-11-23 11:51 ` Ritesh Harjani
2019-11-20 13:11 ` Jan Kara
2019-11-20 16:06 ` Darrick J. Wong
2019-11-20 5:00 ` [RFCv3 3/4] ext4: start with shared iolock in case of DIO instead of excl. iolock Ritesh Harjani
2019-11-20 13:55 ` Jan Kara
2019-11-23 13:17 ` Ritesh Harjani
2019-11-20 5:00 ` [RFCv3 4/4] ext4: Move to shared iolock even without dioread_nolock mount opt Ritesh Harjani
2019-11-20 14:32 ` Jan Kara
2019-11-26 10:51 ` Ritesh Harjani
2019-11-26 12:45 ` Ritesh Harjani
2019-11-29 17:23 ` Jan Kara
2019-11-29 17:18 ` Jan Kara
2019-12-03 11:54 ` Ritesh Harjani [this message]
2019-12-03 12:39 ` Jan Kara
2019-12-03 13:10 ` Ritesh Harjani
2019-12-03 13:48 ` Jan Kara
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20191203115445.6F802AE059@d06av26.portsmouth.uk.ibm.com \
--to=riteshh@linux.ibm.com \
--cc=jack@suse.cz \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=mbobrowski@mbobrowski.org \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).