All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ritesh Harjani <riteshh@linux.ibm.com>
To: Jan Kara <jack@suse.cz>
Cc: tytso@mit.edu, linux-ext4@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, mbobrowski@mbobrowski.org
Subject: Re: [RFCv3 4/4] ext4: Move to shared iolock even without dioread_nolock mount opt
Date: Tue, 3 Dec 2019 18:40:47 +0530	[thread overview]
Message-ID: <20191203131048.A4559A4051@d06av23.portsmouth.uk.ibm.com> (raw)
In-Reply-To: <20191203123929.GE8206@quack2.suse.cz>



On 12/3/19 6:09 PM, Jan Kara wrote:
> 
> Hello Ritesh!
> 
> On Tue 03-12-19 17:24:44, Ritesh Harjani wrote:
>> On 11/29/19 10:48 PM, Jan Kara wrote:
>>>> Also, I wanted to have some more discussions on this race before
>>>> making the changes.
>>>> But nevertheless, it's the right time to discuss those changes here.
>>>>
>>>>> mmap write instantiating dirty page and then someone starting writeback
>>>>> against that page while DIO read is running still theoretically leading to
>>>>> stale data exposure. Now this patch does not have influence on that race
>>>>> but:
>>>>
>>>> Yes, agreed.
>>>>
>>>>>
>>>>> 1) We need to close the race mentioned above. Maybe we could do that by
>>>>> proactively allocating unwritten blocks for a page being faulted when there
>>>>> is direct IO running against the file - the one who fills holes through
>>>>> mmap write while direct IO is running on the file deserves to suffer the
>>>>> performance penalty...
>>>>
>>>> I was giving this a thought. So even if we try to penalize mmap
>>>> write as you mentioned above, what I am not sure about it, is that, how can
>>>> we reliably detect that the DIO is in progress?
>>>>
>>>> Say even if we try to check for atomic_read(&inode->i_dio_count) in mmap
>>>> ext4_page_mkwrite path, it cannot be reliable unless there is some sort of a
>>>> lock protection, no?
>>>> Because after the check the DIO can still snoop in, right?
>>>
>>> Yes, doing this reliably will need some code tweaking. Also thinking about
>>> this in detail, doing a reliable check in ext4_page_mkwrite() is
>>> somewhat difficult so it will be probably less error-prone to deal with the
>>> race in the writeback path.
>>
>> hmm. But if we don't do in ext4_page_mkwrite, then I am afraid on
>> how to handle nodelalloc scenario. Where we will directly go and
>> allocate block via ext4_get_block() in ext4_page_mkwrite(),
>> as explained below.
>> I guess we may need some tweaking at both places.
> 
> Ok, I forgot to mention that. Yes, the nodelalloc case in
> ext4_page_mkwrite() still needs tweaking. But that is not performance
> sensitive path at all. So we can just have there:

hmm. I was of the opinion that why use unwritten blocks or move
from written to unwritten method while we can still avoid it.

> 
> 	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
> 		get_block = ext4_get_block_unwritten;
> 	else
> 		get_block = ext4_get_block;
> 

Although adding a function ext4_dio_check_get_block() as described in
previous email is also trivial, which could avoid using unwritten
blocks here when DIO is not in progress.
But if you think it's not worth it, then I will go with your suggestion
here.



> and be done with it. And yes, for inodes using indirect blocks, direct IO
> reads can still theoretically expose data from blocks instantiated by hole
> filling from ext4_page_mkwrite(). But that race has always been there
> regardless of DIO locking and is hardly fixable with that on-disk format.
> 

Agreed.


> 								Honza
> 
>>
>>
>>>
>>> My preferred way of dealing with this would be to move inode_dio_begin()
>>> call in iomap_dio_rw() a bit earlier before page cache invalidation and add
>>> there smp_mb_after_atomic() (so that e.g. nrpages checks cannot get
>>> reordered before the increment).  Then the check on i_dio_count in
>>> ext4_writepages() will be reliable if we do it after gathering and locking
>>> pages for writeback (i.e., in mpage_map_and_submit_extent()) - either we
>>> see i_dio_count elevated and use the safe (but slower) writeback using
>>> unwritten extents, or we see don't and then we are sure DIO will not start
>>> until writeback of the pages we have locked has finished because of
>>> filemap_write_and_wait() call in iomap_dio_rw().
>>>
>>>
>>
>> Thanks for explaining this in detail. I guess I understand this part now
>> Earlier my understanding towards mapping->nrpages was not complete.
>>
>> AFAIU, with your above suggestion the race won't happen for delalloc
>> cases. But what if it is a nodelalloc mount option?
>>
>> Say with above changes i.e. after tweaking iomap_dio_rw() code as you
>> mentioned above. Below race could still happen, right?
>>
>> iomap_dio_rw()					
>> filemap_write_and_wait_range() 			
>> inode_dio_begin()
>> smp_mb__after_atomic()
>> invalidate_inode_pages2_range()				
>> 						ext4_page_mkwrite()
>> 						block_page_mkwrite()
>> 		  				  lock_page()
>> 						  ext4_get_block()
>>
>> ext4_map_blocks()
>> //this will return IOMAP_MAPPED entry
>>
>> submit_bio()
>> // this goes and reads the block
>> // with stale data allocated,
>> // by ext4_page_mkwrite()
>>
>>
>> Now, I am assuming that ext4_get_block() via ext4_page_mkwrite() path
>> may try to create the block for hole then and there itself.
>> And if submit_bio() from DIO path is serviced late i.e. after
>> ext4_get_block() has already allocated block there, then this may expose
>> stale data. Thoughts?
>>
>>
>> So to avoid both such races in delalloc & in nodelalloc path,
>> we should add the checks at both ext4_writepages() & also at
>> ext4_page_mkwrite().
>>
>> For ext4_page_mkwrite(), why don't we just change the "get_block"
>> function pointer which is passed to block_page_mkwrite()
>> as below. This should solve our race since
>> ext4_dio_check_get_block() will be only called with lock_page()
>> held. And also with inode_dio_begin() now moved up before
>> invalidate_inode_pages2_range(), we could be sure
>> about DIO is currently running or not in ext4_page_mkwrite() path.
>>
>> Does this looks correct to you?
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index 381813205f99..74c33d03592c 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -806,6 +806,19 @@ int ext4_get_block_unwritten(struct inode *inode,
>> sector_t iblock,
>>   			       EXT4_GET_BLOCKS_IO_CREATE_EXT);
>>   }
>>
>> +int ext4_dio_check_get_block(struct inode *inode, sector_t iblock,
>> +		   struct buffer_head *bh, int create)
>> +{
>> +	get_block_t *get_block;
>> +
>> +	if (!atomic_read(&inode->i_dio_count))
>> +		get_block = ext4_get_block;
>> +	else
>> +		get_block = ext4_get_block_unwritten;
>> +
>> +	return get_block(inode, iblock, bh, create);
>> +}
>> +
>>   /* Maximum number of blocks we map for direct IO at once. */
>>   #define DIO_MAX_BLOCKS 4096
>>
>> @@ -2332,7 +2345,8 @@ static int mpage_map_one_extent(handle_t *handle,
>> struct mpage_da_data *mpd)
>>   	struct inode *inode = mpd->inode;
>>   	struct ext4_map_blocks *map = &mpd->map;
>>   	int get_blocks_flags;
>> -	int err, dioread_nolock;
>> +	int err;
>> +	bool dio_in_progress = atomic_read(&inode->i_dio_count);
>>
>>   	trace_ext4_da_write_pages_extent(inode, map);
>>   	/*
>> @@ -2353,8 +2367,14 @@ static int mpage_map_one_extent(handle_t *handle,
>> struct mpage_da_data *mpd)
>>   	get_blocks_flags = EXT4_GET_BLOCKS_CREATE |
>>   			   EXT4_GET_BLOCKS_METADATA_NOFAIL |
>>   			   EXT4_GET_BLOCKS_IO_SUBMIT;
>> -	dioread_nolock = ext4_should_dioread_nolock(inode);
>> -	if (dioread_nolock)
>> +
>> +	/*
>> +	 * There could be race between DIO read & ext4_page_mkwrite
>> +	 * where in delalloc case, we may go and try to allocate the
>> +	 * block here but if DIO read is in progress then it may expose
>> +	 * stale data, hence use unwritten blocks for allocation
>> +	 * when DIO is in progress.
>> +	 */
>> +	if (dio_in_progress)
>>   		get_blocks_flags |= EXT4_GET_BLOCKS_IO_CREATE_EXT;
>>   	if (map->m_flags & (1 << BH_Delay))
>>   		get_blocks_flags |= EXT4_GET_BLOCKS_DELALLOC_RESERVE;
>> @@ -2362,7 +2382,7 @@ static int mpage_map_one_extent(handle_t *handle,
>> struct mpage_da_data *mpd)
>>   	err = ext4_map_blocks(handle, inode, map, get_blocks_flags);
>>   	if (err < 0)
>>   		return err;
>> -	if (dioread_nolock && (map->m_flags & EXT4_MAP_UNWRITTEN)) {
>> +	if (dio_in_progress && (map->m_flags & EXT4_MAP_UNWRITTEN)) {
>>   		if (!mpd->io_submit.io_end->handle &&
>>   		    ext4_handle_valid(handle)) {
>>   			mpd->io_submit.io_end->handle = handle->h_rsv_handle;
>> @@ -5906,10 +5926,7 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
>>   	}
>>   	unlock_page(page);
>>   	/* OK, we need to fill the hole... */
>> -	if (ext4_should_dioread_nolock(inode))
>> -		get_block = ext4_get_block_unwritten;
>> -	else
>> -		get_block = ext4_get_block;
>> +	get_block = ext4_dio_check_get_block;
>>   retry_alloc:
>>   	handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE,
>>   				    ext4_writepage_trans_blocks(inode));
>> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
>> index 2f88d64c2a4d..09d0601e5ecb 100644
>> --- a/fs/iomap/direct-io.c
>> +++ b/fs/iomap/direct-io.c
>> @@ -465,6 +465,8 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>>   	if (ret)
>>   		goto out_free_dio;
>>
>> +	inode_dio_begin(inode);
>> +	smp_mb__after_atomic();
>>   	/*
>>   	 * Try to invalidate cache pages for the range we're direct
>>   	 * writing.  If this invalidation fails, tough, the write will
>> @@ -484,8 +486,6 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>>   			goto out_free_dio;
>>   	}
>>
>> -	inode_dio_begin(inode);
>> -
>>   	blk_start_plug(&plug);
>>   	do {
>>   		ret = iomap_apply(inode, pos, count, flags, ops, dio,
>>
>>
>>
>> -ritesh
>>


  reply	other threads:[~2019-12-03 13:11 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-20  5:00 [RFCv3 0/4] ext4: Introducing ilock wrapper APIs & fixing i_rwsem scalablity prob. in DIO mixed-rw Ritesh Harjani
2019-11-20  5:00 ` [RFCv3 1/4] ext4: fix ext4_dax_read/write inode locking sequence for IOCB_NOWAIT Ritesh Harjani
2019-11-20 12:51   ` Jan Kara
2019-11-22  5:53   ` Matthew Bobrowski
2019-11-20  5:00 ` [RFCv3 2/4] ext4: Add ext4_ilock & ext4_iunlock API Ritesh Harjani
2019-11-20 11:23   ` Matthew Bobrowski
2019-11-20 12:18     ` Ritesh Harjani
2019-11-20 16:35       ` Matthew Wilcox
2019-11-23 11:51         ` Ritesh Harjani
2019-11-20 13:11   ` Jan Kara
2019-11-20 16:06     ` Darrick J. Wong
2019-11-20  5:00 ` [RFCv3 3/4] ext4: start with shared iolock in case of DIO instead of excl. iolock Ritesh Harjani
2019-11-20 13:55   ` Jan Kara
2019-11-23 13:17     ` Ritesh Harjani
2019-11-20  5:00 ` [RFCv3 4/4] ext4: Move to shared iolock even without dioread_nolock mount opt Ritesh Harjani
2019-11-20 14:32   ` Jan Kara
2019-11-26 10:51     ` Ritesh Harjani
2019-11-26 12:45       ` Ritesh Harjani
2019-11-29 17:23         ` Jan Kara
2019-11-29 17:18       ` Jan Kara
2019-12-03 11:54         ` Ritesh Harjani
2019-12-03 12:39           ` Jan Kara
2019-12-03 13:10             ` Ritesh Harjani [this message]
2019-12-03 13:48               ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20191203131048.A4559A4051@d06av23.portsmouth.uk.ibm.com \
    --to=riteshh@linux.ibm.com \
    --cc=jack@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=mbobrowski@mbobrowski.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.