[PATCH 0/3] Optimize __ext4_block_zero_page_range() for unwritten buffers

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] Optimize __ext4_block_zero_page_range() for unwritten buffers
@ 2023-09-29 14:10 Ojaswin Mujoo
  2023-09-29 14:10 ` [PATCH 1/3] ext4: treat end of range as exclusive in ext4_zero_range() Ojaswin Mujoo
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Ojaswin Mujoo @ 2023-09-29 14:10 UTC (permalink / raw)
  To: linux-ext4, Theodore Ts'o; +Cc: Ritesh Harjani, linux-kernel, Jan Kara

As per discussion with Jan here [1], this patchset intends to exit early
from __ext4_block_zero_page_range() incase the block we are about to
zero (partially) is unwritten and unmapped, since such a block doesn't
require zeroing.

Further, also make sure that calls to ext4_zero_partial_blocks()
truncate the page cache completely beforehand, so that they don't rely
on ext4_zero_partial_block() -> __ext4_block_zero_page_range() to zero
out non block aligned edges of pagecache.

Reviews and comments are appreciated!

Regards,
ojaswin

[1]
https://lore.kernel.org/linux-ext4/20230914141920.lw2nlpzhcxwuz2y6@quack3/

Ojaswin Mujoo (3):
  ext4: treat end of range as exclusive in ext4_zero_range()
  ext4: truncate complete range in pagecache before calling
    ext4_zero_partial_blocks()
  ext4: Skip unwritten buffers in __ext4_block_zero_page_range()

 fs/ext4/extents.c | 20 +++++++++++++-------
 fs/ext4/inode.c   |  7 +++++--
 2 files changed, 18 insertions(+), 9 deletions(-)

-- 
2.39.3

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/3] ext4: treat end of range as exclusive in ext4_zero_range()
  2023-09-29 14:10 [PATCH 0/3] Optimize __ext4_block_zero_page_range() for unwritten buffers Ojaswin Mujoo
@ 2023-09-29 14:10 ` Ojaswin Mujoo
  2023-10-19 15:01   ` Jan Kara
  2023-09-29 14:10 ` [PATCH 2/3] ext4: truncate complete range in pagecache before calling ext4_zero_partial_blocks() Ojaswin Mujoo
  2023-09-29 14:10 ` [PATCH 3/3] ext4: Skip unwritten buffers in __ext4_block_zero_page_range() Ojaswin Mujoo
  2 siblings, 1 reply; 8+ messages in thread
From: Ojaswin Mujoo @ 2023-09-29 14:10 UTC (permalink / raw)
  To: linux-ext4, Theodore Ts'o; +Cc: Ritesh Harjani, linux-kernel, Jan Kara

The call to filemap_write_and_wait_range() assumes the range passed to be
inclusive, so fix the call to make sure we follow that.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
---
 fs/ext4/extents.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index e4115d338f10..c79b4c25afc4 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4522,7 +4522,8 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 	 * Round up offset. This is not fallocate, we need to zero out
 	 * blocks, so convert interior block aligned part of the range to
 	 * unwritten and possibly manually zero out unaligned parts of the
-	 * range.
+	 * range. Here, start and partial_begin are inclusive, end and
+	 * partial_end are exclusive.
 	 */
 	start = round_up(offset, 1 << blkbits);
 	end = round_down((offset + len), 1 << blkbits);
@@ -4608,7 +4609,7 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 		 * disk in case of crash before zeroing trans is committed.
 		 */
 		if (ext4_should_journal_data(inode)) {
-			ret = filemap_write_and_wait_range(mapping, start, end);
+			ret = filemap_write_and_wait_range(mapping, start, end - 1);
 			if (ret) {
 				filemap_invalidate_unlock(mapping);
 				goto out_mutex;
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/3] ext4: truncate complete range in pagecache before calling ext4_zero_partial_blocks()
  2023-09-29 14:10 [PATCH 0/3] Optimize __ext4_block_zero_page_range() for unwritten buffers Ojaswin Mujoo
  2023-09-29 14:10 ` [PATCH 1/3] ext4: treat end of range as exclusive in ext4_zero_range() Ojaswin Mujoo
@ 2023-09-29 14:10 ` Ojaswin Mujoo
  2023-09-29 16:15   ` Ojaswin Mujoo
  2023-09-29 14:10 ` [PATCH 3/3] ext4: Skip unwritten buffers in __ext4_block_zero_page_range() Ojaswin Mujoo
  2 siblings, 1 reply; 8+ messages in thread
From: Ojaswin Mujoo @ 2023-09-29 14:10 UTC (permalink / raw)
  To: linux-ext4, Theodore Ts'o; +Cc: Ritesh Harjani, linux-kernel, Jan Kara

In ext4_zero_range() and ext4_punch_hole(), the range passed could be unaligned
however we only zero out the pagecache range that is block aligned. These
functions are relying on ext4_zero_partial_blocks() ->
__ext4_block_zero_page_range() to take care of zeroing the unaligned edges in
the pageacache. However, the right thing to do is to properly zero out the whole
range in these functions before and not rely on a different function to do it
for us. Hence, modify ext4_zero_range() and ext4_punch_hole() to zero the
complete range.

This will also allow us to now exit early for unwritten buffer heads in
__ext4_block_zero_page_range(), in upcoming patch.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
---
 fs/ext4/extents.c | 17 +++++++++++------
 fs/ext4/inode.c   |  3 +--
 2 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index c79b4c25afc4..2dc681cab6a5 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4582,9 +4582,6 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 
 	/* Zero range excluding the unaligned edges */
 	if (max_blocks > 0) {
-		flags |= (EXT4_GET_BLOCKS_CONVERT_UNWRITTEN |
-			  EXT4_EX_NOCACHE);
-
 		/*
 		 * Prevent page faults from reinstantiating pages we have
 		 * released from page cache.
@@ -4609,17 +4606,25 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 		 * disk in case of crash before zeroing trans is committed.
 		 */
 		if (ext4_should_journal_data(inode)) {
-			ret = filemap_write_and_wait_range(mapping, start, end - 1);
+			ret = filemap_write_and_wait_range(mapping, start,
+							   end - 1);
 			if (ret) {
 				filemap_invalidate_unlock(mapping);
 				goto out_mutex;
 			}
 		}
+	}
+
+	/*
+	 * Now truncate the pagecache and zero out non page aligned edges of the
+	 * range (if any)
+	 */
+	truncate_pagecache_range(inode, offset, offset + len - 1);
 
-		/* Now release the pages and zero block aligned part of pages */
-		truncate_pagecache_range(inode, start, end - 1);
+	if (max_blocks > 0) {
 		inode->i_mtime = inode->i_ctime = current_time(inode);
 
+		flags |= (EXT4_GET_BLOCKS_CONVERT_UNWRITTEN | EXT4_EX_NOCACHE);
 		ret = ext4_alloc_file_blocks(file, lblk, max_blocks, new_size,
 					     flags);
 		filemap_invalidate_unlock(mapping);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 6c490f05e2ba..de8ea8430d30 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3974,9 +3974,8 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
 		ret = ext4_update_disksize_before_punch(inode, offset, length);
 		if (ret)
 			goto out_dio;
-		truncate_pagecache_range(inode, first_block_offset,
-					 last_block_offset);
 	}
+	truncate_pagecache_range(inode, offset, offset + length - 1);
 
 	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
 		credits = ext4_writepage_trans_blocks(inode);
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 3/3] ext4: Skip unwritten buffers in __ext4_block_zero_page_range()
  2023-09-29 14:10 [PATCH 0/3] Optimize __ext4_block_zero_page_range() for unwritten buffers Ojaswin Mujoo
  2023-09-29 14:10 ` [PATCH 1/3] ext4: treat end of range as exclusive in ext4_zero_range() Ojaswin Mujoo
  2023-09-29 14:10 ` [PATCH 2/3] ext4: truncate complete range in pagecache before calling ext4_zero_partial_blocks() Ojaswin Mujoo
@ 2023-09-29 14:10 ` Ojaswin Mujoo
  2 siblings, 0 replies; 8+ messages in thread
From: Ojaswin Mujoo @ 2023-09-29 14:10 UTC (permalink / raw)
  To: linux-ext4, Theodore Ts'o; +Cc: Ritesh Harjani, linux-kernel, Jan Kara

If the buffer is unwritten then the underlying block should already return zero
for reads. Further, if it is not dirty then we can be sure that there is no data
on the folio that might get written back later. Hence we skip zeroing out the
folio and underlying block in this case.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
---
 fs/ext4/inode.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index de8ea8430d30..75a951ffa3cb 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3659,6 +3659,10 @@ static int __ext4_block_zero_page_range(handle_t *handle,
 		BUFFER_TRACE(bh, "freed: skip");
 		goto unlock;
 	}
+	if (buffer_unwritten(bh) && !buffer_dirty(bh)) {
+		BUFFER_TRACE(bh, "unwritten and non-dirty: skip");
+		goto unlock;
+	}
 	if (!buffer_mapped(bh)) {
 		BUFFER_TRACE(bh, "unmapped");
 		ext4_get_block(inode, iblock, bh, 0);
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/3] ext4: truncate complete range in pagecache before calling ext4_zero_partial_blocks()
  2023-09-29 14:10 ` [PATCH 2/3] ext4: truncate complete range in pagecache before calling ext4_zero_partial_blocks() Ojaswin Mujoo
@ 2023-09-29 16:15   ` Ojaswin Mujoo
  2023-10-19 16:55     ` Jan Kara
  0 siblings, 1 reply; 8+ messages in thread
From: Ojaswin Mujoo @ 2023-09-29 16:15 UTC (permalink / raw)
  To: linux-ext4, Theodore Ts'o; +Cc: Ritesh Harjani, linux-kernel, Jan Kara

On Fri, Sep 29, 2023 at 07:40:44PM +0530, Ojaswin Mujoo wrote:
> In ext4_zero_range() and ext4_punch_hole(), the range passed could be unaligned
> however we only zero out the pagecache range that is block aligned. These
> functions are relying on ext4_zero_partial_blocks() ->
> __ext4_block_zero_page_range() to take care of zeroing the unaligned edges in
> the pageacache. However, the right thing to do is to properly zero out the whole
> range in these functions before and not rely on a different function to do it
> for us. Hence, modify ext4_zero_range() and ext4_punch_hole() to zero the
> complete range.
> 
> This will also allow us to now exit early for unwritten buffer heads in
> __ext4_block_zero_page_range(), in upcoming patch.
> 
> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> ---
>  fs/ext4/extents.c | 17 +++++++++++------
>  fs/ext4/inode.c   |  3 +--
>  2 files changed, 12 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index c79b4c25afc4..2dc681cab6a5 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -4582,9 +4582,6 @@ static long ext4_zero_range(struct file *file, loff_t offset,
>  
>  	/* Zero range excluding the unaligned edges */
>  	if (max_blocks > 0) {
> -		flags |= (EXT4_GET_BLOCKS_CONVERT_UNWRITTEN |
> -			  EXT4_EX_NOCACHE);
> -
>  		/*
>  		 * Prevent page faults from reinstantiating pages we have
>  		 * released from page cache.
> @@ -4609,17 +4606,25 @@ static long ext4_zero_range(struct file *file, loff_t offset,
>  		 * disk in case of crash before zeroing trans is committed.
>  		 */
>  		if (ext4_should_journal_data(inode)) {
> -			ret = filemap_write_and_wait_range(mapping, start, end - 1);
> +			ret = filemap_write_and_wait_range(mapping, start,
> +							   end - 1);

I think this accidentally creeped in, will fix it in next rev.

Anyways, I had some questions that might be unrelated to this patch,
I'll add them inline:

>  			if (ret) {
>  				filemap_invalidate_unlock(mapping);
>  				goto out_mutex;
>  			}
>  		}
> +	}

So the above if (max_blocks) {...} block runs when the range spans
multiple blocks but I think the filemap_write_and_wait_range() and
ext4_update_disksize_before_punch() should be called when we are actually
spanning multiple pages, since the disksize not updating issue and the 
truncate racing with checkpoint only happen when the complete page is
truncated. Is this understanding correct? 

> +
> +	/*
> +	 * Now truncate the pagecache and zero out non page aligned edges of the
> +	 * range (if any)
> +	 */
> +	truncate_pagecache_range(inode, offset, offset + len - 1);
>  
> -		/* Now release the pages and zero block aligned part of pages */
> -		truncate_pagecache_range(inode, start, end - 1);
> +	if (max_blocks > 0) {
>  		inode->i_mtime = inode->i_ctime = current_time(inode);
>  
> +		flags |= (EXT4_GET_BLOCKS_CONVERT_UNWRITTEN | EXT4_EX_NOCACHE);
>  		ret = ext4_alloc_file_blocks(file, lblk, max_blocks, new_size,
>  					     flags);
>  		filemap_invalidate_unlock(mapping);
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 6c490f05e2ba..de8ea8430d30 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3974,9 +3974,8 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
>  		ret = ext4_update_disksize_before_punch(inode, offset, length);

In this function ext4_punch_hole() I see that we call
filemap_write_and_wait_range() and then take the inode_lock() later.
Doesn't this leave a window for the pages to get dirty again? 

For example, in ext4_zero_range(), we checkpoint using
filemap_write_and_wait_range() in case of data=journal under
inode_lock() but that's not the case here. Just wondering if this 
or any other code path might still race here? 

Regards,
ojaswin

>  		if (ret)
>  			goto out_dio;
> -		truncate_pagecache_range(inode, first_block_offset,
> -					 last_block_offset);
>  	}
> +	truncate_pagecache_range(inode, offset, offset + length - 1);
>  
>  	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
>  		credits = ext4_writepage_trans_blocks(inode);
> -- 
> 2.39.3
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/3] ext4: treat end of range as exclusive in ext4_zero_range()
  2023-09-29 14:10 ` [PATCH 1/3] ext4: treat end of range as exclusive in ext4_zero_range() Ojaswin Mujoo
@ 2023-10-19 15:01   ` Jan Kara
  0 siblings, 0 replies; 8+ messages in thread
From: Jan Kara @ 2023-10-19 15:01 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-ext4, Theodore Ts'o, Ritesh Harjani, linux-kernel, Jan Kara

On Fri 29-09-23 19:40:43, Ojaswin Mujoo wrote:
> The call to filemap_write_and_wait_range() assumes the range passed to be
> inclusive, so fix the call to make sure we follow that.
> 
> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>

Yes, makes sense. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/extents.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index e4115d338f10..c79b4c25afc4 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -4522,7 +4522,8 @@ static long ext4_zero_range(struct file *file, loff_t offset,
>  	 * Round up offset. This is not fallocate, we need to zero out
>  	 * blocks, so convert interior block aligned part of the range to
>  	 * unwritten and possibly manually zero out unaligned parts of the
> -	 * range.
> +	 * range. Here, start and partial_begin are inclusive, end and
> +	 * partial_end are exclusive.
>  	 */
>  	start = round_up(offset, 1 << blkbits);
>  	end = round_down((offset + len), 1 << blkbits);
> @@ -4608,7 +4609,7 @@ static long ext4_zero_range(struct file *file, loff_t offset,
>  		 * disk in case of crash before zeroing trans is committed.
>  		 */
>  		if (ext4_should_journal_data(inode)) {
> -			ret = filemap_write_and_wait_range(mapping, start, end);
> +			ret = filemap_write_and_wait_range(mapping, start, end - 1);
>  			if (ret) {
>  				filemap_invalidate_unlock(mapping);
>  				goto out_mutex;
> -- 
> 2.39.3
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/3] ext4: truncate complete range in pagecache before calling ext4_zero_partial_blocks()
  2023-09-29 16:15   ` Ojaswin Mujoo
@ 2023-10-19 16:55     ` Jan Kara
  2023-11-01 15:45       ` Ojaswin Mujoo
  0 siblings, 1 reply; 8+ messages in thread
From: Jan Kara @ 2023-10-19 16:55 UTC (permalink / raw)
  To: Ojaswin Mujoo
  Cc: linux-ext4, Theodore Ts'o, Ritesh Harjani, linux-kernel, Jan Kara

On Fri 29-09-23 21:45:29, Ojaswin Mujoo wrote:
> On Fri, Sep 29, 2023 at 07:40:44PM +0530, Ojaswin Mujoo wrote:
> > In ext4_zero_range() and ext4_punch_hole(), the range passed could be unaligned
> > however we only zero out the pagecache range that is block aligned. These
> > functions are relying on ext4_zero_partial_blocks() ->
> > __ext4_block_zero_page_range() to take care of zeroing the unaligned edges in
> > the pageacache. However, the right thing to do is to properly zero out the whole
> > range in these functions before and not rely on a different function to do it
> > for us. Hence, modify ext4_zero_range() and ext4_punch_hole() to zero the
> > complete range.
> > 
> > This will also allow us to now exit early for unwritten buffer heads in
> > __ext4_block_zero_page_range(), in upcoming patch.
> > 
> > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> > ---
> >  fs/ext4/extents.c | 17 +++++++++++------
> >  fs/ext4/inode.c   |  3 +--
> >  2 files changed, 12 insertions(+), 8 deletions(-)
> > 
> > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> > index c79b4c25afc4..2dc681cab6a5 100644
> > --- a/fs/ext4/extents.c
> > +++ b/fs/ext4/extents.c
> > @@ -4582,9 +4582,6 @@ static long ext4_zero_range(struct file *file, loff_t offset,
> >  
> >  	/* Zero range excluding the unaligned edges */
> >  	if (max_blocks > 0) {
> > -		flags |= (EXT4_GET_BLOCKS_CONVERT_UNWRITTEN |
> > -			  EXT4_EX_NOCACHE);
> > -
> >  		/*
> >  		 * Prevent page faults from reinstantiating pages we have
> >  		 * released from page cache.
> > @@ -4609,17 +4606,25 @@ static long ext4_zero_range(struct file *file, loff_t offset,
> >  		 * disk in case of crash before zeroing trans is committed.
> >  		 */
> >  		if (ext4_should_journal_data(inode)) {
> > -			ret = filemap_write_and_wait_range(mapping, start, end - 1);
> > +			ret = filemap_write_and_wait_range(mapping, start,
> > +							   end - 1);
> 
> I think this accidentally creeped in, will fix it in next rev.

Yeah, just pull it in patch 1.

> >  			if (ret) {
> >  				filemap_invalidate_unlock(mapping);
> >  				goto out_mutex;
> >  			}
> >  		}
> > +	}
> 
> So the above if (max_blocks) {...} block runs when the range spans
> multiple blocks but I think the filemap_write_and_wait_range() and
> ext4_update_disksize_before_punch() should be called when we are actually
> spanning multiple pages, since the disksize not updating issue and the 
> truncate racing with checkpoint only happen when the complete page is
> truncated. Is this understanding correct? 

Why do you think the issues apply only to multiple pages? I mean even if a
single block is dirty in memory, it may be pushing i_disksize or carrying
journalled data we need to commit.

> > +	/*
> > +	 * Now truncate the pagecache and zero out non page aligned edges of the
> > +	 * range (if any)
> > +	 */
> > +	truncate_pagecache_range(inode, offset, offset + len - 1);
> >  
> > -		/* Now release the pages and zero block aligned part of pages */
> > -		truncate_pagecache_range(inode, start, end - 1);
> > +	if (max_blocks > 0) {
> >  		inode->i_mtime = inode->i_ctime = current_time(inode);
> >  
> > +		flags |= (EXT4_GET_BLOCKS_CONVERT_UNWRITTEN | EXT4_EX_NOCACHE);
> >  		ret = ext4_alloc_file_blocks(file, lblk, max_blocks, new_size,
> >  					     flags);
> >  		filemap_invalidate_unlock(mapping);
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index 6c490f05e2ba..de8ea8430d30 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -3974,9 +3974,8 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
> >  		ret = ext4_update_disksize_before_punch(inode, offset, length);
> 
> In this function ext4_punch_hole() I see that we call
> filemap_write_and_wait_range() and then take the inode_lock() later.
> Doesn't this leave a window for the pages to get dirty again? 

There's definitely a race window but I think the call to
filemap_write_and_wait_range() is a leftover from the past when hole
punching could race in a nasty way. These days we have invalidate_lock so I
*think* we can just remove that filemap_write_and_wait_range() call. At
least I don't see a good reason for it now because the pages are going away
anyway. But it needs testing :).

> For example, in ext4_zero_range(), we checkpoint using
> filemap_write_and_wait_range() in case of data=journal under
> inode_lock() but that's not the case here. Just wondering if this 
> or any other code path might still race here? 

Well, that's a bit different story as the comment there explains. And yes,
invalidate_lock protects us from possible races there.

> >  		if (ret)
> >  			goto out_dio;
> > -		truncate_pagecache_range(inode, first_block_offset,
> > -					 last_block_offset);
> >  	}
> > +	truncate_pagecache_range(inode, offset, offset + length - 1);

But I have realized that changes done in this patch actually don't help
with changing ext4_zero_partial_blocks() because as soon as we drop
invalidate_lock, a page fault can come in and modify contents of partial
pages we need zeroed.

So thinking about this again with fresh mind, these block vs pagecache
consistency issues aren't probably worth it and current code flow is good
enough. Sorry for misleading you. We might just add a comment to
__ext4_block_zero_page_range() to explain that buffer_unwritten() buffers
can get there but they should be already zeroed-out and uptodate and we do
need to process them because of page cache zeroing. What do you think?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/3] ext4: truncate complete range in pagecache before calling ext4_zero_partial_blocks()
  2023-10-19 16:55     ` Jan Kara
@ 2023-11-01 15:45       ` Ojaswin Mujoo
  0 siblings, 0 replies; 8+ messages in thread
From: Ojaswin Mujoo @ 2023-11-01 15:45 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, Theodore Ts'o, Ritesh Harjani, linux-kernel

On Thu, Oct 19, 2023 at 06:55:46PM +0200, Jan Kara wrote:
> On Fri 29-09-23 21:45:29, Ojaswin Mujoo wrote:
> > On Fri, Sep 29, 2023 at 07:40:44PM +0530, Ojaswin Mujoo wrote:
> > > In ext4_zero_range() and ext4_punch_hole(), the range passed could be unaligned
> > > however we only zero out the pagecache range that is block aligned. These
> > > functions are relying on ext4_zero_partial_blocks() ->
> > > __ext4_block_zero_page_range() to take care of zeroing the unaligned edges in
> > > the pageacache. However, the right thing to do is to properly zero out the whole
> > > range in these functions before and not rely on a different function to do it
> > > for us. Hence, modify ext4_zero_range() and ext4_punch_hole() to zero the
> > > complete range.
> > > 
> > > This will also allow us to now exit early for unwritten buffer heads in
> > > __ext4_block_zero_page_range(), in upcoming patch.
> > > 
> > > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> > > ---
> > >  fs/ext4/extents.c | 17 +++++++++++------
> > >  fs/ext4/inode.c   |  3 +--
> > >  2 files changed, 12 insertions(+), 8 deletions(-)
> > > 
> > > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> > > index c79b4c25afc4..2dc681cab6a5 100644
> > > --- a/fs/ext4/extents.c
> > > +++ b/fs/ext4/extents.c
> > > @@ -4582,9 +4582,6 @@ static long ext4_zero_range(struct file *file, loff_t offset,
> > >  
> > >  	/* Zero range excluding the unaligned edges */
> > >  	if (max_blocks > 0) {
> > > -		flags |= (EXT4_GET_BLOCKS_CONVERT_UNWRITTEN |
> > > -			  EXT4_EX_NOCACHE);
> > > -
> > >  		/*
> > >  		 * Prevent page faults from reinstantiating pages we have
> > >  		 * released from page cache.
> > > @@ -4609,17 +4606,25 @@ static long ext4_zero_range(struct file *file, loff_t offset,
> > >  		 * disk in case of crash before zeroing trans is committed.
> > >  		 */
> > >  		if (ext4_should_journal_data(inode)) {
> > > -			ret = filemap_write_and_wait_range(mapping, start, end - 1);
> > > +			ret = filemap_write_and_wait_range(mapping, start,
> > > +							   end - 1);
> > 
> > I think this accidentally creeped in, will fix it in next rev.
> 
> Yeah, just pull it in patch 1.
> 
> > >  			if (ret) {
> > >  				filemap_invalidate_unlock(mapping);
> > >  				goto out_mutex;
> > >  			}
> > >  		}
> > > +	}
> > 
> > So the above if (max_blocks) {...} block runs when the range spans
> > multiple blocks but I think the filemap_write_and_wait_range() and
> > ext4_update_disksize_before_punch() should be called when we are actually
> > spanning multiple pages, since the disksize not updating issue and the 
> > truncate racing with checkpoint only happen when the complete page is
> > truncated. Is this understanding correct? 
> 
> Why do you think the issues apply only to multiple pages? I mean even if a
> single block is dirty in memory, it may be pushing i_disksize or carrying
> journalled data we need to commit.

Hey Jan, 

You are right, I think I was misunderstanding this code a bit, thinking
that these things would only be needed if the complete folio is absent.
Upon rechecking the paths like the writeback path I can now see that
even if the blocks till i_size are already allocated (eg, through
ext4_zero_range) then we won't actually be calling
mpage_map_and_submit_extent() which is where disksize updates.

> 
> > > +	/*
> > > +	 * Now truncate the pagecache and zero out non page aligned edges of the
> > > +	 * range (if any)
> > > +	 */
> > > +	truncate_pagecache_range(inode, offset, offset + len - 1);
> > >  
> > > -		/* Now release the pages and zero block aligned part of pages */
> > > -		truncate_pagecache_range(inode, start, end - 1);
> > > +	if (max_blocks > 0) {
> > >  		inode->i_mtime = inode->i_ctime = current_time(inode);
> > >  
> > > +		flags |= (EXT4_GET_BLOCKS_CONVERT_UNWRITTEN | EXT4_EX_NOCACHE);
> > >  		ret = ext4_alloc_file_blocks(file, lblk, max_blocks, new_size,
> > >  					     flags);
> > >  		filemap_invalidate_unlock(mapping);
> > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > index 6c490f05e2ba..de8ea8430d30 100644
> > > --- a/fs/ext4/inode.c
> > > +++ b/fs/ext4/inode.c
> > > @@ -3974,9 +3974,8 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
> > >  		ret = ext4_update_disksize_before_punch(inode, offset, length);
> > 
> > In this function ext4_punch_hole() I see that we call
> > filemap_write_and_wait_range() and then take the inode_lock() later.
> > Doesn't this leave a window for the pages to get dirty again? 
> 
> There's definitely a race window but I think the call to
> filemap_write_and_wait_range() is a leftover from the past when hole
> punching could race in a nasty way. These days we have invalidate_lock so I
> *think* we can just remove that filemap_write_and_wait_range() call. At
> least I don't see a good reason for it now because the pages are going away
> anyway. But it needs testing :).

> 
> > For example, in ext4_zero_range(), we checkpoint using
> > filemap_write_and_wait_range() in case of data=journal under
> > inode_lock() but that's not the case here. Just wondering if this 
> > or any other code path might still race here? 
> 
> Well, that's a bit different story as the comment there explains. And yes,
> invalidate_lock protects us from possible races there.

Ahh okay, got it.

> 
> > >  		if (ret)
> > >  			goto out_dio;
> > > -		truncate_pagecache_range(inode, first_block_offset,
> > > -					 last_block_offset);
> > >  	}
> > > +	truncate_pagecache_range(inode, offset, offset + length - 1);
> 
> But I have realized that changes done in this patch actually don't help
> with changing ext4_zero_partial_blocks() because as soon as we drop
> invalidate_lock, a page fault can come in and modify contents of partial
> pages we need zeroed.
> 
> So thinking about this again with fresh mind, these block vs pagecache
> consistency issues aren't probably worth it and current code flow is good
> enough. Sorry for misleading you. We might just add a comment to
> __ext4_block_zero_page_range() to explain that buffer_unwritten() buffers
> can get there but they should be already zeroed-out and uptodate and we do
> need to process them because of page cache zeroing. What do you think?

Oh right, I was not thinking from the mmap path, sorry about that. In
that case I think your point makes sense, lets just let this be for now.
I'll send a v2 with the first patch of the series and also add a comment
as you suggested.

Thanks for the review and taking the time to answer my questions!

Regards,
ojaswin
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-11-01 15:45 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-29 14:10 [PATCH 0/3] Optimize __ext4_block_zero_page_range() for unwritten buffers Ojaswin Mujoo
2023-09-29 14:10 ` [PATCH 1/3] ext4: treat end of range as exclusive in ext4_zero_range() Ojaswin Mujoo
2023-10-19 15:01   ` Jan Kara
2023-09-29 14:10 ` [PATCH 2/3] ext4: truncate complete range in pagecache before calling ext4_zero_partial_blocks() Ojaswin Mujoo
2023-09-29 16:15   ` Ojaswin Mujoo
2023-10-19 16:55     ` Jan Kara
2023-11-01 15:45       ` Ojaswin Mujoo
2023-09-29 14:10 ` [PATCH 3/3] ext4: Skip unwritten buffers in __ext4_block_zero_page_range() Ojaswin Mujoo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).