linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] fs: Move mark_inode_dirty out of __generic_write_end
@ 2019-06-18 14:47 Andreas Gruenbacher
  2019-06-19 16:01 ` Jan Kara
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Andreas Gruenbacher @ 2019-06-18 14:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: cluster-devel, linux-fsdevel, Jan Kara, Andreas Gruenbacher

Remove the mark_inode_dirty call from __generic_write_end and add it to
generic_write_end and the high-level iomap functions where necessary.
That way, large writes will only dirty inodes at the end instead of
dirtying them once per page.  This fixes a performance bottleneck on
gfs2.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/buffer.c | 26 ++++++++++++++++++--------
 fs/iomap.c  | 42 ++++++++++++++++++++++++++++++++++++++----
 2 files changed, 56 insertions(+), 12 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index e450c55f6434..1b51003bc9d2 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2089,8 +2089,7 @@ EXPORT_SYMBOL(block_write_begin);
 void __generic_write_end(struct inode *inode, loff_t pos, unsigned copied,
 		struct page *page)
 {
-	loff_t old_size = inode->i_size;
-	bool i_size_changed = false;
+	loff_t old_size;
 
 	/*
 	 * No need to use i_size_read() here, the i_size cannot change under us
@@ -2099,23 +2098,21 @@ void __generic_write_end(struct inode *inode, loff_t pos, unsigned copied,
 	 * But it's important to update i_size while still holding page lock:
 	 * page writeout could otherwise come in and zero beyond i_size.
 	 */
-	if (pos + copied > inode->i_size) {
+	old_size = inode->i_size;
+	if (pos + copied > old_size)
 		i_size_write(inode, pos + copied);
-		i_size_changed = true;
-	}
 
 	unlock_page(page);
 
 	if (old_size < pos)
 		pagecache_isize_extended(inode, old_size, pos);
+
 	/*
 	 * Don't mark the inode dirty under page lock. First, it unnecessarily
 	 * makes the holding time of page lock longer. Second, it forces lock
 	 * ordering of page lock and transaction start for journaling
 	 * filesystems.
 	 */
-	if (i_size_changed)
-		mark_inode_dirty(inode);
 }
 
 int block_write_end(struct file *file, struct address_space *mapping,
@@ -2158,9 +2155,22 @@ int generic_write_end(struct file *file, struct address_space *mapping,
 			loff_t pos, unsigned len, unsigned copied,
 			struct page *page, void *fsdata)
 {
+	struct inode *inode = mapping->host;
+	loff_t old_size;
+
+	/*
+	 * No need to use i_size_read() here, the i_size cannot change under us
+	 * because we hold i_rwsem.
+	 */
+	old_size = inode->i_size;
+
 	copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
-	__generic_write_end(mapping->host, pos, copied, page);
+	__generic_write_end(inode, pos, copied, page);
 	put_page(page);
+
+	if (old_size != inode->i_size)
+		mark_inode_dirty(inode);
+
 	return copied;
 }
 EXPORT_SYMBOL(generic_write_end);
diff --git a/fs/iomap.c b/fs/iomap.c
index 23ef63fd1669..9454568a7f5e 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -881,6 +881,13 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *iter,
 {
 	struct inode *inode = iocb->ki_filp->f_mapping->host;
 	loff_t pos = iocb->ki_pos, ret = 0, written = 0;
+	loff_t old_size;
+
+        /*
+	 * No need to use i_size_read() here, the i_size cannot change under us
+	 * because we hold i_rwsem.
+	 */
+	old_size = inode->i_size;
 
 	while (iov_iter_count(iter)) {
 		ret = iomap_apply(inode, pos, iov_iter_count(iter),
@@ -891,6 +898,9 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *iter,
 		written += ret;
 	}
 
+	if (old_size != inode->i_size)
+		mark_inode_dirty(inode);
+
 	return written ? written : ret;
 }
 EXPORT_SYMBOL_GPL(iomap_file_buffered_write);
@@ -961,18 +971,30 @@ int
 iomap_file_dirty(struct inode *inode, loff_t pos, loff_t len,
 		const struct iomap_ops *ops)
 {
+	loff_t old_size;
 	loff_t ret;
 
+        /*
+	 * No need to use i_size_read() here, the i_size cannot change under us
+	 * because we hold i_rwsem.
+	 */
+	old_size = inode->i_size;
+
 	while (len) {
 		ret = iomap_apply(inode, pos, len, IOMAP_WRITE, ops, NULL,
 				iomap_dirty_actor);
 		if (ret <= 0)
-			return ret;
+			goto out;
 		pos += ret;
 		len -= ret;
 	}
+	ret = 0;
 
-	return 0;
+out:
+	if (old_size != inode->i_size)
+		mark_inode_dirty(inode);
+
+	return ret;
 }
 EXPORT_SYMBOL_GPL(iomap_file_dirty);
 
@@ -1039,19 +1061,31 @@ int
 iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
 		const struct iomap_ops *ops)
 {
+	loff_t old_size;
 	loff_t ret;
 
+        /*
+	 * No need to use i_size_read() here, the i_size cannot change under us
+	 * because we hold i_rwsem.
+	 */
+	old_size = inode->i_size;
+
 	while (len > 0) {
 		ret = iomap_apply(inode, pos, len, IOMAP_ZERO,
 				ops, did_zero, iomap_zero_range_actor);
 		if (ret <= 0)
-			return ret;
+			goto out;
 
 		pos += ret;
 		len -= ret;
 	}
+	ret = 0;
 
-	return 0;
+out:
+	if (old_size != inode->i_size)
+		mark_inode_dirty(inode);
+
+	return ret;
 }
 EXPORT_SYMBOL_GPL(iomap_zero_range);
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] fs: Move mark_inode_dirty out of __generic_write_end
  2019-06-18 14:47 [PATCH] fs: Move mark_inode_dirty out of __generic_write_end Andreas Gruenbacher
@ 2019-06-19 16:01 ` Jan Kara
  2019-06-20  4:47 ` Dave Chinner
  2019-06-24  6:54 ` Christoph Hellwig
  2 siblings, 0 replies; 11+ messages in thread
From: Jan Kara @ 2019-06-19 16:01 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Christoph Hellwig, cluster-devel, linux-fsdevel, Jan Kara

On Tue 18-06-19 16:47:16, Andreas Gruenbacher wrote:
> Remove the mark_inode_dirty call from __generic_write_end and add it to
> generic_write_end and the high-level iomap functions where necessary.
> That way, large writes will only dirty inodes at the end instead of
> dirtying them once per page.  This fixes a performance bottleneck on
> gfs2.
> 
> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>

So the patch looks correct but honestly I don't like how we duplicate inode
dirtying over several places. I was wondering whether we could not move the
logic to iomap_apply() or something like that.

								Honza

> ---
>  fs/buffer.c | 26 ++++++++++++++++++--------
>  fs/iomap.c  | 42 ++++++++++++++++++++++++++++++++++++++----
>  2 files changed, 56 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index e450c55f6434..1b51003bc9d2 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2089,8 +2089,7 @@ EXPORT_SYMBOL(block_write_begin);
>  void __generic_write_end(struct inode *inode, loff_t pos, unsigned copied,
>  		struct page *page)
>  {
> -	loff_t old_size = inode->i_size;
> -	bool i_size_changed = false;
> +	loff_t old_size;
>  
>  	/*
>  	 * No need to use i_size_read() here, the i_size cannot change under us
> @@ -2099,23 +2098,21 @@ void __generic_write_end(struct inode *inode, loff_t pos, unsigned copied,
>  	 * But it's important to update i_size while still holding page lock:
>  	 * page writeout could otherwise come in and zero beyond i_size.
>  	 */
> -	if (pos + copied > inode->i_size) {
> +	old_size = inode->i_size;
> +	if (pos + copied > old_size)
>  		i_size_write(inode, pos + copied);
> -		i_size_changed = true;
> -	}
>  
>  	unlock_page(page);
>  
>  	if (old_size < pos)
>  		pagecache_isize_extended(inode, old_size, pos);
> +
>  	/*
>  	 * Don't mark the inode dirty under page lock. First, it unnecessarily
>  	 * makes the holding time of page lock longer. Second, it forces lock
>  	 * ordering of page lock and transaction start for journaling
>  	 * filesystems.
>  	 */
> -	if (i_size_changed)
> -		mark_inode_dirty(inode);
>  }
>  
>  int block_write_end(struct file *file, struct address_space *mapping,
> @@ -2158,9 +2155,22 @@ int generic_write_end(struct file *file, struct address_space *mapping,
>  			loff_t pos, unsigned len, unsigned copied,
>  			struct page *page, void *fsdata)
>  {
> +	struct inode *inode = mapping->host;
> +	loff_t old_size;
> +
> +	/*
> +	 * No need to use i_size_read() here, the i_size cannot change under us
> +	 * because we hold i_rwsem.
> +	 */
> +	old_size = inode->i_size;
> +
>  	copied = block_write_end(file, mapping, pos, len, copied, page, fsdata);
> -	__generic_write_end(mapping->host, pos, copied, page);
> +	__generic_write_end(inode, pos, copied, page);
>  	put_page(page);
> +
> +	if (old_size != inode->i_size)
> +		mark_inode_dirty(inode);
> +
>  	return copied;
>  }
>  EXPORT_SYMBOL(generic_write_end);
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 23ef63fd1669..9454568a7f5e 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -881,6 +881,13 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *iter,
>  {
>  	struct inode *inode = iocb->ki_filp->f_mapping->host;
>  	loff_t pos = iocb->ki_pos, ret = 0, written = 0;
> +	loff_t old_size;
> +
> +        /*
> +	 * No need to use i_size_read() here, the i_size cannot change under us
> +	 * because we hold i_rwsem.
> +	 */
> +	old_size = inode->i_size;
>  
>  	while (iov_iter_count(iter)) {
>  		ret = iomap_apply(inode, pos, iov_iter_count(iter),
> @@ -891,6 +898,9 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *iter,
>  		written += ret;
>  	}
>  
> +	if (old_size != inode->i_size)
> +		mark_inode_dirty(inode);
> +
>  	return written ? written : ret;
>  }
>  EXPORT_SYMBOL_GPL(iomap_file_buffered_write);
> @@ -961,18 +971,30 @@ int
>  iomap_file_dirty(struct inode *inode, loff_t pos, loff_t len,
>  		const struct iomap_ops *ops)
>  {
> +	loff_t old_size;
>  	loff_t ret;
>  
> +        /*
> +	 * No need to use i_size_read() here, the i_size cannot change under us
> +	 * because we hold i_rwsem.
> +	 */
> +	old_size = inode->i_size;
> +
>  	while (len) {
>  		ret = iomap_apply(inode, pos, len, IOMAP_WRITE, ops, NULL,
>  				iomap_dirty_actor);
>  		if (ret <= 0)
> -			return ret;
> +			goto out;
>  		pos += ret;
>  		len -= ret;
>  	}
> +	ret = 0;
>  
> -	return 0;
> +out:
> +	if (old_size != inode->i_size)
> +		mark_inode_dirty(inode);
> +
> +	return ret;
>  }
>  EXPORT_SYMBOL_GPL(iomap_file_dirty);
>  
> @@ -1039,19 +1061,31 @@ int
>  iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
>  		const struct iomap_ops *ops)
>  {
> +	loff_t old_size;
>  	loff_t ret;
>  
> +        /*
> +	 * No need to use i_size_read() here, the i_size cannot change under us
> +	 * because we hold i_rwsem.
> +	 */
> +	old_size = inode->i_size;
> +
>  	while (len > 0) {
>  		ret = iomap_apply(inode, pos, len, IOMAP_ZERO,
>  				ops, did_zero, iomap_zero_range_actor);
>  		if (ret <= 0)
> -			return ret;
> +			goto out;
>  
>  		pos += ret;
>  		len -= ret;
>  	}
> +	ret = 0;
>  
> -	return 0;
> +out:
> +	if (old_size != inode->i_size)
> +		mark_inode_dirty(inode);
> +
> +	return ret;
>  }
>  EXPORT_SYMBOL_GPL(iomap_zero_range);
>  
> -- 
> 2.20.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] fs: Move mark_inode_dirty out of __generic_write_end
  2019-06-18 14:47 [PATCH] fs: Move mark_inode_dirty out of __generic_write_end Andreas Gruenbacher
  2019-06-19 16:01 ` Jan Kara
@ 2019-06-20  4:47 ` Dave Chinner
  2019-06-24  6:54 ` Christoph Hellwig
  2 siblings, 0 replies; 11+ messages in thread
From: Dave Chinner @ 2019-06-20  4:47 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Christoph Hellwig, cluster-devel, linux-fsdevel, Jan Kara

On Tue, Jun 18, 2019 at 04:47:16PM +0200, Andreas Gruenbacher wrote:
> Remove the mark_inode_dirty call from __generic_write_end and add it to
> generic_write_end and the high-level iomap functions where necessary.
> That way, large writes will only dirty inodes at the end instead of
> dirtying them once per page.  This fixes a performance bottleneck on
> gfs2.
> 
> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
> ---
>  fs/buffer.c | 26 ++++++++++++++++++--------
>  fs/iomap.c  | 42 ++++++++++++++++++++++++++++++++++++++----
>  2 files changed, 56 insertions(+), 12 deletions(-)

....

> 
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 23ef63fd1669..9454568a7f5e 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -881,6 +881,13 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *iter,
>  {
>  	struct inode *inode = iocb->ki_filp->f_mapping->host;
>  	loff_t pos = iocb->ki_pos, ret = 0, written = 0;
> +	loff_t old_size;
> +
> +        /*
> +	 * No need to use i_size_read() here, the i_size cannot change under us
> +	 * because we hold i_rwsem.
> +	 */
> +	old_size = inode->i_size;
>  
>  	while (iov_iter_count(iter)) {
>  		ret = iomap_apply(inode, pos, iov_iter_count(iter),
> @@ -891,6 +898,9 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *iter,
>  		written += ret;
>  	}
>  
> +	if (old_size != inode->i_size)
> +		mark_inode_dirty(inode);
> +
>  	return written ? written : ret;
>  }
>  EXPORT_SYMBOL_GPL(iomap_file_buffered_write);
> @@ -961,18 +971,30 @@ int
>  iomap_file_dirty(struct inode *inode, loff_t pos, loff_t len,
>  		const struct iomap_ops *ops)
>  {
> +	loff_t old_size;
>  	loff_t ret;
>  
> +        /*
> +	 * No need to use i_size_read() here, the i_size cannot change under us
> +	 * because we hold i_rwsem.
> +	 */
> +	old_size = inode->i_size;
> +
>  	while (len) {
>  		ret = iomap_apply(inode, pos, len, IOMAP_WRITE, ops, NULL,
>  				iomap_dirty_actor);
>  		if (ret <= 0)
> -			return ret;
> +			goto out;
>  		pos += ret;
>  		len -= ret;
>  	}
> +	ret = 0;
>  
> -	return 0;
> +out:
> +	if (old_size != inode->i_size)
> +		mark_inode_dirty(inode);


I don't think we want to do this.

The patches I have that add range locking for XFS allow buffered
writes to run concurrently with operations that change the inode
size as long as the ranges don't overlap. To do this, XFS will not
hold the i_rwsem over any iomap call it makes in future - it will
hold a range lock instead. Hence we can have writes and other IO
operations occurring at the same time some other operation is
changing the size of the file, and that means this code no longer
does what you are intending it to do because the inode->i_size is no
longer constant across these operations...

Hence I think adding code that depends on i_rwsem to be held to
function correctly is the wrong direction to be taking the iomap
infrastructure.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] fs: Move mark_inode_dirty out of __generic_write_end
  2019-06-18 14:47 [PATCH] fs: Move mark_inode_dirty out of __generic_write_end Andreas Gruenbacher
  2019-06-19 16:01 ` Jan Kara
  2019-06-20  4:47 ` Dave Chinner
@ 2019-06-24  6:54 ` Christoph Hellwig
  2019-06-24 18:22   ` Andreas Gruenbacher
  2 siblings, 1 reply; 11+ messages in thread
From: Christoph Hellwig @ 2019-06-24  6:54 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Christoph Hellwig, cluster-devel, linux-fsdevel, Jan Kara

At least for xfs we don't need the mark_inode_dirty at all.  Can you
solve your gfs2 requirements on top of something like the patch below?
Note that in general it seems like you should try to only update the
on-disk inode size in writeback completion anyway, otherwise you can
have a stale i_size update before the data was actually written.


diff --git a/fs/iomap.c b/fs/iomap.c
index c98107a6bf81..fcf2cbd39114 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -785,6 +785,7 @@ iomap_write_end(struct inode *inode, loff_t pos, unsigned len,
 		unsigned copied, struct page *page, struct iomap *iomap)
 {
 	const struct iomap_page_ops *page_ops = iomap->page_ops;
+	loff_t old_size = inode->i_size;
 	int ret;
 
 	if (iomap->type == IOMAP_INLINE) {
@@ -796,7 +797,12 @@ iomap_write_end(struct inode *inode, loff_t pos, unsigned len,
 		ret = __iomap_write_end(inode, pos, len, copied, page, iomap);
 	}
 
-	__generic_write_end(inode, pos, ret, page);
+	if (pos + ret > inode->i_size)
+		i_size_write(inode, pos + ret);
+	unlock_page(page);
+
+	if (old_size < pos)
+		pagecache_isize_extended(inode, old_size, pos);
 	if (page_ops && page_ops->page_done)
 		page_ops->page_done(inode, pos, copied, page, iomap);
 	put_page(page);

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] fs: Move mark_inode_dirty out of __generic_write_end
  2019-06-24  6:54 ` Christoph Hellwig
@ 2019-06-24 18:22   ` Andreas Gruenbacher
  2019-06-25  9:57     ` Christoph Hellwig
  0 siblings, 1 reply; 11+ messages in thread
From: Andreas Gruenbacher @ 2019-06-24 18:22 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andreas Gruenbacher, cluster-devel, linux-fsdevel, Jan Kara

On Mon, 24 Jun 2019 at 08:55, Christoph Hellwig <hch@lst.de> wrote:
> At least for xfs we don't need the mark_inode_dirty at all.  Can you
> solve your gfs2 requirements on top of something like the patch below?
> Note that in general it seems like you should try to only update the
> on-disk inode size in writeback completion anyway, otherwise you can
> have a stale i_size update before the data was actually written.
>
>
> diff --git a/fs/iomap.c b/fs/iomap.c
> index c98107a6bf81..fcf2cbd39114 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -785,6 +785,7 @@ iomap_write_end(struct inode *inode, loff_t pos, unsigned len,
>                 unsigned copied, struct page *page, struct iomap *iomap)
>  {
>         const struct iomap_page_ops *page_ops = iomap->page_ops;
> +       loff_t old_size = inode->i_size;
>         int ret;
>
>         if (iomap->type == IOMAP_INLINE) {
> @@ -796,7 +797,12 @@ iomap_write_end(struct inode *inode, loff_t pos, unsigned len,
>                 ret = __iomap_write_end(inode, pos, len, copied, page, iomap);
>         }
>
> -       __generic_write_end(inode, pos, ret, page);
> +       if (pos + ret > inode->i_size)
> +               i_size_write(inode, pos + ret);
> +       unlock_page(page);
> +
> +       if (old_size < pos)
> +               pagecache_isize_extended(inode, old_size, pos);
>         if (page_ops && page_ops->page_done)
>                 page_ops->page_done(inode, pos, copied, page, iomap);
>         put_page(page);

That would work, but I don't like how this leaves us with a vfs function
that updates i_size without bothering to dirty the inode very much.

How about if we move the __generic_write_end call into the page_done
callback and leave special handling to the filesystem code if needed
instead?  The below patch seems to work for gfs2.

Thanks,
Andreas

---
 fs/gfs2/bmap.c   | 42 ++++++++++++++++++++++++++++++++++++------
 fs/gfs2/incore.h |  1 +
 fs/iomap.c       |  5 +++--
 3 files changed, 40 insertions(+), 8 deletions(-)

diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index 93ea1d529aa3..7569770e6871 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -991,10 +991,13 @@ static void gfs2_write_unlock(struct inode *inode)
 static int gfs2_iomap_page_prepare(struct inode *inode, loff_t pos,
 				   unsigned len, struct iomap *iomap)
 {
-	unsigned int blockmask = i_blocksize(inode) - 1;
+	struct gfs2_inode *ip = GFS2_I(inode);
 	struct gfs2_sbd *sdp = GFS2_SB(inode);
-	unsigned int blocks;
+	unsigned int blockmask, blocks;
 
+	if (!(gfs2_is_stuffed(ip) || gfs2_is_jdata(ip)))
+		return 0;
+	blockmask = i_blocksize(inode) - 1;
 	blocks = ((pos & blockmask) + len + blockmask) >> inode->i_blkbits;
 	return gfs2_trans_begin(sdp, RES_DINODE + blocks, 0);
 }
@@ -1005,10 +1008,33 @@ static void gfs2_iomap_page_done(struct inode *inode, loff_t pos,
 {
 	struct gfs2_inode *ip = GFS2_I(inode);
 	struct gfs2_sbd *sdp = GFS2_SB(inode);
+	loff_t old_size;
+
+	if (!page)
+		goto out;
 
-	if (page && !gfs2_is_stuffed(ip))
+	/*
+	 * Avoid calling __generic_write_end here to prevent mark_inode_dirty
+	 * from being called for each page: it's relatively expensive on gfs2,
+	 * so we defer that to gfs2_iomap_end.
+	 */
+	old_size = inode->i_size;
+	if (pos + copied > old_size) {
+		i_size_write(inode, pos + copied);
+		set_bit(GIF_SIZE_CHANGED, &ip->i_flags);
+	}
+
+	unlock_page(page);
+
+	if (old_size < pos)
+		pagecache_isize_extended(inode, old_size, pos);
+
+	if (gfs2_is_jdata(ip) && !gfs2_is_stuffed(ip))
 		gfs2_page_add_databufs(ip, page, offset_in_page(pos), copied);
-	gfs2_trans_end(sdp);
+
+out:
+	if (current->journal_info)
+		gfs2_trans_end(sdp);
 }
 
 static const struct iomap_page_ops gfs2_iomap_page_ops = {
@@ -1106,8 +1132,7 @@ static int gfs2_iomap_begin_write(struct inode *inode, loff_t pos,
 		gfs2_trans_end(sdp);
 	}
 
-	if (gfs2_is_stuffed(ip) || gfs2_is_jdata(ip))
-		iomap->page_ops = &gfs2_iomap_page_ops;
+	iomap->page_ops = &gfs2_iomap_page_ops;
 	return 0;
 
 out_trans_end:
@@ -1160,6 +1185,11 @@ static int gfs2_iomap_end(struct inode *inode, loff_t pos, loff_t length,
 	if ((flags & (IOMAP_WRITE | IOMAP_DIRECT)) != IOMAP_WRITE)
 		goto out;
 
+	if (test_bit(GIF_SIZE_CHANGED, &ip->i_flags)) {
+		clear_bit(GIF_SIZE_CHANGED, &ip->i_flags);
+		mark_inode_dirty(inode);
+	}
+
 	if (!gfs2_is_stuffed(ip))
 		gfs2_ordered_add_inode(ip);
 
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index c9af93ac6c73..9f620807b396 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -396,6 +396,7 @@ enum {
 	GIF_ORDERED		= 4,
 	GIF_FREE_VFS_INODE      = 5,
 	GIF_GLOP_PENDING	= 6,
+	GIF_SIZE_CHANGED	= 7,
 };
 
 struct gfs2_inode {
diff --git a/fs/iomap.c b/fs/iomap.c
index 12654c2e78f8..b5c761827966 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -788,9 +788,10 @@ iomap_write_end(struct inode *inode, loff_t pos, unsigned len,
 		ret = __iomap_write_end(inode, pos, len, copied, page, iomap);
 	}
 
-	__generic_write_end(inode, pos, ret, page);
 	if (page_ops && page_ops->page_done)
-		page_ops->page_done(inode, pos, copied, page, iomap);
+		page_ops->page_done(inode, pos, ret, page, iomap);
+	else
+		__generic_write_end(inode, pos, ret, page);
 	put_page(page);
 
 	if (ret < len)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] fs: Move mark_inode_dirty out of __generic_write_end
  2019-06-24 18:22   ` Andreas Gruenbacher
@ 2019-06-25  9:57     ` Christoph Hellwig
  2019-06-25 10:50       ` Christoph Hellwig
  2019-06-25 15:00       ` Andreas Gruenbacher
  0 siblings, 2 replies; 11+ messages in thread
From: Christoph Hellwig @ 2019-06-25  9:57 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Christoph Hellwig, cluster-devel, linux-fsdevel, Jan Kara

On Mon, Jun 24, 2019 at 08:22:43PM +0200, Andreas Gruenbacher wrote:
> That would work, but I don't like how this leaves us with a vfs function
> that updates i_size without bothering to dirty the inode very much.

This isn't a VFS function, it is a helper library.

> 
> How about if we move the __generic_write_end call into the page_done
> callback and leave special handling to the filesystem code if needed
> instead?  The below patch seems to work for gfs2.

That seems way more complicated.  I'd much rather go with something
like may patch plus maybe a big fat comment explaining that persisting
the size update is the file systems job.  Note that a lot of the modern
file systems don't use the VFS inode tracking for that, besides XFS
that includes at least btrfs and ocfs2 as well.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] fs: Move mark_inode_dirty out of __generic_write_end
  2019-06-25  9:57     ` Christoph Hellwig
@ 2019-06-25 10:50       ` Christoph Hellwig
  2019-06-25 18:13         ` Andreas Gruenbacher
  2019-06-25 15:00       ` Andreas Gruenbacher
  1 sibling, 1 reply; 11+ messages in thread
From: Christoph Hellwig @ 2019-06-25 10:50 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Christoph Hellwig, cluster-devel, linux-fsdevel, Jan Kara

> That seems way more complicated.  I'd much rather go with something
> like may patch plus maybe a big fat comment explaining that persisting
> the size update is the file systems job.  Note that a lot of the modern
> file systems don't use the VFS inode tracking for that, besides XFS
> that includes at least btrfs and ocfs2 as well.

I'd suggest something like this as the baseline:

http://git.infradead.org/users/hch/xfs.git/shortlog/refs/heads/iomap-i_size

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] fs: Move mark_inode_dirty out of __generic_write_end
  2019-06-25  9:57     ` Christoph Hellwig
  2019-06-25 10:50       ` Christoph Hellwig
@ 2019-06-25 15:00       ` Andreas Gruenbacher
  1 sibling, 0 replies; 11+ messages in thread
From: Andreas Gruenbacher @ 2019-06-25 15:00 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: cluster-devel, linux-fsdevel, Jan Kara

On Tue, 25 Jun 2019 at 11:57, Christoph Hellwig <hch@lst.de> wrote:
> On Mon, Jun 24, 2019 at 08:22:43PM +0200, Andreas Gruenbacher wrote:
> > That would work, but I don't like how this leaves us with a vfs function
> > that updates i_size without bothering to dirty the inode very much.
>
> This isn't a VFS function, it is a helper library.

Well, let's call it that if this suits you better.

> > How about if we move the __generic_write_end call into the page_done
> > callback and leave special handling to the filesystem code if needed
> > instead?  The below patch seems to work for gfs2.
>
> That seems way more complicated.  I'd much rather go with something
> like may patch plus maybe a big fat comment explaining that persisting
> the size update is the file systems job.  Note that a lot of the modern
> file systems don't use the VFS inode tracking for that, besides XFS
> that includes at least btrfs and ocfs2 as well.

Andreas

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] fs: Move mark_inode_dirty out of __generic_write_end
  2019-06-25 10:50       ` Christoph Hellwig
@ 2019-06-25 18:13         ` Andreas Gruenbacher
  2019-06-26  6:03           ` Christoph Hellwig
  0 siblings, 1 reply; 11+ messages in thread
From: Andreas Gruenbacher @ 2019-06-25 18:13 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andreas Gruenbacher, cluster-devel, linux-fsdevel, Jan Kara

On Tue, 25 Jun 2019 at 12:50, Christoph Hellwig <hch@lst.de> wrote:
>
> > That seems way more complicated.  I'd much rather go with something
> > like may patch plus maybe a big fat comment explaining that persisting
> > the size update is the file systems job.  Note that a lot of the modern
> > file systems don't use the VFS inode tracking for that, besides XFS
> > that includes at least btrfs and ocfs2 as well.
>
> I'd suggest something like this as the baseline:
>
> http://git.infradead.org/users/hch/xfs.git/shortlog/refs/heads/iomap-i_size

Alright, can we change this as follows?

[Also, I'm not really sure why we check for (pos + ret > inode->i_size)
when we have already read inode->i_size into old_size.]

Thanks,
Andreas

--

iomap: don't mark the inode dirty in iomap_write_end

Marking the inode dirty for each page copied into the page cache
can be very ineffcient for file systems that use the VFS dirty
inode tracking, and is completely pointless for those that don't
use the VFS dirty tracking.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/bmap.c        |  2 ++
 fs/iomap.c            | 15 ++++++++++++++-
 include/linux/iomap.h |  1 +
 3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index 93ea1d5..f4b895f 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -1182,6 +1182,8 @@ static int gfs2_iomap_end(struct inode *inode, loff_t pos, loff_t length,
 
 	if (ip->i_qadata && ip->i_qadata->qa_qd_num)
 		gfs2_quota_unlock(ip);
+	if (iomap->flags & IOMAP_F_SIZE_CHANGED)
+		mark_inode_dirty(inode);
 	gfs2_write_unlock(inode);
 
 out:
diff --git a/fs/iomap.c b/fs/iomap.c
index 12654c2..cd24bd1 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -777,6 +777,7 @@ struct iomap_readpage_ctx {
 		unsigned copied, struct page *page, struct iomap *iomap)
 {
 	const struct iomap_page_ops *page_ops = iomap->page_ops;
+	loff_t old_size = inode->i_size;
 	int ret;
 
 	if (iomap->type == IOMAP_INLINE) {
@@ -788,7 +789,19 @@ struct iomap_readpage_ctx {
 		ret = __iomap_write_end(inode, pos, len, copied, page, iomap);
 	}
 
-	__generic_write_end(inode, pos, ret, page);
+	/*
+	 * Update the in-memory inode size after copying the data into the page
+	 * cache.  It's up to the file system to write the updated size to disk,
+	 * preferably after I/O completion so that no stale data is exposed.
+	 */
+	if (pos + ret > inode->i_size) {
+		i_size_write(inode, pos + ret);
+		iomap->flags |= IOMAP_F_SIZE_CHANGED;
+	}
+	unlock_page(page);
+
+	if (old_size < pos)
+		pagecache_isize_extended(inode, old_size, pos);
 	if (page_ops && page_ops->page_done)
 		page_ops->page_done(inode, pos, copied, page, iomap);
 	put_page(page);
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 2103b94..1df9ea1 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -35,6 +35,7 @@
 #define IOMAP_F_NEW		0x01	/* blocks have been newly allocated */
 #define IOMAP_F_DIRTY		0x02	/* uncommitted metadata */
 #define IOMAP_F_BUFFER_HEAD	0x04	/* file system requires buffer heads */
+#define IOMAP_F_SIZE_CHANGED	0x08	/* file size has changed */
 
 /*
  * Flags that only need to be reported for IOMAP_REPORT requests:
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] fs: Move mark_inode_dirty out of __generic_write_end
  2019-06-25 18:13         ` Andreas Gruenbacher
@ 2019-06-26  6:03           ` Christoph Hellwig
  2019-06-26 12:07             ` Andreas Gruenbacher
  0 siblings, 1 reply; 11+ messages in thread
From: Christoph Hellwig @ 2019-06-26  6:03 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Christoph Hellwig, cluster-devel, linux-fsdevel, Jan Kara

On Tue, Jun 25, 2019 at 08:13:29PM +0200, Andreas Gruenbacher wrote:
> On Tue, 25 Jun 2019 at 12:50, Christoph Hellwig <hch@lst.de> wrote:
> >
> > > That seems way more complicated.  I'd much rather go with something
> > > like may patch plus maybe a big fat comment explaining that persisting
> > > the size update is the file systems job.  Note that a lot of the modern
> > > file systems don't use the VFS inode tracking for that, besides XFS
> > > that includes at least btrfs and ocfs2 as well.
> >
> > I'd suggest something like this as the baseline:
> >
> > http://git.infradead.org/users/hch/xfs.git/shortlog/refs/heads/iomap-i_size
> 
> Alright, can we change this as follows?
> 
> [Also, I'm not really sure why we check for (pos + ret > inode->i_size)
> when we have already read inode->i_size into old_size.]

Yeah, you probably want to change that to old_size.  Your changes look
good to me,

Can you just take the patch over from here as you've clearly done more
work on it and resend the whole series?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] fs: Move mark_inode_dirty out of __generic_write_end
  2019-06-26  6:03           ` Christoph Hellwig
@ 2019-06-26 12:07             ` Andreas Gruenbacher
  0 siblings, 0 replies; 11+ messages in thread
From: Andreas Gruenbacher @ 2019-06-26 12:07 UTC (permalink / raw)
  To: Christoph Hellwig, Darrick J. Wong; +Cc: cluster-devel, linux-fsdevel, Jan Kara

On Wed, 26 Jun 2019 at 08:04, Christoph Hellwig <hch@lst.de> wrote:
> On Tue, Jun 25, 2019 at 08:13:29PM +0200, Andreas Gruenbacher wrote:
> > On Tue, 25 Jun 2019 at 12:50, Christoph Hellwig <hch@lst.de> wrote:
> > >
> > > > That seems way more complicated.  I'd much rather go with something
> > > > like may patch plus maybe a big fat comment explaining that persisting
> > > > the size update is the file systems job.  Note that a lot of the modern
> > > > file systems don't use the VFS inode tracking for that, besides XFS
> > > > that includes at least btrfs and ocfs2 as well.
> > >
> > > I'd suggest something like this as the baseline:
> > >
> > > http://git.infradead.org/users/hch/xfs.git/shortlog/refs/heads/iomap-i_size
> >
> > Alright, can we change this as follows?
> >
> > [Also, I'm not really sure why we check for (pos + ret > inode->i_size)
> > when we have already read inode->i_size into old_size.]
>
> Yeah, you probably want to change that to old_size.  Your changes look
> good to me,
>
> Can you just take the patch over from here as you've clearly done more
> work on it and resend the whole series?

Ok, done. It's just the two patches; passes xfstests for xfs and gfs2,
the current users.

Darrick, can you please push this in the next merge window as "usual"?

Thanks,
Andreas

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2019-06-26 12:07 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-18 14:47 [PATCH] fs: Move mark_inode_dirty out of __generic_write_end Andreas Gruenbacher
2019-06-19 16:01 ` Jan Kara
2019-06-20  4:47 ` Dave Chinner
2019-06-24  6:54 ` Christoph Hellwig
2019-06-24 18:22   ` Andreas Gruenbacher
2019-06-25  9:57     ` Christoph Hellwig
2019-06-25 10:50       ` Christoph Hellwig
2019-06-25 18:13         ` Andreas Gruenbacher
2019-06-26  6:03           ` Christoph Hellwig
2019-06-26 12:07             ` Andreas Gruenbacher
2019-06-25 15:00       ` Andreas Gruenbacher

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).