[PATCH 0/3] ext4: Fix data corruption when extending DIO write races with buffered read

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] ext4: Fix data corruption when extending DIO write races with buffered read
@ 2021-04-12 10:23 Jan Kara
  2021-04-12 10:23 ` [PATCH 1/3] iomap: Pass original DIO size to completion handler Jan Kara
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Jan Kara @ 2021-04-12 10:23 UTC (permalink / raw)
  To: Ted Tso
  Cc: linux-ext4, Eric Whitney, linux-fsdevel, Darrick J . Wong, Jan Kara

Hello,

This patch series fixes a bug manifesting as occasional generic/418 failure
when direct IO write extending a file races with buffered read. Patch 1
extends iomap DIO code to pass original IO size to ->end_dio handler so that
ext4 can tell whether everything has succeeded and we don't need expensive
DIO cleanup (possible truncate of leftover blocks beyond i_size). Patch 2
fixes the ext4 bug, patch 3 fixes unrelated problem I've found in ext4 DIO
code.

								Honza

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/3] iomap: Pass original DIO size to completion handler
  2021-04-12 10:23 [PATCH 0/3] ext4: Fix data corruption when extending DIO write races with buffered read Jan Kara
@ 2021-04-12 10:23 ` Jan Kara
  2021-04-12 14:07   ` kernel test robot
                     ` (2 more replies)
  2021-04-12 10:23 ` [PATCH 2/3] ext4: Fix occasional generic/418 failure Jan Kara
  2021-04-12 10:23 ` [PATCH 3/3] ext4: Fix overflow in ext4_iomap_alloc() Jan Kara
  2 siblings, 3 replies; 13+ messages in thread
From: Jan Kara @ 2021-04-12 10:23 UTC (permalink / raw)
  To: Ted Tso
  Cc: linux-ext4, Eric Whitney, linux-fsdevel, Darrick J . Wong, Jan Kara

When extending a file with direct IO write, ext4 needs to know whether IO
has succeeded for the whole range that was prepared for it (the common fast
path). In that case we can piggy back the orphan list update with the
inode size update. In case write was actually shorter than originally
intended, we must leave inode on the orphan list until we cleanup blocks
beyond i_size. So provide the original IO size to the direct IO ->end_io
handler.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/file.c        | 3 ++-
 fs/iomap/direct-io.c  | 5 ++++-
 fs/xfs/xfs_file.c     | 1 +
 fs/zonefs/super.c     | 3 ++-
 include/linux/iomap.h | 4 ++--
 5 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 194f5d00fa32..2505313d96b0 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -369,7 +369,8 @@ static ssize_t ext4_handle_inode_extension(struct inode *inode, loff_t offset,
 }
 
 static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size,
-				 int error, unsigned int flags)
+				 ssize_t orig_size, int error,
+				 unsigned int flags)
 {
 	loff_t offset = iocb->ki_pos;
 	struct inode *inode = file_inode(iocb->ki_filp);
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index bdd0d89bbf0a..a4bbf22f69bc 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -28,6 +28,7 @@ struct iomap_dio {
 	const struct iomap_dio_ops *dops;
 	loff_t			i_size;
 	loff_t			size;
+	loff_t			orig_size;
 	atomic_t		ref;
 	unsigned		flags;
 	int			error;
@@ -85,7 +86,8 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
 	ssize_t ret = dio->error;
 
 	if (dops && dops->end_io)
-		ret = dops->end_io(iocb, dio->size, ret, dio->flags);
+		ret = dops->end_io(iocb, dio->size, dio->orig_size, ret,
+				   dio->flags);
 
 	if (likely(!ret)) {
 		ret = dio->size;
@@ -473,6 +475,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	dio->iocb = iocb;
 	atomic_set(&dio->ref, 1);
 	dio->size = 0;
+	dio->orig_size = count;
 	dio->i_size = i_size_read(inode);
 	dio->dops = dops;
 	dio->error = 0;
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index a007ca0711d9..ed23ed56e345 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -442,6 +442,7 @@ static int
 xfs_dio_write_end_io(
 	struct kiocb		*iocb,
 	ssize_t			size,
+	ssize_t			orig_size,
 	int			error,
 	unsigned		flags)
 {
diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index 049e36c69ed7..e3e0ee4b8c6e 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -652,7 +652,8 @@ static loff_t zonefs_file_llseek(struct file *file, loff_t offset, int whence)
 }
 
 static int zonefs_file_write_dio_end_io(struct kiocb *iocb, ssize_t size,
-					int error, unsigned int flags)
+					ssize_t orig_size, int error,
+					unsigned int flags)
 {
 	struct inode *inode = file_inode(iocb->ki_filp);
 	struct zonefs_inode_info *zi = ZONEFS_I(inode);
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index d202fd2d0f91..b175641e9540 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -252,8 +252,8 @@ int iomap_writepages(struct address_space *mapping,
 #define IOMAP_DIO_COW		(1 << 1)	/* covers COW extent(s) */
 
 struct iomap_dio_ops {
-	int (*end_io)(struct kiocb *iocb, ssize_t size, int error,
-		      unsigned flags);
+	int (*end_io)(struct kiocb *iocb, ssize_t size, ssize_t orig_size,
+		      int error, unsigned flags);
 	blk_qc_t (*submit_io)(struct inode *inode, struct iomap *iomap,
 			struct bio *bio, loff_t file_offset);
 };
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] iomap: Pass original DIO size to completion handler
  2021-04-12 10:23 ` [PATCH 1/3] iomap: Pass original DIO size to completion handler Jan Kara
@ 2021-04-12 14:07   ` kernel test robot
  2021-04-12 14:12   ` kernel test robot
  2021-04-12 16:37   ` kernel test robot
  2 siblings, 0 replies; 13+ messages in thread
From: kernel test robot @ 2021-04-12 14:07 UTC (permalink / raw)
  To: Jan Kara, Ted Tso
  Cc: kbuild-all, clang-built-linux, linux-ext4, Eric Whitney,
	linux-fsdevel, Darrick J . Wong, Jan Kara

[-- Attachment #1: Type: text/plain, Size: 7126 bytes --]

Hi Jan,

I love your patch! Yet something to improve:

[auto build test ERROR on ext4/dev]
[also build test ERROR on xfs-linux/for-next linus/master v5.12-rc7 next-20210409]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Jan-Kara/ext4-Fix-data-corruption-when-extending-DIO-write-races-with-buffered-read/20210412-182524
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git dev
config: riscv-randconfig-r004-20210412 (attached as .config)
compiler: clang version 13.0.0 (https://github.com/llvm/llvm-project 9829f5e6b1bca9b61efc629770d28bb9014dec45)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install riscv cross compiling tool for clang build
        # apt-get install binutils-riscv64-linux-gnu
        # https://github.com/0day-ci/linux/commit/0d289243d061378ac42188ff5079287885575bb3
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Jan-Kara/ext4-Fix-data-corruption-when-extending-DIO-write-races-with-buffered-read/20210412-182524
        git checkout 0d289243d061378ac42188ff5079287885575bb3
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=riscv 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> fs/zonefs/super.c:732:49: error: too few arguments to function call, expected 5, have 4
           zonefs_file_write_dio_end_io(iocb, size, ret, 0);
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~                   ^
   fs/zonefs/super.c:654:12: note: 'zonefs_file_write_dio_end_io' declared here
   static int zonefs_file_write_dio_end_io(struct kiocb *iocb, ssize_t size,
              ^
>> fs/zonefs/super.c:961:14: error: incompatible function pointer types initializing 'int (*)(struct kiocb *, ssize_t, ssize_t, int, unsigned int)' (aka 'int (*)(struct kiocb *, long, long, int, unsigned int)') with an expression of type 'int (struct kiocb *, ssize_t, int, unsigned int)' (aka 'int (struct kiocb *, long, int, unsigned int)') [-Werror,-Wincompatible-function-pointer-types]
           .end_io                 = zonefs_file_read_dio_end_io,
                                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~
   2 errors generated.


vim +732 fs/zonefs/super.c

8dcc1a9d90c10f Damien Le Moal     2019-12-25  688  
02ef12a663c7ac Johannes Thumshirn 2020-05-12  689  static ssize_t zonefs_file_dio_append(struct kiocb *iocb, struct iov_iter *from)
02ef12a663c7ac Johannes Thumshirn 2020-05-12  690  {
02ef12a663c7ac Johannes Thumshirn 2020-05-12  691  	struct inode *inode = file_inode(iocb->ki_filp);
02ef12a663c7ac Johannes Thumshirn 2020-05-12  692  	struct zonefs_inode_info *zi = ZONEFS_I(inode);
02ef12a663c7ac Johannes Thumshirn 2020-05-12  693  	struct block_device *bdev = inode->i_sb->s_bdev;
02ef12a663c7ac Johannes Thumshirn 2020-05-12  694  	unsigned int max;
02ef12a663c7ac Johannes Thumshirn 2020-05-12  695  	struct bio *bio;
02ef12a663c7ac Johannes Thumshirn 2020-05-12  696  	ssize_t size;
02ef12a663c7ac Johannes Thumshirn 2020-05-12  697  	int nr_pages;
02ef12a663c7ac Johannes Thumshirn 2020-05-12  698  	ssize_t ret;
02ef12a663c7ac Johannes Thumshirn 2020-05-12  699  
02ef12a663c7ac Johannes Thumshirn 2020-05-12  700  	max = queue_max_zone_append_sectors(bdev_get_queue(bdev));
02ef12a663c7ac Johannes Thumshirn 2020-05-12  701  	max = ALIGN_DOWN(max << SECTOR_SHIFT, inode->i_sb->s_blocksize);
02ef12a663c7ac Johannes Thumshirn 2020-05-12  702  	iov_iter_truncate(from, max);
02ef12a663c7ac Johannes Thumshirn 2020-05-12  703  
a8affc03a9b375 Christoph Hellwig  2021-03-11  704  	nr_pages = iov_iter_npages(from, BIO_MAX_VECS);
89ee72376be23a Johannes Thumshirn 2020-07-16  705  	if (!nr_pages)
89ee72376be23a Johannes Thumshirn 2020-07-16  706  		return 0;
89ee72376be23a Johannes Thumshirn 2020-07-16  707  
f91ca2a370bec5 Christoph Hellwig  2021-01-26  708  	bio = bio_alloc(GFP_NOFS, nr_pages);
02ef12a663c7ac Johannes Thumshirn 2020-05-12  709  	if (!bio)
02ef12a663c7ac Johannes Thumshirn 2020-05-12  710  		return -ENOMEM;
02ef12a663c7ac Johannes Thumshirn 2020-05-12  711  
02ef12a663c7ac Johannes Thumshirn 2020-05-12  712  	bio_set_dev(bio, bdev);
02ef12a663c7ac Johannes Thumshirn 2020-05-12  713  	bio->bi_iter.bi_sector = zi->i_zsector;
02ef12a663c7ac Johannes Thumshirn 2020-05-12  714  	bio->bi_write_hint = iocb->ki_hint;
02ef12a663c7ac Johannes Thumshirn 2020-05-12  715  	bio->bi_ioprio = iocb->ki_ioprio;
02ef12a663c7ac Johannes Thumshirn 2020-05-12  716  	bio->bi_opf = REQ_OP_ZONE_APPEND | REQ_SYNC | REQ_IDLE;
02ef12a663c7ac Johannes Thumshirn 2020-05-12  717  	if (iocb->ki_flags & IOCB_DSYNC)
02ef12a663c7ac Johannes Thumshirn 2020-05-12  718  		bio->bi_opf |= REQ_FUA;
02ef12a663c7ac Johannes Thumshirn 2020-05-12  719  
02ef12a663c7ac Johannes Thumshirn 2020-05-12  720  	ret = bio_iov_iter_get_pages(bio, from);
6bea0225a4bf14 Damien Le Moal     2020-12-09  721  	if (unlikely(ret))
6bea0225a4bf14 Damien Le Moal     2020-12-09  722  		goto out_release;
6bea0225a4bf14 Damien Le Moal     2020-12-09  723  
02ef12a663c7ac Johannes Thumshirn 2020-05-12  724  	size = bio->bi_iter.bi_size;
6bea0225a4bf14 Damien Le Moal     2020-12-09  725  	task_io_account_write(size);
02ef12a663c7ac Johannes Thumshirn 2020-05-12  726  
02ef12a663c7ac Johannes Thumshirn 2020-05-12  727  	if (iocb->ki_flags & IOCB_HIPRI)
02ef12a663c7ac Johannes Thumshirn 2020-05-12  728  		bio_set_polled(bio, iocb);
02ef12a663c7ac Johannes Thumshirn 2020-05-12  729  
02ef12a663c7ac Johannes Thumshirn 2020-05-12  730  	ret = submit_bio_wait(bio);
02ef12a663c7ac Johannes Thumshirn 2020-05-12  731  
6bea0225a4bf14 Damien Le Moal     2020-12-09 @732  	zonefs_file_write_dio_end_io(iocb, size, ret, 0);
62ab1aadcccd03 Johannes Thumshirn 2021-01-27  733  	trace_zonefs_file_dio_append(inode, size, ret);
6bea0225a4bf14 Damien Le Moal     2020-12-09  734  
6bea0225a4bf14 Damien Le Moal     2020-12-09  735  out_release:
6bea0225a4bf14 Damien Le Moal     2020-12-09  736  	bio_release_pages(bio, false);
02ef12a663c7ac Johannes Thumshirn 2020-05-12  737  	bio_put(bio);
02ef12a663c7ac Johannes Thumshirn 2020-05-12  738  
02ef12a663c7ac Johannes Thumshirn 2020-05-12  739  	if (ret >= 0) {
02ef12a663c7ac Johannes Thumshirn 2020-05-12  740  		iocb->ki_pos += size;
02ef12a663c7ac Johannes Thumshirn 2020-05-12  741  		return size;
02ef12a663c7ac Johannes Thumshirn 2020-05-12  742  	}
02ef12a663c7ac Johannes Thumshirn 2020-05-12  743  
02ef12a663c7ac Johannes Thumshirn 2020-05-12  744  	return ret;
02ef12a663c7ac Johannes Thumshirn 2020-05-12  745  }
02ef12a663c7ac Johannes Thumshirn 2020-05-12  746  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 33984 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] iomap: Pass original DIO size to completion handler
  2021-04-12 10:23 ` [PATCH 1/3] iomap: Pass original DIO size to completion handler Jan Kara
  2021-04-12 14:07   ` kernel test robot
@ 2021-04-12 14:12   ` kernel test robot
  2021-04-12 16:37   ` kernel test robot
  2 siblings, 0 replies; 13+ messages in thread
From: kernel test robot @ 2021-04-12 14:12 UTC (permalink / raw)
  To: Jan Kara, Ted Tso
  Cc: kbuild-all, linux-ext4, Eric Whitney, linux-fsdevel,
	Darrick J . Wong, Jan Kara

[-- Attachment #1: Type: text/plain, Size: 2980 bytes --]

Hi Jan,

I love your patch! Yet something to improve:

[auto build test ERROR on ext4/dev]
[also build test ERROR on xfs-linux/for-next linus/master v5.12-rc7 next-20210409]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Jan-Kara/ext4-Fix-data-corruption-when-extending-DIO-write-races-with-buffered-read/20210412-182524
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git dev
config: um-allmodconfig (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce (this is a W=1 build):
        # https://github.com/0day-ci/linux/commit/0d289243d061378ac42188ff5079287885575bb3
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Jan-Kara/ext4-Fix-data-corruption-when-extending-DIO-write-races-with-buffered-read/20210412-182524
        git checkout 0d289243d061378ac42188ff5079287885575bb3
        # save the attached .config to linux build tree
        make W=1 ARCH=um 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   cc1: warning: arch/um/include/uapi: No such file or directory [-Wmissing-include-dirs]
   fs/zonefs/super.c: In function 'zonefs_file_dio_append':
   fs/zonefs/super.c:732:2: error: too few arguments to function 'zonefs_file_write_dio_end_io'
     732 |  zonefs_file_write_dio_end_io(iocb, size, ret, 0);
         |  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
   fs/zonefs/super.c:654:12: note: declared here
     654 | static int zonefs_file_write_dio_end_io(struct kiocb *iocb, ssize_t size,
         |            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
   fs/zonefs/super.c: At top level:
>> fs/zonefs/super.c:961:14: error: initialization of 'int (*)(struct kiocb *, ssize_t,  ssize_t,  int,  unsigned int)' {aka 'int (*)(struct kiocb *, long int,  long int,  int,  unsigned int)'} from incompatible pointer type 'int (*)(struct kiocb *, ssize_t,  int,  unsigned int)' {aka 'int (*)(struct kiocb *, long int,  int,  unsigned int)'} [-Werror=incompatible-pointer-types]
     961 |  .end_io   = zonefs_file_read_dio_end_io,
         |              ^~~~~~~~~~~~~~~~~~~~~~~~~~~
   fs/zonefs/super.c:961:14: note: (near initialization for 'zonefs_read_dio_ops.end_io')
   cc1: some warnings being treated as errors


vim +961 fs/zonefs/super.c

8dcc1a9d90c10fa Damien Le Moal 2019-12-25  959  
8dcc1a9d90c10fa Damien Le Moal 2019-12-25  960  static const struct iomap_dio_ops zonefs_read_dio_ops = {
8dcc1a9d90c10fa Damien Le Moal 2019-12-25 @961  	.end_io			= zonefs_file_read_dio_end_io,
8dcc1a9d90c10fa Damien Le Moal 2019-12-25  962  };
8dcc1a9d90c10fa Damien Le Moal 2019-12-25  963  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 24378 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] iomap: Pass original DIO size to completion handler
  2021-04-12 10:23 ` [PATCH 1/3] iomap: Pass original DIO size to completion handler Jan Kara
  2021-04-12 14:07   ` kernel test robot
  2021-04-12 14:12   ` kernel test robot
@ 2021-04-12 16:37   ` kernel test robot
  2 siblings, 0 replies; 13+ messages in thread
From: kernel test robot @ 2021-04-12 16:37 UTC (permalink / raw)
  To: Jan Kara, Ted Tso
  Cc: kbuild-all, linux-ext4, Eric Whitney, linux-fsdevel,
	Darrick J . Wong, Jan Kara

[-- Attachment #1: Type: text/plain, Size: 7175 bytes --]

Hi Jan,

I love your patch! Yet something to improve:

[auto build test ERROR on ext4/dev]
[also build test ERROR on xfs-linux/for-next linus/master v5.12-rc7 next-20210412]
[cannot apply to tytso-fscrypt/master]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Jan-Kara/ext4-Fix-data-corruption-when-extending-DIO-write-races-with-buffered-read/20210412-182524
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git dev
config: openrisc-randconfig-r032-20210412 (attached as .config)
compiler: or1k-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/0d289243d061378ac42188ff5079287885575bb3
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Jan-Kara/ext4-Fix-data-corruption-when-extending-DIO-write-races-with-buffered-read/20210412-182524
        git checkout 0d289243d061378ac42188ff5079287885575bb3
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=openrisc 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   fs/zonefs/super.c: In function 'zonefs_file_dio_append':
>> fs/zonefs/super.c:732:2: error: too few arguments to function 'zonefs_file_write_dio_end_io'
     732 |  zonefs_file_write_dio_end_io(iocb, size, ret, 0);
         |  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
   fs/zonefs/super.c:654:12: note: declared here
     654 | static int zonefs_file_write_dio_end_io(struct kiocb *iocb, ssize_t size,
         |            ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
   fs/zonefs/super.c: At top level:
>> fs/zonefs/super.c:961:14: error: initialization of 'int (*)(struct kiocb *, ssize_t,  ssize_t,  int,  unsigned int)' {aka 'int (*)(struct kiocb *, int,  int,  int,  unsigned int)'} from incompatible pointer type 'int (*)(struct kiocb *, ssize_t,  int,  unsigned int)' {aka 'int (*)(struct kiocb *, int,  int,  unsigned int)'} [-Werror=incompatible-pointer-types]
     961 |  .end_io   = zonefs_file_read_dio_end_io,
         |              ^~~~~~~~~~~~~~~~~~~~~~~~~~~
   fs/zonefs/super.c:961:14: note: (near initialization for 'zonefs_read_dio_ops.end_io')
   cc1: some warnings being treated as errors


vim +/zonefs_file_write_dio_end_io +732 fs/zonefs/super.c

8dcc1a9d90c10f Damien Le Moal     2019-12-25  688  
02ef12a663c7ac Johannes Thumshirn 2020-05-12  689  static ssize_t zonefs_file_dio_append(struct kiocb *iocb, struct iov_iter *from)
02ef12a663c7ac Johannes Thumshirn 2020-05-12  690  {
02ef12a663c7ac Johannes Thumshirn 2020-05-12  691  	struct inode *inode = file_inode(iocb->ki_filp);
02ef12a663c7ac Johannes Thumshirn 2020-05-12  692  	struct zonefs_inode_info *zi = ZONEFS_I(inode);
02ef12a663c7ac Johannes Thumshirn 2020-05-12  693  	struct block_device *bdev = inode->i_sb->s_bdev;
02ef12a663c7ac Johannes Thumshirn 2020-05-12  694  	unsigned int max;
02ef12a663c7ac Johannes Thumshirn 2020-05-12  695  	struct bio *bio;
02ef12a663c7ac Johannes Thumshirn 2020-05-12  696  	ssize_t size;
02ef12a663c7ac Johannes Thumshirn 2020-05-12  697  	int nr_pages;
02ef12a663c7ac Johannes Thumshirn 2020-05-12  698  	ssize_t ret;
02ef12a663c7ac Johannes Thumshirn 2020-05-12  699  
02ef12a663c7ac Johannes Thumshirn 2020-05-12  700  	max = queue_max_zone_append_sectors(bdev_get_queue(bdev));
02ef12a663c7ac Johannes Thumshirn 2020-05-12  701  	max = ALIGN_DOWN(max << SECTOR_SHIFT, inode->i_sb->s_blocksize);
02ef12a663c7ac Johannes Thumshirn 2020-05-12  702  	iov_iter_truncate(from, max);
02ef12a663c7ac Johannes Thumshirn 2020-05-12  703  
a8affc03a9b375 Christoph Hellwig  2021-03-11  704  	nr_pages = iov_iter_npages(from, BIO_MAX_VECS);
89ee72376be23a Johannes Thumshirn 2020-07-16  705  	if (!nr_pages)
89ee72376be23a Johannes Thumshirn 2020-07-16  706  		return 0;
89ee72376be23a Johannes Thumshirn 2020-07-16  707  
f91ca2a370bec5 Christoph Hellwig  2021-01-26  708  	bio = bio_alloc(GFP_NOFS, nr_pages);
02ef12a663c7ac Johannes Thumshirn 2020-05-12  709  	if (!bio)
02ef12a663c7ac Johannes Thumshirn 2020-05-12  710  		return -ENOMEM;
02ef12a663c7ac Johannes Thumshirn 2020-05-12  711  
02ef12a663c7ac Johannes Thumshirn 2020-05-12  712  	bio_set_dev(bio, bdev);
02ef12a663c7ac Johannes Thumshirn 2020-05-12  713  	bio->bi_iter.bi_sector = zi->i_zsector;
02ef12a663c7ac Johannes Thumshirn 2020-05-12  714  	bio->bi_write_hint = iocb->ki_hint;
02ef12a663c7ac Johannes Thumshirn 2020-05-12  715  	bio->bi_ioprio = iocb->ki_ioprio;
02ef12a663c7ac Johannes Thumshirn 2020-05-12  716  	bio->bi_opf = REQ_OP_ZONE_APPEND | REQ_SYNC | REQ_IDLE;
02ef12a663c7ac Johannes Thumshirn 2020-05-12  717  	if (iocb->ki_flags & IOCB_DSYNC)
02ef12a663c7ac Johannes Thumshirn 2020-05-12  718  		bio->bi_opf |= REQ_FUA;
02ef12a663c7ac Johannes Thumshirn 2020-05-12  719  
02ef12a663c7ac Johannes Thumshirn 2020-05-12  720  	ret = bio_iov_iter_get_pages(bio, from);
6bea0225a4bf14 Damien Le Moal     2020-12-09  721  	if (unlikely(ret))
6bea0225a4bf14 Damien Le Moal     2020-12-09  722  		goto out_release;
6bea0225a4bf14 Damien Le Moal     2020-12-09  723  
02ef12a663c7ac Johannes Thumshirn 2020-05-12  724  	size = bio->bi_iter.bi_size;
6bea0225a4bf14 Damien Le Moal     2020-12-09  725  	task_io_account_write(size);
02ef12a663c7ac Johannes Thumshirn 2020-05-12  726  
02ef12a663c7ac Johannes Thumshirn 2020-05-12  727  	if (iocb->ki_flags & IOCB_HIPRI)
02ef12a663c7ac Johannes Thumshirn 2020-05-12  728  		bio_set_polled(bio, iocb);
02ef12a663c7ac Johannes Thumshirn 2020-05-12  729  
02ef12a663c7ac Johannes Thumshirn 2020-05-12  730  	ret = submit_bio_wait(bio);
02ef12a663c7ac Johannes Thumshirn 2020-05-12  731  
6bea0225a4bf14 Damien Le Moal     2020-12-09 @732  	zonefs_file_write_dio_end_io(iocb, size, ret, 0);
62ab1aadcccd03 Johannes Thumshirn 2021-01-27  733  	trace_zonefs_file_dio_append(inode, size, ret);
6bea0225a4bf14 Damien Le Moal     2020-12-09  734  
6bea0225a4bf14 Damien Le Moal     2020-12-09  735  out_release:
6bea0225a4bf14 Damien Le Moal     2020-12-09  736  	bio_release_pages(bio, false);
02ef12a663c7ac Johannes Thumshirn 2020-05-12  737  	bio_put(bio);
02ef12a663c7ac Johannes Thumshirn 2020-05-12  738  
02ef12a663c7ac Johannes Thumshirn 2020-05-12  739  	if (ret >= 0) {
02ef12a663c7ac Johannes Thumshirn 2020-05-12  740  		iocb->ki_pos += size;
02ef12a663c7ac Johannes Thumshirn 2020-05-12  741  		return size;
02ef12a663c7ac Johannes Thumshirn 2020-05-12  742  	}
02ef12a663c7ac Johannes Thumshirn 2020-05-12  743  
02ef12a663c7ac Johannes Thumshirn 2020-05-12  744  	return ret;
02ef12a663c7ac Johannes Thumshirn 2020-05-12  745  }
02ef12a663c7ac Johannes Thumshirn 2020-05-12  746  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 21556 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 2/3] ext4: Fix occasional generic/418 failure
  2021-04-12 10:23 [PATCH 0/3] ext4: Fix data corruption when extending DIO write races with buffered read Jan Kara
  2021-04-12 10:23 ` [PATCH 1/3] iomap: Pass original DIO size to completion handler Jan Kara
@ 2021-04-12 10:23 ` Jan Kara
  2021-04-12 21:50   ` Dave Chinner
  2021-04-12 10:23 ` [PATCH 3/3] ext4: Fix overflow in ext4_iomap_alloc() Jan Kara
  2 siblings, 1 reply; 13+ messages in thread
From: Jan Kara @ 2021-04-12 10:23 UTC (permalink / raw)
  To: Ted Tso
  Cc: linux-ext4, Eric Whitney, linux-fsdevel, Darrick J . Wong, Jan Kara

Eric has noticed that after pagecache read rework, generic/418 is
occasionally failing for ext4 when blocksize < pagesize. In fact, the
pagecache rework just made hard to hit race in ext4 more likely. The
problem is that since ext4 conversion of direct IO writes to iomap
framework (commit 378f32bab371), we update inode size after direct IO
write only after invalidating page cache. Thus if buffered read sneaks
at unfortunate moment like:

CPU1 - write at offset 1k                       CPU2 - read from offset 0
iomap_dio_rw(..., IOMAP_DIO_FORCE_WAIT);
                                                ext4_readpage();
ext4_handle_inode_extension()

the read will zero out tail of the page as it still sees smaller inode
size and thus page cache becomes inconsistent with on-disk contents with
all the consequences.

Fix the problem by moving inode size update into end_io handler which
gets called before the page cache is invalidated.

Reported-by: Eric Whitney <enwlinux@gmail.com>
Fixes: 378f32bab371 ("ext4: introduce direct I/O write using iomap infrastructure")
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/file.c | 185 ++++++++++++++++++++++++++-----------------------
 1 file changed, 99 insertions(+), 86 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 2505313d96b0..ec59ea078f53 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -280,11 +280,67 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
 	return ret;
 }
 
+static int ext4_prepare_dio_extend(struct inode *inode)
+{
+	handle_t *handle;
+	int ret;
+
+	handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
+	if (IS_ERR(handle))
+		return PTR_ERR(handle);
+
+	ext4_fc_start_update(inode);
+	ret = ext4_orphan_add(handle, inode);
+	ext4_fc_stop_update(inode);
+	if (ret) {
+		ext4_journal_stop(handle);
+		return ret;
+	}
+	ext4_journal_stop(handle);
+	return 0;
+}
+
+static void ext4_cleanup_dio_extend(struct inode *inode, loff_t offset,
+				    ssize_t written, size_t count)
+{
+	handle_t *handle;
+	u8 blkbits = inode->i_blkbits;
+	ext4_lblk_t written_blk, end_blk;
+
+	written_blk = ALIGN(offset + written, 1 << blkbits);
+	end_blk = ALIGN(offset + count, 1 << blkbits);
+	if (written_blk < end_blk && ext4_can_truncate(inode)) {
+		ext4_truncate_failed_write(inode);
+		/*
+		 * If the truncate operation failed early, then the inode may
+		 * still be on the orphan list. In that case, we need to try
+		 * remove the inode from the in-memory linked list.
+		 */
+		if (inode->i_nlink)
+			ext4_orphan_del(NULL, inode);
+		return;
+	}
+
+	/*
+	 * We need to ensure that the inode is removed from the orphan
+	 * list if it has been added prematurely, due to writeback of
+	 * delalloc blocks.
+	 */
+	if (!list_empty(&EXT4_I(inode)->i_orphan) && inode->i_nlink) {
+		handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
+		if (IS_ERR(handle)) {
+			ext4_orphan_del(NULL, inode);
+			return;
+		}
+		ext4_orphan_del(handle, inode);
+		ext4_journal_stop(handle);
+	}
+}
+
 static ssize_t ext4_handle_inode_extension(struct inode *inode, loff_t offset,
 					   ssize_t written, size_t count)
 {
 	handle_t *handle;
-	bool truncate = false;
 	u8 blkbits = inode->i_blkbits;
 	ext4_lblk_t written_blk, end_blk;
 	int ret;
@@ -298,73 +354,31 @@ static ssize_t ext4_handle_inode_extension(struct inode *inode, loff_t offset,
 	 * as much as we intended.
 	 */
 	WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize);
-	if (offset + count <= EXT4_I(inode)->i_disksize) {
-		/*
-		 * We need to ensure that the inode is removed from the orphan
-		 * list if it has been added prematurely, due to writeback of
-		 * delalloc blocks.
-		 */
-		if (!list_empty(&EXT4_I(inode)->i_orphan) && inode->i_nlink) {
-			handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
-
-			if (IS_ERR(handle)) {
-				ext4_orphan_del(NULL, inode);
-				return PTR_ERR(handle);
-			}
-
-			ext4_orphan_del(handle, inode);
-			ext4_journal_stop(handle);
-		}
-
+	if (offset + count <= EXT4_I(inode)->i_disksize)
 		return written;
-	}
-
-	if (written < 0)
-		goto truncate;
 
 	handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
-	if (IS_ERR(handle)) {
-		written = PTR_ERR(handle);
-		goto truncate;
-	}
+	if (IS_ERR(handle))
+		return PTR_ERR(handle);
 
 	if (ext4_update_inode_size(inode, offset + written)) {
 		ret = ext4_mark_inode_dirty(handle, inode);
 		if (unlikely(ret)) {
-			written = ret;
 			ext4_journal_stop(handle);
-			goto truncate;
+			return ret;
 		}
 	}
 
 	/*
-	 * We may need to truncate allocated but not written blocks beyond EOF.
+	 * If there cannot be allocated but unused blocks beyond EOF, remove
+	 * the inode from the orphan list as we are done with it.
 	 */
 	written_blk = ALIGN(offset + written, 1 << blkbits);
 	end_blk = ALIGN(offset + count, 1 << blkbits);
-	if (written_blk < end_blk && ext4_can_truncate(inode))
-		truncate = true;
-
-	/*
-	 * Remove the inode from the orphan list if it has been extended and
-	 * everything went OK.
-	 */
-	if (!truncate && inode->i_nlink)
+	if (written_blk == end_blk && inode->i_nlink)
 		ext4_orphan_del(handle, inode);
 	ext4_journal_stop(handle);
 
-	if (truncate) {
-truncate:
-		ext4_truncate_failed_write(inode);
-		/*
-		 * If the truncate operation failed early, then the inode may
-		 * still be on the orphan list. In that case, we need to try
-		 * remove the inode from the in-memory linked list.
-		 */
-		if (inode->i_nlink)
-			ext4_orphan_del(NULL, inode);
-	}
-
 	return written;
 }
 
@@ -385,10 +399,28 @@ static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size,
 	return 0;
 }
 
+static int ext4_dio_extending_write_end_io(struct kiocb *iocb, ssize_t size,
+					   ssize_t orig_size, int error,
+					   unsigned int flags)
+{
+	struct inode *inode = file_inode(iocb->ki_filp);
+	int ret;
+
+	ret = ext4_dio_write_end_io(iocb, size, orig_size, error, flags);
+	if (ret < 0)
+		return ret;
+	return ext4_handle_inode_extension(inode, iocb->ki_pos, size,
+					   orig_size);
+}
+
 static const struct iomap_dio_ops ext4_dio_write_ops = {
 	.end_io = ext4_dio_write_end_io,
 };
 
+static const struct iomap_dio_ops ext4_dio_extending_write_ops = {
+	.end_io = ext4_dio_extending_write_end_io,
+};
+
 /*
  * The intention here is to start with shared lock acquired then see if any
  * condition requires an exclusive inode lock. If yes, then we restart the
@@ -455,11 +487,11 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
 static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
 {
 	ssize_t ret;
-	handle_t *handle;
 	struct inode *inode = file_inode(iocb->ki_filp);
 	loff_t offset = iocb->ki_pos;
 	size_t count = iov_iter_count(from);
 	const struct iomap_ops *iomap_ops = &ext4_iomap_ops;
+	const struct iomap_dio_ops *dio_ops = &ext4_dio_write_ops;
 	bool extend = false, unaligned_io = false;
 	bool ilock_shared = true;
 
@@ -530,33 +562,21 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		inode_dio_wait(inode);
 
 	if (extend) {
-		handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
-		if (IS_ERR(handle)) {
-			ret = PTR_ERR(handle);
+		ret = ext4_prepare_dio_extend(inode);
+		if (ret)
 			goto out;
-		}
-
-		ext4_fc_start_update(inode);
-		ret = ext4_orphan_add(handle, inode);
-		ext4_fc_stop_update(inode);
-		if (ret) {
-			ext4_journal_stop(handle);
-			goto out;
-		}
-
-		ext4_journal_stop(handle);
 	}
 
 	if (ilock_shared)
 		iomap_ops = &ext4_iomap_overwrite_ops;
-	ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
-			   (unaligned_io || extend) ? IOMAP_DIO_FORCE_WAIT : 0);
-	if (ret == -ENOTBLK)
-		ret = 0;
-
 	if (extend)
-		ret = ext4_handle_inode_extension(inode, offset, ret, count);
+		dio_ops = &ext4_dio_extending_write_ops;
 
+	ret = iomap_dio_rw(iocb, from, iomap_ops, dio_ops,
+			   (extend || unaligned_io) ? IOMAP_DIO_FORCE_WAIT : 0);
+	if (ret == -ENOTBLK)
+		ret = 0;
+	ext4_cleanup_dio_extend(inode, offset, ret, count);
 out:
 	if (ilock_shared)
 		inode_unlock_shared(inode);
@@ -599,7 +619,6 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	ssize_t ret;
 	size_t count;
 	loff_t offset;
-	handle_t *handle;
 	bool extend = false;
 	struct inode *inode = file_inode(iocb->ki_filp);
 
@@ -618,26 +637,20 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	count = iov_iter_count(from);
 
 	if (offset + count > EXT4_I(inode)->i_disksize) {
-		handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
-		if (IS_ERR(handle)) {
-			ret = PTR_ERR(handle);
-			goto out;
-		}
-
-		ret = ext4_orphan_add(handle, inode);
-		if (ret) {
-			ext4_journal_stop(handle);
+		ret = ext4_prepare_dio_extend(inode);
+		if (ret < 0)
 			goto out;
-		}
-
 		extend = true;
-		ext4_journal_stop(handle);
 	}
 
 	ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops);
 
-	if (extend)
-		ret = ext4_handle_inode_extension(inode, offset, ret, count);
+	if (extend) {
+		if (ret > 0)
+			ret = ext4_handle_inode_extension(inode, offset, ret,
+							  count);
+		ext4_cleanup_dio_extend(inode, offset, ret, count);
+	}
 out:
 	inode_unlock(inode);
 	if (ret > 0)
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/3] ext4: Fix occasional generic/418 failure
  2021-04-12 10:23 ` [PATCH 2/3] ext4: Fix occasional generic/418 failure Jan Kara
@ 2021-04-12 21:50   ` Dave Chinner
  2021-04-13  9:11     ` Jan Kara
  0 siblings, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2021-04-12 21:50 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ted Tso, linux-ext4, Eric Whitney, linux-fsdevel, Darrick J . Wong

On Mon, Apr 12, 2021 at 12:23:32PM +0200, Jan Kara wrote:
> Eric has noticed that after pagecache read rework, generic/418 is
> occasionally failing for ext4 when blocksize < pagesize. In fact, the
> pagecache rework just made hard to hit race in ext4 more likely. The
> problem is that since ext4 conversion of direct IO writes to iomap
> framework (commit 378f32bab371), we update inode size after direct IO
> write only after invalidating page cache. Thus if buffered read sneaks
> at unfortunate moment like:
> 
> CPU1 - write at offset 1k                       CPU2 - read from offset 0
> iomap_dio_rw(..., IOMAP_DIO_FORCE_WAIT);
>                                                 ext4_readpage();
> ext4_handle_inode_extension()
> 
> the read will zero out tail of the page as it still sees smaller inode
> size and thus page cache becomes inconsistent with on-disk contents with
> all the consequences.
> 
> Fix the problem by moving inode size update into end_io handler which
> gets called before the page cache is invalidated.

Confused.

This moves all the inode extension stuff into the completion
handler, when all that really needs to be done is extending
inode->i_size to tell the world there is data up to where the
IO completed. Actually removing the inode from the orphan list
does not need to be done in the IO completion callback, because...

>  	if (ilock_shared)
>  		iomap_ops = &ext4_iomap_overwrite_ops;
> -	ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
> -			   (unaligned_io || extend) ? IOMAP_DIO_FORCE_WAIT : 0);
> -	if (ret == -ENOTBLK)
> -		ret = 0;
> -
>  	if (extend)
> -		ret = ext4_handle_inode_extension(inode, offset, ret, count);
> +		dio_ops = &ext4_dio_extending_write_ops;
>  
> +	ret = iomap_dio_rw(iocb, from, iomap_ops, dio_ops,
> +			   (extend || unaligned_io) ? IOMAP_DIO_FORCE_WAIT : 0);
                            ^^^^^^                    ^^^^^^^^^^^^^^^^^^^ 

.... if we are doing an extending write, we force DIO to complete
before returning. Hence even AIO will block here on an extending
write, and hence we can -always- do the correct post-IO completion
orphan list cleanup here because we know a) the original IO size and
b) the amount of data that was actually written.

Hence all that remains is closing the buffered read vs invalidation
race. All this requires is for the dio write completion to behave
like XFS where it just does the inode->i_size update for extending
writes. THis means the size is updated before the invalidation, and
hence any read that occurs after the invalidation but before the
post-eof blocks have been removed will see the correct size and read
the tail page(s) correctly. This closes the race window, and the
caller can still handle the post-eof block cleanup as it does now.

Hence I don't see any need for changing the iomap infrastructure to
solve this problem. This seems like the obvious solution to me, so
what am I missing?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/3] ext4: Fix occasional generic/418 failure
  2021-04-12 21:50   ` Dave Chinner
@ 2021-04-13  9:11     ` Jan Kara
  2021-04-13 22:45       ` Dave Chinner
  0 siblings, 1 reply; 13+ messages in thread
From: Jan Kara @ 2021-04-13  9:11 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Ted Tso, linux-ext4, Eric Whitney, linux-fsdevel,
	Darrick J . Wong

On Tue 13-04-21 07:50:24, Dave Chinner wrote:
> On Mon, Apr 12, 2021 at 12:23:32PM +0200, Jan Kara wrote:
> > Eric has noticed that after pagecache read rework, generic/418 is
> > occasionally failing for ext4 when blocksize < pagesize. In fact, the
> > pagecache rework just made hard to hit race in ext4 more likely. The
> > problem is that since ext4 conversion of direct IO writes to iomap
> > framework (commit 378f32bab371), we update inode size after direct IO
> > write only after invalidating page cache. Thus if buffered read sneaks
> > at unfortunate moment like:
> > 
> > CPU1 - write at offset 1k                       CPU2 - read from offset 0
> > iomap_dio_rw(..., IOMAP_DIO_FORCE_WAIT);
> >                                                 ext4_readpage();
> > ext4_handle_inode_extension()
> > 
> > the read will zero out tail of the page as it still sees smaller inode
> > size and thus page cache becomes inconsistent with on-disk contents with
> > all the consequences.
> > 
> > Fix the problem by moving inode size update into end_io handler which
> > gets called before the page cache is invalidated.
> 
> Confused.
> 
> This moves all the inode extension stuff into the completion
> handler, when all that really needs to be done is extending
> inode->i_size to tell the world there is data up to where the
> IO completed. Actually removing the inode from the orphan list
> does not need to be done in the IO completion callback, because...
> 
> >  	if (ilock_shared)
> >  		iomap_ops = &ext4_iomap_overwrite_ops;
> > -	ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
> > -			   (unaligned_io || extend) ? IOMAP_DIO_FORCE_WAIT : 0);
> > -	if (ret == -ENOTBLK)
> > -		ret = 0;
> > -
> >  	if (extend)
> > -		ret = ext4_handle_inode_extension(inode, offset, ret, count);
> > +		dio_ops = &ext4_dio_extending_write_ops;
> >  
> > +	ret = iomap_dio_rw(iocb, from, iomap_ops, dio_ops,
> > +			   (extend || unaligned_io) ? IOMAP_DIO_FORCE_WAIT : 0);
>                             ^^^^^^                    ^^^^^^^^^^^^^^^^^^^ 
> 
> .... if we are doing an extending write, we force DIO to complete
> before returning. Hence even AIO will block here on an extending
> write, and hence we can -always- do the correct post-IO completion
> orphan list cleanup here because we know a) the original IO size and
> b) the amount of data that was actually written.
> 
> Hence all that remains is closing the buffered read vs invalidation
> race. All this requires is for the dio write completion to behave
> like XFS where it just does the inode->i_size update for extending
> writes. THis means the size is updated before the invalidation, and
> hence any read that occurs after the invalidation but before the
> post-eof blocks have been removed will see the correct size and read
> the tail page(s) correctly. This closes the race window, and the
> caller can still handle the post-eof block cleanup as it does now.
> 
> Hence I don't see any need for changing the iomap infrastructure to
> solve this problem. This seems like the obvious solution to me, so
> what am I missing?

All that you write above is correct. The missing piece is: If everything
succeeded and all the cleanup we need is removing inode from the orphan
list (common case), we want to piggyback that orphan list removal into the
same transaction handle as the update of the inode size. This is just a
performance thing, you are absolutely right we could also do the orphan
cleanup unconditionally in ext4_dio_write_iter() and thus avoid any changes
to the iomap framework.

OK, now that I write about this, maybe I was just too hung up on the
performance improvement. Probably a better way forward is that I just fix
the data corruption bug only inside ext4 (that will be also much easier to
backport) and then submit the performance improvement modifying iomap if I
can actually get performance data justifying it. Thanks for poking into
this :)

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/3] ext4: Fix occasional generic/418 failure
  2021-04-13  9:11     ` Jan Kara
@ 2021-04-13 22:45       ` Dave Chinner
  2021-04-14 11:56         ` Jan Kara
  0 siblings, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2021-04-13 22:45 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ted Tso, linux-ext4, Eric Whitney, linux-fsdevel, Darrick J . Wong

On Tue, Apr 13, 2021 at 11:11:22AM +0200, Jan Kara wrote:
> On Tue 13-04-21 07:50:24, Dave Chinner wrote:
> > On Mon, Apr 12, 2021 at 12:23:32PM +0200, Jan Kara wrote:
> > > Eric has noticed that after pagecache read rework, generic/418 is
> > > occasionally failing for ext4 when blocksize < pagesize. In fact, the
> > > pagecache rework just made hard to hit race in ext4 more likely. The
> > > problem is that since ext4 conversion of direct IO writes to iomap
> > > framework (commit 378f32bab371), we update inode size after direct IO
> > > write only after invalidating page cache. Thus if buffered read sneaks
> > > at unfortunate moment like:
> > > 
> > > CPU1 - write at offset 1k                       CPU2 - read from offset 0
> > > iomap_dio_rw(..., IOMAP_DIO_FORCE_WAIT);
> > >                                                 ext4_readpage();
> > > ext4_handle_inode_extension()
> > > 
> > > the read will zero out tail of the page as it still sees smaller inode
> > > size and thus page cache becomes inconsistent with on-disk contents with
> > > all the consequences.
> > > 
> > > Fix the problem by moving inode size update into end_io handler which
> > > gets called before the page cache is invalidated.
> > 
> > Confused.
> > 
> > This moves all the inode extension stuff into the completion
> > handler, when all that really needs to be done is extending
> > inode->i_size to tell the world there is data up to where the
> > IO completed. Actually removing the inode from the orphan list
> > does not need to be done in the IO completion callback, because...
> > 
> > >  	if (ilock_shared)
> > >  		iomap_ops = &ext4_iomap_overwrite_ops;
> > > -	ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
> > > -			   (unaligned_io || extend) ? IOMAP_DIO_FORCE_WAIT : 0);
> > > -	if (ret == -ENOTBLK)
> > > -		ret = 0;
> > > -
> > >  	if (extend)
> > > -		ret = ext4_handle_inode_extension(inode, offset, ret, count);
> > > +		dio_ops = &ext4_dio_extending_write_ops;
> > >  
> > > +	ret = iomap_dio_rw(iocb, from, iomap_ops, dio_ops,
> > > +			   (extend || unaligned_io) ? IOMAP_DIO_FORCE_WAIT : 0);
> >                             ^^^^^^                    ^^^^^^^^^^^^^^^^^^^ 
> > 
> > .... if we are doing an extending write, we force DIO to complete
> > before returning. Hence even AIO will block here on an extending
> > write, and hence we can -always- do the correct post-IO completion
> > orphan list cleanup here because we know a) the original IO size and
> > b) the amount of data that was actually written.
> > 
> > Hence all that remains is closing the buffered read vs invalidation
> > race. All this requires is for the dio write completion to behave
> > like XFS where it just does the inode->i_size update for extending
> > writes. THis means the size is updated before the invalidation, and
> > hence any read that occurs after the invalidation but before the
> > post-eof blocks have been removed will see the correct size and read
> > the tail page(s) correctly. This closes the race window, and the
> > caller can still handle the post-eof block cleanup as it does now.
> > 
> > Hence I don't see any need for changing the iomap infrastructure to
> > solve this problem. This seems like the obvious solution to me, so
> > what am I missing?
> 
> All that you write above is correct. The missing piece is: If everything
> succeeded and all the cleanup we need is removing inode from the orphan
> list (common case), we want to piggyback that orphan list removal into the
> same transaction handle as the update of the inode size. This is just a
> performance thing, you are absolutely right we could also do the orphan
> cleanup unconditionally in ext4_dio_write_iter() and thus avoid any changes
> to the iomap framework.

Doesn't ext4, like XFS, keep two copies of the inode size? One for
the on-disk size and one for the in-memory size?

/me looks...

Yeah, there's ei->i_disksize that reflects the on-disk size.

And I note that the first thing that ext4_handle_inode_extension()
is already checking that the write is extending past the current
on-disk inode size before running the extension transaction.

The page cache only cares about the inode->i_size value, not the
ei->i_disksize value, so you can update them independently and still
have things work correctly. That's what XFS does in
xfs_dio_write_end_io - it updates the in-memory inode->i_size, then
runs a transaction to atomically update the inode on-disk inode
size. Updating the VFS inode size first protects against buffered
read races while updating the on-disk size...

So for ext4, the two separate size updates don't need to be done at
the same time - you have all the state you need in the ext4 dio
write path to extend the on-disk file size on successful extending
write, and it is not dependent in any way on the current in-memory
VFS inode size that you'd update in the ->end_io callback....

> OK, now that I write about this, maybe I was just too hung up on the
> performance improvement. Probably a better way forward is that I just fix
> the data corruption bug only inside ext4 (that will be also much easier to
> backport) and then submit the performance improvement modifying iomap if I
> can actually get performance data justifying it. Thanks for poking into
> this :)

Sounds like a good plan :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/3] ext4: Fix occasional generic/418 failure
  2021-04-13 22:45       ` Dave Chinner
@ 2021-04-14 11:56         ` Jan Kara
  0 siblings, 0 replies; 13+ messages in thread
From: Jan Kara @ 2021-04-14 11:56 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Ted Tso, linux-ext4, Eric Whitney, linux-fsdevel,
	Darrick J . Wong

On Wed 14-04-21 08:45:31, Dave Chinner wrote:
> On Tue, Apr 13, 2021 at 11:11:22AM +0200, Jan Kara wrote:
> > On Tue 13-04-21 07:50:24, Dave Chinner wrote:
> > > On Mon, Apr 12, 2021 at 12:23:32PM +0200, Jan Kara wrote:
> > > > Eric has noticed that after pagecache read rework, generic/418 is
> > > > occasionally failing for ext4 when blocksize < pagesize. In fact, the
> > > > pagecache rework just made hard to hit race in ext4 more likely. The
> > > > problem is that since ext4 conversion of direct IO writes to iomap
> > > > framework (commit 378f32bab371), we update inode size after direct IO
> > > > write only after invalidating page cache. Thus if buffered read sneaks
> > > > at unfortunate moment like:
> > > > 
> > > > CPU1 - write at offset 1k                       CPU2 - read from offset 0
> > > > iomap_dio_rw(..., IOMAP_DIO_FORCE_WAIT);
> > > >                                                 ext4_readpage();
> > > > ext4_handle_inode_extension()
> > > > 
> > > > the read will zero out tail of the page as it still sees smaller inode
> > > > size and thus page cache becomes inconsistent with on-disk contents with
> > > > all the consequences.
> > > > 
> > > > Fix the problem by moving inode size update into end_io handler which
> > > > gets called before the page cache is invalidated.
> > > 
> > > Confused.
> > > 
> > > This moves all the inode extension stuff into the completion
> > > handler, when all that really needs to be done is extending
> > > inode->i_size to tell the world there is data up to where the
> > > IO completed. Actually removing the inode from the orphan list
> > > does not need to be done in the IO completion callback, because...
> > > 
> > > >  	if (ilock_shared)
> > > >  		iomap_ops = &ext4_iomap_overwrite_ops;
> > > > -	ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
> > > > -			   (unaligned_io || extend) ? IOMAP_DIO_FORCE_WAIT : 0);
> > > > -	if (ret == -ENOTBLK)
> > > > -		ret = 0;
> > > > -
> > > >  	if (extend)
> > > > -		ret = ext4_handle_inode_extension(inode, offset, ret, count);
> > > > +		dio_ops = &ext4_dio_extending_write_ops;
> > > >  
> > > > +	ret = iomap_dio_rw(iocb, from, iomap_ops, dio_ops,
> > > > +			   (extend || unaligned_io) ? IOMAP_DIO_FORCE_WAIT : 0);
> > >                             ^^^^^^                    ^^^^^^^^^^^^^^^^^^^ 
> > > 
> > > .... if we are doing an extending write, we force DIO to complete
> > > before returning. Hence even AIO will block here on an extending
> > > write, and hence we can -always- do the correct post-IO completion
> > > orphan list cleanup here because we know a) the original IO size and
> > > b) the amount of data that was actually written.
> > > 
> > > Hence all that remains is closing the buffered read vs invalidation
> > > race. All this requires is for the dio write completion to behave
> > > like XFS where it just does the inode->i_size update for extending
> > > writes. THis means the size is updated before the invalidation, and
> > > hence any read that occurs after the invalidation but before the
> > > post-eof blocks have been removed will see the correct size and read
> > > the tail page(s) correctly. This closes the race window, and the
> > > caller can still handle the post-eof block cleanup as it does now.
> > > 
> > > Hence I don't see any need for changing the iomap infrastructure to
> > > solve this problem. This seems like the obvious solution to me, so
> > > what am I missing?
> > 
> > All that you write above is correct. The missing piece is: If everything
> > succeeded and all the cleanup we need is removing inode from the orphan
> > list (common case), we want to piggyback that orphan list removal into the
> > same transaction handle as the update of the inode size. This is just a
> > performance thing, you are absolutely right we could also do the orphan
> > cleanup unconditionally in ext4_dio_write_iter() and thus avoid any changes
> > to the iomap framework.
> 
> Doesn't ext4, like XFS, keep two copies of the inode size? One for
> the on-disk size and one for the in-memory size?
> 
> /me looks...
> 
> Yeah, there's ei->i_disksize that reflects the on-disk size.
> 
> And I note that the first thing that ext4_handle_inode_extension()
> is already checking that the write is extending past the current
> on-disk inode size before running the extension transaction.

Yes.

> The page cache only cares about the inode->i_size value, not the
> ei->i_disksize value, so you can update them independently and still
> have things work correctly. That's what XFS does in
> xfs_dio_write_end_io - it updates the in-memory inode->i_size, then
> runs a transaction to atomically update the inode on-disk inode
> size. Updating the VFS inode size first protects against buffered
> read races while updating the on-disk size...
> 
> So for ext4, the two separate size updates don't need to be done at
> the same time - you have all the state you need in the ext4 dio
> write path to extend the on-disk file size on successful extending
> write, and it is not dependent in any way on the current in-memory
> VFS inode size that you'd update in the ->end_io callback....

Right, that's a nice trick that will allow us to keep the original
performance (I've verified that indeed splitting inode size and orphan
updates into separate transactions costs us ~10% of performance when
appending 512-byte chunks) without touching generic dio code. Thanks for
the idea!

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 3/3] ext4: Fix overflow in ext4_iomap_alloc()
  2021-04-12 10:23 [PATCH 0/3] ext4: Fix data corruption when extending DIO write races with buffered read Jan Kara
  2021-04-12 10:23 ` [PATCH 1/3] iomap: Pass original DIO size to completion handler Jan Kara
  2021-04-12 10:23 ` [PATCH 2/3] ext4: Fix occasional generic/418 failure Jan Kara
@ 2021-04-12 10:23 ` Jan Kara
  2021-04-12 11:30   ` Matthew Wilcox
  2021-06-16 12:22   ` Theodore Ts'o
  2 siblings, 2 replies; 13+ messages in thread
From: Jan Kara @ 2021-04-12 10:23 UTC (permalink / raw)
  To: Ted Tso
  Cc: linux-ext4, Eric Whitney, linux-fsdevel, Darrick J . Wong, Jan Kara

A code in iomap alloc may overblock block number when converting it to
byte offset. Luckily this is mostly harmless as we will just use more
expensive method of writing using unwritten extents even though we are
writing beyond i_size.

Fixes: 378f32bab371 ("ext4: introduce direct I/O write using iomap infrastructure")
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 0948a43f1b3d..7cebbb2d2e34 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3420,7 +3420,7 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
 	 * i_disksize out to i_size. This could be beyond where direct I/O is
 	 * happening and thus expose allocated blocks to direct I/O reads.
 	 */
-	else if ((map->m_lblk * (1 << blkbits)) >= i_size_read(inode))
+	else if (((loff_t)map->m_lblk << blkbits) >= i_size_read(inode))
 		m_flags = EXT4_GET_BLOCKS_CREATE;
 	else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
 		m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] ext4: Fix overflow in ext4_iomap_alloc()
  2021-04-12 10:23 ` [PATCH 3/3] ext4: Fix overflow in ext4_iomap_alloc() Jan Kara
@ 2021-04-12 11:30   ` Matthew Wilcox
  2021-06-16 12:22   ` Theodore Ts'o
  1 sibling, 0 replies; 13+ messages in thread
From: Matthew Wilcox @ 2021-04-12 11:30 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ted Tso, linux-ext4, Eric Whitney, linux-fsdevel, Darrick J . Wong

On Mon, Apr 12, 2021 at 12:23:33PM +0200, Jan Kara wrote:
> A code in iomap alloc may overblock block number when converting it to

"overflow"?

> byte offset. Luckily this is mostly harmless as we will just use more
> expensive method of writing using unwritten extents even though we are
> writing beyond i_size.
> 
> Fixes: 378f32bab371 ("ext4: introduce direct I/O write using iomap infrastructure")
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/ext4/inode.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 0948a43f1b3d..7cebbb2d2e34 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3420,7 +3420,7 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
>  	 * i_disksize out to i_size. This could be beyond where direct I/O is
>  	 * happening and thus expose allocated blocks to direct I/O reads.
>  	 */
> -	else if ((map->m_lblk * (1 << blkbits)) >= i_size_read(inode))
> +	else if (((loff_t)map->m_lblk << blkbits) >= i_size_read(inode))
>  		m_flags = EXT4_GET_BLOCKS_CREATE;
>  	else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
>  		m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
> -- 
> 2.26.2
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] ext4: Fix overflow in ext4_iomap_alloc()
  2021-04-12 10:23 ` [PATCH 3/3] ext4: Fix overflow in ext4_iomap_alloc() Jan Kara
  2021-04-12 11:30   ` Matthew Wilcox
@ 2021-06-16 12:22   ` Theodore Ts'o
  1 sibling, 0 replies; 13+ messages in thread
From: Theodore Ts'o @ 2021-06-16 12:22 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, Eric Whitney, linux-fsdevel, Darrick J . Wong

On Mon, Apr 12, 2021 at 12:23:33PM +0200, Jan Kara wrote:
> A code in iomap alloc may overblock block number when converting it to
> byte offset. Luckily this is mostly harmless as we will just use more
> expensive method of writing using unwritten extents even though we are
> writing beyond i_size.
> 
> Fixes: 378f32bab371 ("ext4: introduce direct I/O write using iomap infrastructure")
> Signed-off-by: Jan Kara <jack@suse.cz>

This was part of the patch series "ext4: Fix data corruption when
extending DIO write races with buffered read" but which fixes an
unrelated problem.  The patch series was dropped in favor of a
different approach, but it looks like this patch is still applicable,
so I've applied with a minor typo fix in the commit description.

   		       	     	      - Ted

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-06-16 12:22 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-12 10:23 [PATCH 0/3] ext4: Fix data corruption when extending DIO write races with buffered read Jan Kara
2021-04-12 10:23 ` [PATCH 1/3] iomap: Pass original DIO size to completion handler Jan Kara
2021-04-12 14:07   ` kernel test robot
2021-04-12 14:12   ` kernel test robot
2021-04-12 16:37   ` kernel test robot
2021-04-12 10:23 ` [PATCH 2/3] ext4: Fix occasional generic/418 failure Jan Kara
2021-04-12 21:50   ` Dave Chinner
2021-04-13  9:11     ` Jan Kara
2021-04-13 22:45       ` Dave Chinner
2021-04-14 11:56         ` Jan Kara
2021-04-12 10:23 ` [PATCH 3/3] ext4: Fix overflow in ext4_iomap_alloc() Jan Kara
2021-04-12 11:30   ` Matthew Wilcox
2021-06-16 12:22   ` Theodore Ts'o

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).