ocfs2-devel.lists.linux.dev archive mirror

Re: [PATCH][next] ocfs2: remove redundant assignment to variable status

2024-04-24T05:54:13Z

On Wed, Apr 24, 2024 at 09:40:33AM +0800, Heming Zhao wrote:
> On 4/24/24 06:30, Colin Ian King wrote:
> > Variable status is being assigned and error code that is never read, it is
> > being assigned inside of a do-while loop. The assignment is redundant and
> > can be removed.
> > 
> > Cleans up clang scan build warning:
> > fs/ocfs2/dlm/dlmdomain.c:1530:2: warning: Value stored to 'status' is never
> > read [deadcode.DeadStores]
> > 
> > Signed-off-by: Colin Ian King 
> > ---
> >   fs/ocfs2/dlm/dlmdomain.c | 1 -
> >   1 file changed, 1 deletion(-)
> > 
> > diff --git a/fs/ocfs2/dlm/dlmdomain.c b/fs/ocfs2/dlm/dlmdomain.c
> > index 2e0a2f338282..2018501b2249 100644
> > --- a/fs/ocfs2/dlm/dlmdomain.c
> > +++ b/fs/ocfs2/dlm/dlmdomain.c
> > @@ -1527,7 +1527,6 @@ static void dlm_send_join_asserts(struct dlm_ctxt *dlm,
> >   {
> >   	int status, node, live;
> > -	status = 0;
> >   	node = -1;
> >   	while ((node = find_next_bit(node_map, O2NM_MAX_NODES,
> >   				     node + 1)) < O2NM_MAX_NODES) {
> 
> This mail cc linux-kernel@vger.kernel.org, it would be better to only
> cc ocfs2-devel next time.

I used to tell people that, but Linus Torvalds disagreed with me.  I
also used to filter LKML from my own patches but then I ran into the
issue where a couple subsystems use LKML as a source for patchwork and
they get annoyed when it's not on the CC list.  In one of the threads
for last year's kernel summit people said you should just use
get_maintainer.pl and keep LKML on the CC list.  So I think that is the
rule now.

Except for maybe in networking.  I still edit the CC lists there.

regards,
dan carpenter

Re: [PATCH][next] ocfs2: remove redundant assignment to variable status

2024-04-24T03:30:56Z


On 4/24/24 6:30 AM, Colin Ian King wrote:
> Variable status is being assigned and error code that is never read, it is
> being assigned inside of a do-while loop. The assignment is redundant and
> can be removed.
> 
> Cleans up clang scan build warning:
> fs/ocfs2/dlm/dlmdomain.c:1530:2: warning: Value stored to 'status' is never
> read [deadcode.DeadStores]
> 
> Signed-off-by: Colin Ian King 

Acked-by: Joseph Qi 
> ---
>  fs/ocfs2/dlm/dlmdomain.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/fs/ocfs2/dlm/dlmdomain.c b/fs/ocfs2/dlm/dlmdomain.c
> index 2e0a2f338282..2018501b2249 100644
> --- a/fs/ocfs2/dlm/dlmdomain.c
> +++ b/fs/ocfs2/dlm/dlmdomain.c
> @@ -1527,7 +1527,6 @@ static void dlm_send_join_asserts(struct dlm_ctxt *dlm,
>  {
>  	int status, node, live;
>  
> -	status = 0;
>  	node = -1;
>  	while ((node = find_next_bit(node_map, O2NM_MAX_NODES,
>  				     node + 1)) < O2NM_MAX_NODES) {

Re: [PATCH][next] ocfs2: remove redundant assignment to variable status

2024-04-24T01:40:43Z

On 4/24/24 06:30, Colin Ian King wrote:
> Variable status is being assigned and error code that is never read, it is
> being assigned inside of a do-while loop. The assignment is redundant and
> can be removed.
> 
> Cleans up clang scan build warning:
> fs/ocfs2/dlm/dlmdomain.c:1530:2: warning: Value stored to 'status' is never
> read [deadcode.DeadStores]
> 
> Signed-off-by: Colin Ian King 
> ---
>   fs/ocfs2/dlm/dlmdomain.c | 1 -
>   1 file changed, 1 deletion(-)
> 
> diff --git a/fs/ocfs2/dlm/dlmdomain.c b/fs/ocfs2/dlm/dlmdomain.c
> index 2e0a2f338282..2018501b2249 100644
> --- a/fs/ocfs2/dlm/dlmdomain.c
> +++ b/fs/ocfs2/dlm/dlmdomain.c
> @@ -1527,7 +1527,6 @@ static void dlm_send_join_asserts(struct dlm_ctxt *dlm,
>   {
>   	int status, node, live;
>   
> -	status = 0;
>   	node = -1;
>   	while ((node = find_next_bit(node_map, O2NM_MAX_NODES,
>   				     node + 1)) < O2NM_MAX_NODES) {

This mail cc linux-kernel@vger.kernel.org, it would be better to only
cc ocfs2-devel next time.

This patch is correct, but it's a very trivial fix. I am not sure if
Joseph is willing to take time to push to mainline kernel.

Thanks,
Heming

[PATCH][next] ocfs2: remove redundant assignment to variable status

2024-04-23T22:30:21Z

Variable status is being assigned and error code that is never read, it is
being assigned inside of a do-while loop. The assignment is redundant and
can be removed.

Cleans up clang scan build warning:
fs/ocfs2/dlm/dlmdomain.c:1530:2: warning: Value stored to 'status' is never
read [deadcode.DeadStores]

Signed-off-by: Colin Ian King 
---
 fs/ocfs2/dlm/dlmdomain.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/ocfs2/dlm/dlmdomain.c b/fs/ocfs2/dlm/dlmdomain.c
index 2e0a2f338282..2018501b2249 100644
--- a/fs/ocfs2/dlm/dlmdomain.c
+++ b/fs/ocfs2/dlm/dlmdomain.c
@@ -1527,7 +1527,6 @@ static void dlm_send_join_asserts(struct dlm_ctxt *dlm,
 {
 	int status, node, live;
 
-	status = 0;
 	node = -1;
 	while ((node = find_next_bit(node_map, O2NM_MAX_NODES,
 				     node + 1)) < O2NM_MAX_NODES) {
-- 
2.39.2

[PATCH v3 4/4] ocfs2: use coarse time for new created files

2024-04-08T08:21:08Z

The default atime related mount option is '-o realtime'
which means file atime should be updated if atime <= ctime
or atime <= mtime. atime should be updated in the
following scenario, but it is not:
==========================================================
$ rm /mnt/testfile;
$ echo test > /mnt/testfile
$ stat -c "%X %Y %Z" /mnt/testfile
1711881646 1711881646 1711881646
$ sleep 5
$ cat /mnt/testfile > /dev/null
$ stat -c "%X %Y %Z" /mnt/testfile
1711881646 1711881646 1711881646
==========================================================

And the reason the atime in the test is not updated is that
ocfs2 calls ktime_get_real_ts64() in __ocfs2_mknod_locked during
file creation. Then inode_set_ctime_current() is called in
inode_set_ctime_current() calls ktime_get_coarse_real_ts64() to
get current time.
ktime_get_real_ts64() is accurater than ktime_get_coarse_real_ts64().
In my test box, I saw ctime set by ktime_get_coarse_real_ts64() is
less than ktime_get_real_ts64() even ctime is set later.
The ctime of the new inode is smaller than atime.

The call trace is like:

ocfs2_create
  ocfs2_mknod
    __ocfs2_mknod_locked
    ....

      ktime_get_real_ts64 <------- set atime,ctime,mtime, more accurate
      ocfs2_populate_inode
    ...
    ocfs2_init_acl
      ocfs2_acl_set_mode
        inode_set_ctime_current
          current_time
            ktime_get_coarse_real_ts64 <-------less accurate

ocfs2_file_read_iter
  ocfs2_inode_lock_atime
    ocfs2_should_update_atime
      atime <= ctime ? <-------- false, ctime < atime due to accuracy

So here call ktime_get_coarse_real_ts64 to set inode time coarser while
creating new files. It may lower the accuracy of file times. But it's not
a big deal since we already use coarse time in other places like
ocfs2_update_inode_atime and inode_set_ctime_current.

Fixes: c62c38f6b91b ("ocfs2: replace CURRENT_TIME macro")
Reviewed-by: Joseph Qi 
Signed-off-by: Su Yue 
---
 fs/ocfs2/namei.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
index 55c9d90caaaf..4d1ea8703fcd 100644
--- a/fs/ocfs2/namei.c
+++ b/fs/ocfs2/namei.c
@@ -566,7 +566,7 @@ static int __ocfs2_mknod_locked(struct inode *dir,
 	fe->i_last_eb_blk = 0;
 	strcpy(fe->i_signature, OCFS2_INODE_SIGNATURE);
 	fe->i_flags |= cpu_to_le32(OCFS2_VALID_FL);
-	ktime_get_real_ts64(&ts);
+	ktime_get_coarse_real_ts64(&ts);
 	fe->i_atime = fe->i_ctime = fe->i_mtime =
 		cpu_to_le64(ts.tv_sec);
 	fe->i_mtime_nsec = fe->i_ctime_nsec = fe->i_atime_nsec =
-- 
2.44.0

[PATCH v3 3/4] ocfs2: update inode fsync transaction id in ocfs2_unlink and ocfs2_link

2024-04-08T08:21:04Z

transaction id should be updated in ocfs2_unlink and ocfs2_link.
Otherwise, inode link will be wrong after journal replay even fsync was
called before power failure:
=======================================================================
$ touch testdir/bar
$ ln testdir/bar testdir/bar_link
$ fsync testdir/bar
$ stat -c %h $SCRATCH_MNT/testdir/bar
1
$ stat -c %h $SCRATCH_MNT/testdir/bar
1
=======================================================================

Fixes: ccd979bdbce9 ("[PATCH] OCFS2: The Second Oracle Cluster Filesystem")
Reviewed-by: Joseph Qi 
Signed-off-by: Su Yue 
---
 fs/ocfs2/namei.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
index 9221a33f917b..55c9d90caaaf 100644
--- a/fs/ocfs2/namei.c
+++ b/fs/ocfs2/namei.c
@@ -797,6 +797,7 @@ static int ocfs2_link(struct dentry *old_dentry,
 	ocfs2_set_links_count(fe, inode->i_nlink);
 	fe->i_ctime = cpu_to_le64(inode_get_ctime_sec(inode));
 	fe->i_ctime_nsec = cpu_to_le32(inode_get_ctime_nsec(inode));
+	ocfs2_update_inode_fsync_trans(handle, inode, 0);
 	ocfs2_journal_dirty(handle, fe_bh);
 
 	err = ocfs2_add_entry(handle, dentry, inode,
@@ -993,6 +994,7 @@ static int ocfs2_unlink(struct inode *dir,
 		drop_nlink(inode);
 	drop_nlink(inode);
 	ocfs2_set_links_count(fe, inode->i_nlink);
+	ocfs2_update_inode_fsync_trans(handle, inode, 0);
 	ocfs2_journal_dirty(handle, fe_bh);
 
 	inode_set_mtime_to_ts(dir, inode_set_ctime_current(dir));
-- 
2.44.0

[PATCH v3 2/4] ocfs2: fix races between hole punching and AIO+DIO

2024-04-08T08:21:02Z

After commit "ocfs2: return real error code in ocfs2_dio_wr_get_block",
fstests/generic/300 become from always failed to sometimes failed:

========================================================================
[  473.293420 ] run fstests generic/300

[  475.296983 ] JBD2: Ignoring recovery information on journal
[  475.302473 ] ocfs2: Mounting device (253,1) on (node local, slot 0)
with ordered data mode.
[  494.290998 ] OCFS2: ERROR (device dm-1): ocfs2_change_extent_flag:
Owner 5668 has an extent at cpos 78723 which can no longer be found
[  494.291609 ] On-disk corruption discovered. Please run fsck.ocfs2
once the filesystem is unmounted.
[  494.292018 ] OCFS2: File system is now read-only.
[  494.292224 ] (kworker/19:11,2628,19):ocfs2_mark_extent_written:5272
ERROR: status = -30
[  494.292602 ] (kworker/19:11,2628,19):ocfs2_dio_end_io_write:2374
ERROR: status = -3
fio: io_u error on file /mnt/scratch/racer: Read-only file system: write
offset=460849152, buflen=131072
=========================================================================

In __blockdev_direct_IO, ocfs2_dio_wr_get_block is called to add
unwritten extents to a list. extents are also inserted into extent tree
in ocfs2_write_begin_nolock. Then another thread call fallocate to
puch a hole at one of the unwritten extent. The extent at cpos was
removed by ocfs2_remove_extent(). At end io worker thread,
ocfs2_search_extent_list found there is no such extent at the cpos.

    T1                        T2                T3
                              inode lock
                                ...
                                insert extents
                                ...
                              inode unlock
ocfs2_fallocate
 __ocfs2_change_file_space
  inode lock
  lock ip_alloc_sem
  ocfs2_remove_inode_range inode
   ocfs2_remove_btree_range
    ocfs2_remove_extent
    ^---remove the extent at cpos 78723
  ...
  unlock ip_alloc_sem
  inode unlock
                                       ocfs2_dio_end_io
                                        ocfs2_dio_end_io_write
                                         lock ip_alloc_sem
                                         ocfs2_mark_extent_written
                                          ocfs2_change_extent_flag
                                           ocfs2_search_extent_list
                                           ^---failed to find extent
                                          ...
                                          unlock ip_alloc_sem

In most filesystems, fallocate is not compatible with racing with
AIO+DIO, so fix it by adding to wait for all dio before
fallocate/punch_hole like ext4.

Fixes: b25801038da5 ("ocfs2: Support xfs style space reservation ioctls")
Reviewed-by: Joseph Qi 
Signed-off-by: Su Yue 
---
 fs/ocfs2/file.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 0da8e7bd3261..ccc57038a977 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -1936,6 +1936,8 @@ static int __ocfs2_change_file_space(struct file *file, struct inode *inode,
 
 	inode_lock(inode);
 
+	/* Wait all existing dio workers, newcomers will block on i_rwsem */
+	inode_dio_wait(inode);
 	/*
 	 * This prevents concurrent writes on other nodes
 	 */
-- 
2.44.0

[PATCH v3 1/4] ocfs2: return real error code in ocfs2_dio_wr_get_block

2024-04-08T08:21:00Z

ocfs2_dio_wr_get_block always returns -EIO in case of errors.
However, some programs expect right exit codes while doing dio.
For example, tools like fio treat -ENOSPC as expected code while
doing stress jobs. And quota tools expect -EDQUOT when disk quota
exceeds.

-EIO is too strong return code in the dio path.
The caller of ocfs2_dio_wr_get_block is __blockdev_direct_IO which is
widely used and it handles error codes well. I have checked functions
called by ocfs2_dio_wr_get_block and their return codes look good and
clear. So I think it's safe to let ocfs2_dio_wr_get_block return real
error code.

Reviewed-by: Joseph Qi 
Signed-off-by: Su Yue 
---
 fs/ocfs2/aops.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index b82185075de7..f0467d3b3c88 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -2283,8 +2283,6 @@ static int ocfs2_dio_wr_get_block(struct inode *inode, sector_t iblock,
 	ocfs2_inode_unlock(inode, 1);
 	brelse(di_bh);
 out:
-	if (ret < 0)
-		ret = -EIO;
 	return ret;
 }
 
-- 
2.44.0

[PATCH v3 0/4] ocfs2 bugs fixes exposed by fstests

2024-04-08T08:20:57Z

The patchset is to fix some wrong behavior of ocfs2 exposed
by fstests.

Patch 1 makes userspace happy when some error happens when
doing direct io. Before the patch, DIO always return -EIO
in case of error. After the patch, it returns real error code
such like -ENOSPC, EDQUOT...

Patch 2 fixes an error case when doing AIO+DIO and hole punching
at same file position in parallel. generic/300

Patch 3 fixes inode link count mismatch after power failure.
Without the patch, inode link would be wrong even fync was called
on the file. tests/generic/040,041,104,107,336

patch 4 fixes wrong atime with mount option realtime.
Without the patch, atime of new created file won't be updated in
right time. tests/generic/192

For stable kernels, I added fixes to patch 2,3,4.
The patch 1 is not recommended to be backported since
ocfs2_dio_wr_get_block calls too many functions. It's diffcult
to check every git history of ocfs2 for every LTS kernel. 

Changelog:
v3:
  - Add rvb to patch 1.
  - Add fixes to patch 2,3,4.
  - Descirbe more in cover letter.
v2:
  - Fix typos and amend commit message about the functions called
  by ocfs2_dio_wr_get_block in patch 1.
  - Add rvb to patch 2,3,4.
  
Su Yue (4):
  ocfs2: return real error code in ocfs2_dio_wr_get_block
  ocfs2: fix races between hole punching and AIO+DIO
  ocfs2: update inode fsync transaction id in ocfs2_unlink and
    ocfs2_link
  ocfs2: use coarse time for new created files

 fs/ocfs2/aops.c  | 2 --
 fs/ocfs2/file.c  | 2 ++
 fs/ocfs2/namei.c | 4 +++-
 3 files changed, 5 insertions(+), 3 deletions(-)

-- 
2.44.0

Re: [PATCH v2 0/4] ocfs2 bugs fixes exposed by fstests

2024-04-04T05:54:55Z


> On Apr 4, 2024, at 09:51, Andrew Morton  wrote:
> 
> On Tue,  2 Apr 2024 09:46:47 +0800 Su Yue  wrote:
> 
>> The patchset is to fix some wrong behavior of ocfs2 exposed
>> by fstests.
> 
> Thanks.  We should consider which of these fixes should be backported
> into -stable kernels.  For that we should provide, for each patch:
> 
> - A description of the userspace-visible impact of the bug and
> 
Yeah. I should elaborate more in cover letter.

> - A suitable Fixes: target to tell -stable maintainers how far back
> these fixes are needed.
> 
Necessary Indeed.

> Please could we give some consideration to these matters?
> 
Sure. I will do these after vacation in next version.

Re: [PATCH v2 0/4] ocfs2 bugs fixes exposed by fstests

2024-04-04T01:51:38Z

On Tue,  2 Apr 2024 09:46:47 +0800 Su Yue  wrote:

> The patchset is to fix some wrong behavior of ocfs2 exposed
> by fstests.

Thanks.  We should consider which of these fixes should be backported
into -stable kernels.  For that we should provide, for each patch:

- A description of the userspace-visible impact of the bug and

- A suitable Fixes: target to tell -stable maintainers how far back
these fixes are needed.

Please could we give some consideration to these matters?

Re: [PATCH v2 1/4] ocfs2: return real error code in ocfs2_dio_wr_get_block

2024-04-02T01:51:12Z


On 4/2/24 9:46 AM, Su Yue wrote:
> ocfs2_dio_wr_get_block always returns -EIO in case of errors.
> However, some programs expect right exit codes while doing dio.
> For example, tools like fio treat -ENOSPC as expected code while
> doing stress jobs. And quota tools expect -EDQUOT when disk quota
> exceeds.
> 
> -EIO is too strong return code in the dio path.
> The caller of ocfs2_dio_wr_get_block is __blockdev_direct_IO which is
> widely used and it handles error codes well. I have checked functions
> called by ocfs2_dio_wr_get_block and their return codes look good and
> clear. So I think it's safe to let ocfs2_dio_wr_get_block return real
> error code.
> 
> Signed-off-by: Su Yue 

Reviewed-by: Joseph Qi 
> ---
>  fs/ocfs2/aops.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
> index b82185075de7..f0467d3b3c88 100644
> --- a/fs/ocfs2/aops.c
> +++ b/fs/ocfs2/aops.c
> @@ -2283,8 +2283,6 @@ static int ocfs2_dio_wr_get_block(struct inode *inode, sector_t iblock,
>  	ocfs2_inode_unlock(inode, 1);
>  	brelse(di_bh);
>  out:
> -	if (ret < 0)
> -		ret = -EIO;
>  	return ret;
>  }
>

[PATCH v2 4/4] ocfs2: use coarse time for new created files

2024-04-02T01:47:19Z

The default atime related mount option is '-o realtime'
which means file atime should be updated if atime <= ctime
or atime <= mtime. atime should be updated in the
following scenario, but it is not:
==========================================================
$ rm /mnt/testfile;
$ echo test > /mnt/testfile
$ stat -c "%X %Y %Z" /mnt/testfile
1711881646 1711881646 1711881646
$ sleep 5
$ cat /mnt/testfile > /dev/null
$ stat -c "%X %Y %Z" /mnt/testfile
1711881646 1711881646 1711881646
==========================================================

And the reason the atime in the test is not updated is that
ocfs2 calls ktime_get_real_ts64() in __ocfs2_mknod_locked during
file creation. Then inode_set_ctime_current() is called in
inode_set_ctime_current() calls ktime_get_coarse_real_ts64() to
get current time.
ktime_get_real_ts64() is accurater than ktime_get_coarse_real_ts64().
In my test box, I saw ctime set by ktime_get_coarse_real_ts64() is
less than ktime_get_real_ts64() even ctime is set later.
The ctime of the new inode is smaller than atime.

The call trace is like:

ocfs2_create
  ocfs2_mknod
    __ocfs2_mknod_locked
    ....

      ktime_get_real_ts64 <------- set atime,ctime,mtime, more accurate
      ocfs2_populate_inode
    ...
    ocfs2_init_acl
      ocfs2_acl_set_mode
        inode_set_ctime_current
          current_time
            ktime_get_coarse_real_ts64 <-------less accurate

ocfs2_file_read_iter
  ocfs2_inode_lock_atime
    ocfs2_should_update_atime
      atime <= ctime ? <-------- false, ctime < atime due to accuracy

So here call ktime_get_coarse_real_ts64 to set inode time coarser while
creating new files. It may lower the accuracy of file times. But it's not
a big deal since we already use coarse time in other places like
ocfs2_update_inode_atime and inode_set_ctime_current.

Reviewed-by: Joseph Qi 
Signed-off-by: Su Yue 
---
 fs/ocfs2/namei.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
index 55c9d90caaaf..4d1ea8703fcd 100644
--- a/fs/ocfs2/namei.c
+++ b/fs/ocfs2/namei.c
@@ -566,7 +566,7 @@ static int __ocfs2_mknod_locked(struct inode *dir,
 	fe->i_last_eb_blk = 0;
 	strcpy(fe->i_signature, OCFS2_INODE_SIGNATURE);
 	fe->i_flags |= cpu_to_le32(OCFS2_VALID_FL);
-	ktime_get_real_ts64(&ts);
+	ktime_get_coarse_real_ts64(&ts);
 	fe->i_atime = fe->i_ctime = fe->i_mtime =
 		cpu_to_le64(ts.tv_sec);
 	fe->i_mtime_nsec = fe->i_ctime_nsec = fe->i_atime_nsec =
-- 
2.44.0

[PATCH v2 3/4] ocfs2: update inode fsync transaction id in ocfs2_unlink and ocfs2_link

2024-04-02T01:47:16Z

transaction id should be updated in ocfs2_unlink and ocfs2_link.
Otherwise, inode link will be wrong after journal replay even fsync was
called before power failure:
=======================================================================
$ touch testdir/bar
$ ln testdir/bar testdir/bar_link
$ fsync testdir/bar
$ stat -c %h $SCRATCH_MNT/testdir/bar
1
$ stat -c %h $SCRATCH_MNT/testdir/bar
1
=======================================================================

Reviewed-by: Joseph Qi 
Signed-off-by: Su Yue 
---
 fs/ocfs2/namei.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
index 9221a33f917b..55c9d90caaaf 100644
--- a/fs/ocfs2/namei.c
+++ b/fs/ocfs2/namei.c
@@ -797,6 +797,7 @@ static int ocfs2_link(struct dentry *old_dentry,
 	ocfs2_set_links_count(fe, inode->i_nlink);
 	fe->i_ctime = cpu_to_le64(inode_get_ctime_sec(inode));
 	fe->i_ctime_nsec = cpu_to_le32(inode_get_ctime_nsec(inode));
+	ocfs2_update_inode_fsync_trans(handle, inode, 0);
 	ocfs2_journal_dirty(handle, fe_bh);
 
 	err = ocfs2_add_entry(handle, dentry, inode,
@@ -993,6 +994,7 @@ static int ocfs2_unlink(struct inode *dir,
 		drop_nlink(inode);
 	drop_nlink(inode);
 	ocfs2_set_links_count(fe, inode->i_nlink);
+	ocfs2_update_inode_fsync_trans(handle, inode, 0);
 	ocfs2_journal_dirty(handle, fe_bh);
 
 	inode_set_mtime_to_ts(dir, inode_set_ctime_current(dir));
-- 
2.44.0

[PATCH v2 2/4] ocfs2: fix races between hole punching and AIO+DIO

2024-04-02T01:47:13Z

After commit "ocfs2: return real error code in ocfs2_dio_wr_get_block",
fstests/generic/300 become from always failed to sometimes failed:

========================================================================
[  473.293420 ] run fstests generic/300

[  475.296983 ] JBD2: Ignoring recovery information on journal
[  475.302473 ] ocfs2: Mounting device (253,1) on (node local, slot 0)
with ordered data mode.
[  494.290998 ] OCFS2: ERROR (device dm-1): ocfs2_change_extent_flag:
Owner 5668 has an extent at cpos 78723 which can no longer be found
[  494.291609 ] On-disk corruption discovered. Please run fsck.ocfs2
once the filesystem is unmounted.
[  494.292018 ] OCFS2: File system is now read-only.
[  494.292224 ] (kworker/19:11,2628,19):ocfs2_mark_extent_written:5272
ERROR: status = -30
[  494.292602 ] (kworker/19:11,2628,19):ocfs2_dio_end_io_write:2374
ERROR: status = -3
fio: io_u error on file /mnt/scratch/racer: Read-only file system: write
offset=460849152, buflen=131072
=========================================================================

In __blockdev_direct_IO, ocfs2_dio_wr_get_block is called to add
unwritten extents to a list. extents are also inserted into extent tree
in ocfs2_write_begin_nolock. Then another thread call fallocate to
puch a hole at one of the unwritten extent. The extent at cpos was
removed by ocfs2_remove_extent(). At end io worker thread,
ocfs2_search_extent_list found there is no such extent at the cpos.

    T1                        T2                T3
                              inode lock
                                ...
                                insert extents
                                ...
                              inode unlock
ocfs2_fallocate
 __ocfs2_change_file_space
  inode lock
  lock ip_alloc_sem
  ocfs2_remove_inode_range inode
   ocfs2_remove_btree_range
    ocfs2_remove_extent
    ^---remove the extent at cpos 78723
  ...
  unlock ip_alloc_sem
  inode unlock
                                       ocfs2_dio_end_io
                                        ocfs2_dio_end_io_write
                                         lock ip_alloc_sem
                                         ocfs2_mark_extent_written
                                          ocfs2_change_extent_flag
                                           ocfs2_search_extent_list
                                           ^---failed to find extent
                                          ...
                                          unlock ip_alloc_sem

In most filesystems, fallocate is not compatible with racing with
AIO+DIO, so fix it by adding to wait for all dio before
fallocate/punch_hole like ext4.

Reviewed-by: Joseph Qi 
Signed-off-by: Su Yue 
---
 fs/ocfs2/file.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 0da8e7bd3261..ccc57038a977 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -1936,6 +1936,8 @@ static int __ocfs2_change_file_space(struct file *file, struct inode *inode,
 
 	inode_lock(inode);
 
+	/* Wait all existing dio workers, newcomers will block on i_rwsem */
+	inode_dio_wait(inode);
 	/*
 	 * This prevents concurrent writes on other nodes
 	 */
-- 
2.44.0

[PATCH v2 1/4] ocfs2: return real error code in ocfs2_dio_wr_get_block

2024-04-02T01:47:10Z

ocfs2_dio_wr_get_block always returns -EIO in case of errors.
However, some programs expect right exit codes while doing dio.
For example, tools like fio treat -ENOSPC as expected code while
doing stress jobs. And quota tools expect -EDQUOT when disk quota
exceeds.

-EIO is too strong return code in the dio path.
The caller of ocfs2_dio_wr_get_block is __blockdev_direct_IO which is
widely used and it handles error codes well. I have checked functions
called by ocfs2_dio_wr_get_block and their return codes look good and
clear. So I think it's safe to let ocfs2_dio_wr_get_block return real
error code.

Signed-off-by: Su Yue 
---
 fs/ocfs2/aops.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index b82185075de7..f0467d3b3c88 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -2283,8 +2283,6 @@ static int ocfs2_dio_wr_get_block(struct inode *inode, sector_t iblock,
 	ocfs2_inode_unlock(inode, 1);
 	brelse(di_bh);
 out:
-	if (ret < 0)
-		ret = -EIO;
 	return ret;
 }
 
-- 
2.44.0

[PATCH v2 0/4] ocfs2 bugs fixes exposed by fstests

2024-04-02T01:47:08Z

The patchset is to fix some wrong behavior of ocfs2 exposed
by fstests.

Patch 1,2 are about AIO+DIO vs hole punching. generic/300

Patch 3 fixes inode link count mismatch after power failure.
tests/generic/040,041,104,107,336

patch 4 fixes wrong atime with mount option realtime. tests/generic/192

Changelog:
v2:
  - Fix typos and amend commit message about the functions called
  by ocfs2_dio_wr_get_block in patch 1.
  - Add rvb to patch 2,3,4.
  
Su Yue (4):
  ocfs2: return real error code in ocfs2_dio_wr_get_block
  ocfs2: fix races between hole punching and AIO+DIO
  ocfs2: update inode fsync transaction id in ocfs2_unlink and
    ocfs2_link
  ocfs2: use coarse time for new created files

 fs/ocfs2/aops.c  | 2 --
 fs/ocfs2/file.c  | 2 ++
 fs/ocfs2/namei.c | 4 +++-
 3 files changed, 5 insertions(+), 3 deletions(-)

-- 
2.44.0

Re: [PATCH 1/4] ocfs2: return real error code in ocfs2_dio_wr_get_block

2024-04-01T03:51:27Z


> On Apr 1, 2024, at 09:44, Joseph Qi  wrote:
> 
> 
> 
> On 3/31/24 7:17 PM, Su Yue wrote:
>> From: Su Yue 
>> 
>> ocfs2_dio_wr_get_block always returns -EIO in case of errors.
>> However, some programs expect right exit codes while doing dio.
>> For example, tools like fio treat -ENOSPC as expected code while
>> doing stress jobs. And quota tools expect -QUOTA when disk quota
> 
> EDQUOT?
> 

Right.

>> exceededs.
> 
> s/exceededs/exceeds
> 

Sorry for the typos.

>> 
>> -EIO is too strong return code in the dio path. I have checked
>> return codes from collees of ocfs2_dio_wr_get_block to make sure
> 
> s/collees/callers

The caller of ocfs2_dio_wr_get_block is __blockdev_direct_IO which
Is widely used and it handles error codes well. Here I mean the
functions called by ocfs2_dio_wr_get_block and their return codes look
good and clear. I will make it clear in next version. Thanks.

— 
Su
> 
>> the change does not hurt us.
>> 
>> Signed-off-by: Su Yue > ---
>> fs/ocfs2/aops.c | 2 --
>> 1 file changed, 2 deletions(-)
>> 
>> diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
>> index b82185075de7..f0467d3b3c88 100644
>> --- a/fs/ocfs2/aops.c
>> +++ b/fs/ocfs2/aops.c
>> @@ -2283,8 +2283,6 @@ static int ocfs2_dio_wr_get_block(struct inode *inode, sector_t iblock,
>> ocfs2_inode_unlock(inode, 1);
>> brelse(di_bh);
>> out:
>> - if (ret < 0)
>> - ret = -EIO;
>> return ret;
>> }
>>

Re: [PATCH 4/4] ocfs2: use coarse time for new created files

2024-04-01T02:02:34Z


On 3/31/24 7:17 PM, Su Yue wrote:
> From: Su Yue 
> 
> The default atime related mount option is '-o realtime'
> which means file atime should be updated if atime <= ctime
> or atime <= mtime. atime should be updated in the
> following scenario, but it is not:
> ==========================================================
> $ rm /mnt/testfile;
> $ echo test > /mnt/testfile
> $ stat -c "%X %Y %Z" /mnt/testfile
> 1711881646 1711881646 1711881646
> $ sleep 5
> $ cat /mnt/testfile > /dev/null
> $ stat -c "%X %Y %Z" /mnt/testfile
> 1711881646 1711881646 1711881646
> ==========================================================
> 
> And the reason the atime in the test is not updated is that
> ocfs2 calls ktime_get_real_ts64() in __ocfs2_mknod_locked during
> file creation. Then inode_set_ctime_current() is called in
> inode_set_ctime_current() calls ktime_get_coarse_real_ts64() to
> get current time.
> ktime_get_real_ts64() is accurater than ktime_get_coarse_real_ts64().
> In my test box, I saw ctime set by ktime_get_coarse_real_ts64() is
> less than ktime_get_real_ts64() even ctime is set later.
> The ctime of the new inode is smaller than atime.
> 
> The call trace is like:
> 
> ocfs2_create
>   ocfs2_mknod
>     __ocfs2_mknod_locked
>     ....
> 
>       ktime_get_real_ts64 <------- set atime,ctime,mtime, more accurate
>       ocfs2_populate_inode
>     ...
>     ocfs2_init_acl
>       ocfs2_acl_set_mode
>         inode_set_ctime_current
>           current_time
>             ktime_get_coarse_real_ts64 <-------less accurate
> 
> ocfs2_file_read_iter
>   ocfs2_inode_lock_atime
>     ocfs2_should_update_atime
>       atime <= ctime ? <-------- false, ctime < atime due to accuracy
> 
> So here call ktime_get_coarse_real_ts64 to set inode time coarser while
> creating new files. It may lower the accuracy of file times. But it's not
> a big deal since we already use coarse time in other places like
> ocfs2_update_inode_atime and inode_set_ctime_current.
> 
> Signed-off-by: Su Yue 

Looks reasonable.
Reviewed-by: Joseph Qi 

> ---
>  fs/ocfs2/namei.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
> index 55c9d90caaaf..4d1ea8703fcd 100644
> --- a/fs/ocfs2/namei.c
> +++ b/fs/ocfs2/namei.c
> @@ -566,7 +566,7 @@ static int __ocfs2_mknod_locked(struct inode *dir,
>  	fe->i_last_eb_blk = 0;
>  	strcpy(fe->i_signature, OCFS2_INODE_SIGNATURE);
>  	fe->i_flags |= cpu_to_le32(OCFS2_VALID_FL);
> -	ktime_get_real_ts64(&ts);
> +	ktime_get_coarse_real_ts64(&ts);
>  	fe->i_atime = fe->i_ctime = fe->i_mtime =
>  		cpu_to_le64(ts.tv_sec);
>  	fe->i_mtime_nsec = fe->i_ctime_nsec = fe->i_atime_nsec =

Re: [PATCH 3/4] ocfs2: update inode fsync transaction id in ocfs2_unlink and ocfs2_link

2024-04-01T01:55:54Z


On 3/31/24 7:17 PM, Su Yue wrote:
> From: Su Yue 
> 
> transaction id should be updated in ocfs2_unlink and ocfs2_link.
> Otherwise, inode link will be wrong after journal replay even fsync was
> called before power failure:
> =======================================================================
> $ touch testdir/bar
> $ ln testdir/bar testdir/bar_link
> $ fsync testdir/bar
> $ stat -c %h $SCRATCH_MNT/testdir/bar
> 1
> $ stat -c %h $SCRATCH_MNT/testdir/bar
> 1
> =======================================================================
> 
> Signed-off-by: Su Yue 

Reviewed-by: Joseph Qi 

> ---
>  fs/ocfs2/namei.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
> index 9221a33f917b..55c9d90caaaf 100644
> --- a/fs/ocfs2/namei.c
> +++ b/fs/ocfs2/namei.c
> @@ -797,6 +797,7 @@ static int ocfs2_link(struct dentry *old_dentry,
>  	ocfs2_set_links_count(fe, inode->i_nlink);
>  	fe->i_ctime = cpu_to_le64(inode_get_ctime_sec(inode));
>  	fe->i_ctime_nsec = cpu_to_le32(inode_get_ctime_nsec(inode));
> +	ocfs2_update_inode_fsync_trans(handle, inode, 0);
>  	ocfs2_journal_dirty(handle, fe_bh);
>  
>  	err = ocfs2_add_entry(handle, dentry, inode,
> @@ -993,6 +994,7 @@ static int ocfs2_unlink(struct inode *dir,
>  		drop_nlink(inode);
>  	drop_nlink(inode);
>  	ocfs2_set_links_count(fe, inode->i_nlink);
> +	ocfs2_update_inode_fsync_trans(handle, inode, 0);
>  	ocfs2_journal_dirty(handle, fe_bh);
>  
>  	inode_set_mtime_to_ts(dir, inode_set_ctime_current(dir));

Re: [PATCH 2/4] ocfs2: fix races between hole punching and AIO+DIO

2024-04-01T01:52:10Z


On 3/31/24 7:17 PM, Su Yue wrote:
> From: Su Yue 
> 
> After commit "ocfs2: return real error code in ocfs2_dio_wr_get_block",
> fstests/generic/300 become from always failed to sometimes failed:
> 
> ========================================================================
> [  473.293420 ] run fstests generic/300
> 
> [  475.296983 ] JBD2: Ignoring recovery information on journal
> [  475.302473 ] ocfs2: Mounting device (253,1) on (node local, slot 0)
> with ordered data mode.
> [  494.290998 ] OCFS2: ERROR (device dm-1): ocfs2_change_extent_flag:
> Owner 5668 has an extent at cpos 78723 which can no longer be found
> [  494.291609 ] On-disk corruption discovered. Please run fsck.ocfs2
> once the filesystem is unmounted.
> [  494.292018 ] OCFS2: File system is now read-only.
> [  494.292224 ] (kworker/19:11,2628,19):ocfs2_mark_extent_written:5272
> ERROR: status = -30
> [  494.292602 ] (kworker/19:11,2628,19):ocfs2_dio_end_io_write:2374
> ERROR: status = -3
> fio: io_u error on file /mnt/scratch/racer: Read-only file system: write
> offset=460849152, buflen=131072
> =========================================================================
> 
> In __blockdev_direct_IO, ocfs2_dio_wr_get_block is called to add
> unwritten extents to a list. extents are also inserted into extent tree
> in ocfs2_write_begin_nolock. Then another thread call fallocate to
> puch a hole at one of the unwritten extent. The extent at cpos was
> removed by ocfs2_remove_extent(). At end io worker thread,
> ocfs2_search_extent_list found there is no such extent at the cpos.
> 
>     T1                        T2                T3
>                               inode lock
>                                 ...
>                                 insert extents
>                                 ...
>                               inode unlock
> ocfs2_fallocate
>  __ocfs2_change_file_space
>   inode lock
>   lock ip_alloc_sem
>   ocfs2_remove_inode_range inode
>    ocfs2_remove_btree_range
>     ocfs2_remove_extent
>     ^---remove the extent at cpos 78723
>   ...
>   unlock ip_alloc_sem
>   inode unlock
>                                        ocfs2_dio_end_io
>                                         ocfs2_dio_end_io_write
>                                          lock ip_alloc_sem
>                                          ocfs2_mark_extent_written
>                                           ocfs2_change_extent_flag
>                                            ocfs2_search_extent_list
>                                            ^---failed to find extent
>                                           ...
>                                           unlock ip_alloc_sem
> 
> In most filesystems, fallocate is not compatible with racing with
> AIO+DIO, so fix it by adding to wait for all dio before
> fallocate/punch_hole like ext4.
> 
> Signed-off-by: Su Yue 

Reviewed-by: Joseph Qi 

> ---
>  fs/ocfs2/file.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
> index 0da8e7bd3261..ccc57038a977 100644
> --- a/fs/ocfs2/file.c
> +++ b/fs/ocfs2/file.c
> @@ -1936,6 +1936,8 @@ static int __ocfs2_change_file_space(struct file *file, struct inode *inode,
>  
>  	inode_lock(inode);
>  
> +	/* Wait all existing dio workers, newcomers will block on i_rwsem */
> +	inode_dio_wait(inode);
>  	/*
>  	 * This prevents concurrent writes on other nodes
>  	 */

Re: [PATCH 1/4] ocfs2: return real error code in ocfs2_dio_wr_get_block

2024-04-01T01:45:01Z


On 3/31/24 7:17 PM, Su Yue wrote:
> From: Su Yue 
> 
> ocfs2_dio_wr_get_block always returns -EIO in case of errors.
> However, some programs expect right exit codes while doing dio.
> For example, tools like fio treat -ENOSPC as expected code while
> doing stress jobs. And quota tools expect -QUOTA when disk quota

EDQUOT?

> exceededs.

s/exceededs/exceeds

> 
> -EIO is too strong return code in the dio path. I have checked
> return codes from collees of ocfs2_dio_wr_get_block to make sure

s/collees/callers

> the change does not hurt us.
> 
> Signed-off-by: Su Yue > ---
>  fs/ocfs2/aops.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
> index b82185075de7..f0467d3b3c88 100644
> --- a/fs/ocfs2/aops.c
> +++ b/fs/ocfs2/aops.c
> @@ -2283,8 +2283,6 @@ static int ocfs2_dio_wr_get_block(struct inode *inode, sector_t iblock,
>  	ocfs2_inode_unlock(inode, 1);
>  	brelse(di_bh);
>  out:
> -	if (ret < 0)
> -		ret = -EIO;
>  	return ret;
>  }
>

[PATCH 4/4] ocfs2: use coarse time for new created files

2024-03-31T11:18:09Z

From: Su Yue 

The default atime related mount option is '-o realtime'
which means file atime should be updated if atime <= ctime
or atime <= mtime. atime should be updated in the
following scenario, but it is not:
==========================================================
$ rm /mnt/testfile;
$ echo test > /mnt/testfile
$ stat -c "%X %Y %Z" /mnt/testfile
1711881646 1711881646 1711881646
$ sleep 5
$ cat /mnt/testfile > /dev/null
$ stat -c "%X %Y %Z" /mnt/testfile
1711881646 1711881646 1711881646
==========================================================

And the reason the atime in the test is not updated is that
ocfs2 calls ktime_get_real_ts64() in __ocfs2_mknod_locked during
file creation. Then inode_set_ctime_current() is called in
inode_set_ctime_current() calls ktime_get_coarse_real_ts64() to
get current time.
ktime_get_real_ts64() is accurater than ktime_get_coarse_real_ts64().
In my test box, I saw ctime set by ktime_get_coarse_real_ts64() is
less than ktime_get_real_ts64() even ctime is set later.
The ctime of the new inode is smaller than atime.

The call trace is like:

ocfs2_create
  ocfs2_mknod
    __ocfs2_mknod_locked
    ....

      ktime_get_real_ts64 <------- set atime,ctime,mtime, more accurate
      ocfs2_populate_inode
    ...
    ocfs2_init_acl
      ocfs2_acl_set_mode
        inode_set_ctime_current
          current_time
            ktime_get_coarse_real_ts64 <-------less accurate

ocfs2_file_read_iter
  ocfs2_inode_lock_atime
    ocfs2_should_update_atime
      atime <= ctime ? <-------- false, ctime < atime due to accuracy

So here call ktime_get_coarse_real_ts64 to set inode time coarser while
creating new files. It may lower the accuracy of file times. But it's not
a big deal since we already use coarse time in other places like
ocfs2_update_inode_atime and inode_set_ctime_current.

Signed-off-by: Su Yue 
---
 fs/ocfs2/namei.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
index 55c9d90caaaf..4d1ea8703fcd 100644
--- a/fs/ocfs2/namei.c
+++ b/fs/ocfs2/namei.c
@@ -566,7 +566,7 @@ static int __ocfs2_mknod_locked(struct inode *dir,
 	fe->i_last_eb_blk = 0;
 	strcpy(fe->i_signature, OCFS2_INODE_SIGNATURE);
 	fe->i_flags |= cpu_to_le32(OCFS2_VALID_FL);
-	ktime_get_real_ts64(&ts);
+	ktime_get_coarse_real_ts64(&ts);
 	fe->i_atime = fe->i_ctime = fe->i_mtime =
 		cpu_to_le64(ts.tv_sec);
 	fe->i_mtime_nsec = fe->i_ctime_nsec = fe->i_atime_nsec =
-- 
2.44.0

[PATCH 3/4] ocfs2: update inode fsync transaction id in ocfs2_unlink and ocfs2_link

2024-03-31T11:18:07Z

From: Su Yue 

transaction id should be updated in ocfs2_unlink and ocfs2_link.
Otherwise, inode link will be wrong after journal replay even fsync was
called before power failure:
=======================================================================
$ touch testdir/bar
$ ln testdir/bar testdir/bar_link
$ fsync testdir/bar
$ stat -c %h $SCRATCH_MNT/testdir/bar
1
$ stat -c %h $SCRATCH_MNT/testdir/bar
1
=======================================================================

Signed-off-by: Su Yue 
---
 fs/ocfs2/namei.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
index 9221a33f917b..55c9d90caaaf 100644
--- a/fs/ocfs2/namei.c
+++ b/fs/ocfs2/namei.c
@@ -797,6 +797,7 @@ static int ocfs2_link(struct dentry *old_dentry,
 	ocfs2_set_links_count(fe, inode->i_nlink);
 	fe->i_ctime = cpu_to_le64(inode_get_ctime_sec(inode));
 	fe->i_ctime_nsec = cpu_to_le32(inode_get_ctime_nsec(inode));
+	ocfs2_update_inode_fsync_trans(handle, inode, 0);
 	ocfs2_journal_dirty(handle, fe_bh);
 
 	err = ocfs2_add_entry(handle, dentry, inode,
@@ -993,6 +994,7 @@ static int ocfs2_unlink(struct inode *dir,
 		drop_nlink(inode);
 	drop_nlink(inode);
 	ocfs2_set_links_count(fe, inode->i_nlink);
+	ocfs2_update_inode_fsync_trans(handle, inode, 0);
 	ocfs2_journal_dirty(handle, fe_bh);
 
 	inode_set_mtime_to_ts(dir, inode_set_ctime_current(dir));
-- 
2.44.0

[PATCH 2/4] ocfs2: fix races between hole punching and AIO+DIO

2024-03-31T11:18:05Z

From: Su Yue 

After commit "ocfs2: return real error code in ocfs2_dio_wr_get_block",
fstests/generic/300 become from always failed to sometimes failed:

========================================================================
[  473.293420 ] run fstests generic/300

[  475.296983 ] JBD2: Ignoring recovery information on journal
[  475.302473 ] ocfs2: Mounting device (253,1) on (node local, slot 0)
with ordered data mode.
[  494.290998 ] OCFS2: ERROR (device dm-1): ocfs2_change_extent_flag:
Owner 5668 has an extent at cpos 78723 which can no longer be found
[  494.291609 ] On-disk corruption discovered. Please run fsck.ocfs2
once the filesystem is unmounted.
[  494.292018 ] OCFS2: File system is now read-only.
[  494.292224 ] (kworker/19:11,2628,19):ocfs2_mark_extent_written:5272
ERROR: status = -30
[  494.292602 ] (kworker/19:11,2628,19):ocfs2_dio_end_io_write:2374
ERROR: status = -3
fio: io_u error on file /mnt/scratch/racer: Read-only file system: write
offset=460849152, buflen=131072
=========================================================================

In __blockdev_direct_IO, ocfs2_dio_wr_get_block is called to add
unwritten extents to a list. extents are also inserted into extent tree
in ocfs2_write_begin_nolock. Then another thread call fallocate to
puch a hole at one of the unwritten extent. The extent at cpos was
removed by ocfs2_remove_extent(). At end io worker thread,
ocfs2_search_extent_list found there is no such extent at the cpos.

    T1                        T2                T3
                              inode lock
                                ...
                                insert extents
                                ...
                              inode unlock
ocfs2_fallocate
 __ocfs2_change_file_space
  inode lock
  lock ip_alloc_sem
  ocfs2_remove_inode_range inode
   ocfs2_remove_btree_range
    ocfs2_remove_extent
    ^---remove the extent at cpos 78723
  ...
  unlock ip_alloc_sem
  inode unlock
                                       ocfs2_dio_end_io
                                        ocfs2_dio_end_io_write
                                         lock ip_alloc_sem
                                         ocfs2_mark_extent_written
                                          ocfs2_change_extent_flag
                                           ocfs2_search_extent_list
                                           ^---failed to find extent
                                          ...
                                          unlock ip_alloc_sem

In most filesystems, fallocate is not compatible with racing with
AIO+DIO, so fix it by adding to wait for all dio before
fallocate/punch_hole like ext4.

Signed-off-by: Su Yue 
---
 fs/ocfs2/file.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 0da8e7bd3261..ccc57038a977 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -1936,6 +1936,8 @@ static int __ocfs2_change_file_space(struct file *file, struct inode *inode,
 
 	inode_lock(inode);
 
+	/* Wait all existing dio workers, newcomers will block on i_rwsem */
+	inode_dio_wait(inode);
 	/*
 	 * This prevents concurrent writes on other nodes
 	 */
-- 
2.44.0