ocfs2-devel.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/4] ocfs2 bugs fixes exposed by fstests
@ 2024-04-02  1:46 Su Yue
  2024-04-02  1:46 ` [PATCH v2 1/4] ocfs2: return real error code in ocfs2_dio_wr_get_block Su Yue
                   ` (4 more replies)
  0 siblings, 5 replies; 8+ messages in thread
From: Su Yue @ 2024-04-02  1:46 UTC (permalink / raw)
  To: ocfs2-devel; +Cc: joseph.qi, akpm, Su Yue

The patchset is to fix some wrong behavior of ocfs2 exposed
by fstests.

Patch 1,2 are about AIO+DIO vs hole punching. generic/300

Patch 3 fixes inode link count mismatch after power failure.
tests/generic/040,041,104,107,336

patch 4 fixes wrong atime with mount option realtime. tests/generic/192

Changelog:
v2:
  - Fix typos and amend commit message about the functions called
  by ocfs2_dio_wr_get_block in patch 1.
  - Add rvb to patch 2,3,4.
  
Su Yue (4):
  ocfs2: return real error code in ocfs2_dio_wr_get_block
  ocfs2: fix races between hole punching and AIO+DIO
  ocfs2: update inode fsync transaction id in ocfs2_unlink and
    ocfs2_link
  ocfs2: use coarse time for new created files

 fs/ocfs2/aops.c  | 2 --
 fs/ocfs2/file.c  | 2 ++
 fs/ocfs2/namei.c | 4 +++-
 3 files changed, 5 insertions(+), 3 deletions(-)

-- 
2.44.0


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2 1/4] ocfs2: return real error code in ocfs2_dio_wr_get_block
  2024-04-02  1:46 [PATCH v2 0/4] ocfs2 bugs fixes exposed by fstests Su Yue
@ 2024-04-02  1:46 ` Su Yue
  2024-04-02  1:51   ` Joseph Qi
  2024-04-02  1:46 ` [PATCH v2 2/4] ocfs2: fix races between hole punching and AIO+DIO Su Yue
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 8+ messages in thread
From: Su Yue @ 2024-04-02  1:46 UTC (permalink / raw)
  To: ocfs2-devel; +Cc: joseph.qi, akpm, Su Yue

ocfs2_dio_wr_get_block always returns -EIO in case of errors.
However, some programs expect right exit codes while doing dio.
For example, tools like fio treat -ENOSPC as expected code while
doing stress jobs. And quota tools expect -EDQUOT when disk quota
exceeds.

-EIO is too strong return code in the dio path.
The caller of ocfs2_dio_wr_get_block is __blockdev_direct_IO which is
widely used and it handles error codes well. I have checked functions
called by ocfs2_dio_wr_get_block and their return codes look good and
clear. So I think it's safe to let ocfs2_dio_wr_get_block return real
error code.

Signed-off-by: Su Yue <glass.su@suse.com>
---
 fs/ocfs2/aops.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index b82185075de7..f0467d3b3c88 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -2283,8 +2283,6 @@ static int ocfs2_dio_wr_get_block(struct inode *inode, sector_t iblock,
 	ocfs2_inode_unlock(inode, 1);
 	brelse(di_bh);
 out:
-	if (ret < 0)
-		ret = -EIO;
 	return ret;
 }
 
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 2/4] ocfs2: fix races between hole punching and AIO+DIO
  2024-04-02  1:46 [PATCH v2 0/4] ocfs2 bugs fixes exposed by fstests Su Yue
  2024-04-02  1:46 ` [PATCH v2 1/4] ocfs2: return real error code in ocfs2_dio_wr_get_block Su Yue
@ 2024-04-02  1:46 ` Su Yue
  2024-04-02  1:46 ` [PATCH v2 3/4] ocfs2: update inode fsync transaction id in ocfs2_unlink and ocfs2_link Su Yue
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 8+ messages in thread
From: Su Yue @ 2024-04-02  1:46 UTC (permalink / raw)
  To: ocfs2-devel; +Cc: joseph.qi, akpm, Su Yue

After commit "ocfs2: return real error code in ocfs2_dio_wr_get_block",
fstests/generic/300 become from always failed to sometimes failed:

========================================================================
[  473.293420 ] run fstests generic/300

[  475.296983 ] JBD2: Ignoring recovery information on journal
[  475.302473 ] ocfs2: Mounting device (253,1) on (node local, slot 0)
with ordered data mode.
[  494.290998 ] OCFS2: ERROR (device dm-1): ocfs2_change_extent_flag:
Owner 5668 has an extent at cpos 78723 which can no longer be found
[  494.291609 ] On-disk corruption discovered. Please run fsck.ocfs2
once the filesystem is unmounted.
[  494.292018 ] OCFS2: File system is now read-only.
[  494.292224 ] (kworker/19:11,2628,19):ocfs2_mark_extent_written:5272
ERROR: status = -30
[  494.292602 ] (kworker/19:11,2628,19):ocfs2_dio_end_io_write:2374
ERROR: status = -3
fio: io_u error on file /mnt/scratch/racer: Read-only file system: write
offset=460849152, buflen=131072
=========================================================================

In __blockdev_direct_IO, ocfs2_dio_wr_get_block is called to add
unwritten extents to a list. extents are also inserted into extent tree
in ocfs2_write_begin_nolock. Then another thread call fallocate to
puch a hole at one of the unwritten extent. The extent at cpos was
removed by ocfs2_remove_extent(). At end io worker thread,
ocfs2_search_extent_list found there is no such extent at the cpos.

    T1                        T2                T3
                              inode lock
                                ...
                                insert extents
                                ...
                              inode unlock
ocfs2_fallocate
 __ocfs2_change_file_space
  inode lock
  lock ip_alloc_sem
  ocfs2_remove_inode_range inode
   ocfs2_remove_btree_range
    ocfs2_remove_extent
    ^---remove the extent at cpos 78723
  ...
  unlock ip_alloc_sem
  inode unlock
                                       ocfs2_dio_end_io
                                        ocfs2_dio_end_io_write
                                         lock ip_alloc_sem
                                         ocfs2_mark_extent_written
                                          ocfs2_change_extent_flag
                                           ocfs2_search_extent_list
                                           ^---failed to find extent
                                          ...
                                          unlock ip_alloc_sem

In most filesystems, fallocate is not compatible with racing with
AIO+DIO, so fix it by adding to wait for all dio before
fallocate/punch_hole like ext4.

Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Su Yue <glass.su@suse.com>
---
 fs/ocfs2/file.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 0da8e7bd3261..ccc57038a977 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -1936,6 +1936,8 @@ static int __ocfs2_change_file_space(struct file *file, struct inode *inode,
 
 	inode_lock(inode);
 
+	/* Wait all existing dio workers, newcomers will block on i_rwsem */
+	inode_dio_wait(inode);
 	/*
 	 * This prevents concurrent writes on other nodes
 	 */
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 3/4] ocfs2: update inode fsync transaction id in ocfs2_unlink and ocfs2_link
  2024-04-02  1:46 [PATCH v2 0/4] ocfs2 bugs fixes exposed by fstests Su Yue
  2024-04-02  1:46 ` [PATCH v2 1/4] ocfs2: return real error code in ocfs2_dio_wr_get_block Su Yue
  2024-04-02  1:46 ` [PATCH v2 2/4] ocfs2: fix races between hole punching and AIO+DIO Su Yue
@ 2024-04-02  1:46 ` Su Yue
  2024-04-02  1:46 ` [PATCH v2 4/4] ocfs2: use coarse time for new created files Su Yue
  2024-04-04  1:51 ` [PATCH v2 0/4] ocfs2 bugs fixes exposed by fstests Andrew Morton
  4 siblings, 0 replies; 8+ messages in thread
From: Su Yue @ 2024-04-02  1:46 UTC (permalink / raw)
  To: ocfs2-devel; +Cc: joseph.qi, akpm, Su Yue

transaction id should be updated in ocfs2_unlink and ocfs2_link.
Otherwise, inode link will be wrong after journal replay even fsync was
called before power failure:
=======================================================================
$ touch testdir/bar
$ ln testdir/bar testdir/bar_link
$ fsync testdir/bar
$ stat -c %h $SCRATCH_MNT/testdir/bar
1
$ stat -c %h $SCRATCH_MNT/testdir/bar
1
=======================================================================

Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Su Yue <glass.su@suse.com>
---
 fs/ocfs2/namei.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
index 9221a33f917b..55c9d90caaaf 100644
--- a/fs/ocfs2/namei.c
+++ b/fs/ocfs2/namei.c
@@ -797,6 +797,7 @@ static int ocfs2_link(struct dentry *old_dentry,
 	ocfs2_set_links_count(fe, inode->i_nlink);
 	fe->i_ctime = cpu_to_le64(inode_get_ctime_sec(inode));
 	fe->i_ctime_nsec = cpu_to_le32(inode_get_ctime_nsec(inode));
+	ocfs2_update_inode_fsync_trans(handle, inode, 0);
 	ocfs2_journal_dirty(handle, fe_bh);
 
 	err = ocfs2_add_entry(handle, dentry, inode,
@@ -993,6 +994,7 @@ static int ocfs2_unlink(struct inode *dir,
 		drop_nlink(inode);
 	drop_nlink(inode);
 	ocfs2_set_links_count(fe, inode->i_nlink);
+	ocfs2_update_inode_fsync_trans(handle, inode, 0);
 	ocfs2_journal_dirty(handle, fe_bh);
 
 	inode_set_mtime_to_ts(dir, inode_set_ctime_current(dir));
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v2 4/4] ocfs2: use coarse time for new created files
  2024-04-02  1:46 [PATCH v2 0/4] ocfs2 bugs fixes exposed by fstests Su Yue
                   ` (2 preceding siblings ...)
  2024-04-02  1:46 ` [PATCH v2 3/4] ocfs2: update inode fsync transaction id in ocfs2_unlink and ocfs2_link Su Yue
@ 2024-04-02  1:46 ` Su Yue
  2024-04-04  1:51 ` [PATCH v2 0/4] ocfs2 bugs fixes exposed by fstests Andrew Morton
  4 siblings, 0 replies; 8+ messages in thread
From: Su Yue @ 2024-04-02  1:46 UTC (permalink / raw)
  To: ocfs2-devel; +Cc: joseph.qi, akpm, Su Yue

The default atime related mount option is '-o realtime'
which means file atime should be updated if atime <= ctime
or atime <= mtime. atime should be updated in the
following scenario, but it is not:
==========================================================
$ rm /mnt/testfile;
$ echo test > /mnt/testfile
$ stat -c "%X %Y %Z" /mnt/testfile
1711881646 1711881646 1711881646
$ sleep 5
$ cat /mnt/testfile > /dev/null
$ stat -c "%X %Y %Z" /mnt/testfile
1711881646 1711881646 1711881646
==========================================================

And the reason the atime in the test is not updated is that
ocfs2 calls ktime_get_real_ts64() in __ocfs2_mknod_locked during
file creation. Then inode_set_ctime_current() is called in
inode_set_ctime_current() calls ktime_get_coarse_real_ts64() to
get current time.
ktime_get_real_ts64() is accurater than ktime_get_coarse_real_ts64().
In my test box, I saw ctime set by ktime_get_coarse_real_ts64() is
less than ktime_get_real_ts64() even ctime is set later.
The ctime of the new inode is smaller than atime.

The call trace is like:

ocfs2_create
  ocfs2_mknod
    __ocfs2_mknod_locked
    ....

      ktime_get_real_ts64 <------- set atime,ctime,mtime, more accurate
      ocfs2_populate_inode
    ...
    ocfs2_init_acl
      ocfs2_acl_set_mode
        inode_set_ctime_current
          current_time
            ktime_get_coarse_real_ts64 <-------less accurate

ocfs2_file_read_iter
  ocfs2_inode_lock_atime
    ocfs2_should_update_atime
      atime <= ctime ? <-------- false, ctime < atime due to accuracy

So here call ktime_get_coarse_real_ts64 to set inode time coarser while
creating new files. It may lower the accuracy of file times. But it's not
a big deal since we already use coarse time in other places like
ocfs2_update_inode_atime and inode_set_ctime_current.

Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Su Yue <glass.su@suse.com>
---
 fs/ocfs2/namei.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
index 55c9d90caaaf..4d1ea8703fcd 100644
--- a/fs/ocfs2/namei.c
+++ b/fs/ocfs2/namei.c
@@ -566,7 +566,7 @@ static int __ocfs2_mknod_locked(struct inode *dir,
 	fe->i_last_eb_blk = 0;
 	strcpy(fe->i_signature, OCFS2_INODE_SIGNATURE);
 	fe->i_flags |= cpu_to_le32(OCFS2_VALID_FL);
-	ktime_get_real_ts64(&ts);
+	ktime_get_coarse_real_ts64(&ts);
 	fe->i_atime = fe->i_ctime = fe->i_mtime =
 		cpu_to_le64(ts.tv_sec);
 	fe->i_mtime_nsec = fe->i_ctime_nsec = fe->i_atime_nsec =
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 1/4] ocfs2: return real error code in ocfs2_dio_wr_get_block
  2024-04-02  1:46 ` [PATCH v2 1/4] ocfs2: return real error code in ocfs2_dio_wr_get_block Su Yue
@ 2024-04-02  1:51   ` Joseph Qi
  0 siblings, 0 replies; 8+ messages in thread
From: Joseph Qi @ 2024-04-02  1:51 UTC (permalink / raw)
  To: Su Yue, ocfs2-devel; +Cc: akpm



On 4/2/24 9:46 AM, Su Yue wrote:
> ocfs2_dio_wr_get_block always returns -EIO in case of errors.
> However, some programs expect right exit codes while doing dio.
> For example, tools like fio treat -ENOSPC as expected code while
> doing stress jobs. And quota tools expect -EDQUOT when disk quota
> exceeds.
> 
> -EIO is too strong return code in the dio path.
> The caller of ocfs2_dio_wr_get_block is __blockdev_direct_IO which is
> widely used and it handles error codes well. I have checked functions
> called by ocfs2_dio_wr_get_block and their return codes look good and
> clear. So I think it's safe to let ocfs2_dio_wr_get_block return real
> error code.
> 
> Signed-off-by: Su Yue <glass.su@suse.com>

Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
> ---
>  fs/ocfs2/aops.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
> index b82185075de7..f0467d3b3c88 100644
> --- a/fs/ocfs2/aops.c
> +++ b/fs/ocfs2/aops.c
> @@ -2283,8 +2283,6 @@ static int ocfs2_dio_wr_get_block(struct inode *inode, sector_t iblock,
>  	ocfs2_inode_unlock(inode, 1);
>  	brelse(di_bh);
>  out:
> -	if (ret < 0)
> -		ret = -EIO;
>  	return ret;
>  }
>  

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 0/4] ocfs2 bugs fixes exposed by fstests
  2024-04-02  1:46 [PATCH v2 0/4] ocfs2 bugs fixes exposed by fstests Su Yue
                   ` (3 preceding siblings ...)
  2024-04-02  1:46 ` [PATCH v2 4/4] ocfs2: use coarse time for new created files Su Yue
@ 2024-04-04  1:51 ` Andrew Morton
  2024-04-04  5:54   ` Su Yue
  4 siblings, 1 reply; 8+ messages in thread
From: Andrew Morton @ 2024-04-04  1:51 UTC (permalink / raw)
  To: Su Yue; +Cc: ocfs2-devel, joseph.qi

On Tue,  2 Apr 2024 09:46:47 +0800 Su Yue <glass.su@suse.com> wrote:

> The patchset is to fix some wrong behavior of ocfs2 exposed
> by fstests.

Thanks.  We should consider which of these fixes should be backported
into -stable kernels.  For that we should provide, for each patch:

- A description of the userspace-visible impact of the bug and

- A suitable Fixes: target to tell -stable maintainers how far back
these fixes are needed.

Please could we give some consideration to these matters?


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 0/4] ocfs2 bugs fixes exposed by fstests
  2024-04-04  1:51 ` [PATCH v2 0/4] ocfs2 bugs fixes exposed by fstests Andrew Morton
@ 2024-04-04  5:54   ` Su Yue
  0 siblings, 0 replies; 8+ messages in thread
From: Su Yue @ 2024-04-04  5:54 UTC (permalink / raw)
  To: Andrew Morton; +Cc: ocfs2-devel, Joseph Qi



> On Apr 4, 2024, at 09:51, Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> On Tue,  2 Apr 2024 09:46:47 +0800 Su Yue <glass.su@suse.com> wrote:
> 
>> The patchset is to fix some wrong behavior of ocfs2 exposed
>> by fstests.
> 
> Thanks.  We should consider which of these fixes should be backported
> into -stable kernels.  For that we should provide, for each patch:
> 
> - A description of the userspace-visible impact of the bug and
> 
Yeah. I should elaborate more in cover letter.

> - A suitable Fixes: target to tell -stable maintainers how far back
> these fixes are needed.
> 
Necessary Indeed.

> Please could we give some consideration to these matters?
> 
Sure. I will do these after vacation in next version.


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-04-04  5:54 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-02  1:46 [PATCH v2 0/4] ocfs2 bugs fixes exposed by fstests Su Yue
2024-04-02  1:46 ` [PATCH v2 1/4] ocfs2: return real error code in ocfs2_dio_wr_get_block Su Yue
2024-04-02  1:51   ` Joseph Qi
2024-04-02  1:46 ` [PATCH v2 2/4] ocfs2: fix races between hole punching and AIO+DIO Su Yue
2024-04-02  1:46 ` [PATCH v2 3/4] ocfs2: update inode fsync transaction id in ocfs2_unlink and ocfs2_link Su Yue
2024-04-02  1:46 ` [PATCH v2 4/4] ocfs2: use coarse time for new created files Su Yue
2024-04-04  1:51 ` [PATCH v2 0/4] ocfs2 bugs fixes exposed by fstests Andrew Morton
2024-04-04  5:54   ` Su Yue

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).