linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Problem with direct IO
@ 2021-10-13  1:46 Zhengyuan Liu
  2021-10-18  1:09 ` Zhengyuan Liu
  2021-10-20 17:37 ` Jan Kara
  0 siblings, 2 replies; 13+ messages in thread
From: Zhengyuan Liu @ 2021-10-13  1:46 UTC (permalink / raw)
  To: viro, akpm, tytso
  Cc: linux-fsdevel, linux-mm, mysql, linux-ext4, 刘云,
	Zhengyuan Liu

Hi, all

we are encounting following Mysql crash problem while importing tables :

    2021-09-26T11:22:17.825250Z 0 [ERROR] [MY-013622] [InnoDB] [FATAL]
    fsync() returned EIO, aborting.
    2021-09-26T11:22:17.825315Z 0 [ERROR] [MY-013183] [InnoDB]
    Assertion failure: ut0ut.cc:555 thread 281472996733168

At the same time , we found dmesg had following message:

    [ 4328.838972] Page cache invalidation failure on direct I/O.
    Possible data corruption due to collision with buffered I/O!
    [ 4328.850234] File: /data/mysql/data/sysbench/sbtest53.ibd PID:
    625 Comm: kworker/42:1

Firstly, we doubled Mysql has operating the file with direct IO and
buffered IO interlaced, but after some checking we found it did only
do direct IO using aio. The problem is exactly from direct-io
interface (__generic_file_write_iter) itself.

ssize_t __generic_file_write_iter()
{
...
        if (iocb->ki_flags & IOCB_DIRECT) {
                loff_t pos, endbyte;

                written = generic_file_direct_write(iocb, from);
                /*
                 * If the write stopped short of completing, fall back to
                 * buffered writes.  Some filesystems do this for writes to
                 * holes, for example.  For DAX files, a buffered write will
                 * not succeed (even if it did, DAX does not handle dirty
                 * page-cache pages correctly).
                 */
                if (written < 0 || !iov_iter_count(from) || IS_DAX(inode))
                        goto out;

                status = generic_perform_write(file, from, pos = iocb->ki_pos);
...
}

From above code snippet we can see that direct io could fall back to
buffered IO under certain conditions, so even Mysql only did direct IO
it could interleave with buffered IO when fall back occurred. I have
no idea why FS(ext3) failed the direct IO currently, but it is strange
__generic_file_write_iter make direct IO fall back to buffered IO, it
seems  breaking the semantics of direct IO.

The reproduced  environment is:
Platform:  Kunpeng 920 (arm64)
Kernel: V5.15-rc
PAGESIZE: 64K
Mysql:  V8.0
Innodb_page_size: default(16K)

Thanks,

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Problem with direct IO
  2021-10-13  1:46 Problem with direct IO Zhengyuan Liu
@ 2021-10-18  1:09 ` Zhengyuan Liu
  2021-10-18 18:43   ` Andrew Morton
  2021-10-20 17:37 ` Jan Kara
  1 sibling, 1 reply; 13+ messages in thread
From: Zhengyuan Liu @ 2021-10-18  1:09 UTC (permalink / raw)
  To: viro, akpm, tytso
  Cc: linux-fsdevel, linux-mm, mysql, linux-ext4, 刘云,
	Zhengyuan Liu

Ping.

I think this problem is serious and someone may  also encounter it in
the future.


On Wed, Oct 13, 2021 at 9:46 AM Zhengyuan Liu
<liuzhengyuang521@gmail.com> wrote:
>
> Hi, all
>
> we are encounting following Mysql crash problem while importing tables :
>
>     2021-09-26T11:22:17.825250Z 0 [ERROR] [MY-013622] [InnoDB] [FATAL]
>     fsync() returned EIO, aborting.
>     2021-09-26T11:22:17.825315Z 0 [ERROR] [MY-013183] [InnoDB]
>     Assertion failure: ut0ut.cc:555 thread 281472996733168
>
> At the same time , we found dmesg had following message:
>
>     [ 4328.838972] Page cache invalidation failure on direct I/O.
>     Possible data corruption due to collision with buffered I/O!
>     [ 4328.850234] File: /data/mysql/data/sysbench/sbtest53.ibd PID:
>     625 Comm: kworker/42:1
>
> Firstly, we doubled Mysql has operating the file with direct IO and
> buffered IO interlaced, but after some checking we found it did only
> do direct IO using aio. The problem is exactly from direct-io
> interface (__generic_file_write_iter) itself.
>
> ssize_t __generic_file_write_iter()
> {
> ...
>         if (iocb->ki_flags & IOCB_DIRECT) {
>                 loff_t pos, endbyte;
>
>                 written = generic_file_direct_write(iocb, from);
>                 /*
>                  * If the write stopped short of completing, fall back to
>                  * buffered writes.  Some filesystems do this for writes to
>                  * holes, for example.  For DAX files, a buffered write will
>                  * not succeed (even if it did, DAX does not handle dirty
>                  * page-cache pages correctly).
>                  */
>                 if (written < 0 || !iov_iter_count(from) || IS_DAX(inode))
>                         goto out;
>
>                 status = generic_perform_write(file, from, pos = iocb->ki_pos);
> ...
> }
>
> From above code snippet we can see that direct io could fall back to
> buffered IO under certain conditions, so even Mysql only did direct IO
> it could interleave with buffered IO when fall back occurred. I have
> no idea why FS(ext3) failed the direct IO currently, but it is strange
> __generic_file_write_iter make direct IO fall back to buffered IO, it
> seems  breaking the semantics of direct IO.
>
> The reproduced  environment is:
> Platform:  Kunpeng 920 (arm64)
> Kernel: V5.15-rc
> PAGESIZE: 64K
> Mysql:  V8.0
> Innodb_page_size: default(16K)
>
> Thanks,

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Problem with direct IO
  2021-10-18  1:09 ` Zhengyuan Liu
@ 2021-10-18 18:43   ` Andrew Morton
  2021-10-19  3:39     ` Zhengyuan Liu
  0 siblings, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2021-10-18 18:43 UTC (permalink / raw)
  To: Zhengyuan Liu
  Cc: viro, tytso, linux-fsdevel, linux-mm, mysql, linux-ext4,
	刘云,
	Zhengyuan Liu

On Mon, 18 Oct 2021 09:09:06 +0800 Zhengyuan Liu <liuzhengyuang521@gmail.com> wrote:

> Ping.
> 
> I think this problem is serious and someone may  also encounter it in
> the future.
> 
> 
> On Wed, Oct 13, 2021 at 9:46 AM Zhengyuan Liu
> <liuzhengyuang521@gmail.com> wrote:
> >
> > Hi, all
> >
> > we are encounting following Mysql crash problem while importing tables :
> >
> >     2021-09-26T11:22:17.825250Z 0 [ERROR] [MY-013622] [InnoDB] [FATAL]
> >     fsync() returned EIO, aborting.
> >     2021-09-26T11:22:17.825315Z 0 [ERROR] [MY-013183] [InnoDB]
> >     Assertion failure: ut0ut.cc:555 thread 281472996733168
> >
> > At the same time , we found dmesg had following message:
> >
> >     [ 4328.838972] Page cache invalidation failure on direct I/O.
> >     Possible data corruption due to collision with buffered I/O!
> >     [ 4328.850234] File: /data/mysql/data/sysbench/sbtest53.ibd PID:
> >     625 Comm: kworker/42:1
> >
> > Firstly, we doubled Mysql has operating the file with direct IO and
> > buffered IO interlaced, but after some checking we found it did only
> > do direct IO using aio. The problem is exactly from direct-io
> > interface (__generic_file_write_iter) itself.
> >
> > ssize_t __generic_file_write_iter()
> > {
> > ...
> >         if (iocb->ki_flags & IOCB_DIRECT) {
> >                 loff_t pos, endbyte;
> >
> >                 written = generic_file_direct_write(iocb, from);
> >                 /*
> >                  * If the write stopped short of completing, fall back to
> >                  * buffered writes.  Some filesystems do this for writes to
> >                  * holes, for example.  For DAX files, a buffered write will
> >                  * not succeed (even if it did, DAX does not handle dirty
> >                  * page-cache pages correctly).
> >                  */
> >                 if (written < 0 || !iov_iter_count(from) || IS_DAX(inode))
> >                         goto out;
> >
> >                 status = generic_perform_write(file, from, pos = iocb->ki_pos);
> > ...
> > }
> >
> > From above code snippet we can see that direct io could fall back to
> > buffered IO under certain conditions, so even Mysql only did direct IO
> > it could interleave with buffered IO when fall back occurred. I have
> > no idea why FS(ext3) failed the direct IO currently, but it is strange
> > __generic_file_write_iter make direct IO fall back to buffered IO, it
> > seems  breaking the semantics of direct IO.

That makes sense.

> > The reproduced  environment is:
> > Platform:  Kunpeng 920 (arm64)
> > Kernel: V5.15-rc
> > PAGESIZE: 64K
> > Mysql:  V8.0
> > Innodb_page_size: default(16K)

This is all fairly mature code, I think.  Do you know if earlier
kernels were OK, and if so which versions?

Thanks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Problem with direct IO
  2021-10-18 18:43   ` Andrew Morton
@ 2021-10-19  3:39     ` Zhengyuan Liu
  0 siblings, 0 replies; 13+ messages in thread
From: Zhengyuan Liu @ 2021-10-19  3:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: viro, tytso, linux-fsdevel, linux-mm, mysql, linux-ext4,
	刘云,
	Zhengyuan Liu

On Tue, Oct 19, 2021 at 2:43 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Mon, 18 Oct 2021 09:09:06 +0800 Zhengyuan Liu <liuzhengyuang521@gmail.com> wrote:
>
> > Ping.
> >
> > I think this problem is serious and someone may  also encounter it in
> > the future.
> >
> >
> > On Wed, Oct 13, 2021 at 9:46 AM Zhengyuan Liu
> > <liuzhengyuang521@gmail.com> wrote:
> > >
> > > Hi, all
> > >
> > > we are encounting following Mysql crash problem while importing tables :
> > >
> > >     2021-09-26T11:22:17.825250Z 0 [ERROR] [MY-013622] [InnoDB] [FATAL]
> > >     fsync() returned EIO, aborting.
> > >     2021-09-26T11:22:17.825315Z 0 [ERROR] [MY-013183] [InnoDB]
> > >     Assertion failure: ut0ut.cc:555 thread 281472996733168
> > >
> > > At the same time , we found dmesg had following message:
> > >
> > >     [ 4328.838972] Page cache invalidation failure on direct I/O.
> > >     Possible data corruption due to collision with buffered I/O!
> > >     [ 4328.850234] File: /data/mysql/data/sysbench/sbtest53.ibd PID:
> > >     625 Comm: kworker/42:1
> > >
> > > Firstly, we doubled Mysql has operating the file with direct IO and
> > > buffered IO interlaced, but after some checking we found it did only
> > > do direct IO using aio. The problem is exactly from direct-io
> > > interface (__generic_file_write_iter) itself.
> > >
> > > ssize_t __generic_file_write_iter()
> > > {
> > > ...
> > >         if (iocb->ki_flags & IOCB_DIRECT) {
> > >                 loff_t pos, endbyte;
> > >
> > >                 written = generic_file_direct_write(iocb, from);
> > >                 /*
> > >                  * If the write stopped short of completing, fall back to
> > >                  * buffered writes.  Some filesystems do this for writes to
> > >                  * holes, for example.  For DAX files, a buffered write will
> > >                  * not succeed (even if it did, DAX does not handle dirty
> > >                  * page-cache pages correctly).
> > >                  */
> > >                 if (written < 0 || !iov_iter_count(from) || IS_DAX(inode))
> > >                         goto out;
> > >
> > >                 status = generic_perform_write(file, from, pos = iocb->ki_pos);
> > > ...
> > > }
> > >
> > > From above code snippet we can see that direct io could fall back to
> > > buffered IO under certain conditions, so even Mysql only did direct IO
> > > it could interleave with buffered IO when fall back occurred. I have
> > > no idea why FS(ext3) failed the direct IO currently, but it is strange
> > > __generic_file_write_iter make direct IO fall back to buffered IO, it
> > > seems  breaking the semantics of direct IO.
>
> That makes sense.
>
> > > The reproduced  environment is:
> > > Platform:  Kunpeng 920 (arm64)
> > > Kernel: V5.15-rc
> > > PAGESIZE: 64K
> > > Mysql:  V8.0
> > > Innodb_page_size: default(16K)
>
> This is all fairly mature code, I think.  Do you know if earlier
> kernels were OK, and if so which versions?

we have tested v4.18 and v4.19 and the problem is still here,  the earlier
version such before v4.12 doesn't support Arm64 well  so we can't test.

I think this problem has something to do with page size,  if we change kernel
page size from 64K to 4k or just set Innodb_page_size to 64K then we cannot
reproduce this problem.  Typically we use 4k as kernel page size and FS block
size, if database use more than 4k as IO unit then it won't interleave for each
IO in kernel page cache as each one will occupy one or more page cache, that
means it is hard to trigger this problem on x84 or other platforms using 4k page
size.  But thing got changed when come to Arm64 64K page size, if database uses
a smaller IO unit, in our Mysql case that is 16K DIO, then two IO
could share one
page cache and if one falls back to buffered IO it can trigger the problem. For
example,  aio got two direct IO which share the same page cache to write , it
dispatched the first one to storage and begin process the second one before
the first one completed, if the second one fall back to buffered IO it will been
copy to page cache and mark the page as dirty, upon that the first one completed
it will check and invalidate it's page cache, if it is dirty then the
problem occured.

If my analysis isn't correct please point it out, thanks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Problem with direct IO
  2021-10-13  1:46 Problem with direct IO Zhengyuan Liu
  2021-10-18  1:09 ` Zhengyuan Liu
@ 2021-10-20 17:37 ` Jan Kara
  2021-10-21  2:21   ` Zhengyuan Liu
  1 sibling, 1 reply; 13+ messages in thread
From: Jan Kara @ 2021-10-20 17:37 UTC (permalink / raw)
  To: Zhengyuan Liu
  Cc: viro, akpm, tytso, linux-fsdevel, linux-mm, mysql, linux-ext4,
	刘云,
	Zhengyuan Liu

On Wed 13-10-21 09:46:46, Zhengyuan Liu wrote:
> Hi, all
> 
> we are encounting following Mysql crash problem while importing tables :
> 
>     2021-09-26T11:22:17.825250Z 0 [ERROR] [MY-013622] [InnoDB] [FATAL]
>     fsync() returned EIO, aborting.
>     2021-09-26T11:22:17.825315Z 0 [ERROR] [MY-013183] [InnoDB]
>     Assertion failure: ut0ut.cc:555 thread 281472996733168
> 
> At the same time , we found dmesg had following message:
> 
>     [ 4328.838972] Page cache invalidation failure on direct I/O.
>     Possible data corruption due to collision with buffered I/O!
>     [ 4328.850234] File: /data/mysql/data/sysbench/sbtest53.ibd PID:
>     625 Comm: kworker/42:1
> 
> Firstly, we doubled Mysql has operating the file with direct IO and
> buffered IO interlaced, but after some checking we found it did only
> do direct IO using aio. The problem is exactly from direct-io
> interface (__generic_file_write_iter) itself.
> 
> ssize_t __generic_file_write_iter()
> {
> ...
>         if (iocb->ki_flags & IOCB_DIRECT) {
>                 loff_t pos, endbyte;
> 
>                 written = generic_file_direct_write(iocb, from);
>                 /*
>                  * If the write stopped short of completing, fall back to
>                  * buffered writes.  Some filesystems do this for writes to
>                  * holes, for example.  For DAX files, a buffered write will
>                  * not succeed (even if it did, DAX does not handle dirty
>                  * page-cache pages correctly).
>                  */
>                 if (written < 0 || !iov_iter_count(from) || IS_DAX(inode))
>                         goto out;
> 
>                 status = generic_perform_write(file, from, pos = iocb->ki_pos);
> ...
> }
> 
> From above code snippet we can see that direct io could fall back to
> buffered IO under certain conditions, so even Mysql only did direct IO
> it could interleave with buffered IO when fall back occurred. I have
> no idea why FS(ext3) failed the direct IO currently, but it is strange
> __generic_file_write_iter make direct IO fall back to buffered IO, it
> seems  breaking the semantics of direct IO.
> 
> The reproduced  environment is:
> Platform:  Kunpeng 920 (arm64)
> Kernel: V5.15-rc
> PAGESIZE: 64K
> Mysql:  V8.0
> Innodb_page_size: default(16K)

Thanks for report. I agree this should not happen. How hard is this to
reproduce? Any idea whether the fallback to buffered IO happens because
iomap_dio_rw() returns -ENOTBLK or because it returns short write?

Can you post output of "dumpe2fs -h <device>" for the filesystem where the
problem happens? Thanks!

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Problem with direct IO
  2021-10-20 17:37 ` Jan Kara
@ 2021-10-21  2:21   ` Zhengyuan Liu
  2021-10-21  8:03     ` Jan Kara
  0 siblings, 1 reply; 13+ messages in thread
From: Zhengyuan Liu @ 2021-10-21  2:21 UTC (permalink / raw)
  To: Jan Kara
  Cc: viro, Andrew Morton, tytso, linux-fsdevel, linux-mm, linux-ext4,
	刘云,
	Zhengyuan Liu

On Thu, Oct 21, 2021 at 1:37 AM Jan Kara <jack@suse.cz> wrote:
>
> On Wed 13-10-21 09:46:46, Zhengyuan Liu wrote:
> > Hi, all
> >
> > we are encounting following Mysql crash problem while importing tables :
> >
> >     2021-09-26T11:22:17.825250Z 0 [ERROR] [MY-013622] [InnoDB] [FATAL]
> >     fsync() returned EIO, aborting.
> >     2021-09-26T11:22:17.825315Z 0 [ERROR] [MY-013183] [InnoDB]
> >     Assertion failure: ut0ut.cc:555 thread 281472996733168
> >
> > At the same time , we found dmesg had following message:
> >
> >     [ 4328.838972] Page cache invalidation failure on direct I/O.
> >     Possible data corruption due to collision with buffered I/O!
> >     [ 4328.850234] File: /data/mysql/data/sysbench/sbtest53.ibd PID:
> >     625 Comm: kworker/42:1
> >
> > Firstly, we doubled Mysql has operating the file with direct IO and
> > buffered IO interlaced, but after some checking we found it did only
> > do direct IO using aio. The problem is exactly from direct-io
> > interface (__generic_file_write_iter) itself.
> >
> > ssize_t __generic_file_write_iter()
> > {
> > ...
> >         if (iocb->ki_flags & IOCB_DIRECT) {
> >                 loff_t pos, endbyte;
> >
> >                 written = generic_file_direct_write(iocb, from);
> >                 /*
> >                  * If the write stopped short of completing, fall back to
> >                  * buffered writes.  Some filesystems do this for writes to
> >                  * holes, for example.  For DAX files, a buffered write will
> >                  * not succeed (even if it did, DAX does not handle dirty
> >                  * page-cache pages correctly).
> >                  */
> >                 if (written < 0 || !iov_iter_count(from) || IS_DAX(inode))
> >                         goto out;
> >
> >                 status = generic_perform_write(file, from, pos = iocb->ki_pos);
> > ...
> > }
> >
> > From above code snippet we can see that direct io could fall back to
> > buffered IO under certain conditions, so even Mysql only did direct IO
> > it could interleave with buffered IO when fall back occurred. I have
> > no idea why FS(ext3) failed the direct IO currently, but it is strange
> > __generic_file_write_iter make direct IO fall back to buffered IO, it
> > seems  breaking the semantics of direct IO.
> >
> > The reproduced  environment is:
> > Platform:  Kunpeng 920 (arm64)
> > Kernel: V5.15-rc
> > PAGESIZE: 64K
> > Mysql:  V8.0
> > Innodb_page_size: default(16K)
>
> Thanks for report. I agree this should not happen. How hard is this to
> reproduce? Any idea whether the fallback to buffered IO happens because
> iomap_dio_rw() returns -ENOTBLK or because it returns short write?

It is easy to reproduce in my test environment, as I said in the previous email
replied to Andrew this problem is related to kernel page size.

> Can you post output of "dumpe2fs -h <device>" for the filesystem where the
> problem happens? Thanks!

Sure, the output is:

# dumpe2fs -h /dev/sda3
dumpe2fs 1.45.3 (14-Jul-2019)
Filesystem volume name:   <none>
Last mounted on:          /data
Filesystem UUID:          09a51146-b325-48bb-be63-c9df539a90a1
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index
filetype needs_recovery sparse_super large_file
Filesystem flags:         unsigned_directory_hash
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              11034624
Block count:              44138240
Reserved block count:     2206912
Free blocks:              43168100
Free inodes:              11034613
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      1013
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8192
Inode blocks per group:   512
Filesystem created:       Thu Oct 21 09:42:03 2021
Last mount time:          Thu Oct 21 09:43:36 2021
Last write time:          Thu Oct 21 09:43:36 2021
Mount count:              1
Maximum mount count:      -1
Last checked:             Thu Oct 21 09:42:03 2021
Check interval:           0 (<none>)
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:           256
Required extra isize:     32
Desired extra isize:      32
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      a7b04e61-1209-496d-ab9d-a51009b51ddb
Journal backup:           inode blocks
Journal features:         journal_incompat_revoke
Journal size:             1024M
Journal length:           262144
Journal sequence:         0x00000002
Journal start:            1

BTW, we have  also tested Ext4 and XFS and didn't see direct write fallback.

Thanks,

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Problem with direct IO
  2021-10-21  2:21   ` Zhengyuan Liu
@ 2021-10-21  8:03     ` Jan Kara
  2021-10-21 12:11       ` Zhengyuan Liu
       [not found]       ` <61712B10.2060408@huawei.com>
  0 siblings, 2 replies; 13+ messages in thread
From: Jan Kara @ 2021-10-21  8:03 UTC (permalink / raw)
  To: Zhengyuan Liu
  Cc: Jan Kara, viro, Andrew Morton, tytso, linux-fsdevel, linux-mm,
	linux-ext4, 刘云,
	Zhengyuan Liu

On Thu 21-10-21 10:21:55, Zhengyuan Liu wrote:
> On Thu, Oct 21, 2021 at 1:37 AM Jan Kara <jack@suse.cz> wrote:
> > On Wed 13-10-21 09:46:46, Zhengyuan Liu wrote:
> > > we are encounting following Mysql crash problem while importing tables :
> > >
> > >     2021-09-26T11:22:17.825250Z 0 [ERROR] [MY-013622] [InnoDB] [FATAL]
> > >     fsync() returned EIO, aborting.
> > >     2021-09-26T11:22:17.825315Z 0 [ERROR] [MY-013183] [InnoDB]
> > >     Assertion failure: ut0ut.cc:555 thread 281472996733168
> > >
> > > At the same time , we found dmesg had following message:
> > >
> > >     [ 4328.838972] Page cache invalidation failure on direct I/O.
> > >     Possible data corruption due to collision with buffered I/O!
> > >     [ 4328.850234] File: /data/mysql/data/sysbench/sbtest53.ibd PID:
> > >     625 Comm: kworker/42:1
> > >
> > > Firstly, we doubled Mysql has operating the file with direct IO and
> > > buffered IO interlaced, but after some checking we found it did only
> > > do direct IO using aio. The problem is exactly from direct-io
> > > interface (__generic_file_write_iter) itself.
> > >
> > > ssize_t __generic_file_write_iter()
> > > {
> > > ...
> > >         if (iocb->ki_flags & IOCB_DIRECT) {
> > >                 loff_t pos, endbyte;
> > >
> > >                 written = generic_file_direct_write(iocb, from);
> > >                 /*
> > >                  * If the write stopped short of completing, fall back to
> > >                  * buffered writes.  Some filesystems do this for writes to
> > >                  * holes, for example.  For DAX files, a buffered write will
> > >                  * not succeed (even if it did, DAX does not handle dirty
> > >                  * page-cache pages correctly).
> > >                  */
> > >                 if (written < 0 || !iov_iter_count(from) || IS_DAX(inode))
> > >                         goto out;
> > >
> > >                 status = generic_perform_write(file, from, pos = iocb->ki_pos);
> > > ...
> > > }
> > >
> > > From above code snippet we can see that direct io could fall back to
> > > buffered IO under certain conditions, so even Mysql only did direct IO
> > > it could interleave with buffered IO when fall back occurred. I have
> > > no idea why FS(ext3) failed the direct IO currently, but it is strange
> > > __generic_file_write_iter make direct IO fall back to buffered IO, it
> > > seems  breaking the semantics of direct IO.
> > >
> > > The reproduced  environment is:
> > > Platform:  Kunpeng 920 (arm64)
> > > Kernel: V5.15-rc
> > > PAGESIZE: 64K
> > > Mysql:  V8.0
> > > Innodb_page_size: default(16K)
> >
> > Thanks for report. I agree this should not happen. How hard is this to
> > reproduce? Any idea whether the fallback to buffered IO happens because
> > iomap_dio_rw() returns -ENOTBLK or because it returns short write?
> 
> It is easy to reproduce in my test environment, as I said in the previous
> email replied to Andrew this problem is related to kernel page size.

Ok, can you share a reproducer?

> > Can you post output of "dumpe2fs -h <device>" for the filesystem where the
> > problem happens? Thanks!
> 
> Sure, the output is:
> 
> # dumpe2fs -h /dev/sda3
> dumpe2fs 1.45.3 (14-Jul-2019)
> Filesystem volume name:   <none>
> Last mounted on:          /data
> Filesystem UUID:          09a51146-b325-48bb-be63-c9df539a90a1
> Filesystem magic number:  0xEF53
> Filesystem revision #:    1 (dynamic)
> Filesystem features:      has_journal ext_attr resize_inode dir_index
> filetype needs_recovery sparse_super large_file

Thanks for the data. OK, a filesystem without extents. Does your test by
any chance try to do direct IO to a hole in a file? Because that is not
(and never was) supported without extents. Also the fact that you don't see
the problem with ext4 (which means extents support) would be pointing in
that direction.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Problem with direct IO
  2021-10-21  8:03     ` Jan Kara
@ 2021-10-21 12:11       ` Zhengyuan Liu
  2021-10-22  9:31         ` Jan Kara
       [not found]       ` <61712B10.2060408@huawei.com>
  1 sibling, 1 reply; 13+ messages in thread
From: Zhengyuan Liu @ 2021-10-21 12:11 UTC (permalink / raw)
  To: Jan Kara
  Cc: viro, Andrew Morton, tytso, linux-fsdevel, linux-mm, linux-ext4,
	刘云,
	Zhengyuan Liu

On Thu, Oct 21, 2021 at 4:03 PM Jan Kara <jack@suse.cz> wrote:
>
> On Thu 21-10-21 10:21:55, Zhengyuan Liu wrote:
> > On Thu, Oct 21, 2021 at 1:37 AM Jan Kara <jack@suse.cz> wrote:
> > > On Wed 13-10-21 09:46:46, Zhengyuan Liu wrote:
> > > > we are encounting following Mysql crash problem while importing tables :
> > > >
> > > >     2021-09-26T11:22:17.825250Z 0 [ERROR] [MY-013622] [InnoDB] [FATAL]
> > > >     fsync() returned EIO, aborting.
> > > >     2021-09-26T11:22:17.825315Z 0 [ERROR] [MY-013183] [InnoDB]
> > > >     Assertion failure: ut0ut.cc:555 thread 281472996733168
> > > >
> > > > At the same time , we found dmesg had following message:
> > > >
> > > >     [ 4328.838972] Page cache invalidation failure on direct I/O.
> > > >     Possible data corruption due to collision with buffered I/O!
> > > >     [ 4328.850234] File: /data/mysql/data/sysbench/sbtest53.ibd PID:
> > > >     625 Comm: kworker/42:1
> > > >
> > > > Firstly, we doubled Mysql has operating the file with direct IO and
> > > > buffered IO interlaced, but after some checking we found it did only
> > > > do direct IO using aio. The problem is exactly from direct-io
> > > > interface (__generic_file_write_iter) itself.
> > > >
> > > > ssize_t __generic_file_write_iter()
> > > > {
> > > > ...
> > > >         if (iocb->ki_flags & IOCB_DIRECT) {
> > > >                 loff_t pos, endbyte;
> > > >
> > > >                 written = generic_file_direct_write(iocb, from);
> > > >                 /*
> > > >                  * If the write stopped short of completing, fall back to
> > > >                  * buffered writes.  Some filesystems do this for writes to
> > > >                  * holes, for example.  For DAX files, a buffered write will
> > > >                  * not succeed (even if it did, DAX does not handle dirty
> > > >                  * page-cache pages correctly).
> > > >                  */
> > > >                 if (written < 0 || !iov_iter_count(from) || IS_DAX(inode))
> > > >                         goto out;
> > > >
> > > >                 status = generic_perform_write(file, from, pos = iocb->ki_pos);
> > > > ...
> > > > }
> > > >
> > > > From above code snippet we can see that direct io could fall back to
> > > > buffered IO under certain conditions, so even Mysql only did direct IO
> > > > it could interleave with buffered IO when fall back occurred. I have
> > > > no idea why FS(ext3) failed the direct IO currently, but it is strange
> > > > __generic_file_write_iter make direct IO fall back to buffered IO, it
> > > > seems  breaking the semantics of direct IO.
> > > >
> > > > The reproduced  environment is:
> > > > Platform:  Kunpeng 920 (arm64)
> > > > Kernel: V5.15-rc
> > > > PAGESIZE: 64K
> > > > Mysql:  V8.0
> > > > Innodb_page_size: default(16K)
> > >
> > > Thanks for report. I agree this should not happen. How hard is this to
> > > reproduce? Any idea whether the fallback to buffered IO happens because
> > > iomap_dio_rw() returns -ENOTBLK or because it returns short write?
> >
> > It is easy to reproduce in my test environment, as I said in the previous
> > email replied to Andrew this problem is related to kernel page size.
>
> Ok, can you share a reproducer?

I don't have a simple test case to reproduce, the whole procedure shown as
following is somewhat complex.

1. Prepare Mysql installation environment
a.  Prepare a SSD partition  (at least 100G) as the Mysql data
partition, format to Ext3 and mount to /data
# mkfs.ext3 /dev/sdb1
# mount /dev/sdb1 /data
b. Create Mysql user and user group
# groupadd mysql
# useradd -g mysql mysql
c. Create Mysql directory
# mkdir -p /data/mysql
# cd /data/mysql
#  mkdir data tmp run log

2. Install Mysql
a. Download mysql-8.0.25-1.el8.aarch64.rpm-bundle.tar from
https://downloads.mysql.com/archives/community/
b. Install Mysql
#  tar -xvf mysql-8.0.25-1.el8.aarch64.rpm-bundle.tar
# yum install openssl openssl-devel
# rpm -ivh mysql-community-common-8.0.25-1.el8.aarch64.rpm
mysql-community-client-plugins-8.0.25-1.el8.aarch64.rpm \
mysql-community-libs-8.0.25-1.el8.aarch64.rp
mysql-community-client-8.0.25-1.el8.aarch64.rpm \
 mysql-community-server-8.0.25-1.el8.aarch64.rpm
mysql-community-devel-8.0.25-1.el8.aarch64.rpm

3. Configure Mysql
a. # chown mysql:mysql /etc/my.cnf
b. # vim /etc/my.cnf
innodb_flush_method = O_DIRECT
default-storage-engine=INNODB
datadir=/data/mysql/data
socket=/data/mysql/run/mysql.sock
tmpdir=/data/mysql/tmp
log-error=/data/mysql/log/mysqld.log
pid-file=/data/mysql/run/mysqld.pid
port=3306
user=mysql
c. initialize Mysql (problem may reproduce at this stage)
# mysqld --defaults-file=/etc/my.cnf --initialize
d. Start Mysql
# mysqld --defaults-file=/etc/my.cnf &
e. Login into Mysql
# mysql -uroot -p -S /data/mysql/run/mysql.sock
You can see the temporary password from step 3.c
f. Configure access
mysql> alter user 'root'@'localhost' identified by "123456";
mysql> create user 'root'@'%' identified by '123456';
mysql> grant all privileges on *.* to 'root'@'%'; flush privileges;
mysql> create database sysbench;

4. Use sysbench to test Mysql
a. Install sysbench from https://github.com/akopytov/sysbench/archive/master.zip
b. Use following script to reproduce problem (may need dozens of minutes)
    while true ; do
      sysbench /usr/local/share/sysbench/oltp_write_only.lua
--table-size=1000000 --tables=100 \
      --threads=32 --db-driver=mysql --mysql-db=sysbench
--mysql-host=127.0.0.1 --mysql- port=3306 \
      --mysql-user=root --mysql-password=123456
--mysql-socket=/var/lib/mysql/mysql.sock  prepare

      sleep 5
      sysbench /usr/local/share/sysbench/oltp_write_only.lua
--table-size=1000000 --tables=100 \
      --threads=32 --db-driver=mysql --mysql-db=sysbench
--mysql-host=127.0.0.1 --mysql- port=3306 \
      --mysql-user=root --mysql-password=123456
--mysql-socket=/var/lib/mysql/mysql.sock  cleanup

      sleep 5
  done

If you can't reproduce, we could provide a remote environment for you or
connect to your machine to build a reproduced environment.

> > > Can you post output of "dumpe2fs -h <device>" for the filesystem where the
> > > problem happens? Thanks!
> >
> > Sure, the output is:
> >
> > # dumpe2fs -h /dev/sda3
> > dumpe2fs 1.45.3 (14-Jul-2019)
> > Filesystem volume name:   <none>
> > Last mounted on:          /data
> > Filesystem UUID:          09a51146-b325-48bb-be63-c9df539a90a1
> > Filesystem magic number:  0xEF53
> > Filesystem revision #:    1 (dynamic)
> > Filesystem features:      has_journal ext_attr resize_inode dir_index
> > filetype needs_recovery sparse_super large_file
>
> Thanks for the data. OK, a filesystem without extents. Does your test by
> any chance try to do direct IO to a hole in a file? Because that is not
> (and never was) supported without extents. Also the fact that you don't see
> the problem with ext4 (which means extents support) would be pointing in
> that direction.

I am not sure if it trys to do direct IO to a hole or not, is there
any way to check?
If you have a simple test to reproduce please let me know, we are glad to try.

Thanks,

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Problem with direct IO
       [not found]       ` <61712B10.2060408@huawei.com>
@ 2021-10-21 12:20         ` Zhengyuan Liu
  0 siblings, 0 replies; 13+ messages in thread
From: Zhengyuan Liu @ 2021-10-21 12:20 UTC (permalink / raw)
  To: yebin
  Cc: Jan Kara, viro, Andrew Morton, tytso, linux-fsdevel, linux-mm,
	linux-ext4, 刘云,
	Zhengyuan Liu

On Thu, Oct 21, 2021 at 4:55 PM yebin <yebin10@huawei.com> wrote:
>
>
>
> On 2021/10/21 16:03, Jan Kara wrote:
>
> On Thu 21-10-21 10:21:55, Zhengyuan Liu wrote:
>
> On Thu, Oct 21, 2021 at 1:37 AM Jan Kara <jack@suse.cz> wrote:
>
> On Wed 13-10-21 09:46:46, Zhengyuan Liu wrote:
>
> we are encounting following Mysql crash problem while importing tables :
>
>     2021-09-26T11:22:17.825250Z 0 [ERROR] [MY-013622] [InnoDB] [FATAL]
>     fsync() returned EIO, aborting.
>     2021-09-26T11:22:17.825315Z 0 [ERROR] [MY-013183] [InnoDB]
>     Assertion failure: ut0ut.cc:555 thread 281472996733168
>
> At the same time , we found dmesg had following message:
>
>     [ 4328.838972] Page cache invalidation failure on direct I/O.
>     Possible data corruption due to collision with buffered I/O!
>     [ 4328.850234] File: /data/mysql/data/sysbench/sbtest53.ibd PID:
>     625 Comm: kworker/42:1
>
> Firstly, we doubled Mysql has operating the file with direct IO and
> buffered IO interlaced, but after some checking we found it did only
> do direct IO using aio. The problem is exactly from direct-io
> interface (__generic_file_write_iter) itself.
>
> ssize_t __generic_file_write_iter()
> {
> ...
>         if (iocb->ki_flags & IOCB_DIRECT) {
>                 loff_t pos, endbyte;
>
>                 written = generic_file_direct_write(iocb, from);
>                 /*
>                  * If the write stopped short of completing, fall back to
>                  * buffered writes.  Some filesystems do this for writes to
>                  * holes, for example.  For DAX files, a buffered write will
>                  * not succeed (even if it did, DAX does not handle dirty
>                  * page-cache pages correctly).
>                  */
>                 if (written < 0 || !iov_iter_count(from) || IS_DAX(inode))
>                         goto out;
>
>                 status = generic_perform_write(file, from, pos = iocb->ki_pos);
> ...
> }
>
> From above code snippet we can see that direct io could fall back to
> buffered IO under certain conditions, so even Mysql only did direct IO
> it could interleave with buffered IO when fall back occurred. I have
> no idea why FS(ext3) failed the direct IO currently, but it is strange
> __generic_file_write_iter make direct IO fall back to buffered IO, it
> seems  breaking the semantics of direct IO.
>
> The reproduced  environment is:
> Platform:  Kunpeng 920 (arm64)
> Kernel: V5.15-rc
> PAGESIZE: 64K
> Mysql:  V8.0
> Innodb_page_size: default(16K)
>
> Thanks for report. I agree this should not happen. How hard is this to
> reproduce? Any idea whether the fallback to buffered IO happens because
> iomap_dio_rw() returns -ENOTBLK or because it returns short write?
>
> It is easy to reproduce in my test environment, as I said in the previous
> email replied to Andrew this problem is related to kernel page size.
>
> Ok, can you share a reproducer?
>
> Can you post output of "dumpe2fs -h <device>" for the filesystem where the
> problem happens? Thanks!
>
> Sure, the output is:
>
> # dumpe2fs -h /dev/sda3
> dumpe2fs 1.45.3 (14-Jul-2019)
> Filesystem volume name:   <none>
> Last mounted on:          /data
> Filesystem UUID:          09a51146-b325-48bb-be63-c9df539a90a1
> Filesystem magic number:  0xEF53
> Filesystem revision #:    1 (dynamic)
> Filesystem features:      has_journal ext_attr resize_inode dir_index
> filetype needs_recovery sparse_super large_file
>
> Thanks for the data. OK, a filesystem without extents. Does your test by
> any chance try to do direct IO to a hole in a file? Because that is not
> (and never was) supported without extents. Also the fact that you don't see
> the problem with ext4 (which means extents support) would be pointing in
> that direction.
>
> Honza
>
>
> I try to reprodeuce this issue as follows, maybe this is just one scenario :
> step1:  Modify  kernel code, add delay in dio_complete and dio_bio_end_aio
> step2:  mkfs.ext4 /dev/sda
> step3:  mount /dev/sda  /home/test
> step4:  fio -filename=/home/test/test  -direct=0 -buffered=1 -iodepth 1 -thread -rw=write -rwmixread=70 -ioengine=libaio -bs=4k -size=16k -numjobs=1 -name=test_r_w
> step5:  fio -filename=/home/test/test  -direct=1 -buffered=0 -iodepth 1 -thread -rw=write -rwmixread=70 -ioengine=libaio -bs=4k -size=4k -numjobs=1 -name=test_r_w &
> step6:  fio -filename=/home/test/test  -direct=0 -buffered=1 -iodepth 1 -thread -rw=write -rwmixread=70 -ioengine=libaio -bs=4k -size=16k -numjobs=1 -name=test_r_w

Hi, yebin

Thanks for your test case,  but it is not the same problem. The scenario you
described here is some thing the kernel design for, since you are intending to
interleave buffered IO and direct IO, please look at this two patch:

332391a (“fs: Fix page cache inconsistency when mixing buffered and AIO DIO”)
5a9d929 (“iomap: report collisions between directio and buffered
writes to userspace”)

>
> Kernel modification base on v4.19:
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> +#include <linux/delay.h>
>
> /**
> @@ -305,6 +307,8 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, unsigned int flags)
>          * end_io() when necessary, otherwise a racing buffer read would cache
>          * zeros from unwritten extents.
>          */
> +       printk("%s: before invalidate page \n", __func__);
> +       mdelay(10 * 1000);
>         if (flags & DIO_COMPLETE_INVALIDATE &&
>             ret > 0 && dio->op == REQ_OP_WRITE &&
>             dio->inode->i_mapping->nrpages) {
> @@ -314,6 +318,7 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, unsigned int flags)
>                 if (err)
>                         dio_warn_stale_pagecache(dio->iocb->ki_filp);
>         }
> +       printk("%s: end invalidate page \n", __func__);
>
>         inode_dio_end(dio->inode);
>
> @@ -371,10 +378,17 @@ static void dio_bio_end_aio(struct bio *bio)
>                  * went in between AIO submission and completion into the
>                  * same region.
>                  */
> +               printk("%s:start to  sleep 5s.... \n", __func__);
> +               mdelay(10 * 1000);
> +               printk("%s: endto  sleep, defer_completion=%d op=%d nrpages=%d\n",
> +                       __func__,dio->defer_completion, dio->op, dio->inode->i_mapping->nrpages);
> +
>                 if (dio->result)
>                         defer_completion = dio->defer_completion ||
>                                            (dio->op == REQ_OP_WRITE &&
>                                             dio->inode->i_mapping->nrpages);
> +               printk("%s: dio=%px defer_completion=%d\n",
> +                       __func__, dio, defer_completion);
>                 if (defer_completion) {
>                         INIT_WORK(&dio->complete_work, dio_aio_complete_work);
>                         queue_work(dio->inode->i_sb->s_dio_done_wq,
>
>
> Reproduce result :
> s=1 -name=test_r_w=write -rwmixread=70 -ioengine=libaio -bs=4k -size=16k -numjob
> test_r_w: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
> fio-3.7
> Starting 1 thread
> test_r_w: (groupid=0, jobs=1): err= 0: pid=2083: Sun Sep 26 03:18:06 2021
>   write: IOPS=444, BW=1778KiB/s (1820kB/s)(16.0KiB/9msec)
>     slat (usec): min=1294, max=3789, avg=2007.88, stdev=1191.07
>     clat (nsec): min=2304, max=28645, avg=9918.75, stdev=12574.80
>      lat (usec): min=1298, max=3824, avg=2020.36, stdev=1206.00
>     clat percentiles (nsec):
>      |  1.00th=[ 2320],  5.00th=[ 2320], 10.00th=[ 2320], 20.00th=[ 2320],
>      | 30.00th=[ 2960], 40.00th=[ 2960], 50.00th=[ 2960], 60.00th=[ 5792],
>      | 70.00th=[ 5792], 80.00th=[28544], 90.00th=[28544], 95.00th=[28544],
>      | 99.00th=[28544], 99.50th=[28544], 99.90th=[28544], 99.95th=[28544],
>      | 99.99th=[28544]
>   lat (usec)   : 4=50.00%, 10=25.00%, 50=25.00%
>   cpu          : usr=0.00%, sys=87.50%, ctx=0, majf=0, minf=1
>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued rwts: total=0,4,0,0 short=0,0,0,0 dropped=0,0,0,0
>      latency   : target=0, window=0, percentile=100.00%, depth=1
>
> Run status group 0 (all jobs):
>   WRITE: bw=1778KiB/s (1820kB/s), 1778KiB/s-1778KiB/s (1820kB/s-1820kB/s), io=16.0KiB (16.4kB), run=9-9msec
>
> Disk stats (read/write):
>   sda: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
> =1 -name=test_r_w &write -rwmixread=70 -ioengine=libaio -bs=4k -size=4k -numjobs
> [1] 2087
> [root@localhost home]# test_r_w: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
> fio-3.7
> Starting 1 thread
>
> [root@localhost home]# [ 1475.286928] ext4_file_write_iter: start lock inode
> [ 1475.288807] ext4_file_write_iter: unlock inode
> [ 1475.290402] dio_bio_end_aio:start to  sleep 5s....
>
> s=1 -name=test_r_w=write -rwmixread=70 -ioengine=libaio -bs=4k -size=16k -numjob
> test_r_w: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
> fio-3.7
> Starting 1 thread
> Jobs: 1 (f=0): [f(1)][-.-%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 00m:00s]
> Jobs: 1 (f=0): [f(1)][-.-%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 00m:00s]
> [ 1485.292939] dio_bio_end_aio: endto  sleep, defer_completion=0 op=1 nrpages=4
> [ 1485.293888] dio_bio_end_aio: dio=ffff88839aff5200 defer_completion=1
> [ 1485.294821] dio_aio_complete_work: dio=ffff88839aff5200...
> Jobs: 1 (f=1): [W(1)][-.-%][r=0KiB/s,w=0KiB/s][r[ 1485.296539] dio_complete: before invalidate page
> [ 1495.298472] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!     -->detect  buffer io and direct io conflict.
> [ 1495.302746] File: /home/test/test PID: 493 Comm: kworker/3:2
> [ 1495.303508] CPU: 3 PID: 493 Comm: kworker/3:2 Not tainted 4.19.90-dirty #96
> [ 1495.304386] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
> Jobs: 1 (f=1): [[ 1495.306050] Workqueue: dio/sda dio_aio_complete_work
> W(1)][-.-%][r=0K
> [ 1495.316879] dio_complete: end invalidate page
> test_r_w: (groupid=0, jobs=1): err= 0: pid=2092: Sun Sep 26 03:18:37 2021
>   write: IOPS=266, BW=1067KiB/s (1092kB/s)(16.0KiB/15msec)
>     slat (usec): min=2001, max=3101, avg=2691.97, stdev=480.07
>     clat (nsec): min=2058, max=12337, avg=5999.50, stdev=4456.54
>      lat (usec): min=2004, max=3116, avg=2700.11, stdev=484.22
>     clat percentiles (nsec):
>      |  1.00th=[ 2064],  5.00th=[ 2064], 10.00th=[ 2064], 20.00th=[ 2064],
>      | 30.00th=[ 4080], 40.00th=[ 4080], 50.00th=[ 4080], 60.00th=[ 5536],
>      | 70.00th=[ 5536], 80.00th=[12352], 90.00th=[12352], 95.00th=[12352],
>      | 99.00th=[12352], 99.50th=[12352], 99.90th=[12352], 99.95th=[12352],
>      | 99.99th=[12352]
>   lat (usec)   : 4=25.00%, 10=50.00%, 20=25.00%
>   cpu          : usr=0.00%, sys=100.00%, ctx=1, majf=0, minf=1
>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued rwts: total=0,4,0,0 short=0,0,0,0 dropped=0,0,0,0
>      latency   : target=0, window=0, percentile=100.00%, depth=1
>
> Run status group 0 (all jobs):
>   WRITE: bw=1067KiB/s (1092kB/s), 1067KiB/s-1067KiB/s (1092kB/s-1092kB/s), io=16.0KiB (16.4kB), run=15-15msec
>
> Disk stats (read/write):
>   sda: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
>
> test_r_w: (groupid=0, jobs=1): err= 0: pid=2089: Sun Sep 26 03:18:37 2021
>   write: IOPS=0, BW=204B/s (204B/s)(4096B/20033msec)
>     slat (nsec): min=3358.7k, max=3358.7k, avg=3358748.00, stdev= 0.00
>     clat (nsec): min=20027M, max=20027M, avg=20027265355.00, stdev= 0.00
>      lat (nsec): min=20031M, max=20031M, avg=20030627283.00, stdev= 0.00
>     clat percentiles (msec):
>      |  1.00th=[17113],  5.00th=[17113], 10.00th=[17113], 20.00th=[17113],
>      | 30.00th=[17113], 40.00th=[17113], 50.00th=[17113], 60.00th=[17113],
>      | 70.00th=[17113], 80.00th=[17113], 90.00th=[17113], 95.00th=[17113],
>      | 99.00th=[17113], 99.50th=[17113], 99.90th=[17113], 99.95th=[17113],
>      | 99.99th=[17113]
>   cpu          : usr=0.00%, sys=0.02%, ctx=2, majf=0, minf=1
>   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      issued rwts: total=0,1,0,0 short=0,0,0,0 dropped=0,0,0,0
>      latency   : target=0, window=0, percentile=100.00%, depth=1
>
> Run status group 0 (all jobs):
>   WRITE: bw=204B/s (204B/s), 204B/s-204B/s (204B/s-204B/s), io=4096B (4096B), run=20033-20033msec
>
> Disk stats (read/write):
>   sda: ios=0/5, merge=0/1, ticks=0/10018, in_queue=10015, util=36.98%
>
>
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Problem with direct IO
  2021-10-21 12:11       ` Zhengyuan Liu
@ 2021-10-22  9:31         ` Jan Kara
  2021-10-23  2:06           ` Zhengyuan Liu
  0 siblings, 1 reply; 13+ messages in thread
From: Jan Kara @ 2021-10-22  9:31 UTC (permalink / raw)
  To: Zhengyuan Liu
  Cc: Jan Kara, viro, Andrew Morton, tytso, linux-fsdevel, linux-mm,
	linux-ext4, 刘云,
	Zhengyuan Liu

On Thu 21-10-21 20:11:43, Zhengyuan Liu wrote:
> On Thu, Oct 21, 2021 at 4:03 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Thu 21-10-21 10:21:55, Zhengyuan Liu wrote:
> > > On Thu, Oct 21, 2021 at 1:37 AM Jan Kara <jack@suse.cz> wrote:
> > > > On Wed 13-10-21 09:46:46, Zhengyuan Liu wrote:
> > > > > we are encounting following Mysql crash problem while importing tables :
> > > > >
> > > > >     2021-09-26T11:22:17.825250Z 0 [ERROR] [MY-013622] [InnoDB] [FATAL]
> > > > >     fsync() returned EIO, aborting.
> > > > >     2021-09-26T11:22:17.825315Z 0 [ERROR] [MY-013183] [InnoDB]
> > > > >     Assertion failure: ut0ut.cc:555 thread 281472996733168
> > > > >
> > > > > At the same time , we found dmesg had following message:
> > > > >
> > > > >     [ 4328.838972] Page cache invalidation failure on direct I/O.
> > > > >     Possible data corruption due to collision with buffered I/O!
> > > > >     [ 4328.850234] File: /data/mysql/data/sysbench/sbtest53.ibd PID:
> > > > >     625 Comm: kworker/42:1
> > > > >
> > > > > Firstly, we doubled Mysql has operating the file with direct IO and
> > > > > buffered IO interlaced, but after some checking we found it did only
> > > > > do direct IO using aio. The problem is exactly from direct-io
> > > > > interface (__generic_file_write_iter) itself.
> > > > >
> > > > > ssize_t __generic_file_write_iter()
> > > > > {
> > > > > ...
> > > > >         if (iocb->ki_flags & IOCB_DIRECT) {
> > > > >                 loff_t pos, endbyte;
> > > > >
> > > > >                 written = generic_file_direct_write(iocb, from);
> > > > >                 /*
> > > > >                  * If the write stopped short of completing, fall back to
> > > > >                  * buffered writes.  Some filesystems do this for writes to
> > > > >                  * holes, for example.  For DAX files, a buffered write will
> > > > >                  * not succeed (even if it did, DAX does not handle dirty
> > > > >                  * page-cache pages correctly).
> > > > >                  */
> > > > >                 if (written < 0 || !iov_iter_count(from) || IS_DAX(inode))
> > > > >                         goto out;
> > > > >
> > > > >                 status = generic_perform_write(file, from, pos = iocb->ki_pos);
> > > > > ...
> > > > > }
> > > > >
> > > > > From above code snippet we can see that direct io could fall back to
> > > > > buffered IO under certain conditions, so even Mysql only did direct IO
> > > > > it could interleave with buffered IO when fall back occurred. I have
> > > > > no idea why FS(ext3) failed the direct IO currently, but it is strange
> > > > > __generic_file_write_iter make direct IO fall back to buffered IO, it
> > > > > seems  breaking the semantics of direct IO.
> > > > >
> > > > > The reproduced  environment is:
> > > > > Platform:  Kunpeng 920 (arm64)
> > > > > Kernel: V5.15-rc
> > > > > PAGESIZE: 64K
> > > > > Mysql:  V8.0
> > > > > Innodb_page_size: default(16K)
> > > >
> > > > Thanks for report. I agree this should not happen. How hard is this to
> > > > reproduce? Any idea whether the fallback to buffered IO happens because
> > > > iomap_dio_rw() returns -ENOTBLK or because it returns short write?
> > >
> > > It is easy to reproduce in my test environment, as I said in the previous
> > > email replied to Andrew this problem is related to kernel page size.
> >
> > Ok, can you share a reproducer?
> 
> I don't have a simple test case to reproduce, the whole procedure shown as
> following is somewhat complex.
> 
> 1. Prepare Mysql installation environment
> a.  Prepare a SSD partition  (at least 100G) as the Mysql data
> partition, format to Ext3 and mount to /data
> # mkfs.ext3 /dev/sdb1
> # mount /dev/sdb1 /data
> b. Create Mysql user and user group
> # groupadd mysql
> # useradd -g mysql mysql
> c. Create Mysql directory
> # mkdir -p /data/mysql
> # cd /data/mysql
> #  mkdir data tmp run log
> 
> 2. Install Mysql
> a. Download mysql-8.0.25-1.el8.aarch64.rpm-bundle.tar from
> https://downloads.mysql.com/archives/community/
> b. Install Mysql
> #  tar -xvf mysql-8.0.25-1.el8.aarch64.rpm-bundle.tar
> # yum install openssl openssl-devel
> # rpm -ivh mysql-community-common-8.0.25-1.el8.aarch64.rpm
> mysql-community-client-plugins-8.0.25-1.el8.aarch64.rpm \
> mysql-community-libs-8.0.25-1.el8.aarch64.rp
> mysql-community-client-8.0.25-1.el8.aarch64.rpm \
>  mysql-community-server-8.0.25-1.el8.aarch64.rpm
> mysql-community-devel-8.0.25-1.el8.aarch64.rpm
> 
> 3. Configure Mysql
> a. # chown mysql:mysql /etc/my.cnf
> b. # vim /etc/my.cnf
> innodb_flush_method = O_DIRECT
> default-storage-engine=INNODB
> datadir=/data/mysql/data
> socket=/data/mysql/run/mysql.sock
> tmpdir=/data/mysql/tmp
> log-error=/data/mysql/log/mysqld.log
> pid-file=/data/mysql/run/mysqld.pid
> port=3306
> user=mysql
> c. initialize Mysql (problem may reproduce at this stage)
> # mysqld --defaults-file=/etc/my.cnf --initialize
> d. Start Mysql
> # mysqld --defaults-file=/etc/my.cnf &
> e. Login into Mysql
> # mysql -uroot -p -S /data/mysql/run/mysql.sock
> You can see the temporary password from step 3.c
> f. Configure access
> mysql> alter user 'root'@'localhost' identified by "123456";
> mysql> create user 'root'@'%' identified by '123456';
> mysql> grant all privileges on *.* to 'root'@'%'; flush privileges;
> mysql> create database sysbench;
> 
> 4. Use sysbench to test Mysql
> a. Install sysbench from https://github.com/akopytov/sysbench/archive/master.zip
> b. Use following script to reproduce problem (may need dozens of minutes)
>     while true ; do
>       sysbench /usr/local/share/sysbench/oltp_write_only.lua
> --table-size=1000000 --tables=100 \
>       --threads=32 --db-driver=mysql --mysql-db=sysbench
> --mysql-host=127.0.0.1 --mysql- port=3306 \
>       --mysql-user=root --mysql-password=123456
> --mysql-socket=/var/lib/mysql/mysql.sock  prepare
> 
>       sleep 5
>       sysbench /usr/local/share/sysbench/oltp_write_only.lua
> --table-size=1000000 --tables=100 \
>       --threads=32 --db-driver=mysql --mysql-db=sysbench
> --mysql-host=127.0.0.1 --mysql- port=3306 \
>       --mysql-user=root --mysql-password=123456
> --mysql-socket=/var/lib/mysql/mysql.sock  cleanup
> 
>       sleep 5
>   done
> 
> If you can't reproduce, we could provide a remote environment for you or
> connect to your machine to build a reproduced environment.

Ah, not that simple, also it isn't that easy to get arm64 machine for
experiments for me. Connecting to your environment would be possible but
let's try remote debugging for a bit more ;)

> > > > Can you post output of "dumpe2fs -h <device>" for the filesystem where the
> > > > problem happens? Thanks!
> > >
> > > Sure, the output is:
> > >
> > > # dumpe2fs -h /dev/sda3
> > > dumpe2fs 1.45.3 (14-Jul-2019)
> > > Filesystem volume name:   <none>
> > > Last mounted on:          /data
> > > Filesystem UUID:          09a51146-b325-48bb-be63-c9df539a90a1
> > > Filesystem magic number:  0xEF53
> > > Filesystem revision #:    1 (dynamic)
> > > Filesystem features:      has_journal ext_attr resize_inode dir_index
> > > filetype needs_recovery sparse_super large_file
> >
> > Thanks for the data. OK, a filesystem without extents. Does your test by
> > any chance try to do direct IO to a hole in a file? Because that is not
> > (and never was) supported without extents. Also the fact that you don't see
> > the problem with ext4 (which means extents support) would be pointing in
> > that direction.
> 
> I am not sure if it trys to do direct IO to a hole or not, is there any
> way to check?  If you have a simple test to reproduce please let me know,
> we are glad to try.

Can you enable following tracing?

echo 1 >/sys/kernel/debug/tracing/events/ext4/ext4_ind_map_blocks_exit/enable
echo iomap_dio_rw >/sys/kernel/debug/tracing/set_ftrace_filter
echo "function_graph" >/sys/kernel/debug/tracing/current_tracer

And then gather output from /sys/kernel/debug/tracing/trace_pipe. Once the
problem reproduces, you can gather the problematic file name from dmesg, find
inode number from "stat <filename>" and provide that all to me? Thanks!

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Problem with direct IO
  2021-10-22  9:31         ` Jan Kara
@ 2021-10-23  2:06           ` Zhengyuan Liu
  2021-10-25 15:57             ` Jan Kara
  0 siblings, 1 reply; 13+ messages in thread
From: Zhengyuan Liu @ 2021-10-23  2:06 UTC (permalink / raw)
  To: Jan Kara
  Cc: viro, Andrew Morton, tytso, linux-fsdevel, linux-mm, linux-ext4,
	刘云,
	Zhengyuan Liu

On Fri, Oct 22, 2021 at 5:31 PM Jan Kara <jack@suse.cz> wrote:
>
> On Thu 21-10-21 20:11:43, Zhengyuan Liu wrote:
> > On Thu, Oct 21, 2021 at 4:03 PM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Thu 21-10-21 10:21:55, Zhengyuan Liu wrote:
> > > > On Thu, Oct 21, 2021 at 1:37 AM Jan Kara <jack@suse.cz> wrote:
> > > > > On Wed 13-10-21 09:46:46, Zhengyuan Liu wrote:
> > > > > > we are encounting following Mysql crash problem while importing tables :
> > > > > >
> > > > > >     2021-09-26T11:22:17.825250Z 0 [ERROR] [MY-013622] [InnoDB] [FATAL]
> > > > > >     fsync() returned EIO, aborting.
> > > > > >     2021-09-26T11:22:17.825315Z 0 [ERROR] [MY-013183] [InnoDB]
> > > > > >     Assertion failure: ut0ut.cc:555 thread 281472996733168
> > > > > >
> > > > > > At the same time , we found dmesg had following message:
> > > > > >
> > > > > >     [ 4328.838972] Page cache invalidation failure on direct I/O.
> > > > > >     Possible data corruption due to collision with buffered I/O!
> > > > > >     [ 4328.850234] File: /data/mysql/data/sysbench/sbtest53.ibd PID:
> > > > > >     625 Comm: kworker/42:1
> > > > > >
> > > > > > Firstly, we doubled Mysql has operating the file with direct IO and
> > > > > > buffered IO interlaced, but after some checking we found it did only
> > > > > > do direct IO using aio. The problem is exactly from direct-io
> > > > > > interface (__generic_file_write_iter) itself.
> > > > > >
> > > > > > ssize_t __generic_file_write_iter()
> > > > > > {
> > > > > > ...
> > > > > >         if (iocb->ki_flags & IOCB_DIRECT) {
> > > > > >                 loff_t pos, endbyte;
> > > > > >
> > > > > >                 written = generic_file_direct_write(iocb, from);
> > > > > >                 /*
> > > > > >                  * If the write stopped short of completing, fall back to
> > > > > >                  * buffered writes.  Some filesystems do this for writes to
> > > > > >                  * holes, for example.  For DAX files, a buffered write will
> > > > > >                  * not succeed (even if it did, DAX does not handle dirty
> > > > > >                  * page-cache pages correctly).
> > > > > >                  */
> > > > > >                 if (written < 0 || !iov_iter_count(from) || IS_DAX(inode))
> > > > > >                         goto out;
> > > > > >
> > > > > >                 status = generic_perform_write(file, from, pos = iocb->ki_pos);
> > > > > > ...
> > > > > > }
> > > > > >
> > > > > > From above code snippet we can see that direct io could fall back to
> > > > > > buffered IO under certain conditions, so even Mysql only did direct IO
> > > > > > it could interleave with buffered IO when fall back occurred. I have
> > > > > > no idea why FS(ext3) failed the direct IO currently, but it is strange
> > > > > > __generic_file_write_iter make direct IO fall back to buffered IO, it
> > > > > > seems  breaking the semantics of direct IO.
> > > > > >
> > > > > > The reproduced  environment is:
> > > > > > Platform:  Kunpeng 920 (arm64)
> > > > > > Kernel: V5.15-rc
> > > > > > PAGESIZE: 64K
> > > > > > Mysql:  V8.0
> > > > > > Innodb_page_size: default(16K)
> > > > >
> > > > > Thanks for report. I agree this should not happen. How hard is this to
> > > > > reproduce? Any idea whether the fallback to buffered IO happens because
> > > > > iomap_dio_rw() returns -ENOTBLK or because it returns short write?
> > > >
> > > > It is easy to reproduce in my test environment, as I said in the previous
> > > > email replied to Andrew this problem is related to kernel page size.
> > >
> > > Ok, can you share a reproducer?
> >
> > I don't have a simple test case to reproduce, the whole procedure shown as
> > following is somewhat complex.
> >
> > 1. Prepare Mysql installation environment
> > a.  Prepare a SSD partition  (at least 100G) as the Mysql data
> > partition, format to Ext3 and mount to /data
> > # mkfs.ext3 /dev/sdb1
> > # mount /dev/sdb1 /data
> > b. Create Mysql user and user group
> > # groupadd mysql
> > # useradd -g mysql mysql
> > c. Create Mysql directory
> > # mkdir -p /data/mysql
> > # cd /data/mysql
> > #  mkdir data tmp run log
> >
> > 2. Install Mysql
> > a. Download mysql-8.0.25-1.el8.aarch64.rpm-bundle.tar from
> > https://downloads.mysql.com/archives/community/
> > b. Install Mysql
> > #  tar -xvf mysql-8.0.25-1.el8.aarch64.rpm-bundle.tar
> > # yum install openssl openssl-devel
> > # rpm -ivh mysql-community-common-8.0.25-1.el8.aarch64.rpm
> > mysql-community-client-plugins-8.0.25-1.el8.aarch64.rpm \
> > mysql-community-libs-8.0.25-1.el8.aarch64.rp
> > mysql-community-client-8.0.25-1.el8.aarch64.rpm \
> >  mysql-community-server-8.0.25-1.el8.aarch64.rpm
> > mysql-community-devel-8.0.25-1.el8.aarch64.rpm
> >
> > 3. Configure Mysql
> > a. # chown mysql:mysql /etc/my.cnf
> > b. # vim /etc/my.cnf
> > innodb_flush_method = O_DIRECT
> > default-storage-engine=INNODB
> > datadir=/data/mysql/data
> > socket=/data/mysql/run/mysql.sock
> > tmpdir=/data/mysql/tmp
> > log-error=/data/mysql/log/mysqld.log
> > pid-file=/data/mysql/run/mysqld.pid
> > port=3306
> > user=mysql
> > c. initialize Mysql (problem may reproduce at this stage)
> > # mysqld --defaults-file=/etc/my.cnf --initialize
> > d. Start Mysql
> > # mysqld --defaults-file=/etc/my.cnf &
> > e. Login into Mysql
> > # mysql -uroot -p -S /data/mysql/run/mysql.sock
> > You can see the temporary password from step 3.c
> > f. Configure access
> > mysql> alter user 'root'@'localhost' identified by "123456";
> > mysql> create user 'root'@'%' identified by '123456';
> > mysql> grant all privileges on *.* to 'root'@'%'; flush privileges;
> > mysql> create database sysbench;
> >
> > 4. Use sysbench to test Mysql
> > a. Install sysbench from https://github.com/akopytov/sysbench/archive/master.zip
> > b. Use following script to reproduce problem (may need dozens of minutes)
> >     while true ; do
> >       sysbench /usr/local/share/sysbench/oltp_write_only.lua
> > --table-size=1000000 --tables=100 \
> >       --threads=32 --db-driver=mysql --mysql-db=sysbench
> > --mysql-host=127.0.0.1 --mysql- port=3306 \
> >       --mysql-user=root --mysql-password=123456
> > --mysql-socket=/var/lib/mysql/mysql.sock  prepare
> >
> >       sleep 5
> >       sysbench /usr/local/share/sysbench/oltp_write_only.lua
> > --table-size=1000000 --tables=100 \
> >       --threads=32 --db-driver=mysql --mysql-db=sysbench
> > --mysql-host=127.0.0.1 --mysql- port=3306 \
> >       --mysql-user=root --mysql-password=123456
> > --mysql-socket=/var/lib/mysql/mysql.sock  cleanup
> >
> >       sleep 5
> >   done
> >
> > If you can't reproduce, we could provide a remote environment for you or
> > connect to your machine to build a reproduced environment.
>
> Ah, not that simple, also it isn't that easy to get arm64 machine for
> experiments for me. Connecting to your environment would be possible but
> let's try remote debugging for a bit more ;)
>
> > > > > Can you post output of "dumpe2fs -h <device>" for the filesystem where the
> > > > > problem happens? Thanks!
> > > >
> > > > Sure, the output is:
> > > >
> > > > # dumpe2fs -h /dev/sda3
> > > > dumpe2fs 1.45.3 (14-Jul-2019)
> > > > Filesystem volume name:   <none>
> > > > Last mounted on:          /data
> > > > Filesystem UUID:          09a51146-b325-48bb-be63-c9df539a90a1
> > > > Filesystem magic number:  0xEF53
> > > > Filesystem revision #:    1 (dynamic)
> > > > Filesystem features:      has_journal ext_attr resize_inode dir_index
> > > > filetype needs_recovery sparse_super large_file
> > >
> > > Thanks for the data. OK, a filesystem without extents. Does your test by
> > > any chance try to do direct IO to a hole in a file? Because that is not
> > > (and never was) supported without extents. Also the fact that you don't see
> > > the problem with ext4 (which means extents support) would be pointing in
> > > that direction.
> >
> > I am not sure if it trys to do direct IO to a hole or not, is there any
> > way to check?  If you have a simple test to reproduce please let me know,
> > we are glad to try.
>
> Can you enable following tracing?

Sure, but let's confirm before doing that, it seems Ext4 doesn't
support iomap in
V4.19 which could also reproduce the problem, so if it is necessary to
do the following
tracing? or should we modify the tracing if under V4.19?

> echo 1 >/sys/kernel/debug/tracing/events/ext4/ext4_ind_map_blocks_exit/enable
> echo iomap_dio_rw >/sys/kernel/debug/tracing/set_ftrace_filter
> echo "function_graph" >/sys/kernel/debug/tracing/current_tracer
>
> And then gather output from /sys/kernel/debug/tracing/trace_pipe. Once the
> problem reproduces, you can gather the problematic file name from dmesg, find
> inode number from "stat <filename>" and provide that all to me? Thanks!
>
>                                                                 Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Problem with direct IO
  2021-10-23  2:06           ` Zhengyuan Liu
@ 2021-10-25 15:57             ` Jan Kara
  2021-10-28 15:02               ` Zhengyuan Liu
  0 siblings, 1 reply; 13+ messages in thread
From: Jan Kara @ 2021-10-25 15:57 UTC (permalink / raw)
  To: Zhengyuan Liu
  Cc: Jan Kara, viro, Andrew Morton, tytso, linux-fsdevel, linux-mm,
	linux-ext4, 刘云,
	Zhengyuan Liu

On Sat 23-10-21 10:06:24, Zhengyuan Liu wrote:
> On Fri, Oct 22, 2021 at 5:31 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > Can you post output of "dumpe2fs -h <device>" for the filesystem where the
> > > > > > problem happens? Thanks!
> > > > >
> > > > > Sure, the output is:
> > > > >
> > > > > # dumpe2fs -h /dev/sda3
> > > > > dumpe2fs 1.45.3 (14-Jul-2019)
> > > > > Filesystem volume name:   <none>
> > > > > Last mounted on:          /data
> > > > > Filesystem UUID:          09a51146-b325-48bb-be63-c9df539a90a1
> > > > > Filesystem magic number:  0xEF53
> > > > > Filesystem revision #:    1 (dynamic)
> > > > > Filesystem features:      has_journal ext_attr resize_inode dir_index
> > > > > filetype needs_recovery sparse_super large_file
> > > >
> > > > Thanks for the data. OK, a filesystem without extents. Does your test by
> > > > any chance try to do direct IO to a hole in a file? Because that is not
> > > > (and never was) supported without extents. Also the fact that you don't see
> > > > the problem with ext4 (which means extents support) would be pointing in
> > > > that direction.
> > >
> > > I am not sure if it trys to do direct IO to a hole or not, is there any
> > > way to check?  If you have a simple test to reproduce please let me know,
> > > we are glad to try.
> >
> > Can you enable following tracing?
> 
> Sure, but let's confirm before doing that, it seems Ext4 doesn't
> support iomap in
> V4.19 which could also reproduce the problem, so if it is necessary to
> do the following
> tracing? or should we modify the tracing if under V4.19?

Well, iomap is just a different generic framework for doing direct IO. The
fact that you can observe the problem both with iomap and the old direct IO
framework is one of the reasons why I think the problem is actually that
the file has holes (unallocated space in the middle).
 
> > echo 1 >/sys/kernel/debug/tracing/events/ext4/ext4_ind_map_blocks_exit/enable
> > echo iomap_dio_rw >/sys/kernel/debug/tracing/set_ftrace_filter
> > echo "function_graph" >/sys/kernel/debug/tracing/current_tracer

If you want to trace a kernel were ext4 direct IO path is not yet
converted to iomap framework you need to replace tracing of iomap_dio_rw
with:

echo __blockdev_direct_IO >/sys/kernel/debug/tracing/set_ftrace_filter

> > And then gather output from /sys/kernel/debug/tracing/trace_pipe. Once the
> > problem reproduces, you can gather the problematic file name from dmesg, find
> > inode number from "stat <filename>" and provide that all to me? Thanks!

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Problem with direct IO
  2021-10-25 15:57             ` Jan Kara
@ 2021-10-28 15:02               ` Zhengyuan Liu
  0 siblings, 0 replies; 13+ messages in thread
From: Zhengyuan Liu @ 2021-10-28 15:02 UTC (permalink / raw)
  To: Jan Kara
  Cc: viro, Andrew Morton, tytso, linux-fsdevel, linux-mm, linux-ext4,
	刘云,
	Zhengyuan Liu

On Mon, Oct 25, 2021 at 11:57 PM Jan Kara <jack@suse.cz> wrote:
>
> On Sat 23-10-21 10:06:24, Zhengyuan Liu wrote:
> > On Fri, Oct 22, 2021 at 5:31 PM Jan Kara <jack@suse.cz> wrote:
> > > > > > > Can you post output of "dumpe2fs -h <device>" for the filesystem where the
> > > > > > > problem happens? Thanks!
> > > > > >
> > > > > > Sure, the output is:
> > > > > >
> > > > > > # dumpe2fs -h /dev/sda3
> > > > > > dumpe2fs 1.45.3 (14-Jul-2019)
> > > > > > Filesystem volume name:   <none>
> > > > > > Last mounted on:          /data
> > > > > > Filesystem UUID:          09a51146-b325-48bb-be63-c9df539a90a1
> > > > > > Filesystem magic number:  0xEF53
> > > > > > Filesystem revision #:    1 (dynamic)
> > > > > > Filesystem features:      has_journal ext_attr resize_inode dir_index
> > > > > > filetype needs_recovery sparse_super large_file
> > > > >
> > > > > Thanks for the data. OK, a filesystem without extents. Does your test by
> > > > > any chance try to do direct IO to a hole in a file? Because that is not
> > > > > (and never was) supported without extents. Also the fact that you don't see
> > > > > the problem with ext4 (which means extents support) would be pointing in
> > > > > that direction.
> > > >
> > > > I am not sure if it trys to do direct IO to a hole or not, is there any
> > > > way to check?  If you have a simple test to reproduce please let me know,
> > > > we are glad to try.
> > >
> > > Can you enable following tracing?
> >
> > Sure, but let's confirm before doing that, it seems Ext4 doesn't
> > support iomap in
> > V4.19 which could also reproduce the problem, so if it is necessary to
> > do the following
> > tracing? or should we modify the tracing if under V4.19?
>
> Well, iomap is just a different generic framework for doing direct IO. The
> fact that you can observe the problem both with iomap and the old direct IO
> framework is one of the reasons why I think the problem is actually that
> the file has holes (unallocated space in the middle).
>
> > > echo 1 >/sys/kernel/debug/tracing/events/ext4/ext4_ind_map_blocks_exit/enable
> > > echo iomap_dio_rw >/sys/kernel/debug/tracing/set_ftrace_filter
> > > echo "function_graph" >/sys/kernel/debug/tracing/current_tracer
>
> If you want to trace a kernel were ext4 direct IO path is not yet
> converted to iomap framework you need to replace tracing of iomap_dio_rw
> with:
>
> echo __blockdev_direct_IO >/sys/kernel/debug/tracing/set_ftrace_filter

Sorry for the late reply, because it is hard to reproduce when enabling ftrace.
I'll post here right after getting useful data.

> > > And then gather output from /sys/kernel/debug/tracing/trace_pipe. Once the
> > > problem reproduces, you can gather the problematic file name from dmesg, find
> > > inode number from "stat <filename>" and provide that all to me? Thanks!
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-10-28 15:03 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-13  1:46 Problem with direct IO Zhengyuan Liu
2021-10-18  1:09 ` Zhengyuan Liu
2021-10-18 18:43   ` Andrew Morton
2021-10-19  3:39     ` Zhengyuan Liu
2021-10-20 17:37 ` Jan Kara
2021-10-21  2:21   ` Zhengyuan Liu
2021-10-21  8:03     ` Jan Kara
2021-10-21 12:11       ` Zhengyuan Liu
2021-10-22  9:31         ` Jan Kara
2021-10-23  2:06           ` Zhengyuan Liu
2021-10-25 15:57             ` Jan Kara
2021-10-28 15:02               ` Zhengyuan Liu
     [not found]       ` <61712B10.2060408@huawei.com>
2021-10-21 12:20         ` Zhengyuan Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).