regressions.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
From: Thomas Deutschmann <whissi@whissi.de>
To: Song Liu <song@kernel.org>, Vishal Verma <vverma@digitalocean.com>
Cc: Thorsten Leemhuis <regressions@leemhuis.info>,
	stable@vger.kernel.org, regressions@lists.linux.dev,
	Jens Axboe <axboe@kernel.dk>
Subject: Re: [REGRESSION] v5.17-rc1+: FIFREEZE ioctl system call hangs
Date: Wed, 17 Aug 2022 08:53:46 +0200	[thread overview]
Message-ID: <43e678ca-3fc3-6c08-f035-2c31a34dd889@whissi.de> (raw)
In-Reply-To: <CAPhsuW5f9QD+gzJ9eBhn5irsHvrsvkWjSnA4MPaHsQjjLMypXg@mail.gmail.com>

Hi,

On 2022-08-17 08:19, Song Liu wrote:
> On Mon, Aug 15, 2022 at 8:46 AM Vishal Verma
> <vverma@digitalocean.com> wrote:
>> 
>> Just saw this. I’m trying to understand whether this happens only
>> on md array or individual nvme drives (without any raid) too? The
>> commit you pointed added REQ_NOWAIT for md based arrays, but if it
>> is happening on individual nvme drives then that could point to
>> something with REQ_NOWAIT I think.
> 
> Agreed with this analysis.

I bisected again, this time I tested against the single nvme device.

I did it 2 times, and always ended up with

 > git bisect start
 > # good: [8bb7eca972ad531c9b149c0a51ab43a417385813] Linux 5.15
 > git bisect good 8bb7eca972ad531c9b149c0a51ab43a417385813
 > # bad: [df0cc57e057f18e44dac8e6c18aba47ab53202f9] Linux 5.16
 > git bisect bad df0cc57e057f18e44dac8e6c18aba47ab53202f9
 > # good: [2219b0ceefe835b92a8a74a73fe964aa052742a2] Merge tag 
'soc-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
 > git bisect good 2219b0ceefe835b92a8a74a73fe964aa052742a2
 > # good: [206825f50f908771934e1fba2bfc2e1f1138b36a] Merge tag 
'mtd/for-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux
 > git bisect good 206825f50f908771934e1fba2bfc2e1f1138b36a
 > # bad: [4e1fddc98d2585ddd4792b5e44433dcee7ece001] tcp_cubic: fix 
spurious Hystart ACK train detections for not-cwnd-limited flows
 > git bisect bad 4e1fddc98d2585ddd4792b5e44433dcee7ece001
 > # good: [dbf49896187fd58c577fa1574a338e4f3672b4b2] Merge branch 
'akpm' (patches from Andrew)
 > git bisect good dbf49896187fd58c577fa1574a338e4f3672b4b2
 > # good: [0ecca62beb12eeb13965ed602905c8bf53ac93d0] Merge tag 
'ceph-for-5.16-rc1' of git://github.com/ceph/ceph-client
 > git bisect good 0ecca62beb12eeb13965ed602905c8bf53ac93d0
 > # bad: [7d5775d49e4a488bc8a07e5abb2b71a4c28aadbb] Merge tag 
'printk-for-5.16-fixup' of 
git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux
 > git bisect bad 7d5775d49e4a488bc8a07e5abb2b71a4c28aadbb
 > # good: [35c8fad4a703fdfa009ed274f80bb64b49314cde] Merge tag 
'perf-tools-for-v5.16-2021-11-13' of 
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux
 > git bisect good 35c8fad4a703fdfa009ed274f80bb64b49314cde
 > # good: [6ea45c57dc176dde529ab5d7c4b3f20e52a2bd82] Merge tag 
'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm
 > git bisect good 6ea45c57dc176dde529ab5d7c4b3f20e52a2bd82
 > # bad: [fa55b7dcdc43c1aa1ba12bca9d2dd4318c2a0dbf] Linux 5.16-rc1
 > git bisect bad fa55b7dcdc43c1aa1ba12bca9d2dd4318c2a0dbf
 > # good: [475c3f599582a34e189f047ed3fb7e90a295ea5b] sh: fix READ/WRITE 
redefinition warnings
 > git bisect good 475c3f599582a34e189f047ed3fb7e90a295ea5b
 > # good: [c3b68c27f58a07130382f3fa6320c3652ad76f15] Merge tag 
'for-5.16/parisc-3' of 
git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
 > git bisect good c3b68c27f58a07130382f3fa6320c3652ad76f15
 > # good: [4a6b35b3b3f28df81fea931dc77c4c229cbdb5b2] xfs: sync 
xfs_btree_split macros with userspace libxfs
 > git bisect good 4a6b35b3b3f28df81fea931dc77c4c229cbdb5b2
 > # good: [dee2b702bcf067d7b6b62c18bdd060ff0810a800] kconfig: Add 
support for -Wimplicit-fallthrough
 > git bisect good dee2b702bcf067d7b6b62c18bdd060ff0810a800
 > # first bad commit: [fa55b7dcdc43c1aa1ba12bca9d2dd4318c2a0dbf] Linux 
5.16-rc1

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fa55b7dcdc43c1aa1ba12bca9d2dd4318c2a0dbf

...but this doesn't make any sense, right?

However, I cannot reproduce with the commit before, i.e. dee2b702bcf0 
didn't freeze during my 10 test runs.
But with fa55b7dcdc (or any later commit), system will freeze on _every_ 
test run?!

I checked out 1bd297988b75 which never failed before, changed Makefile 
to PATCHLEVEL=16 and EXTRAVERSION=-rc1 and guess what: It's now failing, 
too.

So this sounds like some code changes behavior when KV is >=5.16-rc1. Is 
that possible?

Anyway, I started to test v5.10 (with PATCHLEVEL=16 and 
EXTRAVERSION=-rc1 set) which worked so I started another bisect session 
where I named all KV to 5.16-rc1.

I'll post my finding when this session is completed.


> I am not able to reproduce this on 5.19+ kernel. I have:
> 
> [root@eth50-1 ~]# lsblk NAME    MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINT 
> sr0      11:0    1 1024M  0 rom vda     253:0    0   32G  0 disk 
> ├─vda1  253:1    0    2G  0 part  /boot └─vda2  253:2    0   30G  0
> part  / nvme0n1 259:0    0    4G  0 disk └─md0     9:0    0   12G  0
> raid5 /root/mnt nvme2n1 259:1    0    4G  0 disk └─md0     9:0    0
> 12G  0 raid5 /root/mnt nvme3n1 259:2    0    4G  0 disk └─md0     9:0
> 0   12G  0 raid5 /root/mnt nvme1n1 259:3    0    4G  0 disk └─md0
> 9:0    0   12G  0 raid5 /root/mnt [root@eth50-1 ~]# for x in {1..100}
> ; do fsfreeze --unfreeze /root/mnt ; fsfreeze --freeze /root/mnt ;
> done
> 
> Did I miss something?

Well, your reproducer doesn't work. Like written in my initial mail, 
executing `fsfreeze --freeze...` directly after boot doesn't even fail 
for me. The device/array must have seen some I/O to trigger this.

To be more precise:

During my current bisect session (where I set KV to 5.16-rc1 for all 
kernels), I noticed that my 'reproducer' failed:

To trigger the problem, it is not enough to create random I/O by copying 
some files for example.

I am using mysqld (MariaDB 10.6.8) and restore ~20GB of SQL dumps -- 
somehow this is triggering the problem in a reliable way. The mysqld is 
using O_DIRECT 
(https://mariadb.com/kb/en/innodb-system-variables/#innodb_flush_method) 
-- maybe Direct I/O is the trigger.

This process usually takes ~620s on my test system where I am 
experiencing the problem. After import I called `fsfreeze --freeze ...` 
against the mount point used by mysqld.
When this command did not return (=fsfreeze was hanging), I marked 
revision as bad.

Since setting KV in all kernels to "5.16-rc1" I noticed that the import 
process sometimes "freezed" -- mysqld was still running and responsive 
(that's not the case when fsfreeze hangs for example) and `SHOW 
PROCESSLIST` showed the running imports with still increasing time 
counter. However, no data are read and written anymore. Although 
fsfreeze command works when this happens. Anyway, I marked revisions 
showing this behavior as bad, too.

I'll post my results when I finished this bisect session.


-- 
Regards,
Thomas


  reply	other threads:[~2022-08-17  6:53 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-03 14:35 [REGRESSION] v5.17-rc1+: FIFREEZE ioctl system call hangs Thomas Deutschmann
2022-08-11 12:34 ` Thomas Deutschmann
2022-08-15 10:58   ` Thorsten Leemhuis
2022-08-15 15:46     ` Vishal Verma
2022-08-17  6:19       ` Song Liu
2022-08-17  6:53         ` Thomas Deutschmann [this message]
2022-08-17 18:29           ` Thomas Deutschmann
2022-08-19  2:46             ` Thomas Deutschmann
2022-08-20  1:04               ` Song Liu
2022-08-22 15:29                 ` Thomas Deutschmann
2022-08-22 16:30                   ` Thomas Deutschmann
2022-08-22 21:52                     ` Song Liu
2022-08-22 22:44                       ` Thomas Deutschmann
2022-08-22 22:59                         ` Song Liu
2022-08-23  1:37                           ` Song Liu
2022-08-23  3:15                             ` Thomas Deutschmann
2022-08-23 17:13                               ` Song Liu
2022-08-25 16:47                                 ` Song Liu
2022-08-25 19:12                                   ` Jens Axboe
2022-08-25 22:24                                     ` Song Liu
2022-08-26 20:10                                       ` Thomas Deutschmann
2022-09-08 13:25     ` [REGRESSION] v5.17-rc1+: FIFREEZE ioctl system call hangs #forregzbot Thorsten Leemhuis

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=43e678ca-3fc3-6c08-f035-2c31a34dd889@whissi.de \
    --to=whissi@whissi.de \
    --cc=axboe@kernel.dk \
    --cc=regressions@leemhuis.info \
    --cc=regressions@lists.linux.dev \
    --cc=song@kernel.org \
    --cc=stable@vger.kernel.org \
    --cc=vverma@digitalocean.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).