* Re: blktests failures with v5.19-rc1 [not found] ` <CAHj4cs9G0WDrnSS6iVZJfgfOcRR0ysJhw+9yqcbqE=_8mkF0zw@mail.gmail.com> @ 2022-06-14 1:09 ` Shinichiro Kawasaki 2022-06-14 2:23 ` Keith Busch 0 siblings, 1 reply; 15+ messages in thread From: Shinichiro Kawasaki @ 2022-06-14 1:09 UTC (permalink / raw) To: Yi Zhang Cc: Keith Busch, Chaitanya Kulkarni, linux-block, linux-nvme, linux-scsi, mstowe, linux-pci (CC+: linux-pci) On Jun 11, 2022 / 16:34, Yi Zhang wrote: > On Fri, Jun 10, 2022 at 10:49 PM Keith Busch <kbusch@kernel.org> wrote: > > > > On Fri, Jun 10, 2022 at 12:25:17PM +0000, Shinichiro Kawasaki wrote: > > > On Jun 10, 2022 / 09:32, Chaitanya Kulkarni wrote: > > > > >> #6: nvme/032: Failed at the first run after system reboot. > > > > >> Used QEMU NVME device as TEST_DEV. > > > > >> > > > > > > > > ofcourse we need to fix this issue but can you also > > > > try it with the real H/W ? > > > > > > Hi Chaitanya, thank you for looking into the failures. I have just run the test > > > case nvme/032 with real NVME device and observed the exactly same symptom as > > > QEMU NVME device. > > > > QEMU is perfectly fine for this test. There's no need to bring in "real" > > hardware for this. > > > > And I am not even sure this is real. I don't know yet why this is showing up > > only now, but this should fix it: > > Hi Keith > > Confirmed the WARNING issue was fixed with the change, here is the log: Thanks. I also confirmed that Keith's change to add __ATTR_IGNORE_LOCKDEP to dev_attr_dev_rescan avoids the fix, on v5.19-rc2. I took a closer look into this issue and found The deadlock WARN can be recreated with following two commands: # echo 1 > /sys/bus/pci/devices/0000\:00\:09.0/rescan # echo 1 > /sys/bus/pci/devices/0000\:00\:09.0/remove And it can be recreated with PCI devices other than NVME controller, such as SCSI controller or VGA controller. Then this is not a storage sub-system issue. I checked function call stacks of the two commands above. As shown below, it looks like ABBA deadlock possibility is detected and warned. echo 1 > /sys/bus/pci/devices/*/rescan kernfs_fop_write_iter kernfs_get_active atomic_inc_unless_nagative rwsem_acquire_read(&kn->dep_map, 0, 1, _RET_IP) :lock L1 for "rescan" file dev_rescan_store pci_lock_rescan_remove mutex_lock(&pci_rescan_remove_lock) :lock L2 echo 1 > /sys/bus/pci/devices/*/remove kernfs_fop_write_iter remove_store pci_stop_and_remove_bus_device_locked pci_lock_rescan_remove mutex_lock(&pci_rescan_remove_lock) :lock L2 pci_stop_and_remove_bus_device pci_remove_bus_device device_del device_remove_attrs sysfs_remove_attrs sysfs_remove_groups sysfs_remove_group remove_files .... iterates for pci device sysfs files including "rescan" file? kernfs_remove_by_name_ns __kernfs_remove kernfs_drain rwsem_acquire(&kn->dep_map, 0, 0, _RET_IP): lock L1 It looks for me that the deadlock possibility exists for real, even though the race between rescan operation and remove operation is really rare use case. I would like to hear comments on the guess above. I take the liberty to CC this to linux-pci list. Is this an issue to fix? -- Shin'ichiro Kawasaki ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: blktests failures with v5.19-rc1 2022-06-14 1:09 ` blktests failures with v5.19-rc1 Shinichiro Kawasaki @ 2022-06-14 2:23 ` Keith Busch 2022-06-14 2:38 ` Chaitanya Kulkarni 0 siblings, 1 reply; 15+ messages in thread From: Keith Busch @ 2022-06-14 2:23 UTC (permalink / raw) To: Shinichiro Kawasaki Cc: Yi Zhang, Chaitanya Kulkarni, linux-block, linux-nvme, linux-scsi, mstowe, linux-pci On Tue, Jun 14, 2022 at 01:09:07AM +0000, Shinichiro Kawasaki wrote: > (CC+: linux-pci) > On Jun 11, 2022 / 16:34, Yi Zhang wrote: > > On Fri, Jun 10, 2022 at 10:49 PM Keith Busch <kbusch@kernel.org> wrote: > > > > > > And I am not even sure this is real. I don't know yet why this is showing up > > > only now, but this should fix it: > > > > Hi Keith > > > > Confirmed the WARNING issue was fixed with the change, here is the log: > > Thanks. I also confirmed that Keith's change to add __ATTR_IGNORE_LOCKDEP to > dev_attr_dev_rescan avoids the fix, on v5.19-rc2. > > I took a closer look into this issue and found The deadlock WARN can be > recreated with following two commands: > > # echo 1 > /sys/bus/pci/devices/0000\:00\:09.0/rescan > # echo 1 > /sys/bus/pci/devices/0000\:00\:09.0/remove > > And it can be recreated with PCI devices other than NVME controller, such as > SCSI controller or VGA controller. Then this is not a storage sub-system issue. > > I checked function call stacks of the two commands above. As shown below, it > looks like ABBA deadlock possibility is detected and warned. Yeah, I was mistaken on this report, so my proposal to suppress the warning is definitely not right. If I run both 'echo' commands in parallel, I see it deadlock frequently. I'm not familiar enough with this code to any good ideas on how to fix, but I agree this is a generic pci issue. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: blktests failures with v5.19-rc1 2022-06-14 2:23 ` Keith Busch @ 2022-06-14 2:38 ` Chaitanya Kulkarni 2022-06-14 4:00 ` Shinichiro Kawasaki 0 siblings, 1 reply; 15+ messages in thread From: Chaitanya Kulkarni @ 2022-06-14 2:38 UTC (permalink / raw) To: Shinichiro Kawasaki Cc: Yi Zhang, linux-block, Keith Busch, linux-nvme, linux-scsi, mstowe, linux-pci Shinichiro, On 6/13/22 19:23, Keith Busch wrote: > On Tue, Jun 14, 2022 at 01:09:07AM +0000, Shinichiro Kawasaki wrote: >> (CC+: linux-pci) >> On Jun 11, 2022 / 16:34, Yi Zhang wrote: >>> On Fri, Jun 10, 2022 at 10:49 PM Keith Busch <kbusch@kernel.org> wrote: >>>> >>>> And I am not even sure this is real. I don't know yet why this is showing up >>>> only now, but this should fix it: >>> >>> Hi Keith >>> >>> Confirmed the WARNING issue was fixed with the change, here is the log: >> >> Thanks. I also confirmed that Keith's change to add __ATTR_IGNORE_LOCKDEP to >> dev_attr_dev_rescan avoids the fix, on v5.19-rc2. >> >> I took a closer look into this issue and found The deadlock WARN can be >> recreated with following two commands: >> >> # echo 1 > /sys/bus/pci/devices/0000\:00\:09.0/rescan >> # echo 1 > /sys/bus/pci/devices/0000\:00\:09.0/remove >> >> And it can be recreated with PCI devices other than NVME controller, such as >> SCSI controller or VGA controller. Then this is not a storage sub-system issue. >> >> I checked function call stacks of the two commands above. As shown below, it >> looks like ABBA deadlock possibility is detected and warned. > > Yeah, I was mistaken on this report, so my proposal to suppress the warning is > definitely not right. If I run both 'echo' commands in parallel, I see it > deadlock frequently. I'm not familiar enough with this code to any good ideas > on how to fix, but I agree this is a generic pci issue. I think it is worth adding a testcase to blktests to make sure these future releases will test this. -ck ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: blktests failures with v5.19-rc1 2022-06-14 2:38 ` Chaitanya Kulkarni @ 2022-06-14 4:00 ` Shinichiro Kawasaki 2022-06-15 19:47 ` Bjorn Helgaas 0 siblings, 1 reply; 15+ messages in thread From: Shinichiro Kawasaki @ 2022-06-14 4:00 UTC (permalink / raw) To: Chaitanya Kulkarni Cc: Yi Zhang, linux-block, Keith Busch, linux-nvme, linux-scsi, mstowe, linux-pci On Jun 14, 2022 / 02:38, Chaitanya Kulkarni wrote: > Shinichiro, > > On 6/13/22 19:23, Keith Busch wrote: > > On Tue, Jun 14, 2022 at 01:09:07AM +0000, Shinichiro Kawasaki wrote: > >> (CC+: linux-pci) > >> On Jun 11, 2022 / 16:34, Yi Zhang wrote: > >>> On Fri, Jun 10, 2022 at 10:49 PM Keith Busch <kbusch@kernel.org> wrote: > >>>> > >>>> And I am not even sure this is real. I don't know yet why this is showing up > >>>> only now, but this should fix it: > >>> > >>> Hi Keith > >>> > >>> Confirmed the WARNING issue was fixed with the change, here is the log: > >> > >> Thanks. I also confirmed that Keith's change to add __ATTR_IGNORE_LOCKDEP to > >> dev_attr_dev_rescan avoids the fix, on v5.19-rc2. > >> > >> I took a closer look into this issue and found The deadlock WARN can be > >> recreated with following two commands: > >> > >> # echo 1 > /sys/bus/pci/devices/0000\:00\:09.0/rescan > >> # echo 1 > /sys/bus/pci/devices/0000\:00\:09.0/remove > >> > >> And it can be recreated with PCI devices other than NVME controller, such as > >> SCSI controller or VGA controller. Then this is not a storage sub-system issue. > >> > >> I checked function call stacks of the two commands above. As shown below, it > >> looks like ABBA deadlock possibility is detected and warned. > > > > Yeah, I was mistaken on this report, so my proposal to suppress the warning is > > definitely not right. If I run both 'echo' commands in parallel, I see it > > deadlock frequently. I'm not familiar enough with this code to any good ideas > > on how to fix, but I agree this is a generic pci issue. > > I think it is worth adding a testcase to blktests to make sure these > future releases will test this. Yeah, this WARN is confusing for us then it would be valuable to test by blktests not to repeat it. One point I wonder is: which test group the test case will it fall in? The nvme group could be the group to add, probably. Another point I wonder is other kernel test suite than blktests. Don't we have more appropriate test suite to check PCI device rescan/remove race ? Such a test sounds more like a PCI bus sub-system test than block/storage test. Having said that, still I think the test case is valuable for block/storage. Unless anyone opposes, I'm open for the patch to add it. -- Shin'ichiro Kawasaki ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: blktests failures with v5.19-rc1 2022-06-14 4:00 ` Shinichiro Kawasaki @ 2022-06-15 19:47 ` Bjorn Helgaas 2022-06-15 22:01 ` Chaitanya Kulkarni 2022-06-15 23:16 ` Keith Busch 0 siblings, 2 replies; 15+ messages in thread From: Bjorn Helgaas @ 2022-06-15 19:47 UTC (permalink / raw) To: Shinichiro Kawasaki Cc: Chaitanya Kulkarni, Yi Zhang, linux-block, Keith Busch, linux-nvme, linux-scsi, mstowe, linux-pci On Tue, Jun 14, 2022 at 04:00:45AM +0000, Shinichiro Kawasaki wrote: > On Jun 14, 2022 / 02:38, Chaitanya Kulkarni wrote: > > Shinichiro, > > > > On 6/13/22 19:23, Keith Busch wrote: > > > On Tue, Jun 14, 2022 at 01:09:07AM +0000, Shinichiro Kawasaki wrote: > > >> (CC+: linux-pci) > > >> On Jun 11, 2022 / 16:34, Yi Zhang wrote: > > >>> On Fri, Jun 10, 2022 at 10:49 PM Keith Busch <kbusch@kernel.org> wrote: > > >>>> > > >>>> And I am not even sure this is real. I don't know yet why > > >>>> this is showing up only now, but this should fix it: > > >>> > > >>> Hi Keith > > >>> > > >>> Confirmed the WARNING issue was fixed with the change, here is > > >>> the log: > > >> > > >> Thanks. I also confirmed that Keith's change to add > > >> __ATTR_IGNORE_LOCKDEP to dev_attr_dev_rescan avoids the fix, on > > >> v5.19-rc2. > > >> > > >> I took a closer look into this issue and found The deadlock > > >> WARN can be recreated with following two commands: > > >> > > >> # echo 1 > /sys/bus/pci/devices/0000\:00\:09.0/rescan > > >> # echo 1 > /sys/bus/pci/devices/0000\:00\:09.0/remove > > >> > > >> And it can be recreated with PCI devices other than NVME > > >> controller, such as SCSI controller or VGA controller. Then > > >> this is not a storage sub-system issue. > > >> > > >> I checked function call stacks of the two commands above. As > > >> shown below, it looks like ABBA deadlock possibility is > > >> detected and warned. > > > > > > Yeah, I was mistaken on this report, so my proposal to suppress > > > the warning is definitely not right. If I run both 'echo' > > > commands in parallel, I see it deadlock frequently. I'm not > > > familiar enough with this code to any good ideas on how to fix, > > > but I agree this is a generic pci issue. > > > > I think it is worth adding a testcase to blktests to make sure > > these future releases will test this. > > Yeah, this WARN is confusing for us then it would be valuable to > test by blktests not to repeat it. One point I wonder is: which test > group the test case will it fall in? The nvme group could be the > group to add, probably. > > Another point I wonder is other kernel test suite than blktests. > Don't we have more appropriate test suite to check PCI device > rescan/remove race ? Such a test sounds more like a PCI bus > sub-system test than block/storage test. I'm not aware of such a test, but it would be nice to have one. Can you share your qemu config so I can reproduce this locally? Thanks for finding and reporting this! Bjorn ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: blktests failures with v5.19-rc1 2022-06-15 19:47 ` Bjorn Helgaas @ 2022-06-15 22:01 ` Chaitanya Kulkarni 2022-06-15 23:13 ` Yi Zhang 2022-06-15 23:16 ` Keith Busch 1 sibling, 1 reply; 15+ messages in thread From: Chaitanya Kulkarni @ 2022-06-15 22:01 UTC (permalink / raw) To: Shinichiro Kawasaki Cc: Yi Zhang, linux-block, Keith Busch, linux-nvme, linux-scsi, mstowe, linux-pci, Bjorn Helgaas On 6/15/22 12:47, Bjorn Helgaas wrote: > On Tue, Jun 14, 2022 at 04:00:45AM +0000, Shinichiro Kawasaki wrote: >> On Jun 14, 2022 / 02:38, Chaitanya Kulkarni wrote: >>> Shinichiro, >>> >>> On 6/13/22 19:23, Keith Busch wrote: >>>> On Tue, Jun 14, 2022 at 01:09:07AM +0000, Shinichiro Kawasaki wrote: >>>>> (CC+: linux-pci) >>>>> On Jun 11, 2022 / 16:34, Yi Zhang wrote: >>>>>> On Fri, Jun 10, 2022 at 10:49 PM Keith Busch <kbusch@kernel.org> wrote: >>>>>>> >>>>>>> And I am not even sure this is real. I don't know yet why >>>>>>> this is showing up only now, but this should fix it: >>>>>> >>>>>> Hi Keith >>>>>> >>>>>> Confirmed the WARNING issue was fixed with the change, here is >>>>>> the log: >>>>> >>>>> Thanks. I also confirmed that Keith's change to add >>>>> __ATTR_IGNORE_LOCKDEP to dev_attr_dev_rescan avoids the fix, on >>>>> v5.19-rc2. >>>>> >>>>> I took a closer look into this issue and found The deadlock >>>>> WARN can be recreated with following two commands: >>>>> >>>>> # echo 1 > /sys/bus/pci/devices/0000\:00\:09.0/rescan >>>>> # echo 1 > /sys/bus/pci/devices/0000\:00\:09.0/remove >>>>> >>>>> And it can be recreated with PCI devices other than NVME >>>>> controller, such as SCSI controller or VGA controller. Then >>>>> this is not a storage sub-system issue. >>>>> >>>>> I checked function call stacks of the two commands above. As >>>>> shown below, it looks like ABBA deadlock possibility is >>>>> detected and warned. >>>> >>>> Yeah, I was mistaken on this report, so my proposal to suppress >>>> the warning is definitely not right. If I run both 'echo' >>>> commands in parallel, I see it deadlock frequently. I'm not >>>> familiar enough with this code to any good ideas on how to fix, >>>> but I agree this is a generic pci issue. >>> >>> I think it is worth adding a testcase to blktests to make sure >>> these future releases will test this. >> >> Yeah, this WARN is confusing for us then it would be valuable to >> test by blktests not to repeat it. One point I wonder is: which test >> group the test case will it fall in? The nvme group could be the >> group to add, probably. >> since this issue been discovered with nvme rescan and revmoe, it should be added to the nvme category. >> Another point I wonder is other kernel test suite than blktests. >> Don't we have more appropriate test suite to check PCI device >> rescan/remove race ? Such a test sounds more like a PCI bus >> sub-system test than block/storage test. I don't think so we could have caught it long time back, but we clearly did not. > > I'm not aware of such a test, but it would be nice to have one. > > Can you share your qemu config so I can reproduce this locally? > > Thanks for finding and reporting this! > > Bjorn -ck ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: blktests failures with v5.19-rc1 2022-06-15 22:01 ` Chaitanya Kulkarni @ 2022-06-15 23:13 ` Yi Zhang 2022-06-16 4:42 ` Shinichiro Kawasaki 0 siblings, 1 reply; 15+ messages in thread From: Yi Zhang @ 2022-06-15 23:13 UTC (permalink / raw) To: Chaitanya Kulkarni Cc: Shinichiro Kawasaki, linux-block, Keith Busch, linux-nvme, linux-scsi, mstowe, linux-pci, Bjorn Helgaas On Thu, Jun 16, 2022 at 6:01 AM Chaitanya Kulkarni <chaitanyak@nvidia.com> wrote: > > On 6/15/22 12:47, Bjorn Helgaas wrote: > > On Tue, Jun 14, 2022 at 04:00:45AM +0000, Shinichiro Kawasaki wrote: > >> On Jun 14, 2022 / 02:38, Chaitanya Kulkarni wrote: > >>> Shinichiro, > >>> > >>> On 6/13/22 19:23, Keith Busch wrote: > >>>> On Tue, Jun 14, 2022 at 01:09:07AM +0000, Shinichiro Kawasaki wrote: > >>>>> (CC+: linux-pci) > >>>>> On Jun 11, 2022 / 16:34, Yi Zhang wrote: > >>>>>> On Fri, Jun 10, 2022 at 10:49 PM Keith Busch <kbusch@kernel.org> wrote: > >>>>>>> > >>>>>>> And I am not even sure this is real. I don't know yet why > >>>>>>> this is showing up only now, but this should fix it: > >>>>>> > >>>>>> Hi Keith > >>>>>> > >>>>>> Confirmed the WARNING issue was fixed with the change, here is > >>>>>> the log: > >>>>> > >>>>> Thanks. I also confirmed that Keith's change to add > >>>>> __ATTR_IGNORE_LOCKDEP to dev_attr_dev_rescan avoids the fix, on > >>>>> v5.19-rc2. > >>>>> > >>>>> I took a closer look into this issue and found The deadlock > >>>>> WARN can be recreated with following two commands: > >>>>> > >>>>> # echo 1 > /sys/bus/pci/devices/0000\:00\:09.0/rescan > >>>>> # echo 1 > /sys/bus/pci/devices/0000\:00\:09.0/remove > >>>>> > >>>>> And it can be recreated with PCI devices other than NVME > >>>>> controller, such as SCSI controller or VGA controller. Then > >>>>> this is not a storage sub-system issue. > >>>>> > >>>>> I checked function call stacks of the two commands above. As > >>>>> shown below, it looks like ABBA deadlock possibility is > >>>>> detected and warned. > >>>> > >>>> Yeah, I was mistaken on this report, so my proposal to suppress > >>>> the warning is definitely not right. If I run both 'echo' > >>>> commands in parallel, I see it deadlock frequently. I'm not > >>>> familiar enough with this code to any good ideas on how to fix, > >>>> but I agree this is a generic pci issue. > >>> > >>> I think it is worth adding a testcase to blktests to make sure > >>> these future releases will test this. > >> > >> Yeah, this WARN is confusing for us then it would be valuable to > >> test by blktests not to repeat it. One point I wonder is: which test > >> group the test case will it fall in? The nvme group could be the > >> group to add, probably. > >> > > since this issue been discovered with nvme rescan and revmoe, > it should be added to the nvme category. We already have nvme/032 which tests nvme rescan/reset/remove and the issue was reported by running this one, do we still need one more? > > >> Another point I wonder is other kernel test suite than blktests. > >> Don't we have more appropriate test suite to check PCI device > >> rescan/remove race ? Such a test sounds more like a PCI bus > >> sub-system test than block/storage test. > > I don't think so we could have caught it long time back, > but we clearly did not. > > > > > I'm not aware of such a test, but it would be nice to have one. > > > > Can you share your qemu config so I can reproduce this locally? > > > > Thanks for finding and reporting this! > > > > Bjorn > > -ck > > -- Best Regards, Yi Zhang ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: blktests failures with v5.19-rc1 2022-06-15 23:13 ` Yi Zhang @ 2022-06-16 4:42 ` Shinichiro Kawasaki 2022-06-16 17:55 ` Chaitanya Kulkarni 0 siblings, 1 reply; 15+ messages in thread From: Shinichiro Kawasaki @ 2022-06-16 4:42 UTC (permalink / raw) To: Yi Zhang Cc: Chaitanya Kulkarni, linux-block, Keith Busch, linux-nvme, linux-scsi, mstowe, linux-pci, Bjorn Helgaas On Jun 16, 2022 / 07:13, Yi Zhang wrote: > On Thu, Jun 16, 2022 at 6:01 AM Chaitanya Kulkarni > <chaitanyak@nvidia.com> wrote: > > > > On 6/15/22 12:47, Bjorn Helgaas wrote: > > > On Tue, Jun 14, 2022 at 04:00:45AM +0000, Shinichiro Kawasaki wrote: > > >> On Jun 14, 2022 / 02:38, Chaitanya Kulkarni wrote: > > >>> Shinichiro, [snip] > > >>> I think it is worth adding a testcase to blktests to make sure > > >>> these future releases will test this. > > >> > > >> Yeah, this WARN is confusing for us then it would be valuable to > > >> test by blktests not to repeat it. One point I wonder is: which test > > >> group the test case will it fall in? The nvme group could be the > > >> group to add, probably. > > >> > > > > since this issue been discovered with nvme rescan and revmoe, > > it should be added to the nvme category. > > We already have nvme/032 which tests nvme rescan/reset/remove and the > issue was reported by running this one, do we still need one more? That is a point. Current nvme/032 checks nvme pci adapter rescan/reset/remove during I/O to catch problems in nvme driver and block layer, but actually it can catch the problem in pci sub-system also. I think Chaitanya's motivation for the new test case is to distinguish those two. If we have the new test case, its code will be similar and duplicated as nvme/032 code. To avoid such duplication, it would be good to improve nvme/032 to have two steps. The 1st step checks that nvme pci adapter rescan/reset/remove without I/O causes no kernel WARN (or any other unexpected kernel messages). Any issue found in this step is reported as a pci sub-system issue. The 2nd step checks nvme pci adapter rescan/reset/remove during I/O, as the current nvme/032 does. With this, we don't need the new test case, but still we can distinguish the problems in nvme/block sub-system and pci sub-system. > > >> Another point I wonder is other kernel test suite than blktests. > > >> Don't we have more appropriate test suite to check PCI device > > >> rescan/remove race ? Such a test sounds more like a PCI bus > > >> sub-system test than block/storage test. > > > > I don't think so we could have caught it long time back, > > but we clearly did not. I see, then it looks that blktests is the test suite to test it. -- Shin'ichiro Kawasaki ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: blktests failures with v5.19-rc1 2022-06-16 4:42 ` Shinichiro Kawasaki @ 2022-06-16 17:55 ` Chaitanya Kulkarni 0 siblings, 0 replies; 15+ messages in thread From: Chaitanya Kulkarni @ 2022-06-16 17:55 UTC (permalink / raw) To: Shinichiro Kawasaki, Yi Zhang Cc: linux-block, Keith Busch, linux-nvme, linux-scsi, mstowe, linux-pci, Bjorn Helgaas On 6/15/2022 9:42 PM, Shinichiro Kawasaki wrote: > On Jun 16, 2022 / 07:13, Yi Zhang wrote: >> On Thu, Jun 16, 2022 at 6:01 AM Chaitanya Kulkarni >> <chaitanyak@nvidia.com> wrote: >>> >>> On 6/15/22 12:47, Bjorn Helgaas wrote: >>>> On Tue, Jun 14, 2022 at 04:00:45AM +0000, Shinichiro Kawasaki wrote: >>>>> On Jun 14, 2022 / 02:38, Chaitanya Kulkarni wrote: >>>>>> Shinichiro, > > [snip] > >>>>>> I think it is worth adding a testcase to blktests to make sure >>>>>> these future releases will test this. >>>>> >>>>> Yeah, this WARN is confusing for us then it would be valuable to >>>>> test by blktests not to repeat it. One point I wonder is: which test >>>>> group the test case will it fall in? The nvme group could be the >>>>> group to add, probably. >>>>> >>> >>> since this issue been discovered with nvme rescan and revmoe, >>> it should be added to the nvme category. >> >> We already have nvme/032 which tests nvme rescan/reset/remove and the >> issue was reported by running this one, do we still need one more? > > That is a point. Current nvme/032 checks nvme pci adapter rescan/reset/remove > during I/O to catch problems in nvme driver and block layer, but actually it > can catch the problem in pci sub-system also. I think Chaitanya's motivation > for the new test case is to distinguish those two. > Yes, exactly. > If we have the new test case, its code will be similar and duplicated as > nvme/032 code. To avoid such duplication, it would be good to improve nvme/032 > to have two steps. The 1st step checks that nvme pci adapter rescan/reset/remove > without I/O causes no kernel WARN (or any other unexpected kernel messages). Any > issue found in this step is reported as a pci sub-system issue. The 2nd step > checks nvme pci adapter rescan/reset/remove during I/O, as the current nvme/032 > does. With this, we don't need the new test case, but still we can distinguish > the problems in nvme/block sub-system and pci sub-system. > Totally agree with this. >>>>> Another point I wonder is other kernel test suite than blktests. >>>>> Don't we have more appropriate test suite to check PCI device >>>>> rescan/remove race ? Such a test sounds more like a PCI bus >>>>> sub-system test than block/storage test. >>> >>> I don't think so we could have caught it long time back, >>> but we clearly did not. > > I see, then it looks that blktests is the test suite to test it. > -ck ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: blktests failures with v5.19-rc1 2022-06-15 19:47 ` Bjorn Helgaas 2022-06-15 22:01 ` Chaitanya Kulkarni @ 2022-06-15 23:16 ` Keith Busch 2022-07-19 4:50 ` Shinichiro Kawasaki 1 sibling, 1 reply; 15+ messages in thread From: Keith Busch @ 2022-06-15 23:16 UTC (permalink / raw) To: Bjorn Helgaas Cc: Shinichiro Kawasaki, Chaitanya Kulkarni, Yi Zhang, linux-block, linux-nvme, linux-scsi, mstowe, linux-pci On Wed, Jun 15, 2022 at 02:47:27PM -0500, Bjorn Helgaas wrote: > On Tue, Jun 14, 2022 at 04:00:45AM +0000, Shinichiro Kawasaki wrote: > > > > Yeah, this WARN is confusing for us then it would be valuable to > > test by blktests not to repeat it. One point I wonder is: which test > > group the test case will it fall in? The nvme group could be the > > group to add, probably. > > > > Another point I wonder is other kernel test suite than blktests. > > Don't we have more appropriate test suite to check PCI device > > rescan/remove race ? Such a test sounds more like a PCI bus > > sub-system test than block/storage test. > > I'm not aware of such a test, but it would be nice to have one. > > Can you share your qemu config so I can reproduce this locally? > > Thanks for finding and reporting this! Hi Bjorn, This ought to be reproducible with any pci device that can be removed. Since we initially observed with nvme, you can try with such a device. A quick way to get one appearing in qemu is to add parameters: -drive id=n,if=none,file=null-co://,format=raw \ -device nvme,serial=foobar,drive=n ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: blktests failures with v5.19-rc1 2022-06-15 23:16 ` Keith Busch @ 2022-07-19 4:50 ` Shinichiro Kawasaki 2022-07-19 22:31 ` Bjorn Helgaas 0 siblings, 1 reply; 15+ messages in thread From: Shinichiro Kawasaki @ 2022-07-19 4:50 UTC (permalink / raw) To: Bjorn Helgaas Cc: Chaitanya Kulkarni, Yi Zhang, linux-block, linux-nvme, linux-scsi, mstowe, linux-pci, Xiao Yang, Keith Busch On Jun 15, 2022 / 17:16, Keith Busch wrote: > On Wed, Jun 15, 2022 at 02:47:27PM -0500, Bjorn Helgaas wrote: > > On Tue, Jun 14, 2022 at 04:00:45AM +0000, Shinichiro Kawasaki wrote: > > > > > > Yeah, this WARN is confusing for us then it would be valuable to > > > test by blktests not to repeat it. One point I wonder is: which test > > > group the test case will it fall in? The nvme group could be the > > > group to add, probably. > > > > > > Another point I wonder is other kernel test suite than blktests. > > > Don't we have more appropriate test suite to check PCI device > > > rescan/remove race ? Such a test sounds more like a PCI bus > > > sub-system test than block/storage test. > > > > I'm not aware of such a test, but it would be nice to have one. > > > > Can you share your qemu config so I can reproduce this locally? > > > > Thanks for finding and reporting this! > > Hi Bjorn, > > This ought to be reproducible with any pci device that can be removed. Since we > initially observed with nvme, you can try with such a device. A quick way to > get one appearing in qemu is to add parameters: > > -drive id=n,if=none,file=null-co://,format=raw \ > -device nvme,serial=foobar,drive=n Hello Bjorn, Did you have chance to reproduce the WARN? Recently, it was reported again [1] and getting attention. [1] https://lore.kernel.org/linux-block/4ed3028b-2d2f-8755-fec2-0b9cb5ad42d2@fujitsu.com/ -- Shin'ichiro Kawasaki ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: blktests failures with v5.19-rc1 2022-07-19 4:50 ` Shinichiro Kawasaki @ 2022-07-19 22:31 ` Bjorn Helgaas 2022-07-20 2:27 ` Shinichiro Kawasaki 0 siblings, 1 reply; 15+ messages in thread From: Bjorn Helgaas @ 2022-07-19 22:31 UTC (permalink / raw) To: Shinichiro Kawasaki Cc: Chaitanya Kulkarni, Yi Zhang, linux-block, linux-nvme, linux-scsi, mstowe, linux-pci, Xiao Yang, Keith Busch On Tue, Jul 19, 2022 at 04:50:36AM +0000, Shinichiro Kawasaki wrote: > On Jun 15, 2022 / 17:16, Keith Busch wrote: > > On Wed, Jun 15, 2022 at 02:47:27PM -0500, Bjorn Helgaas wrote: > > > On Tue, Jun 14, 2022 at 04:00:45AM +0000, Shinichiro Kawasaki wrote: > > > > > > > > Yeah, this WARN is confusing for us then it would be valuable to > > > > test by blktests not to repeat it. One point I wonder is: which test > > > > group the test case will it fall in? The nvme group could be the > > > > group to add, probably. > > > > > > > > Another point I wonder is other kernel test suite than blktests. > > > > Don't we have more appropriate test suite to check PCI device > > > > rescan/remove race ? Such a test sounds more like a PCI bus > > > > sub-system test than block/storage test. > > > > > > I'm not aware of such a test, but it would be nice to have one. > > > > > > Can you share your qemu config so I can reproduce this locally? > > > > > > Thanks for finding and reporting this! > > > > This ought to be reproducible with any pci device that can be > > removed. Since we initially observed with nvme, you can try with > > such a device. A quick way to get one appearing in qemu is to add > > parameters: > > > > -drive id=n,if=none,file=null-co://,format=raw \ > > -device nvme,serial=foobar,drive=n > > Did you have chance to reproduce the WARN? Recently, it was reported > again [1] and getting attention. I have not paid any attention to this yet. From what I can tell, the problem was discovered by a test case (i.e., not reported by a real-world user), it is not a recent regression, we haven't identified a commit that introduced the problem, and we do not have a potential fix for it. Obviously it needs to be fixed and I'm not trying to minimize the problem; I just want to calibrate it against everything else. > [1] https://lore.kernel.org/linux-block/4ed3028b-2d2f-8755-fec2-0b9cb5ad42d2@fujitsu.com/ > > -- > Shin'ichiro Kawasaki ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: blktests failures with v5.19-rc1 2022-07-19 22:31 ` Bjorn Helgaas @ 2022-07-20 2:27 ` Shinichiro Kawasaki 2022-12-19 11:27 ` Shinichiro Kawasaki 0 siblings, 1 reply; 15+ messages in thread From: Shinichiro Kawasaki @ 2022-07-20 2:27 UTC (permalink / raw) To: Bjorn Helgaas Cc: Chaitanya Kulkarni, Yi Zhang, linux-block, linux-nvme, linux-scsi, mstowe, linux-pci, Xiao Yang, Keith Busch On Jul 19, 2022 / 17:31, Bjorn Helgaas wrote: > On Tue, Jul 19, 2022 at 04:50:36AM +0000, Shinichiro Kawasaki wrote: > > On Jun 15, 2022 / 17:16, Keith Busch wrote: > > > On Wed, Jun 15, 2022 at 02:47:27PM -0500, Bjorn Helgaas wrote: > > > > On Tue, Jun 14, 2022 at 04:00:45AM +0000, Shinichiro Kawasaki wrote: > > > > > > > > > > Yeah, this WARN is confusing for us then it would be valuable to > > > > > test by blktests not to repeat it. One point I wonder is: which test > > > > > group the test case will it fall in? The nvme group could be the > > > > > group to add, probably. > > > > > > > > > > Another point I wonder is other kernel test suite than blktests. > > > > > Don't we have more appropriate test suite to check PCI device > > > > > rescan/remove race ? Such a test sounds more like a PCI bus > > > > > sub-system test than block/storage test. > > > > > > > > I'm not aware of such a test, but it would be nice to have one. > > > > > > > > Can you share your qemu config so I can reproduce this locally? > > > > > > > > Thanks for finding and reporting this! > > > > > > This ought to be reproducible with any pci device that can be > > > removed. Since we initially observed with nvme, you can try with > > > such a device. A quick way to get one appearing in qemu is to add > > > parameters: > > > > > > -drive id=n,if=none,file=null-co://,format=raw \ > > > -device nvme,serial=foobar,drive=n > > > > Did you have chance to reproduce the WARN? Recently, it was reported > > again [1] and getting attention. > > I have not paid any attention to this yet. From what I can tell, the > problem was discovered by a test case (i.e., not reported by a > real-world user), it is not a recent regression, we haven't identified > a commit that introduced the problem, and we do not have a potential > fix for it. Obviously it needs to be fixed and I'm not trying to > minimize the problem; I just want to calibrate it against everything > else. Bjorn, thank you for sharing your view, which sounds reasonable for me. -- Shin'ichiro Kawasaki ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: blktests failures with v5.19-rc1 2022-07-20 2:27 ` Shinichiro Kawasaki @ 2022-12-19 11:27 ` Shinichiro Kawasaki 2022-12-29 18:13 ` Bjorn Helgaas 0 siblings, 1 reply; 15+ messages in thread From: Shinichiro Kawasaki @ 2022-12-19 11:27 UTC (permalink / raw) To: Bjorn Helgaas Cc: Chaitanya Kulkarni, Yi Zhang, linux-block, linux-nvme, linux-scsi, mstowe, linux-pci, Xiao Yang, Keith Busch Bjorn, FYI. This lockdep warning disappeared with kernel version v6.1. I bisected and found the commit 2d7f9f8c1815 ("kernfs: Improve kernfs_drain() and always call on removal.") avoided the issue. This commit touches kernfs_drain() and __kernfs_remove(), and modifies the condition to lock kernfs_rwsem. I think it explains why the lockdep disappeared. No longer need to work on this issue :) -- Shin'ichiro Kawasaki ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: blktests failures with v5.19-rc1 2022-12-19 11:27 ` Shinichiro Kawasaki @ 2022-12-29 18:13 ` Bjorn Helgaas 0 siblings, 0 replies; 15+ messages in thread From: Bjorn Helgaas @ 2022-12-29 18:13 UTC (permalink / raw) To: Shinichiro Kawasaki Cc: Chaitanya Kulkarni, Yi Zhang, linux-block, linux-nvme, linux-scsi, mstowe, linux-pci, Xiao Yang, Keith Busch On Mon, Dec 19, 2022 at 11:27:03AM +0000, Shinichiro Kawasaki wrote: > Bjorn, FYI. This lockdep warning disappeared with kernel version > v6.1. I bisected and found the commit 2d7f9f8c1815 ("kernfs: Improve > kernfs_drain() and always call on removal.") avoided the issue. This > commit touches kernfs_drain() and __kernfs_remove(), and modifies > the condition to lock kernfs_rwsem. I think it explains why the > lockdep disappeared. No longer need to work on this issue :) Thanks for following up on this! Bjorn ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2022-12-29 18:13 UTC | newest] Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <20220609235329.4jbz4wr3eg2nmzqa@shindev> [not found] ` <717734c9-c633-fb48-499e-7e3e15113020@nvidia.com> [not found] ` <19d09611-42cc-5a81-d676-c5375865c14c@nvidia.com> [not found] ` <20220610122517.6pt5y63hcosk5mes@shindev> [not found] ` <YqNZiMw+rH5gyZDI@kbusch-mbp.dhcp.thefacebook.com> [not found] ` <CAHj4cs9G0WDrnSS6iVZJfgfOcRR0ysJhw+9yqcbqE=_8mkF0zw@mail.gmail.com> 2022-06-14 1:09 ` blktests failures with v5.19-rc1 Shinichiro Kawasaki 2022-06-14 2:23 ` Keith Busch 2022-06-14 2:38 ` Chaitanya Kulkarni 2022-06-14 4:00 ` Shinichiro Kawasaki 2022-06-15 19:47 ` Bjorn Helgaas 2022-06-15 22:01 ` Chaitanya Kulkarni 2022-06-15 23:13 ` Yi Zhang 2022-06-16 4:42 ` Shinichiro Kawasaki 2022-06-16 17:55 ` Chaitanya Kulkarni 2022-06-15 23:16 ` Keith Busch 2022-07-19 4:50 ` Shinichiro Kawasaki 2022-07-19 22:31 ` Bjorn Helgaas 2022-07-20 2:27 ` Shinichiro Kawasaki 2022-12-19 11:27 ` Shinichiro Kawasaki 2022-12-29 18:13 ` Bjorn Helgaas
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).