linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: blktests failures with v5.19-rc1
       [not found]         ` <CAHj4cs9G0WDrnSS6iVZJfgfOcRR0ysJhw+9yqcbqE=_8mkF0zw@mail.gmail.com>
@ 2022-06-14  1:09           ` Shinichiro Kawasaki
  2022-06-14  2:23             ` Keith Busch
  0 siblings, 1 reply; 15+ messages in thread
From: Shinichiro Kawasaki @ 2022-06-14  1:09 UTC (permalink / raw)
  To: Yi Zhang
  Cc: Keith Busch, Chaitanya Kulkarni, linux-block, linux-nvme,
	linux-scsi, mstowe, linux-pci

(CC+: linux-pci)

On Jun 11, 2022 / 16:34, Yi Zhang wrote:
> On Fri, Jun 10, 2022 at 10:49 PM Keith Busch <kbusch@kernel.org> wrote:
> >
> > On Fri, Jun 10, 2022 at 12:25:17PM +0000, Shinichiro Kawasaki wrote:
> > > On Jun 10, 2022 / 09:32, Chaitanya Kulkarni wrote:
> > > > >> #6: nvme/032: Failed at the first run after system reboot.
> > > > >>                 Used QEMU NVME device as TEST_DEV.
> > > > >>
> > > >
> > > > ofcourse we need to fix this issue but can you also
> > > > try it with the real H/W ?
> > >
> > > Hi Chaitanya, thank you for looking into the failures. I have just run the test
> > > case nvme/032 with real NVME device and observed the exactly same symptom as
> > > QEMU NVME device.
> >
> > QEMU is perfectly fine for this test. There's no need to bring in "real"
> > hardware for this.
> >
> > And I am not even sure this is real. I don't know yet why this is showing up
> > only now, but this should fix it:
> 
> Hi Keith
> 
> Confirmed the WARNING issue was fixed with the change, here is the log:

Thanks. I also confirmed that Keith's change to add __ATTR_IGNORE_LOCKDEP to
dev_attr_dev_rescan avoids the fix, on v5.19-rc2.

I took a closer look into this issue and found The deadlock WARN can be
recreated with following two commands:

# echo 1 > /sys/bus/pci/devices/0000\:00\:09.0/rescan
# echo 1 > /sys/bus/pci/devices/0000\:00\:09.0/remove

And it can be recreated with PCI devices other than NVME controller, such as
SCSI controller or VGA controller. Then this is not a storage sub-system issue.

I checked function call stacks of the two commands above. As shown below, it
looks like ABBA deadlock possibility is detected and warned.

echo 1 > /sys/bus/pci/devices/*/rescan
  kernfs_fop_write_iter
    kernfs_get_active
      atomic_inc_unless_nagative
      rwsem_acquire_read(&kn->dep_map, 0, 1, _RET_IP) :lock L1 for "rescan" file
    dev_rescan_store
      pci_lock_rescan_remove
        mutex_lock(&pci_rescan_remove_lock)           :lock L2

echo 1 > /sys/bus/pci/devices/*/remove
  kernfs_fop_write_iter
    remove_store
      pci_stop_and_remove_bus_device_locked
        pci_lock_rescan_remove
          mutex_lock(&pci_rescan_remove_lock)         :lock L2
        pci_stop_and_remove_bus_device
	  pci_remove_bus_device
	    device_del
	      device_remove_attrs
		sysfs_remove_attrs
		  sysfs_remove_groups
		    sysfs_remove_group
		      remove_files    .... iterates for pci device sysfs files including "rescan" file?
			kernfs_remove_by_name_ns
			  __kernfs_remove
			    kernfs_drain
			      rwsem_acquire(&kn->dep_map, 0, 0, _RET_IP): lock L1

It looks for me that the deadlock possibility exists for real, even though the
race between rescan operation and remove operation is really rare use case.

I would like to hear comments on the guess above. I take the liberty to CC this
to linux-pci list. Is this an issue to fix?

-- 
Shin'ichiro Kawasaki

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: blktests failures with v5.19-rc1
  2022-06-14  1:09           ` blktests failures with v5.19-rc1 Shinichiro Kawasaki
@ 2022-06-14  2:23             ` Keith Busch
  2022-06-14  2:38               ` Chaitanya Kulkarni
  0 siblings, 1 reply; 15+ messages in thread
From: Keith Busch @ 2022-06-14  2:23 UTC (permalink / raw)
  To: Shinichiro Kawasaki
  Cc: Yi Zhang, Chaitanya Kulkarni, linux-block, linux-nvme,
	linux-scsi, mstowe, linux-pci

On Tue, Jun 14, 2022 at 01:09:07AM +0000, Shinichiro Kawasaki wrote:
> (CC+: linux-pci)
> On Jun 11, 2022 / 16:34, Yi Zhang wrote:
> > On Fri, Jun 10, 2022 at 10:49 PM Keith Busch <kbusch@kernel.org> wrote:
> > >
> > > And I am not even sure this is real. I don't know yet why this is showing up
> > > only now, but this should fix it:
> > 
> > Hi Keith
> > 
> > Confirmed the WARNING issue was fixed with the change, here is the log:
> 
> Thanks. I also confirmed that Keith's change to add __ATTR_IGNORE_LOCKDEP to
> dev_attr_dev_rescan avoids the fix, on v5.19-rc2.
> 
> I took a closer look into this issue and found The deadlock WARN can be
> recreated with following two commands:
> 
> # echo 1 > /sys/bus/pci/devices/0000\:00\:09.0/rescan
> # echo 1 > /sys/bus/pci/devices/0000\:00\:09.0/remove
> 
> And it can be recreated with PCI devices other than NVME controller, such as
> SCSI controller or VGA controller. Then this is not a storage sub-system issue.
> 
> I checked function call stacks of the two commands above. As shown below, it
> looks like ABBA deadlock possibility is detected and warned.

Yeah, I was mistaken on this report, so my proposal to suppress the warning is
definitely not right. If I run both 'echo' commands in parallel, I see it
deadlock frequently. I'm not familiar enough with this code to any good ideas
on how to fix, but I agree this is a generic pci issue.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: blktests failures with v5.19-rc1
  2022-06-14  2:23             ` Keith Busch
@ 2022-06-14  2:38               ` Chaitanya Kulkarni
  2022-06-14  4:00                 ` Shinichiro Kawasaki
  0 siblings, 1 reply; 15+ messages in thread
From: Chaitanya Kulkarni @ 2022-06-14  2:38 UTC (permalink / raw)
  To: Shinichiro Kawasaki
  Cc: Yi Zhang, linux-block, Keith Busch, linux-nvme, linux-scsi,
	mstowe, linux-pci

Shinichiro,

On 6/13/22 19:23, Keith Busch wrote:
> On Tue, Jun 14, 2022 at 01:09:07AM +0000, Shinichiro Kawasaki wrote:
>> (CC+: linux-pci)
>> On Jun 11, 2022 / 16:34, Yi Zhang wrote:
>>> On Fri, Jun 10, 2022 at 10:49 PM Keith Busch <kbusch@kernel.org> wrote:
>>>>
>>>> And I am not even sure this is real. I don't know yet why this is showing up
>>>> only now, but this should fix it:
>>>
>>> Hi Keith
>>>
>>> Confirmed the WARNING issue was fixed with the change, here is the log:
>>
>> Thanks. I also confirmed that Keith's change to add __ATTR_IGNORE_LOCKDEP to
>> dev_attr_dev_rescan avoids the fix, on v5.19-rc2.
>>
>> I took a closer look into this issue and found The deadlock WARN can be
>> recreated with following two commands:
>>
>> # echo 1 > /sys/bus/pci/devices/0000\:00\:09.0/rescan
>> # echo 1 > /sys/bus/pci/devices/0000\:00\:09.0/remove
>>
>> And it can be recreated with PCI devices other than NVME controller, such as
>> SCSI controller or VGA controller. Then this is not a storage sub-system issue.
>>
>> I checked function call stacks of the two commands above. As shown below, it
>> looks like ABBA deadlock possibility is detected and warned.
> 
> Yeah, I was mistaken on this report, so my proposal to suppress the warning is
> definitely not right. If I run both 'echo' commands in parallel, I see it
> deadlock frequently. I'm not familiar enough with this code to any good ideas
> on how to fix, but I agree this is a generic pci issue.

I think it is worth adding a testcase to blktests to make sure these
future releases will test this.

-ck



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: blktests failures with v5.19-rc1
  2022-06-14  2:38               ` Chaitanya Kulkarni
@ 2022-06-14  4:00                 ` Shinichiro Kawasaki
  2022-06-15 19:47                   ` Bjorn Helgaas
  0 siblings, 1 reply; 15+ messages in thread
From: Shinichiro Kawasaki @ 2022-06-14  4:00 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: Yi Zhang, linux-block, Keith Busch, linux-nvme, linux-scsi,
	mstowe, linux-pci

On Jun 14, 2022 / 02:38, Chaitanya Kulkarni wrote:
> Shinichiro,
> 
> On 6/13/22 19:23, Keith Busch wrote:
> > On Tue, Jun 14, 2022 at 01:09:07AM +0000, Shinichiro Kawasaki wrote:
> >> (CC+: linux-pci)
> >> On Jun 11, 2022 / 16:34, Yi Zhang wrote:
> >>> On Fri, Jun 10, 2022 at 10:49 PM Keith Busch <kbusch@kernel.org> wrote:
> >>>>
> >>>> And I am not even sure this is real. I don't know yet why this is showing up
> >>>> only now, but this should fix it:
> >>>
> >>> Hi Keith
> >>>
> >>> Confirmed the WARNING issue was fixed with the change, here is the log:
> >>
> >> Thanks. I also confirmed that Keith's change to add __ATTR_IGNORE_LOCKDEP to
> >> dev_attr_dev_rescan avoids the fix, on v5.19-rc2.
> >>
> >> I took a closer look into this issue and found The deadlock WARN can be
> >> recreated with following two commands:
> >>
> >> # echo 1 > /sys/bus/pci/devices/0000\:00\:09.0/rescan
> >> # echo 1 > /sys/bus/pci/devices/0000\:00\:09.0/remove
> >>
> >> And it can be recreated with PCI devices other than NVME controller, such as
> >> SCSI controller or VGA controller. Then this is not a storage sub-system issue.
> >>
> >> I checked function call stacks of the two commands above. As shown below, it
> >> looks like ABBA deadlock possibility is detected and warned.
> > 
> > Yeah, I was mistaken on this report, so my proposal to suppress the warning is
> > definitely not right. If I run both 'echo' commands in parallel, I see it
> > deadlock frequently. I'm not familiar enough with this code to any good ideas
> > on how to fix, but I agree this is a generic pci issue.
> 
> I think it is worth adding a testcase to blktests to make sure these
> future releases will test this.

Yeah, this WARN is confusing for us then it would be valuable to test by
blktests not to repeat it. One point I wonder is: which test group the test case
will it fall in? The nvme group could be the group to add, probably.

Another point I wonder is other kernel test suite than blktests. Don't we have
more appropriate test suite to check PCI device rescan/remove race ? Such a test
sounds more like a PCI bus sub-system test than block/storage test.

Having said that, still I think the test case is valuable for block/storage.
Unless anyone opposes, I'm open for the patch to add it.

-- 
Shin'ichiro Kawasaki

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: blktests failures with v5.19-rc1
  2022-06-14  4:00                 ` Shinichiro Kawasaki
@ 2022-06-15 19:47                   ` Bjorn Helgaas
  2022-06-15 22:01                     ` Chaitanya Kulkarni
  2022-06-15 23:16                     ` Keith Busch
  0 siblings, 2 replies; 15+ messages in thread
From: Bjorn Helgaas @ 2022-06-15 19:47 UTC (permalink / raw)
  To: Shinichiro Kawasaki
  Cc: Chaitanya Kulkarni, Yi Zhang, linux-block, Keith Busch,
	linux-nvme, linux-scsi, mstowe, linux-pci

On Tue, Jun 14, 2022 at 04:00:45AM +0000, Shinichiro Kawasaki wrote:
> On Jun 14, 2022 / 02:38, Chaitanya Kulkarni wrote:
> > Shinichiro,
> > 
> > On 6/13/22 19:23, Keith Busch wrote:
> > > On Tue, Jun 14, 2022 at 01:09:07AM +0000, Shinichiro Kawasaki wrote:
> > >> (CC+: linux-pci)
> > >> On Jun 11, 2022 / 16:34, Yi Zhang wrote:
> > >>> On Fri, Jun 10, 2022 at 10:49 PM Keith Busch <kbusch@kernel.org> wrote:
> > >>>>
> > >>>> And I am not even sure this is real. I don't know yet why
> > >>>> this is showing up only now, but this should fix it:
> > >>>
> > >>> Hi Keith
> > >>>
> > >>> Confirmed the WARNING issue was fixed with the change, here is
> > >>> the log:
> > >>
> > >> Thanks. I also confirmed that Keith's change to add
> > >> __ATTR_IGNORE_LOCKDEP to dev_attr_dev_rescan avoids the fix, on
> > >> v5.19-rc2.
> > >>
> > >> I took a closer look into this issue and found The deadlock
> > >> WARN can be recreated with following two commands:
> > >>
> > >> # echo 1 > /sys/bus/pci/devices/0000\:00\:09.0/rescan
> > >> # echo 1 > /sys/bus/pci/devices/0000\:00\:09.0/remove
> > >>
> > >> And it can be recreated with PCI devices other than NVME
> > >> controller, such as SCSI controller or VGA controller. Then
> > >> this is not a storage sub-system issue.
> > >>
> > >> I checked function call stacks of the two commands above. As
> > >> shown below, it looks like ABBA deadlock possibility is
> > >> detected and warned.
> > > 
> > > Yeah, I was mistaken on this report, so my proposal to suppress
> > > the warning is definitely not right. If I run both 'echo'
> > > commands in parallel, I see it deadlock frequently. I'm not
> > > familiar enough with this code to any good ideas on how to fix,
> > > but I agree this is a generic pci issue.
> > 
> > I think it is worth adding a testcase to blktests to make sure
> > these future releases will test this.
> 
> Yeah, this WARN is confusing for us then it would be valuable to
> test by blktests not to repeat it. One point I wonder is: which test
> group the test case will it fall in? The nvme group could be the
> group to add, probably.
> 
> Another point I wonder is other kernel test suite than blktests.
> Don't we have more appropriate test suite to check PCI device
> rescan/remove race ? Such a test sounds more like a PCI bus
> sub-system test than block/storage test.

I'm not aware of such a test, but it would be nice to have one.

Can you share your qemu config so I can reproduce this locally?

Thanks for finding and reporting this!

Bjorn

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: blktests failures with v5.19-rc1
  2022-06-15 19:47                   ` Bjorn Helgaas
@ 2022-06-15 22:01                     ` Chaitanya Kulkarni
  2022-06-15 23:13                       ` Yi Zhang
  2022-06-15 23:16                     ` Keith Busch
  1 sibling, 1 reply; 15+ messages in thread
From: Chaitanya Kulkarni @ 2022-06-15 22:01 UTC (permalink / raw)
  To: Shinichiro Kawasaki
  Cc: Yi Zhang, linux-block, Keith Busch, linux-nvme, linux-scsi,
	mstowe, linux-pci, Bjorn Helgaas

On 6/15/22 12:47, Bjorn Helgaas wrote:
> On Tue, Jun 14, 2022 at 04:00:45AM +0000, Shinichiro Kawasaki wrote:
>> On Jun 14, 2022 / 02:38, Chaitanya Kulkarni wrote:
>>> Shinichiro,
>>>
>>> On 6/13/22 19:23, Keith Busch wrote:
>>>> On Tue, Jun 14, 2022 at 01:09:07AM +0000, Shinichiro Kawasaki wrote:
>>>>> (CC+: linux-pci)
>>>>> On Jun 11, 2022 / 16:34, Yi Zhang wrote:
>>>>>> On Fri, Jun 10, 2022 at 10:49 PM Keith Busch <kbusch@kernel.org> wrote:
>>>>>>>
>>>>>>> And I am not even sure this is real. I don't know yet why
>>>>>>> this is showing up only now, but this should fix it:
>>>>>>
>>>>>> Hi Keith
>>>>>>
>>>>>> Confirmed the WARNING issue was fixed with the change, here is
>>>>>> the log:
>>>>>
>>>>> Thanks. I also confirmed that Keith's change to add
>>>>> __ATTR_IGNORE_LOCKDEP to dev_attr_dev_rescan avoids the fix, on
>>>>> v5.19-rc2.
>>>>>
>>>>> I took a closer look into this issue and found The deadlock
>>>>> WARN can be recreated with following two commands:
>>>>>
>>>>> # echo 1 > /sys/bus/pci/devices/0000\:00\:09.0/rescan
>>>>> # echo 1 > /sys/bus/pci/devices/0000\:00\:09.0/remove
>>>>>
>>>>> And it can be recreated with PCI devices other than NVME
>>>>> controller, such as SCSI controller or VGA controller. Then
>>>>> this is not a storage sub-system issue.
>>>>>
>>>>> I checked function call stacks of the two commands above. As
>>>>> shown below, it looks like ABBA deadlock possibility is
>>>>> detected and warned.
>>>>
>>>> Yeah, I was mistaken on this report, so my proposal to suppress
>>>> the warning is definitely not right. If I run both 'echo'
>>>> commands in parallel, I see it deadlock frequently. I'm not
>>>> familiar enough with this code to any good ideas on how to fix,
>>>> but I agree this is a generic pci issue.
>>>
>>> I think it is worth adding a testcase to blktests to make sure
>>> these future releases will test this.
>>
>> Yeah, this WARN is confusing for us then it would be valuable to
>> test by blktests not to repeat it. One point I wonder is: which test
>> group the test case will it fall in? The nvme group could be the
>> group to add, probably.
>>

since this issue been discovered with nvme rescan and revmoe,
it should be added to the nvme category.

>> Another point I wonder is other kernel test suite than blktests.
>> Don't we have more appropriate test suite to check PCI device
>> rescan/remove race ? Such a test sounds more like a PCI bus
>> sub-system test than block/storage test.

I don't think so we could have caught it long time back,
but we clearly did not.

> 
> I'm not aware of such a test, but it would be nice to have one.
> 
> Can you share your qemu config so I can reproduce this locally?
> 
> Thanks for finding and reporting this!
> 
> Bjorn

-ck



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: blktests failures with v5.19-rc1
  2022-06-15 22:01                     ` Chaitanya Kulkarni
@ 2022-06-15 23:13                       ` Yi Zhang
  2022-06-16  4:42                         ` Shinichiro Kawasaki
  0 siblings, 1 reply; 15+ messages in thread
From: Yi Zhang @ 2022-06-15 23:13 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: Shinichiro Kawasaki, linux-block, Keith Busch, linux-nvme,
	linux-scsi, mstowe, linux-pci, Bjorn Helgaas

On Thu, Jun 16, 2022 at 6:01 AM Chaitanya Kulkarni
<chaitanyak@nvidia.com> wrote:
>
> On 6/15/22 12:47, Bjorn Helgaas wrote:
> > On Tue, Jun 14, 2022 at 04:00:45AM +0000, Shinichiro Kawasaki wrote:
> >> On Jun 14, 2022 / 02:38, Chaitanya Kulkarni wrote:
> >>> Shinichiro,
> >>>
> >>> On 6/13/22 19:23, Keith Busch wrote:
> >>>> On Tue, Jun 14, 2022 at 01:09:07AM +0000, Shinichiro Kawasaki wrote:
> >>>>> (CC+: linux-pci)
> >>>>> On Jun 11, 2022 / 16:34, Yi Zhang wrote:
> >>>>>> On Fri, Jun 10, 2022 at 10:49 PM Keith Busch <kbusch@kernel.org> wrote:
> >>>>>>>
> >>>>>>> And I am not even sure this is real. I don't know yet why
> >>>>>>> this is showing up only now, but this should fix it:
> >>>>>>
> >>>>>> Hi Keith
> >>>>>>
> >>>>>> Confirmed the WARNING issue was fixed with the change, here is
> >>>>>> the log:
> >>>>>
> >>>>> Thanks. I also confirmed that Keith's change to add
> >>>>> __ATTR_IGNORE_LOCKDEP to dev_attr_dev_rescan avoids the fix, on
> >>>>> v5.19-rc2.
> >>>>>
> >>>>> I took a closer look into this issue and found The deadlock
> >>>>> WARN can be recreated with following two commands:
> >>>>>
> >>>>> # echo 1 > /sys/bus/pci/devices/0000\:00\:09.0/rescan
> >>>>> # echo 1 > /sys/bus/pci/devices/0000\:00\:09.0/remove
> >>>>>
> >>>>> And it can be recreated with PCI devices other than NVME
> >>>>> controller, such as SCSI controller or VGA controller. Then
> >>>>> this is not a storage sub-system issue.
> >>>>>
> >>>>> I checked function call stacks of the two commands above. As
> >>>>> shown below, it looks like ABBA deadlock possibility is
> >>>>> detected and warned.
> >>>>
> >>>> Yeah, I was mistaken on this report, so my proposal to suppress
> >>>> the warning is definitely not right. If I run both 'echo'
> >>>> commands in parallel, I see it deadlock frequently. I'm not
> >>>> familiar enough with this code to any good ideas on how to fix,
> >>>> but I agree this is a generic pci issue.
> >>>
> >>> I think it is worth adding a testcase to blktests to make sure
> >>> these future releases will test this.
> >>
> >> Yeah, this WARN is confusing for us then it would be valuable to
> >> test by blktests not to repeat it. One point I wonder is: which test
> >> group the test case will it fall in? The nvme group could be the
> >> group to add, probably.
> >>
>
> since this issue been discovered with nvme rescan and revmoe,
> it should be added to the nvme category.

We already have nvme/032 which tests nvme rescan/reset/remove and the
issue was reported by running this one, do we still need one more?

>
> >> Another point I wonder is other kernel test suite than blktests.
> >> Don't we have more appropriate test suite to check PCI device
> >> rescan/remove race ? Such a test sounds more like a PCI bus
> >> sub-system test than block/storage test.
>
> I don't think so we could have caught it long time back,
> but we clearly did not.
>
> >
> > I'm not aware of such a test, but it would be nice to have one.
> >
> > Can you share your qemu config so I can reproduce this locally?
> >
> > Thanks for finding and reporting this!
> >
> > Bjorn
>
> -ck
>
>


-- 
Best Regards,
  Yi Zhang


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: blktests failures with v5.19-rc1
  2022-06-15 19:47                   ` Bjorn Helgaas
  2022-06-15 22:01                     ` Chaitanya Kulkarni
@ 2022-06-15 23:16                     ` Keith Busch
  2022-07-19  4:50                       ` Shinichiro Kawasaki
  1 sibling, 1 reply; 15+ messages in thread
From: Keith Busch @ 2022-06-15 23:16 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Shinichiro Kawasaki, Chaitanya Kulkarni, Yi Zhang, linux-block,
	linux-nvme, linux-scsi, mstowe, linux-pci

On Wed, Jun 15, 2022 at 02:47:27PM -0500, Bjorn Helgaas wrote:
> On Tue, Jun 14, 2022 at 04:00:45AM +0000, Shinichiro Kawasaki wrote:
> > 
> > Yeah, this WARN is confusing for us then it would be valuable to
> > test by blktests not to repeat it. One point I wonder is: which test
> > group the test case will it fall in? The nvme group could be the
> > group to add, probably.
> > 
> > Another point I wonder is other kernel test suite than blktests.
> > Don't we have more appropriate test suite to check PCI device
> > rescan/remove race ? Such a test sounds more like a PCI bus
> > sub-system test than block/storage test.
> 
> I'm not aware of such a test, but it would be nice to have one.
> 
> Can you share your qemu config so I can reproduce this locally?
> 
> Thanks for finding and reporting this!

Hi Bjorn,

This ought to be reproducible with any pci device that can be removed. Since we
initially observed with nvme, you can try with such a device. A quick way to
get one appearing in qemu is to add parameters:

        -drive id=n,if=none,file=null-co://,format=raw \
	-device nvme,serial=foobar,drive=n

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: blktests failures with v5.19-rc1
  2022-06-15 23:13                       ` Yi Zhang
@ 2022-06-16  4:42                         ` Shinichiro Kawasaki
  2022-06-16 17:55                           ` Chaitanya Kulkarni
  0 siblings, 1 reply; 15+ messages in thread
From: Shinichiro Kawasaki @ 2022-06-16  4:42 UTC (permalink / raw)
  To: Yi Zhang
  Cc: Chaitanya Kulkarni, linux-block, Keith Busch, linux-nvme,
	linux-scsi, mstowe, linux-pci, Bjorn Helgaas

On Jun 16, 2022 / 07:13, Yi Zhang wrote:
> On Thu, Jun 16, 2022 at 6:01 AM Chaitanya Kulkarni
> <chaitanyak@nvidia.com> wrote:
> >
> > On 6/15/22 12:47, Bjorn Helgaas wrote:
> > > On Tue, Jun 14, 2022 at 04:00:45AM +0000, Shinichiro Kawasaki wrote:
> > >> On Jun 14, 2022 / 02:38, Chaitanya Kulkarni wrote:
> > >>> Shinichiro,

[snip]

> > >>> I think it is worth adding a testcase to blktests to make sure
> > >>> these future releases will test this.
> > >>
> > >> Yeah, this WARN is confusing for us then it would be valuable to
> > >> test by blktests not to repeat it. One point I wonder is: which test
> > >> group the test case will it fall in? The nvme group could be the
> > >> group to add, probably.
> > >>
> >
> > since this issue been discovered with nvme rescan and revmoe,
> > it should be added to the nvme category.
> 
> We already have nvme/032 which tests nvme rescan/reset/remove and the
> issue was reported by running this one, do we still need one more?

That is a point. Current nvme/032 checks nvme pci adapter rescan/reset/remove
during I/O to catch problems in nvme driver and block layer, but actually it
can catch the problem in pci sub-system also. I think Chaitanya's motivation
for the new test case is to distinguish those two.

If we have the new test case, its code will be similar and duplicated as
nvme/032 code. To avoid such duplication, it would be good to improve nvme/032
to have two steps. The 1st step checks that nvme pci adapter rescan/reset/remove
without I/O causes no kernel WARN (or any other unexpected kernel messages). Any
issue found in this step is reported as a pci sub-system issue. The 2nd step
checks nvme pci adapter rescan/reset/remove during I/O, as the current nvme/032
does. With this, we don't need the new test case, but still we can distinguish
the problems in nvme/block sub-system and pci sub-system.

> > >> Another point I wonder is other kernel test suite than blktests.
> > >> Don't we have more appropriate test suite to check PCI device
> > >> rescan/remove race ? Such a test sounds more like a PCI bus
> > >> sub-system test than block/storage test.
> >
> > I don't think so we could have caught it long time back,
> > but we clearly did not.

I see, then it looks that blktests is the test suite to test it.

-- 
Shin'ichiro Kawasaki

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: blktests failures with v5.19-rc1
  2022-06-16  4:42                         ` Shinichiro Kawasaki
@ 2022-06-16 17:55                           ` Chaitanya Kulkarni
  0 siblings, 0 replies; 15+ messages in thread
From: Chaitanya Kulkarni @ 2022-06-16 17:55 UTC (permalink / raw)
  To: Shinichiro Kawasaki, Yi Zhang
  Cc: linux-block, Keith Busch, linux-nvme, linux-scsi, mstowe,
	linux-pci, Bjorn Helgaas

On 6/15/2022 9:42 PM, Shinichiro Kawasaki wrote:
> On Jun 16, 2022 / 07:13, Yi Zhang wrote:
>> On Thu, Jun 16, 2022 at 6:01 AM Chaitanya Kulkarni
>> <chaitanyak@nvidia.com> wrote:
>>>
>>> On 6/15/22 12:47, Bjorn Helgaas wrote:
>>>> On Tue, Jun 14, 2022 at 04:00:45AM +0000, Shinichiro Kawasaki wrote:
>>>>> On Jun 14, 2022 / 02:38, Chaitanya Kulkarni wrote:
>>>>>> Shinichiro,
> 
> [snip]
> 
>>>>>> I think it is worth adding a testcase to blktests to make sure
>>>>>> these future releases will test this.
>>>>>
>>>>> Yeah, this WARN is confusing for us then it would be valuable to
>>>>> test by blktests not to repeat it. One point I wonder is: which test
>>>>> group the test case will it fall in? The nvme group could be the
>>>>> group to add, probably.
>>>>>
>>>
>>> since this issue been discovered with nvme rescan and revmoe,
>>> it should be added to the nvme category.
>>
>> We already have nvme/032 which tests nvme rescan/reset/remove and the
>> issue was reported by running this one, do we still need one more?
> 
> That is a point. Current nvme/032 checks nvme pci adapter rescan/reset/remove
> during I/O to catch problems in nvme driver and block layer, but actually it
> can catch the problem in pci sub-system also. I think Chaitanya's motivation
> for the new test case is to distinguish those two.
> 

Yes, exactly.

> If we have the new test case, its code will be similar and duplicated as
> nvme/032 code. To avoid such duplication, it would be good to improve nvme/032
> to have two steps. The 1st step checks that nvme pci adapter rescan/reset/remove
> without I/O causes no kernel WARN (or any other unexpected kernel messages). Any
> issue found in this step is reported as a pci sub-system issue. The 2nd step
> checks nvme pci adapter rescan/reset/remove during I/O, as the current nvme/032
> does. With this, we don't need the new test case, but still we can distinguish
> the problems in nvme/block sub-system and pci sub-system.
> 

Totally agree with this.

>>>>> Another point I wonder is other kernel test suite than blktests.
>>>>> Don't we have more appropriate test suite to check PCI device
>>>>> rescan/remove race ? Such a test sounds more like a PCI bus
>>>>> sub-system test than block/storage test.
>>>
>>> I don't think so we could have caught it long time back,
>>> but we clearly did not.
> 
> I see, then it looks that blktests is the test suite to test it.
> 

-ck



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: blktests failures with v5.19-rc1
  2022-06-15 23:16                     ` Keith Busch
@ 2022-07-19  4:50                       ` Shinichiro Kawasaki
  2022-07-19 22:31                         ` Bjorn Helgaas
  0 siblings, 1 reply; 15+ messages in thread
From: Shinichiro Kawasaki @ 2022-07-19  4:50 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Chaitanya Kulkarni, Yi Zhang, linux-block, linux-nvme,
	linux-scsi, mstowe, linux-pci, Xiao Yang, Keith Busch

On Jun 15, 2022 / 17:16, Keith Busch wrote:
> On Wed, Jun 15, 2022 at 02:47:27PM -0500, Bjorn Helgaas wrote:
> > On Tue, Jun 14, 2022 at 04:00:45AM +0000, Shinichiro Kawasaki wrote:
> > > 
> > > Yeah, this WARN is confusing for us then it would be valuable to
> > > test by blktests not to repeat it. One point I wonder is: which test
> > > group the test case will it fall in? The nvme group could be the
> > > group to add, probably.
> > > 
> > > Another point I wonder is other kernel test suite than blktests.
> > > Don't we have more appropriate test suite to check PCI device
> > > rescan/remove race ? Such a test sounds more like a PCI bus
> > > sub-system test than block/storage test.
> > 
> > I'm not aware of such a test, but it would be nice to have one.
> > 
> > Can you share your qemu config so I can reproduce this locally?
> > 
> > Thanks for finding and reporting this!
> 
> Hi Bjorn,
> 
> This ought to be reproducible with any pci device that can be removed. Since we
> initially observed with nvme, you can try with such a device. A quick way to
> get one appearing in qemu is to add parameters:
> 
>         -drive id=n,if=none,file=null-co://,format=raw \
> 	-device nvme,serial=foobar,drive=n

Hello Bjorn,

Did you have chance to reproduce the WARN? Recently, it was reported again [1]
and getting attention.

[1] https://lore.kernel.org/linux-block/4ed3028b-2d2f-8755-fec2-0b9cb5ad42d2@fujitsu.com/

-- 
Shin'ichiro Kawasaki

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: blktests failures with v5.19-rc1
  2022-07-19  4:50                       ` Shinichiro Kawasaki
@ 2022-07-19 22:31                         ` Bjorn Helgaas
  2022-07-20  2:27                           ` Shinichiro Kawasaki
  0 siblings, 1 reply; 15+ messages in thread
From: Bjorn Helgaas @ 2022-07-19 22:31 UTC (permalink / raw)
  To: Shinichiro Kawasaki
  Cc: Chaitanya Kulkarni, Yi Zhang, linux-block, linux-nvme,
	linux-scsi, mstowe, linux-pci, Xiao Yang, Keith Busch

On Tue, Jul 19, 2022 at 04:50:36AM +0000, Shinichiro Kawasaki wrote:
> On Jun 15, 2022 / 17:16, Keith Busch wrote:
> > On Wed, Jun 15, 2022 at 02:47:27PM -0500, Bjorn Helgaas wrote:
> > > On Tue, Jun 14, 2022 at 04:00:45AM +0000, Shinichiro Kawasaki wrote:
> > > > 
> > > > Yeah, this WARN is confusing for us then it would be valuable to
> > > > test by blktests not to repeat it. One point I wonder is: which test
> > > > group the test case will it fall in? The nvme group could be the
> > > > group to add, probably.
> > > > 
> > > > Another point I wonder is other kernel test suite than blktests.
> > > > Don't we have more appropriate test suite to check PCI device
> > > > rescan/remove race ? Such a test sounds more like a PCI bus
> > > > sub-system test than block/storage test.
> > > 
> > > I'm not aware of such a test, but it would be nice to have one.
> > > 
> > > Can you share your qemu config so I can reproduce this locally?
> > > 
> > > Thanks for finding and reporting this!
> > 
> > This ought to be reproducible with any pci device that can be
> > removed. Since we initially observed with nvme, you can try with
> > such a device. A quick way to get one appearing in qemu is to add
> > parameters:
> > 
> >         -drive id=n,if=none,file=null-co://,format=raw \
> > 	-device nvme,serial=foobar,drive=n
> 
> Did you have chance to reproduce the WARN? Recently, it was reported
> again [1] and getting attention.

I have not paid any attention to this yet.  From what I can tell, the
problem was discovered by a test case (i.e., not reported by a
real-world user), it is not a recent regression, we haven't identified
a commit that introduced the problem, and we do not have a potential
fix for it.  Obviously it needs to be fixed and I'm not trying to
minimize the problem; I just want to calibrate it against everything
else.

> [1] https://lore.kernel.org/linux-block/4ed3028b-2d2f-8755-fec2-0b9cb5ad42d2@fujitsu.com/
> 
> -- 
> Shin'ichiro Kawasaki

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: blktests failures with v5.19-rc1
  2022-07-19 22:31                         ` Bjorn Helgaas
@ 2022-07-20  2:27                           ` Shinichiro Kawasaki
  2022-12-19 11:27                             ` Shinichiro Kawasaki
  0 siblings, 1 reply; 15+ messages in thread
From: Shinichiro Kawasaki @ 2022-07-20  2:27 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Chaitanya Kulkarni, Yi Zhang, linux-block, linux-nvme,
	linux-scsi, mstowe, linux-pci, Xiao Yang, Keith Busch

On Jul 19, 2022 / 17:31, Bjorn Helgaas wrote:
> On Tue, Jul 19, 2022 at 04:50:36AM +0000, Shinichiro Kawasaki wrote:
> > On Jun 15, 2022 / 17:16, Keith Busch wrote:
> > > On Wed, Jun 15, 2022 at 02:47:27PM -0500, Bjorn Helgaas wrote:
> > > > On Tue, Jun 14, 2022 at 04:00:45AM +0000, Shinichiro Kawasaki wrote:
> > > > > 
> > > > > Yeah, this WARN is confusing for us then it would be valuable to
> > > > > test by blktests not to repeat it. One point I wonder is: which test
> > > > > group the test case will it fall in? The nvme group could be the
> > > > > group to add, probably.
> > > > > 
> > > > > Another point I wonder is other kernel test suite than blktests.
> > > > > Don't we have more appropriate test suite to check PCI device
> > > > > rescan/remove race ? Such a test sounds more like a PCI bus
> > > > > sub-system test than block/storage test.
> > > > 
> > > > I'm not aware of such a test, but it would be nice to have one.
> > > > 
> > > > Can you share your qemu config so I can reproduce this locally?
> > > > 
> > > > Thanks for finding and reporting this!
> > > 
> > > This ought to be reproducible with any pci device that can be
> > > removed. Since we initially observed with nvme, you can try with
> > > such a device. A quick way to get one appearing in qemu is to add
> > > parameters:
> > > 
> > >         -drive id=n,if=none,file=null-co://,format=raw \
> > > 	-device nvme,serial=foobar,drive=n
> > 
> > Did you have chance to reproduce the WARN? Recently, it was reported
> > again [1] and getting attention.
> 
> I have not paid any attention to this yet.  From what I can tell, the
> problem was discovered by a test case (i.e., not reported by a
> real-world user), it is not a recent regression, we haven't identified
> a commit that introduced the problem, and we do not have a potential
> fix for it.  Obviously it needs to be fixed and I'm not trying to
> minimize the problem; I just want to calibrate it against everything
> else.

Bjorn, thank you for sharing your view, which sounds reasonable for me.

-- 
Shin'ichiro Kawasaki

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: blktests failures with v5.19-rc1
  2022-07-20  2:27                           ` Shinichiro Kawasaki
@ 2022-12-19 11:27                             ` Shinichiro Kawasaki
  2022-12-29 18:13                               ` Bjorn Helgaas
  0 siblings, 1 reply; 15+ messages in thread
From: Shinichiro Kawasaki @ 2022-12-19 11:27 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Chaitanya Kulkarni, Yi Zhang, linux-block, linux-nvme,
	linux-scsi, mstowe, linux-pci, Xiao Yang, Keith Busch

Bjorn, FYI. This lockdep warning disappeared with kernel version v6.1. I
bisected and found the commit 2d7f9f8c1815 ("kernfs: Improve kernfs_drain() and
always call on removal.") avoided the issue. This commit touches kernfs_drain()
and __kernfs_remove(), and modifies the condition to lock kernfs_rwsem. I think
it explains why the lockdep disappeared. No longer need to work on this issue :)

-- 
Shin'ichiro Kawasaki

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: blktests failures with v5.19-rc1
  2022-12-19 11:27                             ` Shinichiro Kawasaki
@ 2022-12-29 18:13                               ` Bjorn Helgaas
  0 siblings, 0 replies; 15+ messages in thread
From: Bjorn Helgaas @ 2022-12-29 18:13 UTC (permalink / raw)
  To: Shinichiro Kawasaki
  Cc: Chaitanya Kulkarni, Yi Zhang, linux-block, linux-nvme,
	linux-scsi, mstowe, linux-pci, Xiao Yang, Keith Busch

On Mon, Dec 19, 2022 at 11:27:03AM +0000, Shinichiro Kawasaki wrote:
> Bjorn, FYI. This lockdep warning disappeared with kernel version
> v6.1. I bisected and found the commit 2d7f9f8c1815 ("kernfs: Improve
> kernfs_drain() and always call on removal.") avoided the issue. This
> commit touches kernfs_drain() and __kernfs_remove(), and modifies
> the condition to lock kernfs_rwsem. I think it explains why the
> lockdep disappeared. No longer need to work on this issue :)

Thanks for following up on this!

Bjorn

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-12-29 18:13 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20220609235329.4jbz4wr3eg2nmzqa@shindev>
     [not found] ` <717734c9-c633-fb48-499e-7e3e15113020@nvidia.com>
     [not found]   ` <19d09611-42cc-5a81-d676-c5375865c14c@nvidia.com>
     [not found]     ` <20220610122517.6pt5y63hcosk5mes@shindev>
     [not found]       ` <YqNZiMw+rH5gyZDI@kbusch-mbp.dhcp.thefacebook.com>
     [not found]         ` <CAHj4cs9G0WDrnSS6iVZJfgfOcRR0ysJhw+9yqcbqE=_8mkF0zw@mail.gmail.com>
2022-06-14  1:09           ` blktests failures with v5.19-rc1 Shinichiro Kawasaki
2022-06-14  2:23             ` Keith Busch
2022-06-14  2:38               ` Chaitanya Kulkarni
2022-06-14  4:00                 ` Shinichiro Kawasaki
2022-06-15 19:47                   ` Bjorn Helgaas
2022-06-15 22:01                     ` Chaitanya Kulkarni
2022-06-15 23:13                       ` Yi Zhang
2022-06-16  4:42                         ` Shinichiro Kawasaki
2022-06-16 17:55                           ` Chaitanya Kulkarni
2022-06-15 23:16                     ` Keith Busch
2022-07-19  4:50                       ` Shinichiro Kawasaki
2022-07-19 22:31                         ` Bjorn Helgaas
2022-07-20  2:27                           ` Shinichiro Kawasaki
2022-12-19 11:27                             ` Shinichiro Kawasaki
2022-12-29 18:13                               ` Bjorn Helgaas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).