linux-arm-msm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [BUG] usb: dwc3: Kernel NULL pointer dereference in dwc3_remove()
       [not found]                 ` <87sg1q1129.fsf@kernel.org>
@ 2021-06-10 15:33                   ` Jack Pham
  2021-06-14  7:37                     ` Sandeep Maheswaram
  0 siblings, 1 reply; 3+ messages in thread
From: Jack Pham @ 2021-06-10 15:33 UTC (permalink / raw)
  To: Felipe Balbi, Sandeep Maheswaram
  Cc: Alexandru Elisei, Greg Kroah-Hartman, p.zabel, linux-usb,
	Linux Kernel Mailing List, arm-mail-list, linux-arm-msm

On Thu, Jun 10, 2021 at 01:11:42PM +0300, Felipe Balbi wrote:
> Jack Pham <jackp@codeaurora.org> writes:
> > On Fri, Jun 04, 2021 at 11:20:12AM +0300, Felipe Balbi wrote:
> >> Jack Pham <jackp@codeaurora.org> writes:
> >> >> >>>> Alexandru Elisei <alexandru.elisei@arm.com> writes:
> >> >> >>>>> I've been able to bisect the panic and the offending commit is 568262bf5492 ("usb:
> >> >> >>>>> dwc3: core: Add shutdown callback for dwc3"). I can provide more diagnostic
> >> >> >>>>> information if needed and I can help test the fix.
> >> >> >>>> if you simply revert that commit in HEAD, does the problem really go
> >> >> >>>> away?
> >> >> >>> Kernel built from commit 324c92e5e0ee, which is the kernel tip today, the panic is
> >> >> >>> there. Reverting the offending commit, 568262bf5492, makes the panic disappear.
> >> >> >> Want to send a revert so I can take it now?
> >> >> >
> >> >> > I can send a revert, but Felipe was asking Sandeep (the commit author) for a fix,
> >> >> > so I'll leave it up to Felipe to decide how to proceed.
> >> >> 
> >> >> I'm okay with a revert. Feel free to add my Acked-by: Felipe Balbi
> >> >> <balbi@kernel.org> or it.
> >> >> 
> >> >> Sandeep, please send a new version that doesn't encounter the same
> >> >> issue. Make sure to test by reloading the driver in a tight loop for
> >> >> several iterations.
> >> >
> >> > This would probably be tricky to test on other "glue" drivers as the
> >> > problem appears to be specific only to dwc3_of_simple.  It looks like
> >> > both dwc3_of_simple and the dwc3 core now (due to 568262bf5492) each
> >> > implement respective .shutdown callbacks. The latter is simply a wrapper
> >> > around dwc3_remove(). And from the panic call stack above we see that
> >> > dwc3_of_simple_shutdown() calls of_platform_depopulate() which will 
> >> > again call dwc3_remove() resulting in the double remove.
> >> >
> >> > So would an alternative approach be to protect against dwc3_remove()
> >> > getting called multiple times? IMO it'd be a bit messy to have to add
> >> 
> >> no, I  don't think so. That sounds like a workaround. We should be able
> >> to guarantee that ->remove() doesn't get called twice using the driver
> >> model properly.
> >
> > Completely fair.  So then having a .shutdown callback that directly calls
> > dwc3_remove() is probably not the right thing to do as it completely
> > bypasses the driver model so if and when the driver core does later
> > release the device from the driver that's how we end up with the double
> > remove.
> 
> yeah, I would agree with that.
> 
> >> > additional checks there to know if it had already been called. So maybe
> >> > avoid it altogether--should dwc3_of_simple_shutdown() just skip calling
> >> > of_platform_depopulate()?
> >> 
> >> I don't know what the idiomatic is nowadays, but at least early on, we
> >> had to call depopulate.
> >
> > So any suggestions on how to fix the original issue Sandeep was trying
> > to fix with 568262bf5492? Maybe implement .shutdown in dwc3_qcom and have
> > it follow what dwc3_of_simple does with of_platform_depopulate()? But
> > then wouldn't other "glues" want/need to follow suit?
> 
> I think we can implement shutdown in core, but we need to careful with
> it. Instead of just blindly calling remove, let's extract the common
> parts to another internal function that both remove and shutdown
> call. debugfs removal should not be part of that generic method :-)

Hi Sandeep,

Upon re-reading your description in 568262bf5492 it sounds like the
original intention of your patch is basically to quiesce the HW so that
it doesn't continue to run after SMMU/IOMMU is disabled right?

If that is the case, couldn't we simply call only dwc3_core_exit_mode()
assuming there is no other requirement to do any other cleanup/teardown
(PHYs, clocks, resets, runtime PM, et al)? This function should do the
bare minimum of stopping the controller in whatever mode (host or
peripheral) it is currently operating in.

> Anything in that generic method should, probably, be idempotent.

Yes we'll need to ensure that dwc3_core_exit_mode() can be called
multiple times without additional side effects. At first glance this
probably means setting dwc->xhci and dwc->gadget to NULL from
dwc3_host_exit() and dwc3_gadget_exit(), respectively.

Thanks,
Jack
-- 
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [BUG] usb: dwc3: Kernel NULL pointer dereference in dwc3_remove()
  2021-06-10 15:33                   ` [BUG] usb: dwc3: Kernel NULL pointer dereference in dwc3_remove() Jack Pham
@ 2021-06-14  7:37                     ` Sandeep Maheswaram
  2021-06-16  0:13                       ` Jack Pham
  0 siblings, 1 reply; 3+ messages in thread
From: Sandeep Maheswaram @ 2021-06-14  7:37 UTC (permalink / raw)
  To: Jack Pham, Felipe Balbi
  Cc: Alexandru Elisei, Greg Kroah-Hartman, p.zabel, linux-usb,
	Linux Kernel Mailing List, arm-mail-list, linux-arm-msm


On 6/10/2021 9:03 PM, Jack Pham wrote:
> On Thu, Jun 10, 2021 at 01:11:42PM +0300, Felipe Balbi wrote:
>> Jack Pham <jackp@codeaurora.org> writes:
>>> On Fri, Jun 04, 2021 at 11:20:12AM +0300, Felipe Balbi wrote:
>>>> Jack Pham <jackp@codeaurora.org> writes:
>>>>>>>>>> Alexandru Elisei <alexandru.elisei@arm.com> writes:
>>>>>>>>>>> I've been able to bisect the panic and the offending commit is 568262bf5492 ("usb:
>>>>>>>>>>> dwc3: core: Add shutdown callback for dwc3"). I can provide more diagnostic
>>>>>>>>>>> information if needed and I can help test the fix.
>>>>>>>>>> if you simply revert that commit in HEAD, does the problem really go
>>>>>>>>>> away?
>>>>>>>>> Kernel built from commit 324c92e5e0ee, which is the kernel tip today, the panic is
>>>>>>>>> there. Reverting the offending commit, 568262bf5492, makes the panic disappear.
>>>>>>>> Want to send a revert so I can take it now?
>>>>>>> I can send a revert, but Felipe was asking Sandeep (the commit author) for a fix,
>>>>>>> so I'll leave it up to Felipe to decide how to proceed.
>>>>>> I'm okay with a revert. Feel free to add my Acked-by: Felipe Balbi
>>>>>> <balbi@kernel.org> or it.
>>>>>>
>>>>>> Sandeep, please send a new version that doesn't encounter the same
>>>>>> issue. Make sure to test by reloading the driver in a tight loop for
>>>>>> several iterations.
>>>>> This would probably be tricky to test on other "glue" drivers as the
>>>>> problem appears to be specific only to dwc3_of_simple.  It looks like
>>>>> both dwc3_of_simple and the dwc3 core now (due to 568262bf5492) each
>>>>> implement respective .shutdown callbacks. The latter is simply a wrapper
>>>>> around dwc3_remove(). And from the panic call stack above we see that
>>>>> dwc3_of_simple_shutdown() calls of_platform_depopulate() which will
>>>>> again call dwc3_remove() resulting in the double remove.
>>>>>
>>>>> So would an alternative approach be to protect against dwc3_remove()
>>>>> getting called multiple times? IMO it'd be a bit messy to have to add
>>>> no, I  don't think so. That sounds like a workaround. We should be able
>>>> to guarantee that ->remove() doesn't get called twice using the driver
>>>> model properly.
>>> Completely fair.  So then having a .shutdown callback that directly calls
>>> dwc3_remove() is probably not the right thing to do as it completely
>>> bypasses the driver model so if and when the driver core does later
>>> release the device from the driver that's how we end up with the double
>>> remove.
>> yeah, I would agree with that.
>>
>>>>> additional checks there to know if it had already been called. So maybe
>>>>> avoid it altogether--should dwc3_of_simple_shutdown() just skip calling
>>>>> of_platform_depopulate()?
>>>> I don't know what the idiomatic is nowadays, but at least early on, we
>>>> had to call depopulate.
>>> So any suggestions on how to fix the original issue Sandeep was trying
>>> to fix with 568262bf5492? Maybe implement .shutdown in dwc3_qcom and have
>>> it follow what dwc3_of_simple does with of_platform_depopulate()? But
>>> then wouldn't other "glues" want/need to follow suit?
>> I think we can implement shutdown in core, but we need to careful with
>> it. Instead of just blindly calling remove, let's extract the common
>> parts to another internal function that both remove and shutdown
>> call. debugfs removal should not be part of that generic method :-)
> Hi Sandeep,
>
> Upon re-reading your description in 568262bf5492 it sounds like the
> original intention of your patch is basically to quiesce the HW so that
> it doesn't continue to run after SMMU/IOMMU is disabled right?
>
> If that is the case, couldn't we simply call only dwc3_core_exit_mode()
> assuming there is no other requirement to do any other cleanup/teardown
> (PHYs, clocks, resets, runtime PM, et al)? This function should do the
> bare minimum of stopping the controller in whatever mode (host or
> peripheral) it is currently operating in.

Yes that was the intention. I will call only dwc3_core_exit_mode()

and check. Is there any way we can do from dwc3 qcom glue driver to

avoid problems for other glue drivers?


>> Anything in that generic method should, probably, be idempotent.
> Yes we'll need to ensure that dwc3_core_exit_mode() can be called
> multiple times without additional side effects. At first glance this
> probably means setting dwc->xhci and dwc->gadget to NULL from
> dwc3_host_exit() and dwc3_gadget_exit(), respectively.

Ok. Is there any way to test this ?


>
> Thanks,
> Jack

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [BUG] usb: dwc3: Kernel NULL pointer dereference in dwc3_remove()
  2021-06-14  7:37                     ` Sandeep Maheswaram
@ 2021-06-16  0:13                       ` Jack Pham
  0 siblings, 0 replies; 3+ messages in thread
From: Jack Pham @ 2021-06-16  0:13 UTC (permalink / raw)
  To: Sandeep Maheswaram
  Cc: Felipe Balbi, Alexandru Elisei, Greg Kroah-Hartman, p.zabel,
	linux-usb, Linux Kernel Mailing List, arm-mail-list,
	linux-arm-msm

Hi Sandeep,

On Mon, Jun 14, 2021 at 01:07:11PM +0530, Sandeep Maheswaram wrote:
> 
> On 6/10/2021 9:03 PM, Jack Pham wrote:
> > On Thu, Jun 10, 2021 at 01:11:42PM +0300, Felipe Balbi wrote:
> > > Jack Pham <jackp@codeaurora.org> writes:
> > > > On Fri, Jun 04, 2021 at 11:20:12AM +0300, Felipe Balbi wrote:
> > > > > Jack Pham <jackp@codeaurora.org> writes:
> > > > > > > > > > > Alexandru Elisei <alexandru.elisei@arm.com> writes:
> > > > > > > > > > > > I've been able to bisect the panic and the offending commit is 568262bf5492 ("usb:
> > > > > > > > > > > > dwc3: core: Add shutdown callback for dwc3"). I can provide more diagnostic
> > > > > > > > > > > > information if needed and I can help test the fix.
> > > > > > > > > > > if you simply revert that commit in HEAD, does the problem really go
> > > > > > > > > > > away?
> > > > > > > > > > Kernel built from commit 324c92e5e0ee, which is the kernel tip today, the panic is
> > > > > > > > > > there. Reverting the offending commit, 568262bf5492, makes the panic disappear.
> > > > > > > > > Want to send a revert so I can take it now?
> > > > > > > > I can send a revert, but Felipe was asking Sandeep (the commit author) for a fix,
> > > > > > > > so I'll leave it up to Felipe to decide how to proceed.
> > > > > > > I'm okay with a revert. Feel free to add my Acked-by: Felipe Balbi
> > > > > > > <balbi@kernel.org> or it.
> > > > > > > 
> > > > > > > Sandeep, please send a new version that doesn't encounter the same
> > > > > > > issue. Make sure to test by reloading the driver in a tight loop for
> > > > > > > several iterations.
> > > > > > This would probably be tricky to test on other "glue" drivers as the
> > > > > > problem appears to be specific only to dwc3_of_simple.  It looks like
> > > > > > both dwc3_of_simple and the dwc3 core now (due to 568262bf5492) each
> > > > > > implement respective .shutdown callbacks. The latter is simply a wrapper
> > > > > > around dwc3_remove(). And from the panic call stack above we see that
> > > > > > dwc3_of_simple_shutdown() calls of_platform_depopulate() which will
> > > > > > again call dwc3_remove() resulting in the double remove.
> > > > > > 
> > > > > > So would an alternative approach be to protect against dwc3_remove()
> > > > > > getting called multiple times? IMO it'd be a bit messy to have to add
> > > > > no, I  don't think so. That sounds like a workaround. We should be able
> > > > > to guarantee that ->remove() doesn't get called twice using the driver
> > > > > model properly.
> > > > Completely fair.  So then having a .shutdown callback that directly calls
> > > > dwc3_remove() is probably not the right thing to do as it completely
> > > > bypasses the driver model so if and when the driver core does later
> > > > release the device from the driver that's how we end up with the double
> > > > remove.
> > > yeah, I would agree with that.
> > > 
> > > > > > additional checks there to know if it had already been called. So maybe
> > > > > > avoid it altogether--should dwc3_of_simple_shutdown() just skip calling
> > > > > > of_platform_depopulate()?
> > > > > I don't know what the idiomatic is nowadays, but at least early on, we
> > > > > had to call depopulate.
> > > > So any suggestions on how to fix the original issue Sandeep was trying
> > > > to fix with 568262bf5492? Maybe implement .shutdown in dwc3_qcom and have
> > > > it follow what dwc3_of_simple does with of_platform_depopulate()? But
> > > > then wouldn't other "glues" want/need to follow suit?
> > > I think we can implement shutdown in core, but we need to careful with
> > > it. Instead of just blindly calling remove, let's extract the common
> > > parts to another internal function that both remove and shutdown
> > > call. debugfs removal should not be part of that generic method :-)
> > Hi Sandeep,
> > 
> > Upon re-reading your description in 568262bf5492 it sounds like the
> > original intention of your patch is basically to quiesce the HW so that
> > it doesn't continue to run after SMMU/IOMMU is disabled right?
> > 
> > If that is the case, couldn't we simply call only dwc3_core_exit_mode()
> > assuming there is no other requirement to do any other cleanup/teardown
> > (PHYs, clocks, resets, runtime PM, et al)? This function should do the
> > bare minimum of stopping the controller in whatever mode (host or
> > peripheral) it is currently operating in.
> 
> Yes that was the intention. I will call only dwc3_core_exit_mode()
> and check. Is there any way we can do from dwc3 qcom glue driver to
> avoid problems for other glue drivers?

As I mentioned above maybe you could just implement a dwc3_qcom specific
.shutdown callback which mimics what dwc3_of_simple() does by calling
of_platform_depopulate(). This will allow the kernel driver core to
invoke dwc3_remove() rather than calling it directly yourself.

The downside is that if other glue drivers want to follow this they'd
have to duplicate the same logic. But maybe this is a more cautious
approach until we start seeing other drivers needing this generically
within core.c.

> > > Anything in that generic method should, probably, be idempotent.
> > Yes we'll need to ensure that dwc3_core_exit_mode() can be called
> > multiple times without additional side effects. At first glance this
> > probably means setting dwc->xhci and dwc->gadget to NULL from
> > dwc3_host_exit() and dwc3_gadget_exit(), respectively.
> 
> Ok. Is there any way to test this ?

You could implement both the dwc3_qcom_shutdown() as above as well as
adding back dwc3_shutdown() which only does dwc3_core_exit_mode(). Make
sure that even though dwc3_core_exit_mode() gets called twice nothing
bad happens.

Jack
-- 
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-06-16  0:14 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <c3c75895-313a-5be7-6421-b32bac741a88@arm.com>
     [not found] ` <87r1hjcvf6.fsf@kernel.org>
     [not found]   ` <70be179c-d36b-de6f-6efc-2888055b1312@arm.com>
     [not found]     ` <YLi/u9J5f+nQO4Cm@kroah.com>
     [not found]       ` <8272121c-ac8a-1565-a047-e3a16dcf13b0@arm.com>
     [not found]         ` <877djbc8xq.fsf@kernel.org>
     [not found]           ` <20210603173632.GA25299@jackp-linux.qualcomm.com>
     [not found]             ` <87mts6avnn.fsf@kernel.org>
     [not found]               ` <20210607180023.GA23045@jackp-linux.qualcomm.com>
     [not found]                 ` <87sg1q1129.fsf@kernel.org>
2021-06-10 15:33                   ` [BUG] usb: dwc3: Kernel NULL pointer dereference in dwc3_remove() Jack Pham
2021-06-14  7:37                     ` Sandeep Maheswaram
2021-06-16  0:13                       ` Jack Pham

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).