All of lore.kernel.org
 help / color / mirror / Atom feed
* chipidea: udc: kernel panic in isr_setup_status_phase
@ 2016-08-23  0:36 Clemens Gruber
  2016-08-24  8:11 ` Peter Chen
  0 siblings, 1 reply; 12+ messages in thread
From: Clemens Gruber @ 2016-08-23  0:36 UTC (permalink / raw)
  To: linux-usb; +Cc: Peter Chen, Greg Kroah-Hartman, linux-kernel

Hi,

I am using an i.MX6Q embedded board, acting as a (ethernet) gadget with
RNDIS function, connected over an USB OTG cable to a PC.
Most of the time it works fine, but in some mysterious circumstances,
a kernel panic occurs, just after attaching the OTG cable, connecting it
to the other machine:

[   54.012989] Unable to handle kernel NULL pointer dereference at virtual address 00000020
[   54.021099] pgd = 80004000
[   54.023816] [00000020] *pgd=00000000
[   54.027422] Internal error: Oops: 817 [#1] PREEMPT SMP ARM
[   54.032915] Modules linked in:
[   54.035998] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.8.0-rc3-00017-g336bc4a #315
[   54.043662] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
[   54.050196] task: 80b05f80 task.stack: 80b00000
[   54.054744] PC is at isr_setup_status_phase+0x1c/0x40
[   54.059805] LR is at 0xbe570890
[   54.062957] pc : [<804ac464>]    lr : [<be570890>]    psr: 200e0193
[   54.062957] sp : 80b01e10  ip : be570570  fp : be570890
[   54.074442] r10: be5eeebc  r9 : be570010  r8 : be5eeebc
[   54.079673] r7 : be5708d0  r6 : be5eee80  r5 : be7fcf40  r4 : 00000001
[   54.086206] r3 : be571010  r2 : 804ab368  r1 : 00000000  r0 : be570010
[   54.092742] Flags: nzCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment none
[   54.099972] Control: 10c5387d  Table: 4e34404a  DAC: 00000051
[   54.105723] Process swapper/0 (pid: 0, stack limit = 0x80b00210)
(snip)
[   54.247100] [<804ac464>] (isr_setup_status_phase) from [<804acbbc>] (isr_tr_complete_handler+0x734/0x98c)
[   54.256680] [<804acbbc>] (isr_tr_complete_handler) from [<804acfc0>] (udc_irq+0x1ac/0x318)
[   54.264964] [<804acfc0>] (udc_irq) from [<8018ba28>] (__handle_irq_event_percpu+0x9c/0x128)
[   54.273330] [<8018ba28>] (__handle_irq_event_percpu) from [<8018bae0>] (handle_irq_event_percpu+0x2c/0x7c)
[   54.282995] [<8018bae0>] (handle_irq_event_percpu) from [<8018bb68>] (handle_irq_event+0x38/0x5c)
[   54.291880] [<8018bb68>] (handle_irq_event) from [<8018f2cc>] (handle_fasteoi_irq+0xd0/0x1bc)
[   54.300418] [<8018f2cc>] (handle_fasteoi_irq) from [<8018afb0>] (generic_handle_irq+0x24/0x34)
[   54.309042] [<8018afb0>] (generic_handle_irq) from [<8018b2dc>] (__handle_domain_irq+0x7c/0xec)
[   54.317754] [<8018b2dc>] (__handle_domain_irq) from [<80101524>] (gic_handle_irq+0x38/0x74)
[   54.326119] [<80101524>] (gic_handle_irq) from [<8010ccb0>] (__irq_svc+0x70/0xb0)
(snip)

After looking through the isr_setup_status_phase disassembly, I found
that ci->status must have been NULL and dereferencing it in
ci->status->context = ci; triggered the panic.

The interrupt was a USBINT (UI bit was set) and isr_tr_complete_handler
was called from udc_irq.
In the IMX6DQRM I read about the UI bit: "This bit is also set by the
Host/Device Controller when a short packet is detected." and about
USBERRINT / UEI bit: "This bit is set along with the USBINT bit, if the
TD on which the error interrupt occurred also had its interrupt on
complete (IOC) bit set." (page 5494)

However, we do not check for UEI in udc_irq.
Could this be the cause of this error?
Should we only call isr_tr_complete_handler if UI && !UEI ?

Or would adding a check for ci->status == NULL in isr_setup-status_phase
and returning an error code also be a good idea?

Do you have an idea what's going on there and why ci->status is NULL?

Regards,
Clemens

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: chipidea: udc: kernel panic in isr_setup_status_phase
  2016-08-23  0:36 chipidea: udc: kernel panic in isr_setup_status_phase Clemens Gruber
@ 2016-08-24  8:11 ` Peter Chen
  2016-08-25 23:47   ` Clemens Gruber
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Chen @ 2016-08-24  8:11 UTC (permalink / raw)
  To: Clemens Gruber; +Cc: linux-usb, Peter Chen, Greg Kroah-Hartman, linux-kernel

On Tue, Aug 23, 2016 at 02:36:30AM +0200, Clemens Gruber wrote:
> Hi,
> 
> I am using an i.MX6Q embedded board, acting as a (ethernet) gadget with
> RNDIS function, connected over an USB OTG cable to a PC.
> Most of the time it works fine, but in some mysterious circumstances,
> a kernel panic occurs, just after attaching the OTG cable, connecting it
> to the other machine:
> 
> [   54.012989] Unable to handle kernel NULL pointer dereference at virtual address 00000020
> [   54.021099] pgd = 80004000
> [   54.023816] [00000020] *pgd=00000000
> [   54.027422] Internal error: Oops: 817 [#1] PREEMPT SMP ARM
> [   54.032915] Modules linked in:
> [   54.035998] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.8.0-rc3-00017-g336bc4a #315
> [   54.043662] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
> [   54.050196] task: 80b05f80 task.stack: 80b00000
> [   54.054744] PC is at isr_setup_status_phase+0x1c/0x40
> [   54.059805] LR is at 0xbe570890
> [   54.062957] pc : [<804ac464>]    lr : [<be570890>]    psr: 200e0193
> [   54.062957] sp : 80b01e10  ip : be570570  fp : be570890
> [   54.074442] r10: be5eeebc  r9 : be570010  r8 : be5eeebc
> [   54.079673] r7 : be5708d0  r6 : be5eee80  r5 : be7fcf40  r4 : 00000001
> [   54.086206] r3 : be571010  r2 : 804ab368  r1 : 00000000  r0 : be570010
> [   54.092742] Flags: nzCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment none
> [   54.099972] Control: 10c5387d  Table: 4e34404a  DAC: 00000051
> [   54.105723] Process swapper/0 (pid: 0, stack limit = 0x80b00210)
> (snip)
> [   54.247100] [<804ac464>] (isr_setup_status_phase) from [<804acbbc>] (isr_tr_complete_handler+0x734/0x98c)
> [   54.256680] [<804acbbc>] (isr_tr_complete_handler) from [<804acfc0>] (udc_irq+0x1ac/0x318)
> [   54.264964] [<804acfc0>] (udc_irq) from [<8018ba28>] (__handle_irq_event_percpu+0x9c/0x128)
> [   54.273330] [<8018ba28>] (__handle_irq_event_percpu) from [<8018bae0>] (handle_irq_event_percpu+0x2c/0x7c)
> [   54.282995] [<8018bae0>] (handle_irq_event_percpu) from [<8018bb68>] (handle_irq_event+0x38/0x5c)
> [   54.291880] [<8018bb68>] (handle_irq_event) from [<8018f2cc>] (handle_fasteoi_irq+0xd0/0x1bc)
> [   54.300418] [<8018f2cc>] (handle_fasteoi_irq) from [<8018afb0>] (generic_handle_irq+0x24/0x34)
> [   54.309042] [<8018afb0>] (generic_handle_irq) from [<8018b2dc>] (__handle_domain_irq+0x7c/0xec)
> [   54.317754] [<8018b2dc>] (__handle_domain_irq) from [<80101524>] (gic_handle_irq+0x38/0x74)
> [   54.326119] [<80101524>] (gic_handle_irq) from [<8010ccb0>] (__irq_svc+0x70/0xb0)
> (snip)
> 
> After looking through the isr_setup_status_phase disassembly, I found
> that ci->status must have been NULL and dereferencing it in
> ci->status->context = ci; triggered the panic.
> 
> The interrupt was a USBINT (UI bit was set) and isr_tr_complete_handler
> was called from udc_irq.
> In the IMX6DQRM I read about the UI bit: "This bit is also set by the
> Host/Device Controller when a short packet is detected." and about
> USBERRINT / UEI bit: "This bit is set along with the USBINT bit, if the
> TD on which the error interrupt occurred also had its interrupt on
> complete (IOC) bit set." (page 5494)
> 
> However, we do not check for UEI in udc_irq.
> Could this be the cause of this error?

UEI is an error interrupt, and software have not handled it, so it will
not affect ci->status.

> Should we only call isr_tr_complete_handler if UI && !UEI ?
> 
> Or would adding a check for ci->status == NULL in isr_setup-status_phase
> and returning an error code also be a good idea?

I agree with that.

> 
> Do you have an idea what's going on there and why ci->status is NULL?
> 

I can't understand it, the only possible is the last disconnect event
(see ci_udc_vbus_session->_gadget_stop_activity) has scheduled very late
due to vbus lowers very slow.

-- 

Best Regards,
Peter Chen

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: chipidea: udc: kernel panic in isr_setup_status_phase
  2016-08-24  8:11 ` Peter Chen
@ 2016-08-25 23:47   ` Clemens Gruber
  2016-08-26 17:21     ` Peter Chen
  2016-08-26 17:22     ` Peter Chen
  0 siblings, 2 replies; 12+ messages in thread
From: Clemens Gruber @ 2016-08-25 23:47 UTC (permalink / raw)
  To: Peter Chen; +Cc: linux-usb, Greg Kroah-Hartman, linux-kernel, Clemens Gruber

On Wed, Aug 24, 2016 at 04:11:02PM +0800, Peter Chen wrote:
> UEI is an error interrupt, and software have not handled it, so it will
> not affect ci->status.
> 
> > Should we only call isr_tr_complete_handler if UI && !UEI ?
> > 
> > Or would adding a check for ci->status == NULL in isr_setup-status_phase
> > and returning an error code also be a good idea?
> 
> I agree with that.

OK I now return -EINVAL if (ci->status == NULL). This does fix the
kernel panic, but the usb0 interface stays down and does not work.
Should I send a patch to avoid the NULL pointer dereference now or after
we found the cause of ci->status being NULL in the first place?

> 
> > 
> > Do you have an idea what's going on there and why ci->status is NULL?
> > 
> 
> I can't understand it, the only possible is the last disconnect event
> (see ci_udc_vbus_session->_gadget_stop_activity) has scheduled very late
> due to vbus lowers very slow.

I now have more information about the two different behaviors.
I added some printk statements..

A) When it does not work:
ci_udc_vbus_session: is_active, gadget_ready=1
ci_udc_pullup: is_on=1
udc_irq: USBi_UI
isr_tr_complete_handler: when calling isr_setup_status_phase at i=8
 isr_setup_status_phase: ci: status is NULL, vbus_active=1, ep0_dir=TX
udc_irq: USBi_UI
isr_setup_packet_handler: USB_REQ_SET_ADDRESS, type=0, ci->status=NULL
 isr_setup_status_phase: ci: status is NULL, vbus_active=1, ep0_dir=RX
(This then repeats a few times, beginning from udc_irq)

B) When it works:
ci_udc_vbus_session: is_active=1 gadget_ready=1
ci_udc_pullup: is_on=1
udc_irq: USBi_SLI
_gadget_stop_activity
udc_irq: USBi_URI
udc_irq: USBi_PCI
udc_irq: USBi_UI
udc_irq: USBi_UI
_gadget_stop_activity
usb_ep_free_request
udc_irq: USBi_UI | USBi_URI
udc_irq: USBi_PCI
isr_setup_packet_handler: USB_REQ_SET_ADDRESS, ci->status is not NULL
udc_irq: USBi_UI
(The above repeats a few times from _gadget_stop_activity to USBi_UI)
(Then USBi_UI occurs many times)
configsfs-gadget gadget: high-speed config #1 ..
(More USBi_UI interrupts)
IPv6: ADDRCONF (NETDEV_CHANGE): usb0: link becomes ready

--

So, both cases are very different and avoiding that NULL pointer
dereference did only fix the kernel panic but not the problem with the
USB gadget not initializing correctly after plugging in.

In A) The USBi_UI interrupts shouldn't arrive that early, I suppose. If
they are the reason why the problem occured, the question is, what
triggered them?

Does the printk output give you more insight into the problem?

--

You mentioned the possibility that vbus lowers too slow, but vbus is
supplied externally by the host and the problem not only occurs when
the cable is plugged out and in again. Also at boot up when there were
no previous disconnect events.
Or did you mean something else with "vbus lowers too slow"?

Do you have any suggestions how to approach this problem further?
Other spots where adding a printk would be helpful to find out what's
causing this?

Regards,
Clemens

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: chipidea: udc: kernel panic in isr_setup_status_phase
  2016-08-25 23:47   ` Clemens Gruber
@ 2016-08-26 17:21     ` Peter Chen
  2016-08-28 18:15       ` Clemens Gruber
  2016-08-26 17:22     ` Peter Chen
  1 sibling, 1 reply; 12+ messages in thread
From: Peter Chen @ 2016-08-26 17:21 UTC (permalink / raw)
  To: Clemens Gruber; +Cc: linux-usb, Greg Kroah-Hartman, linux-kernel

On Fri, Aug 26, 2016 at 01:47:40AM +0200, Clemens Gruber wrote:
> On Wed, Aug 24, 2016 at 04:11:02PM +0800, Peter Chen wrote:
> > UEI is an error interrupt, and software have not handled it, so it will
> > not affect ci->status.
> > 
> > > Should we only call isr_tr_complete_handler if UI && !UEI ?
> > > 
> > > Or would adding a check for ci->status == NULL in isr_setup-status_phase
> > > and returning an error code also be a good idea?
> > 
> > I agree with that.
> 
> OK I now return -EINVAL if (ci->status == NULL). This does fix the
> kernel panic, but the usb0 interface stays down and does not work.
> Should I send a patch to avoid the NULL pointer dereference now or after
> we found the cause of ci->status being NULL in the first place?

You can send the patch after we find the root cause.

> 
> > 
> > > 
> > > Do you have an idea what's going on there and why ci->status is NULL?
> > > 
> > 
> > I can't understand it, the only possible is the last disconnect event
> > (see ci_udc_vbus_session->_gadget_stop_activity) has scheduled very late
> > due to vbus lowers very slow.
> 
> I now have more information about the two different behaviors.
> I added some printk statements..
> 
> A) When it does not work:
> ci_udc_vbus_session: is_active, gadget_ready=1
> ci_udc_pullup: is_on=1
> udc_irq: USBi_UI

The gadget triggers UI interrupt due to host sends packet.

I really can't understand that, why host does not send bus reset
before sending packet (eg, GET_DESCRIPTOR)? It violates USB spec.

Are you sure the first interrupt is UI when the vbus from off to on?

Peter

> isr_tr_complete_handler: when calling isr_setup_status_phase at i=8
>  isr_setup_status_phase: ci: status is NULL, vbus_active=1, ep0_dir=TX
> udc_irq: USBi_UI
> isr_setup_packet_handler: USB_REQ_SET_ADDRESS, type=0, ci->status=NULL
>  isr_setup_status_phase: ci: status is NULL, vbus_active=1, ep0_dir=RX
> (This then repeats a few times, beginning from udc_irq)
> 
> B) When it works:
> ci_udc_vbus_session: is_active=1 gadget_ready=1
> ci_udc_pullup: is_on=1
> udc_irq: USBi_SLI
> _gadget_stop_activity
> udc_irq: USBi_URI
> udc_irq: USBi_PCI
> udc_irq: USBi_UI
> udc_irq: USBi_UI
> _gadget_stop_activity
> usb_ep_free_request
> udc_irq: USBi_UI | USBi_URI
> udc_irq: USBi_PCI
> isr_setup_packet_handler: USB_REQ_SET_ADDRESS, ci->status is not NULL
> udc_irq: USBi_UI
> (The above repeats a few times from _gadget_stop_activity to USBi_UI)
> (Then USBi_UI occurs many times)
> configsfs-gadget gadget: high-speed config #1 ..
> (More USBi_UI interrupts)
> IPv6: ADDRCONF (NETDEV_CHANGE): usb0: link becomes ready
> 
> --
> 
> So, both cases are very different and avoiding that NULL pointer
> dereference did only fix the kernel panic but not the problem with the
> USB gadget not initializing correctly after plugging in.
> 
> In A) The USBi_UI interrupts shouldn't arrive that early, I suppose. If
> they are the reason why the problem occured, the question is, what
> triggered them?
> 
> Does the printk output give you more insight into the problem?
> 
> --
> 
> You mentioned the possibility that vbus lowers too slow, but vbus is
> supplied externally by the host and the problem not only occurs when
> the cable is plugged out and in again. Also at boot up when there were
> no previous disconnect events.
> Or did you mean something else with "vbus lowers too slow"?
> 
> Do you have any suggestions how to approach this problem further?
> Other spots where adding a printk would be helpful to find out what's
> causing this?
> 
> Regards,
> Clemens

-- 

Best Regards,
Peter Chen

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: chipidea: udc: kernel panic in isr_setup_status_phase
  2016-08-25 23:47   ` Clemens Gruber
  2016-08-26 17:21     ` Peter Chen
@ 2016-08-26 17:22     ` Peter Chen
  1 sibling, 0 replies; 12+ messages in thread
From: Peter Chen @ 2016-08-26 17:22 UTC (permalink / raw)
  To: Clemens Gruber; +Cc: linux-usb, Greg Kroah-Hartman, linux-kernel

On Fri, Aug 26, 2016 at 01:47:40AM +0200, Clemens Gruber wrote:
> On Wed, Aug 24, 2016 at 04:11:02PM +0800, Peter Chen wrote:
> > UEI is an error interrupt, and software have not handled it, so it will
> > not affect ci->status.
> > 
> > > Should we only call isr_tr_complete_handler if UI && !UEI ?
> > > 
> > > Or would adding a check for ci->status == NULL in isr_setup-status_phase
> > > and returning an error code also be a good idea?
> > 
> > I agree with that.
> 
> OK I now return -EINVAL if (ci->status == NULL). This does fix the
> kernel panic, but the usb0 interface stays down and does not work.
> Should I send a patch to avoid the NULL pointer dereference now or after
> we found the cause of ci->status being NULL in the first place?
> 
> > 
> > > 
> > > Do you have an idea what's going on there and why ci->status is NULL?
> > > 
> > 
> > I can't understand it, the only possible is the last disconnect event
> > (see ci_udc_vbus_session->_gadget_stop_activity) has scheduled very late
> > due to vbus lowers very slow.
> 
> I now have more information about the two different behaviors.
> I added some printk statements..
> 
> A) When it does not work:
> ci_udc_vbus_session: is_active, gadget_ready=1
> ci_udc_pullup: is_on=1
> udc_irq: USBi_UI
> isr_tr_complete_handler: when calling isr_setup_status_phase at i=8
>  isr_setup_status_phase: ci: status is NULL, vbus_active=1, ep0_dir=TX
> udc_irq: USBi_UI
> isr_setup_packet_handler: USB_REQ_SET_ADDRESS, type=0, ci->status=NULL
>  isr_setup_status_phase: ci: status is NULL, vbus_active=1, ep0_dir=RX
> (This then repeats a few times, beginning from udc_irq)
> 
> B) When it works:
> ci_udc_vbus_session: is_active=1 gadget_ready=1
> ci_udc_pullup: is_on=1
> udc_irq: USBi_SLI
> _gadget_stop_activity
> udc_irq: USBi_URI
> udc_irq: USBi_PCI
> udc_irq: USBi_UI
> udc_irq: USBi_UI
> _gadget_stop_activity
> usb_ep_free_request
> udc_irq: USBi_UI | USBi_URI
> udc_irq: USBi_PCI
> isr_setup_packet_handler: USB_REQ_SET_ADDRESS, ci->status is not NULL
> udc_irq: USBi_UI
> (The above repeats a few times from _gadget_stop_activity to USBi_UI)
> (Then USBi_UI occurs many times)
> configsfs-gadget gadget: high-speed config #1 ..
> (More USBi_UI interrupts)
> IPv6: ADDRCONF (NETDEV_CHANGE): usb0: link becomes ready
> 
> --
> 
> So, both cases are very different and avoiding that NULL pointer
> dereference did only fix the kernel panic but not the problem with the
> USB gadget not initializing correctly after plugging in.
> 
> In A) The USBi_UI interrupts shouldn't arrive that early, I suppose. If
> they are the reason why the problem occured, the question is, what
> triggered them?
> 
> Does the printk output give you more insight into the problem?
> 
> --
> 
> You mentioned the possibility that vbus lowers too slow, but vbus is
> supplied externally by the host and the problem not only occurs when
> the cable is plugged out and in again. Also at boot up when there were
> no previous disconnect events.
> Or did you mean something else with "vbus lowers too slow"?
> 
> Do you have any suggestions how to approach this problem further?
> Other spots where adding a printk would be helpful to find out what's
> causing this?
> 

Would you show me how to reproduce it, I would like to try it at my
side.

-- 

Best Regards,
Peter Chen

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: chipidea: udc: kernel panic in isr_setup_status_phase
  2016-08-26 17:21     ` Peter Chen
@ 2016-08-28 18:15       ` Clemens Gruber
  2016-08-29 10:24         ` Peter Chen
  0 siblings, 1 reply; 12+ messages in thread
From: Clemens Gruber @ 2016-08-28 18:15 UTC (permalink / raw)
  To: Peter Chen; +Cc: linux-usb, Greg Kroah-Hartman, linux-kernel

On Sat, Aug 27, 2016 at 01:21:52AM +0800, Peter Chen wrote:
> The gadget triggers UI interrupt due to host sends packet.
> 
> I really can't understand that, why host does not send bus reset
> before sending packet (eg, GET_DESCRIPTOR)? It violates USB spec.
> 
> Are you sure the first interrupt is UI when the vbus from off to on?

Yes, if the error is present, the first interrupt is intr=0x1 (USBi_UI)
and then the NULL pointer dereference would occur.
(Also: Checking for ci->status == NULL and avoiding the dereference does
not make the gadget work. It just avoids the kernel panic.)

But I also observed a situation where the first interrupt is intr=0x100
(USBi_SLI) followed by 0x40 (USBi_URI), 0x4 (USBi_PCI) and three times
0x1 (USBi_UI).
After this "g_ether gadget: suspend" appears and the sequence repeats,
starting again with intr=0x100, followed by 0x40, ... until three times
0x1 and the g_ether gadget: suspend message.
On the host, every 500ms a new message with incrementing device number
appears:
usb 1-4: new high-speed USB device number 41 using xhci_hcd
usb 1-4: new high-speed USB device number 42 using xhci_hcd
...

In the case where everything works, it looks like this:
intr=0x100 (USBi_SLI)
intr=0x40 (USBi_URI)
intr=0x4 (USBi_PCI)
intr=0x1 (USBi_UI)
intr=0x1 (USBi_UI)
ci_hdrc ci_hdrc.0: freeing queued_request
intr=0x41 (USBi_URI + USBi_UI)
intr=0x4 (USBi_PCI)
intr=0x1 (USBi_UI) <-- appears 17 times
g_ether gadget: high-speed config #1: CDC Ethernet (EEM)
intr=0x1 (USBi_UI) <-- appears 5 times
IPv6: ADDRCONF(NETDEV_CHANGE): usb0: link becomes ready

--

Do you think this could be a hardware problem? We used the same method
as in the MCIMX6Q-SDB schematics (SPF-27516_C5.pdf) to avoid any current
flow through OTG VBUS to the inside when the board is powered off but a
host PC is still connected via OTG.
So we not just pass the VBUS signal through, there are two MOSFETs,
which prevent that (if the internal 3.3V is low).
Mostly the same logic as in said document on page 11 (top-left area).

Another possibility, I am investigating now, is a ground loop and a
main-supply voltage-dependency, although the whole USB OTG part is
on a completely different supply rail, the GNDs are shared.

I am investigating in all directions at the moment ;-)

--

I also switched to CDC/EEM to make sure it has nothing to do with RNDIS,
and the problem is still present. So the error must be on a lower level.

--

You could try to reproduce it with a MCIMX6Q-SDB and varying the main
supply voltage between minimum and maximum allowed voltage levels. For
example: Plug OTG in once at the minimum and once at the maximum level,
see if it behaves differently.
But this is just one of my desperate theories at the moment..

Regards,
Clemens

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: chipidea: udc: kernel panic in isr_setup_status_phase
  2016-08-28 18:15       ` Clemens Gruber
@ 2016-08-29 10:24         ` Peter Chen
  2016-08-30 17:20           ` Clemens Gruber
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Chen @ 2016-08-29 10:24 UTC (permalink / raw)
  To: Clemens Gruber; +Cc: linux-usb, Greg Kroah-Hartman, linux-kernel

On Sun, Aug 28, 2016 at 08:15:02PM +0200, Clemens Gruber wrote:
> On Sat, Aug 27, 2016 at 01:21:52AM +0800, Peter Chen wrote:
> > The gadget triggers UI interrupt due to host sends packet.
> > 
> > I really can't understand that, why host does not send bus reset
> > before sending packet (eg, GET_DESCRIPTOR)? It violates USB spec.
> > 
> > Are you sure the first interrupt is UI when the vbus from off to on?
> 
> Yes, if the error is present, the first interrupt is intr=0x1 (USBi_UI)
> and then the NULL pointer dereference would occur.
> (Also: Checking for ci->status == NULL and avoiding the dereference does
> not make the gadget work. It just avoids the kernel panic.)
> 
> But I also observed a situation where the first interrupt is intr=0x100
> (USBi_SLI) followed by 0x40 (USBi_URI), 0x4 (USBi_PCI) and three times
> 0x1 (USBi_UI).
> After this "g_ether gadget: suspend" appears and the sequence repeats,
> starting again with intr=0x100, followed by 0x40, ... until three times
> 0x1 and the g_ether gadget: suspend message.
> On the host, every 500ms a new message with incrementing device number
> appears:
> usb 1-4: new high-speed USB device number 41 using xhci_hcd
> usb 1-4: new high-speed USB device number 42 using xhci_hcd
> ...
> 
> In the case where everything works, it looks like this:
> intr=0x100 (USBi_SLI)
> intr=0x40 (USBi_URI)
> intr=0x4 (USBi_PCI)
> intr=0x1 (USBi_UI)
> intr=0x1 (USBi_UI)
> ci_hdrc ci_hdrc.0: freeing queued_request
> intr=0x41 (USBi_URI + USBi_UI)
> intr=0x4 (USBi_PCI)
> intr=0x1 (USBi_UI) <-- appears 17 times
> g_ether gadget: high-speed config #1: CDC Ethernet (EEM)
> intr=0x1 (USBi_UI) <-- appears 5 times
> IPv6: ADDRCONF(NETDEV_CHANGE): usb0: link becomes ready
> 
> --
> 
> Do you think this could be a hardware problem? We used the same method
> as in the MCIMX6Q-SDB schematics (SPF-27516_C5.pdf) to avoid any current
> flow through OTG VBUS to the inside when the board is powered off but a
> host PC is still connected via OTG.
> So we not just pass the VBUS signal through, there are two MOSFETs,
> which prevent that (if the internal 3.3V is low).
> Mostly the same logic as in said document on page 11 (top-left area).
> 
> Another possibility, I am investigating now, is a ground loop and a
> main-supply voltage-dependency, although the whole USB OTG part is
> on a completely different supply rail, the GNDs are shared.
> 
> I am investigating in all directions at the moment ;-)
> 

Would you please measure the voltage of vbus within 1s at below two
conditions:

- Just connect cable
- Just disconnect cable

> 
> I also switched to CDC/EEM to make sure it has nothing to do with RNDIS,
> and the problem is still present. So the error must be on a lower level.
> 
> --
> 
> You could try to reproduce it with a MCIMX6Q-SDB and varying the main
> supply voltage between minimum and maximum allowed voltage levels. For
> example: Plug OTG in once at the minimum and once at the maximum level,
> see if it behaves differently.
> But this is just one of my desperate theories at the moment..
> 

Sorry, I have no equipment which can change the voltage of main supplier
now.

-- 

Best Regards,
Peter Chen

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: chipidea: udc: kernel panic in isr_setup_status_phase
  2016-08-29 10:24         ` Peter Chen
@ 2016-08-30 17:20           ` Clemens Gruber
  2016-09-02  1:55             ` Peter Chen
  0 siblings, 1 reply; 12+ messages in thread
From: Clemens Gruber @ 2016-08-30 17:20 UTC (permalink / raw)
  To: Peter Chen; +Cc: linux-usb, Greg Kroah-Hartman, linux-kernel

On Mon, Aug 29, 2016 at 06:24:03PM +0800, Peter Chen wrote:
> Would you please measure the voltage of vbus within 1s at below two
> conditions:
> 
> - Just connect cable
> - Just disconnect cable

We found out that there was a problem with our hardware design!

But first, here is the VBUS measurement during cable plug-in.
https://s17.postimg.org/8ba3rgl6n/linux_kernel_panic_usb_otg_vbus_is_yellow.jpg
(The yellow signal is USB_OTG_VBUS, please ignore the red one)
The kernel panic occurs where the slip of paper with the arrow is.
The signal looks normal to me, I don't think VBUS was the problem.

The other theory was a GND problem..
After thorough investigation, we discovered that we connected all
SHIELD pins of the USB micro OTG connector to the protective-earth
ground (GND_PE) with a 1 MOhm resistor and a 1 nF cap in parallel.
We changed that to connect the SHIELD pins to the same internal GND
as the USB signals have and also replaced the 1 MOhm resistor with a
100 Ohm  resistor as seen in the MCIMX6Q-SDB schematics.
Since that change, the error did not appear again! :)

The only signals I never checked are the differential high speed signals
(USB_OTG_DN and USB_OTG_DP) because I don't have the necessary equipment
like active differential probes for very high frequency measurements, ..
Maybe with the connection of the shield to PE, the cable acted as an
antenna and interfered with the high speed signals?
(Although they are differential, so the same noise to both should not
be a problem? It is still a mystery to me, what was really causing it)

--

What do you think about still fixing that kernel panic in case something
similar happens again for other users?
If such a situation occurs, should we only avoid the NULL pointer
dereference or print out an error message and stop the gadget driver?

Regards,
Clemens

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: chipidea: udc: kernel panic in isr_setup_status_phase
  2016-08-30 17:20           ` Clemens Gruber
@ 2016-09-02  1:55             ` Peter Chen
  2016-09-02 16:42               ` Clemens Gruber
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Chen @ 2016-09-02  1:55 UTC (permalink / raw)
  To: Clemens Gruber; +Cc: linux-usb, Greg Kroah-Hartman, linux-kernel

On Tue, Aug 30, 2016 at 07:20:45PM +0200, Clemens Gruber wrote:
> On Mon, Aug 29, 2016 at 06:24:03PM +0800, Peter Chen wrote:
> > Would you please measure the voltage of vbus within 1s at below two
> > conditions:
> > 
> > - Just connect cable
> > - Just disconnect cable
> 
> We found out that there was a problem with our hardware design!
> 
> But first, here is the VBUS measurement during cable plug-in.
> https://s17.postimg.org/8ba3rgl6n/linux_kernel_panic_usb_otg_vbus_is_yellow.jpg
> (The yellow signal is USB_OTG_VBUS, please ignore the red one)
> The kernel panic occurs where the slip of paper with the arrow is.
> The signal looks normal to me, I don't think VBUS was the problem.
> 

Do you have other 5V to USB_H1_VBUS? USB PHY needs 5V input voltage
as the source for USB LDO (3.0v), either from OTG or Host 1. I suspect
the lower vbus voltage causes the USB LDO voltage less than 3.0v, then
cause the unstable for USB PHY. If possible, you can connect MAIN 5V
(if it exists) directly to USB_H1_VBUS to see if this problem is fixed.

> The other theory was a GND problem..
> After thorough investigation, we discovered that we connected all
> SHIELD pins of the USB micro OTG connector to the protective-earth
> ground (GND_PE) with a 1 MOhm resistor and a 1 nF cap in parallel.
> We changed that to connect the SHIELD pins to the same internal GND
> as the USB signals have and also replaced the 1 MOhm resistor with a
> 100 Ohm  resistor as seen in the MCIMX6Q-SDB schematics.
> Since that change, the error did not appear again! :)

Great. Can you see the sudden lower for vbus again? If it still exists, it may
be GND issue.

> 
> The only signals I never checked are the differential high speed signals
> (USB_OTG_DN and USB_OTG_DP) because I don't have the necessary equipment
> like active differential probes for very high frequency measurements, ..
> Maybe with the connection of the shield to PE, the cable acted as an
> antenna and interfered with the high speed signals?


> (Although they are differential, so the same noise to both should not
> be a problem? It is still a mystery to me, what was really causing it)
> 

The poor signal may cause the controller has unexpected behavior.

> 
> What do you think about still fixing that kernel panic in case something
> similar happens again for other users?
> If such a situation occurs, should we only avoid the NULL pointer
> dereference or print out an error message and stop the gadget driver?
> 

Avoid NULL pointer deference is necessary. Patch is welcome :)

-- 

Best Regards,
Peter Chen

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: chipidea: udc: kernel panic in isr_setup_status_phase
  2016-09-02  1:55             ` Peter Chen
@ 2016-09-02 16:42               ` Clemens Gruber
  2016-09-05  3:10                 ` Peter Chen
  0 siblings, 1 reply; 12+ messages in thread
From: Clemens Gruber @ 2016-09-02 16:42 UTC (permalink / raw)
  To: Peter Chen; +Cc: linux-usb, Greg Kroah-Hartman, linux-kernel

On Fri, Sep 02, 2016 at 09:55:52AM +0800, Peter Chen wrote:
> Do you have other 5V to USB_H1_VBUS? USB PHY needs 5V input voltage
> as the source for USB LDO (3.0v), either from OTG or Host 1. I suspect
> the lower vbus voltage causes the USB LDO voltage less than 3.0v, then
> cause the unstable for USB PHY. If possible, you can connect MAIN 5V
> (if it exists) directly to USB_H1_VBUS to see if this problem is fixed.

Yes, USB_H1_VBUS is supplied from internal 5V and it is accurate (+/-
0.03V). The USB_H_EN signal enables USB_H1_VBUS through a MIC2026-1.

The USB_OTG_VBUS signal from the picture with 4.69V came from the host
PC and is not under our control.

I read in the ReferenceManual that the USBPHY defaults to USB_H1_VBUS if
both USB_OTG_VBUS and USB_H1_VBUS are available.
If we imagine the following situation: USB OTG cable is plugged-in, the
board is powered on, USB_OTG_VBUS is supplied externally and USBPHY uses
it because USB_H1_EN is not high yet and did not enable USB_H1_VBUS.
If the USBPHY started out being supplied by USB_OTG_VBUS and later on,
USB_H1_VBUS comes up, will the USBPHY switch to USB_H1_VBUS immediately
or stay at being supplied by USB_OTG_VBUS as long as that is available?

> Great. Can you see the sudden lower for vbus again? If it still exists, it may
> be GND issue.

Yes it is still visible. Does not seem to be a problem though.

> Avoid NULL pointer deference is necessary. Patch is welcome :)

Yes, but should we also stop the gadget driver and print an error
message? Or just return from the isr and neither report nor stop?

Regards,
Clemens

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: chipidea: udc: kernel panic in isr_setup_status_phase
  2016-09-02 16:42               ` Clemens Gruber
@ 2016-09-05  3:10                 ` Peter Chen
  2016-09-05 17:24                   ` Clemens Gruber
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Chen @ 2016-09-05  3:10 UTC (permalink / raw)
  To: Clemens Gruber; +Cc: linux-usb, Greg Kroah-Hartman, linux-kernel

On Fri, Sep 02, 2016 at 06:42:43PM +0200, Clemens Gruber wrote:
> On Fri, Sep 02, 2016 at 09:55:52AM +0800, Peter Chen wrote:
> > Do you have other 5V to USB_H1_VBUS? USB PHY needs 5V input voltage
> > as the source for USB LDO (3.0v), either from OTG or Host 1. I suspect
> > the lower vbus voltage causes the USB LDO voltage less than 3.0v, then
> > cause the unstable for USB PHY. If possible, you can connect MAIN 5V
> > (if it exists) directly to USB_H1_VBUS to see if this problem is fixed.
> 
> Yes, USB_H1_VBUS is supplied from internal 5V and it is accurate (+/-
> 0.03V). The USB_H_EN signal enables USB_H1_VBUS through a MIC2026-1.
> 
> The USB_OTG_VBUS signal from the picture with 4.69V came from the host
> PC and is not under our control.
> 
> I read in the ReferenceManual that the USBPHY defaults to USB_H1_VBUS if
> both USB_OTG_VBUS and USB_H1_VBUS are available.
> If we imagine the following situation: USB OTG cable is plugged-in, the
> board is powered on, USB_OTG_VBUS is supplied externally and USBPHY uses
> it because USB_H1_EN is not high yet and did not enable USB_H1_VBUS.
> If the USBPHY started out being supplied by USB_OTG_VBUS and later on,
> USB_H1_VBUS comes up, will the USBPHY switch to USB_H1_VBUS immediately
> or stay at being supplied by USB_OTG_VBUS as long as that is available?

The USB LDO will use the higher voltage as input.

> 
> > Great. Can you see the sudden lower for vbus again? If it still exists, it may
> > be GND issue.
> 
> Yes it is still visible. Does not seem to be a problem though.
> 
> > Avoid NULL pointer deference is necessary. Patch is welcome :)
> 
> Yes, but should we also stop the gadget driver and print an error
> message? Or just return from the isr and neither report nor stop?
> 

How about below, it will set halt for device, and host will get stall
from the device.

diff --git a/drivers/usb/chipidea/udc.c b/drivers/usb/chipidea/udc.c
index 0f692fc..3c46ccb 100644
--- a/drivers/usb/chipidea/udc.c
+++ b/drivers/usb/chipidea/udc.c
@@ -946,6 +946,11 @@ static int isr_setup_status_phase(struct ci_hdrc *ci)
 	int retval;
 	struct ci_hw_ep *hwep;
 
+	if (!ci->status) {
+		WARN_ON(1);
+		return -EPIPE;
+	}
+
 	hwep = (ci->ep0_dir == TX) ? ci->ep0out : ci->ep0in;
 	ci->status->context = ci;
 	ci->status->complete = isr_setup_status_complete;

-- 

Best Regards,
Peter Chen

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: chipidea: udc: kernel panic in isr_setup_status_phase
  2016-09-05  3:10                 ` Peter Chen
@ 2016-09-05 17:24                   ` Clemens Gruber
  0 siblings, 0 replies; 12+ messages in thread
From: Clemens Gruber @ 2016-09-05 17:24 UTC (permalink / raw)
  To: Peter Chen; +Cc: linux-usb, Greg Kroah-Hartman, linux-kernel

On Mon, Sep 05, 2016 at 11:10:22AM +0800, Peter Chen wrote:
> How about below, it will set halt for device, and host will get stall
> from the device.
> 
> diff --git a/drivers/usb/chipidea/udc.c b/drivers/usb/chipidea/udc.c
> index 0f692fc..3c46ccb 100644
> --- a/drivers/usb/chipidea/udc.c
> +++ b/drivers/usb/chipidea/udc.c
> @@ -946,6 +946,11 @@ static int isr_setup_status_phase(struct ci_hdrc *ci)
>  	int retval;
>  	struct ci_hw_ep *hwep;
>  
> +	if (!ci->status) {
> +		WARN_ON(1);
> +		return -EPIPE;
> +	}
> +
>  	hwep = (ci->ep0_dir == TX) ? ci->ep0out : ci->ep0in;
>  	ci->status->context = ci;
>  	ci->status->complete = isr_setup_status_complete;
> 

Returning -EPIPE works! I would however suggest to only warn once, as
this otherwise floods the kernel log.

I'll send a patch shortly, also adding a comment for people experiencing
a similar hardware problem.

Thank you very much for your help, Peter!

Regards,
Clemens

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2016-09-05 17:24 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-23  0:36 chipidea: udc: kernel panic in isr_setup_status_phase Clemens Gruber
2016-08-24  8:11 ` Peter Chen
2016-08-25 23:47   ` Clemens Gruber
2016-08-26 17:21     ` Peter Chen
2016-08-28 18:15       ` Clemens Gruber
2016-08-29 10:24         ` Peter Chen
2016-08-30 17:20           ` Clemens Gruber
2016-09-02  1:55             ` Peter Chen
2016-09-02 16:42               ` Clemens Gruber
2016-09-05  3:10                 ` Peter Chen
2016-09-05 17:24                   ` Clemens Gruber
2016-08-26 17:22     ` Peter Chen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.