On Tue, Dec 10, 2019 at 11:25:35PM +0900, Suwan Kim wrote: > On Mon, Dec 09, 2019 at 03:19:59PM +0100, Marek Marczykowski-Górecki wrote: > > On Mon, Dec 09, 2019 at 03:35:43PM +0900, Suwan Kim wrote: > > > On Mon, Dec 09, 2019 at 04:37:40AM +0100, Marek Marczykowski-Górecki wrote: > > > > On Mon, Dec 09, 2019 at 11:01:30AM +0900, Suwan Kim wrote: > > > > > On Fri, Dec 06, 2019 at 09:57:42PM +0100, Marek Marczykowski-Górecki wrote: > > > > > > [ 212.890519] usb 1-1: recv xbuf, 42 > > > > > > > > > > This message is printed by receive error and before that, driver > > > > > canceled URB transmission. we need to know the exact situation > > > > > before this message. > > > > > > > > I've added some more messages and found recv_size is 0. > > > > > > That is the bug point. "size" is urb->actual_length that means > > > amount of data actually received from device. And "copy" is > > > amount of data received from usbip server. So, in this situation, > > > vhci-hcd received all the data from usbip server even if there > > > are more sg entries left. So, "copy == 0" means vhci-hcd receives > > > all data from the server and we should check "if (copy == 0)" in > > > for_each_sg() loop of usbip_recv_xbuff() to exit the loop and not > > > to add error event. > > > > That makes sense. But I think there is also another issue here: hang in > > case of an error. Here it was EINVAL, but there are probably other > > reasons why usbip_recv can fail, like network error or misbehaving > > server. This definitely should not cause the client to fail this way... > > And also, the actual error code is lost. > > I agree. I have been taking a look at it and trying to reproduce > the same issue on my machine. I guess race condition between hub > irq thread and driver (vhci_urb_enqueue or hcd?) > But I'm not sure... I've tried it some more time and one time I've got a traceback pointing at lock_release() called from vhci_urb_dequeue, not lock_acquire(). This get me thinking it may not be deadlock on a spinlock, but some infinite loop. Looking at the source, I think it's about usb_hcd_flush_endpoint looping indefinitely because vhci_urb_dequeue() exit early on this: spin_lock_irqsave(&vhci->lock, flags); priv = urb->hcpriv; if (!priv) { /* URB was never linked! or will be soon given back by * vhci_rx. */ spin_unlock_irqrestore(&vhci->lock, flags); return -EIDRM; } Adding a print there confirms it. And I think it's because of vhci_recv_ret_submit(): spin_lock_irqsave(&vdev->priv_lock, flags); urb = pickup_urb_and_free_priv(vdev, pdu->base.seqnum); // ****** priv freed here spin_unlock_irqrestore(&vdev->priv_lock, flags); if (!urb) { pr_err("cannot find a urb of seqnum %u max seqnum %d\n", pdu->base.seqnum, atomic_read(&vhci_hcd->seqnum)); usbip_event_add(ud, VDEV_EVENT_ERROR_TCP); return; } /* unpack the pdu to a urb */ usbip_pack_pdu(pdu, urb, USBIP_RET_SUBMIT, 0); /* recv transfer buffer */ if (usbip_recv_xbuff(ud, urb) < 0) // ***** exit early here return; /* recv iso_packet_descriptor */ if (usbip_recv_iso(ud, urb) < 0) return; I'm not really sure what should happen, but I think some cleanup in case of usbip_recv_xbuff() failure is missing. And probably in case of usbip_recv_iso() failure too. -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing?