On Tue, Dec 10, 2019 at 11:25:35PM +0900, Suwan Kim wrote:
> On Mon, Dec 09, 2019 at 03:19:59PM +0100, Marek Marczykowski-Górecki wrote:
> > On Mon, Dec 09, 2019 at 03:35:43PM +0900, Suwan Kim wrote:
> > > On Mon, Dec 09, 2019 at 04:37:40AM +0100, Marek Marczykowski-Górecki wrote:
> > > > On Mon, Dec 09, 2019 at 11:01:30AM +0900, Suwan Kim wrote:
> > > > > On Fri, Dec 06, 2019 at 09:57:42PM +0100, Marek Marczykowski-Górecki wrote:
> > > > > > [  212.890519] usb 1-1: recv xbuf, 42
> > > > > 
> > > > > This message is printed by receive error and before that, driver
> > > > > canceled URB transmission. we need to know the exact situation
> > > > > before this message.
> > > > 
> > > > I've added some more messages and found recv_size is 0.
> > > 
> > > That is the bug point. "size" is urb->actual_length that means
> > > amount of data actually received from device. And "copy" is
> > > amount of data received from usbip server. So, in this situation,
> > > vhci-hcd received all the data from usbip server even if there
> > > are more sg entries left. So, "copy == 0" means vhci-hcd receives
> > > all data from the server and we should check "if (copy == 0)" in
> > > for_each_sg() loop of usbip_recv_xbuff() to exit the loop and not
> > > to add error event.
> > 
> > That makes sense. But I think there is also another issue here: hang in
> > case of an error. Here it was EINVAL, but there are probably other
> > reasons why usbip_recv can fail, like network error or misbehaving
> > server. This definitely should not cause the client to fail this way...
> > And also, the actual error code is lost.
> 
> I agree. I have been taking a look at it and trying to reproduce
> the same issue on my machine. I guess race condition between hub
> irq thread and driver (vhci_urb_enqueue or hcd?)
> But I'm not sure...

I've tried it some more time and one time I've got a traceback pointing
at lock_release() called from vhci_urb_dequeue, not lock_acquire(). This
get me thinking it may not be deadlock on a spinlock, but some infinite
loop. Looking at the source, I think it's about usb_hcd_flush_endpoint
looping indefinitely because vhci_urb_dequeue() exit early on this:


    spin_lock_irqsave(&vhci->lock, flags);

    priv = urb->hcpriv;
    if (!priv) {
        /* URB was never linked! or will be soon given back by
         * vhci_rx. */
        spin_unlock_irqrestore(&vhci->lock, flags);
        return -EIDRM;
    }

Adding a print there confirms it.

And I think it's because of vhci_recv_ret_submit():


    spin_lock_irqsave(&vdev->priv_lock, flags);
    urb = pickup_urb_and_free_priv(vdev, pdu->base.seqnum);

// ****** priv freed here

    spin_unlock_irqrestore(&vdev->priv_lock, flags);

    if (!urb) {
        pr_err("cannot find a urb of seqnum %u max seqnum %d\n",
            pdu->base.seqnum,
            atomic_read(&vhci_hcd->seqnum));
        usbip_event_add(ud, VDEV_EVENT_ERROR_TCP);
        return;
    }
    
    /* unpack the pdu to a urb */
    usbip_pack_pdu(pdu, urb, USBIP_RET_SUBMIT, 0);

    /* recv transfer buffer */
    if (usbip_recv_xbuff(ud, urb) < 0)
// ***** exit early here
        return;

    /* recv iso_packet_descriptor */
    if (usbip_recv_iso(ud, urb) < 0)
        return;

I'm not really sure what should happen, but I think some cleanup in case
of usbip_recv_xbuff() failure is missing. And probably in case of
usbip_recv_iso() failure too.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?