All of lore.kernel.org
 help / color / mirror / Atom feed
From: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
To: Alan Stern <stern@rowland.harvard.edu>,
	Felipe Balbi <balbi@kernel.org>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: syzbot <syzbot+abd2e0dafb481b621869@syzkaller.appspotmail.com>,
	linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
	syzkaller-bugs@googlegroups.com,
	Pavel Skripkin <paskripkin@gmail.com>,
	Thierry Escande <thierry.escande@collabora.com>,
	Andrey Konovalov <andreyknvl@gmail.com>
Subject: Re: [syzbot] INFO: task hung in port100_probe
Date: Wed, 20 Oct 2021 22:56:42 +0200	[thread overview]
Message-ID: <b9695fc8-51b5-c61e-0a2f-fec9c2f0bae0@canonical.com> (raw)
In-Reply-To: <20210722144721.GA6592@rowland.harvard.edu>

On 22/07/2021 16:47, Alan Stern wrote:
> On Thu, Jul 22, 2021 at 04:20:10PM +0200, Krzysztof Kozlowski wrote:
>> On 22/06/2021 17:43, syzbot wrote:
>>> Hello,
>>>
>>> syzbot found the following issue on:
>>>
>>> HEAD commit:    fd0aa1a4 Merge tag 'for-linus' of git://git.kernel.org/pub..
>>> git tree:       upstream
>>> console output: https://syzkaller.appspot.com/x/log.txt?x=13e1500c300000
>>> kernel config:  https://syzkaller.appspot.com/x/.config?x=7ca96a2d153c74b0
>>> dashboard link: https://syzkaller.appspot.com/bug?extid=abd2e0dafb481b621869
>>> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=1792e284300000
>>> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=13ad9d48300000
>>>
>>> IMPORTANT: if you fix the issue, please add the following tag to the commit:
>>> Reported-by: syzbot+abd2e0dafb481b621869@syzkaller.appspotmail.com
>>>
>>> INFO: task kworker/0:1:7 blocked for more than 143 seconds.
>>>       Not tainted 5.13.0-rc6-syzkaller #0
>>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>> task:kworker/0:1     state:D stack:25584 pid:    7 ppid:     2 flags:0x00004000
>>> Workqueue: usb_hub_wq hub_event
>>> Call Trace:
>>>  context_switch kernel/sched/core.c:4339 [inline]
>>>  __schedule+0x916/0x23e0 kernel/sched/core.c:5147
>>>  schedule+0xcf/0x270 kernel/sched/core.c:5226
>>>  schedule_timeout+0x1db/0x250 kernel/time/timer.c:1868
>>>  do_wait_for_common kernel/sched/completion.c:85 [inline]
>>>  __wait_for_common kernel/sched/completion.c:106 [inline]
>>>  wait_for_common kernel/sched/completion.c:117 [inline]
>>>  wait_for_completion+0x168/0x270 kernel/sched/completion.c:138
>>>  port100_send_cmd_sync drivers/nfc/port100.c:923 [inline]
>>>  port100_get_command_type_mask drivers/nfc/port100.c:1008 [inline]
>>>  port100_probe+0x9e4/0x1340 drivers/nfc/port100.c:1554
>>>  usb_probe_interface+0x315/0x7f0 drivers/usb/core/driver.c:396
> ...
> 
>> Cc: Thierry, Alan, Andrey,
>>
>> The issue is reproducible immediately on QEMU instance with
>> USB_DUMMY_HCD and USB_RAW_GADGET. I don't know about real port100 NFC
>> device.
>>
>> I spent some time looking into this and have no clue, except that it
>> looks like an effect of a race condition.
>>
>> 1. When using syskaller reproducer against one USB device (In the C
>> reproducer change the loop in main() to use procid=0) - issue does not
>> happen.
>>
>> 2. With two threads or more talking to separate Dummy USB devices, the
>> issue appears. The more of them, the better...
>>
>> 3. The reported problem is in missing complete. The correct flow is like:
>> port100_probe()
>> port100_get_command_type_mask()
>> port100_send_cmd_sync()
>> port100_send_cmd_async()
>> port100_submit_urb_for_ack()
>> port100_send_complete()
>> [   63.363863] port100 2-1:0.0: NFC: Urb failure (status -71)
>> port100_recv_ack()
>> [   63.369942] port100 2-1:0.0: NFC: Urb failure (status -71)
>>
>> and schedule_work() which completes and unblocks port100_send_cmd_sync
>>
>> However in the failing case (hung task) the port100_recv_ack() is never
>> called. It looks like USB core / HCD / gadget does not send the Ack/URB
>> complete.
>>
>> I don't know why. The port100 NFC driver code looks OK, except it is not
>> prepared for missing ack/urb so it waits indefinitely. I could try to
>> convert it to wait_for_completion_timeout() but it won't be trivial and
>> more important - I am not sure if this is the problem. Somehow the ACK
>> with Urb failure is not sent back to the port100 device. Therefore I am
>> guessing that the race condition is somwhere in USB stack, not in
>> port100 driver.
>>
>> The lockdep and other testing tools did not find anything here.
>>
>> Anyone hints where the issue could be?
> 
> Here's what I wrote earlier: "It looks like the problem stems from the fact 
> that port100_send_frame_async() submits two URBs, but 
> port100_send_cmd_sync() only waits for one of them to complete.  The other 
> URB may then still be active when the driver tries to reuse it."
> 
> Of course, there may be more than one problem, so we may not be talking 
> about the same thing.

Hi Alan, Felipe, Greg and others,

This is an old issue reported by syzkaller for NFC port100 driver [1].
There is something similar for pn533 [2].

I was looking at it some time ago, took a break and now I am trying to
fix it again. Without success.

The issue is reproducible via USB gadget on QEMU, not on real HW. I
looked and debugged the code and I think previously mentioned
double-URB-submit is not the reason here. Or I miss how the USB works
(which is quite probable...).

1. The port100 driver calls port100_send_cmd_sync() which eventually
goes to port100_send_frame_async(). After it, it waits for "sync"
completion.

2. In port100_send_frame_async(), driver indeed first submits "out_urb"
which quite fast is being processed by dummy_hcd with "no ep configured"
and -EPROTO.

3. Then (or sometimes before -EPROTO response from (2) above) the
port100_send_frame_async() submits "in_urb" via
port100_submit_urb_for_ack() and waits for its completion. Completion of
"in_urb" (or the "ack") in port100_recv_ack() would schedule work to
complete the (1) above - the sync completion.

4. Usually, when reproducer works fine (does not trigger issue), the
dummy_timer() from gadget responds with the same "no ep configured for
urb" for this "in_urb" (3). This completes "in_urb", which eventually
completes (1) and probe finishes with error. Error is expected, because
it's random junk-gadget...

The syzkaller reproducer fails if >1 of threads are running these usb
gadgets.  When this happens, no "in_urb" completion happens. No this
"ack" port100_recv_ack().

I added some debugs and simply dummy_hcd dummy_timer() is woken up on
enqueuing in_urb and then is looping crazy on a previous URB (some older
URB, coming from before port100 driver probe started). The dummy_timer()
loop never reaches the second "in_urb" to process it, I think.

The pn533 NFC driver has similar design, but I have now really doubts it
is a NFC driver issue. Instead an issue in dummy gadget HCD is somehow
triggered by the reproducer.

Reproduction - just follow [1] or [2]. Eventually I slightly tweaked the
code and put here:
https://github.com/krzk/tools/tree/master/tests-var/nfc/port100_probe
$ make
$ sudo ./port100_probe


[1] https://syzkaller.appspot.com/bug?extid=abd2e0dafb481b621869
[2] https://syzkaller.appspot.com/bug?extid=1dc8b460d6d48d7ef9ca


Best regards,
Krzysztof

  parent reply	other threads:[~2021-10-20 20:56 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-22 15:43 [syzbot] INFO: task hung in port100_probe syzbot
2021-06-22 16:07 ` Pavel Skripkin
2021-06-22 16:21   ` syzbot
2021-07-22 14:20 ` Krzysztof Kozlowski
2021-07-22 14:23   ` Krzysztof Kozlowski
2021-07-22 14:47   ` Alan Stern
2021-07-23  9:05     ` Krzysztof Kozlowski
2021-07-23 13:07       ` Alan Stern
2021-10-20 20:56     ` Krzysztof Kozlowski [this message]
2021-10-20 22:05       ` Alan Stern
2021-10-25 14:57         ` Krzysztof Kozlowski
2021-10-25 16:22           ` Alan Stern
2021-10-25 17:13             ` Krzysztof Kozlowski
2021-10-25 18:54               ` Alan Stern
2022-03-09 19:33 ` Pavel Skripkin
2022-03-09 19:56   ` syzbot
     [not found] <20220310084247.1148-1-hdanton@sina.com>
2022-03-10 14:22 ` syzbot
     [not found] ` <20220311053751.1226-1-hdanton@sina.com>
2022-03-11 19:17   ` Pavel Skripkin
2022-03-11 19:18     ` syzbot
2022-03-11 19:19       ` Pavel Skripkin
2022-03-11 19:32         ` syzbot
     [not found]   ` <20220312005624.1310-1-hdanton@sina.com>
2022-03-12 10:36     ` Pavel Skripkin
     [not found]     ` <20220312115854.1399-1-hdanton@sina.com>
2022-03-12 12:44       ` Pavel Skripkin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b9695fc8-51b5-c61e-0a2f-fec9c2f0bae0@canonical.com \
    --to=krzysztof.kozlowski@canonical.com \
    --cc=andreyknvl@gmail.com \
    --cc=balbi@kernel.org \
    --cc=gregkh@linuxfoundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=paskripkin@gmail.com \
    --cc=stern@rowland.harvard.edu \
    --cc=syzbot+abd2e0dafb481b621869@syzkaller.appspotmail.com \
    --cc=syzkaller-bugs@googlegroups.com \
    --cc=thierry.escande@collabora.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.