Hi,

Alan Stern <stern@rowland.harvard.edu> writes:
> On Tue, 20 Sep 2016, Felipe Balbi wrote:
>
>> And here's trace output (complete, scroll to bottom). It seems to me
>> like the thread didn't wake up at all. It didn't even try to
>> execute. I'll add some more traces and try to get better information
>> about what's going on.
>> 
>> 
>> 
>> # tracer: nop
>> #
>> # entries-in-buffer/entries-written: 2865/2865   #P:4
>> #
>> #                              _-----=> irqs-off
>> #                             / _----=> need-resched
>> #                            | / _---=> hardirq/softirq
>> #                            || / _--=> preempt-depth
>> #                            ||| /     delay
>> #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
>> #              | |       |   ||||       |         |
>
> Skipping to the end...
>
>>      irq/17-dwc3-2522  [002] d...    43.504199: bulk_out_complete: 0, 31/31
>>     file-storage-2521  [001] ....    43.504202: fsg_main_thread: get_next_command -> 0
>>     file-storage-2521  [001] ....    43.504288: fsg_main_thread: do_scsi_command -> 0
>>     file-storage-2521  [001] ....    43.504295: fsg_main_thread: finish_reply -> 0
>>     file-storage-2521  [001] ....    43.504298: fsg_main_thread: send_status -> 0
>>      irq/17-dwc3-2522  [002] d...    43.504347: bulk_in_complete: 0, 16384/16384
>>      irq/17-dwc3-2522  [002] d...    43.504351: bulk_in_complete: 0, 16384/16384
>>      irq/17-dwc3-2522  [002] d...    43.504434: bulk_in_complete: 0, 16384/16384
>>      irq/17-dwc3-2522  [002] d...    43.504438: bulk_in_complete: 0, 16384/16384
>>      irq/17-dwc3-2522  [002] d...    43.504535: bulk_in_complete: 0, 16384/16384
>>      irq/17-dwc3-2522  [002] d...    43.504539: bulk_in_complete: 0, 16384/16384
>>      irq/17-dwc3-2522  [002] d...    43.504618: bulk_in_complete: 0, 16384/16384
>>      irq/17-dwc3-2522  [002] d...    43.504703: bulk_in_complete: 0, 16384/16384
>>      irq/17-dwc3-2522  [002] d...    43.504794: bulk_in_complete: 0, 13/13
>>      irq/17-dwc3-2522  [002] d...    43.504797: bulk_out_complete: 0, 31/31
>
> Like you say, it appears that the thread didn't get woken up at all.
> But this is inconsistent with your earlier results.  On Sep. 9, you 
> posted a message that ended with these lines:
>
>>      irq/17-dwc3-3579  [003] d..1 21167.729666: bulk_out_complete: compl: bh ffff880111e6aac0 state 1
>>     file-storage-3578  [002] .... 21167.729670: fsg_main_thread: next: bh ffff880111e6aac0 state 1
>
> This indicates that in the earlier test, the thread did start running 
> and get_next_command should have returned.
>
> The trace you posted after this one doesn't seem to show anything new, 
> as far as I can tell.
>
> So I still can't tell what's happening.  Maybe the patch below will 
> help.  It concentrates on the critical area.

Sorry for the long delay, I finally have more information on this. All
this time I was doing something that I never considered to matter: I've
been running host and peripheral on the same machine. Now that I have
tracepoints on xHCI as well, I could see that these 30 seconds of
"nothing" is actuall full of xHCI activity and I can see that for the
duration of these 30 seconds preempt depth on the CPU that (eventually)
queues a request on dwc3, is always > 1 (sometimes 2, most of the time
1). My conclusion from that is that xHCI (or usbcore ?!?) locks the CPU
and g_mass_storage is spinning for over 30 seconds at which point
storage.ko (host side class driver) dequeues the request.

I'll see if I can capture a fresh trace with both xHCI and dwc3 with
this happening, but probably not today (testing stuff for -rc).

-- 
balbi