[PATCH 1/2] usb: host: xhci: Fix possible wild pointer when handling abort command

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 1/2] usb: host: xhci: Fix possible wild pointer when handling abort command
@ 2016-12-05  7:51 Baolin Wang
  2016-12-05  7:51 ` [PATCH 2/2] usb: host: xhci: Handle the right timeout command Baolin Wang
  2016-12-05 14:08 ` [PATCH 1/2] usb: host: xhci: Fix possible wild pointer when handling abort command Mathias Nyman
  0 siblings, 2 replies; 32+ messages in thread
From: Baolin Wang @ 2016-12-05  7:51 UTC (permalink / raw)
  To: mathias.nyman, gregkh
  Cc: baolu.lu, linux-usb, linux-kernel, broonie, baolin.wang

When current command was supposed to be aborted, host will free the command
in handle_cmd_completion() function. But it might be still referenced by
xhci->current_cmd, which need to set NULL.

Signed-off-by: Baolin Wang <baolin.wang@linaro.org>
---
This patch is based on Lu Baolu's new fix patch:
usb: xhci: fix possible wild pointer
https://www.spinics.net/lists/linux-usb/msg150219.html
---
 drivers/usb/host/xhci-ring.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 62dd1c7..9965a4c 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -1362,8 +1362,11 @@ static void handle_cmd_completion(struct xhci_hcd *xhci,
 	 */
 	if (cmd_comp_code == COMP_CMD_ABORT) {
 		xhci->cmd_ring_state = CMD_RING_STATE_STOPPED;
-		if (cmd->status == COMP_CMD_ABORT)
+		if (cmd->status == COMP_CMD_ABORT) {
+			if (xhci->current_cmd == cmd)
+				xhci->current_cmd = NULL;
 			goto event_handled;
+		}
 	}
 
 	cmd_type = TRB_FIELD_TO_TYPE(le32_to_cpu(cmd_trb->generic.field[3]));
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 2/2] usb: host: xhci: Handle the right timeout command
  2016-12-05  7:51 [PATCH 1/2] usb: host: xhci: Fix possible wild pointer when handling abort command Baolin Wang
@ 2016-12-05  7:51 ` Baolin Wang
  2016-12-12 15:52   ` Mathias Nyman
  2016-12-05 14:08 ` [PATCH 1/2] usb: host: xhci: Fix possible wild pointer when handling abort command Mathias Nyman
  1 sibling, 1 reply; 32+ messages in thread
From: Baolin Wang @ 2016-12-05  7:51 UTC (permalink / raw)
  To: mathias.nyman, gregkh
  Cc: baolu.lu, linux-usb, linux-kernel, broonie, baolin.wang

If a command event is found on the event ring during an interrupt,
we need to stop the command timer with del_timer(). Since del_timer()
can fail if the timer is running and waiting on the xHCI lock, then
it maybe get the wrong timeout command in xhci_handle_command_timeout()
if host fetched a new command and updated the xhci->current_cmd in
handle_cmd_completion(). For this situation, we need a way to signal
to the command timer that everything is fine and it should exit.

We should introduce a counter (xhci->current_cmd_pending) for the number
of pending commands. If we need to cancel the command timer and del_timer()
succeeds, we decrement the number of pending commands. If del_timer() fails,
we leave the number of pending commands alone.

For handling timeout command, in xhci_handle_command_timeout() we will check
the counter after decrementing it, if the counter (xhci->current_cmd_pending)
is 0, which means xhci->current_cmd is the right timeout command. If the
counter (xhci->current_cmd_pending) is greater than 0, which means current
timeout command has been handled by host and host has fetched new command as
xhci->current_cmd, then just return and wait for new current command.

Signed-off-by: Baolin Wang <baolin.wang@linaro.org>
---
 drivers/usb/host/xhci-ring.c |   29 ++++++++++++++++++++++++++++-
 drivers/usb/host/xhci.h      |    1 +
 2 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 9965a4c..edc9ac2 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -1253,6 +1253,7 @@ static void xhci_handle_stopped_cmd_ring(struct xhci_hcd *xhci,
 	if ((xhci->cmd_ring->dequeue != xhci->cmd_ring->enqueue) &&
 	    !(xhci->xhc_state & XHCI_STATE_DYING)) {
 		xhci->current_cmd = cur_cmd;
+		xhci->current_cmd_pending++;
 		mod_timer(&xhci->cmd_timer, jiffies + XHCI_CMD_DEFAULT_TIMEOUT);
 		xhci_ring_cmd_db(xhci);
 	}
@@ -1269,11 +1270,27 @@ void xhci_handle_command_timeout(unsigned long data)
 	xhci = (struct xhci_hcd *) data;
 
 	spin_lock_irqsave(&xhci->lock, flags);
+	xhci->current_cmd_pending--;
+
 	if (!xhci->current_cmd) {
 		spin_unlock_irqrestore(&xhci->lock, flags);
 		return;
 	}
 
+	/*
+	 * If the current_cmd_pending is 0, which means current command is
+	 * timeout.
+	 *
+	 * If the current_cmd_pending is greater than 0, which means current
+	 * timeout command has been handled by host and host has fetched new
+	 * command as xhci->current_cmd, then just return and wait for new
+	 * current command.
+	 */
+	if (xhci->current_cmd_pending > 0) {
+		spin_unlock_irqrestore(&xhci->lock, flags);
+		return;
+	}
+
 	if (xhci->current_cmd->status == COMP_CMD_ABORT)
 		second_timeout = true;
 	xhci->current_cmd->status = COMP_CMD_ABORT;
@@ -1282,6 +1299,8 @@ void xhci_handle_command_timeout(unsigned long data)
 	hw_ring_state = xhci_read_64(xhci, &xhci->op_regs->cmd_ring);
 	if ((xhci->cmd_ring_state & CMD_RING_STATE_RUNNING) &&
 	    (hw_ring_state & CMD_RING_RUNNING))  {
+		/* Will add command timer again to wait for abort event */
+		xhci->current_cmd_pending++;
 		spin_unlock_irqrestore(&xhci->lock, flags);
 		xhci_dbg(xhci, "Command timeout\n");
 		ret = xhci_abort_cmd_ring(xhci);
@@ -1336,7 +1355,13 @@ static void handle_cmd_completion(struct xhci_hcd *xhci,
 
 	cmd = list_entry(xhci->cmd_list.next, struct xhci_command, cmd_list);
 
-	del_timer(&xhci->cmd_timer);
+	/*
+	 * If the command timer is running on another CPU, we don't decrement
+	 * current_cmd_pending, since we didn't successfully stop the command
+	 * timer.
+	 */
+	if (del_timer(&xhci->cmd_timer))
+		xhci->current_cmd_pending--;
 
 	trace_xhci_cmd_completion(cmd_trb, (struct xhci_generic_trb *) event);
 
@@ -1427,6 +1452,7 @@ static void handle_cmd_completion(struct xhci_hcd *xhci,
 	if (cmd->cmd_list.next != &xhci->cmd_list) {
 		xhci->current_cmd = list_entry(cmd->cmd_list.next,
 					       struct xhci_command, cmd_list);
+		xhci->current_cmd_pending++;
 		mod_timer(&xhci->cmd_timer, jiffies + XHCI_CMD_DEFAULT_TIMEOUT);
 	} else if (xhci->current_cmd == cmd) {
 		xhci->current_cmd = NULL;
@@ -3927,6 +3953,7 @@ static int queue_command(struct xhci_hcd *xhci, struct xhci_command *cmd,
 	if (xhci->cmd_list.next == &cmd->cmd_list &&
 	    !timer_pending(&xhci->cmd_timer)) {
 		xhci->current_cmd = cmd;
+		xhci->current_cmd_pending++;
 		mod_timer(&xhci->cmd_timer, jiffies + XHCI_CMD_DEFAULT_TIMEOUT);
 	}
 
diff --git a/drivers/usb/host/xhci.h b/drivers/usb/host/xhci.h
index 9dbaacf..5d81257 100644
--- a/drivers/usb/host/xhci.h
+++ b/drivers/usb/host/xhci.h
@@ -1567,6 +1567,7 @@ struct xhci_hcd {
 	unsigned int		cmd_ring_reserved_trbs;
 	struct timer_list	cmd_timer;
 	struct xhci_command	*current_cmd;
+	u32			current_cmd_pending;
 	struct xhci_ring	*event_ring;
 	struct xhci_erst	erst;
 	/* Scratchpad */
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/2] usb: host: xhci: Fix possible wild pointer when handling abort command
  2016-12-05  7:51 [PATCH 1/2] usb: host: xhci: Fix possible wild pointer when handling abort command Baolin Wang
  2016-12-05  7:51 ` [PATCH 2/2] usb: host: xhci: Handle the right timeout command Baolin Wang
@ 2016-12-05 14:08 ` Mathias Nyman
  1 sibling, 0 replies; 32+ messages in thread
From: Mathias Nyman @ 2016-12-05 14:08 UTC (permalink / raw)
  To: Baolin Wang, mathias.nyman, gregkh
  Cc: baolu.lu, linux-usb, linux-kernel, broonie

On 05.12.2016 09:51, Baolin Wang wrote:
> When current command was supposed to be aborted, host will free the command
> in handle_cmd_completion() function. But it might be still referenced by
> xhci->current_cmd, which need to set NULL.
>
> Signed-off-by: Baolin Wang <baolin.wang@linaro.org>
> ---
> This patch is based on Lu Baolu's new fix patch:
> usb: xhci: fix possible wild pointer
> https://www.spinics.net/lists/linux-usb/msg150219.html
> ---
>   drivers/usb/host/xhci-ring.c |    5 ++++-
>   1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
> index 62dd1c7..9965a4c 100644
> --- a/drivers/usb/host/xhci-ring.c
> +++ b/drivers/usb/host/xhci-ring.c
> @@ -1362,8 +1362,11 @@ static void handle_cmd_completion(struct xhci_hcd *xhci,
>   	 */
>   	if (cmd_comp_code == COMP_CMD_ABORT) {
>   		xhci->cmd_ring_state = CMD_RING_STATE_STOPPED;
> -		if (cmd->status == COMP_CMD_ABORT)
> +		if (cmd->status == COMP_CMD_ABORT) {
> +			if (xhci->current_cmd == cmd)
> +				xhci->current_cmd = NULL;
>   			goto event_handled;
> +		}
>   	}
>
>   	cmd_type = TRB_FIELD_TO_TYPE(le32_to_cpu(cmd_trb->generic.field[3]));
>

True, thanks

-Mathias

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command
  2016-12-05  7:51 ` [PATCH 2/2] usb: host: xhci: Handle the right timeout command Baolin Wang
@ 2016-12-12 15:52   ` Mathias Nyman
  2016-12-13  3:21     ` Baolin Wang
  0 siblings, 1 reply; 32+ messages in thread
From: Mathias Nyman @ 2016-12-12 15:52 UTC (permalink / raw)
  To: Baolin Wang, mathias.nyman
  Cc: gregkh, baolu.lu, linux-usb, linux-kernel, broonie

On 05.12.2016 09:51, Baolin Wang wrote:
> If a command event is found on the event ring during an interrupt,
> we need to stop the command timer with del_timer(). Since del_timer()
> can fail if the timer is running and waiting on the xHCI lock, then
> it maybe get the wrong timeout command in xhci_handle_command_timeout()
> if host fetched a new command and updated the xhci->current_cmd in
> handle_cmd_completion(). For this situation, we need a way to signal
> to the command timer that everything is fine and it should exit.

Ah, right, this could actually happen.
  
>
> We should introduce a counter (xhci->current_cmd_pending) for the number
> of pending commands. If we need to cancel the command timer and del_timer()
> succeeds, we decrement the number of pending commands. If del_timer() fails,
> we leave the number of pending commands alone.
>
> For handling timeout command, in xhci_handle_command_timeout() we will check
> the counter after decrementing it, if the counter (xhci->current_cmd_pending)
> is 0, which means xhci->current_cmd is the right timeout command. If the
> counter (xhci->current_cmd_pending) is greater than 0, which means current
> timeout command has been handled by host and host has fetched new command as
> xhci->current_cmd, then just return and wait for new current command.

A counter like this could work.

Writing the abort bit can generate either ABORT+STOP, or just STOP
event, this seems to cover both.

quick check, case 1: timeout and cmd completion at the same time.

cpu1					cpu2

queue_command(first), p++ (=1)
queue_command(more),
--completion irq fires--		-- timer times out at same time--
handle_cmd_completion()			handle_cmd_timeout(),)
lock(xhci_lock	)			spin_on(xhci_lock)
del_timer() fail, p (=1, nochange)
cur_cmd = list_next(), p++ (=2)
unlock(xhci_lock)			
					lock(xhci_lock)
					p-- (=1)
					if (p > 0), exit
OK works

case 2: normal timeout case with ABORT+STOP, no race.

cpu1					cpu2

queue_command(first), p++ (=1)
queue_command(more),
			 		handle_cmd_timeout()
					p-- (P=0), don't exit
					mod_timer(), p++ (P=1)
					write_abort_bit()
handle_cmd_comletion(ABORT)
del_timer(), ok, p-- (p = 0)
handle_cmd_completion(STOP)				
del_timer(), fail, (P=0)
handle_stopped_cmd_ring()
cur_cmd = list_next(), p++ (=1)
mod_timer()

OK, works, and same for just STOP case, with the only difference that
during handle_cmd_completion(STOP) p is decremented (p--)
  				
So unless there is a way to find out if cur_cmd is valid in command timeout
in command timeout with the help of existing flags and lists this would be a working
solution.

-Mathias  					

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command
  2016-12-12 15:52   ` Mathias Nyman
@ 2016-12-13  3:21     ` Baolin Wang
  2016-12-19 10:33       ` Mathias Nyman
  0 siblings, 1 reply; 32+ messages in thread
From: Baolin Wang @ 2016-12-13  3:21 UTC (permalink / raw)
  To: Mathias Nyman; +Cc: mathias.nyman, Greg KH, Lu Baolu, USB, LKML, Mark Brown

Hi Mathias,

On 12 December 2016 at 23:52, Mathias Nyman
<mathias.nyman@linux.intel.com> wrote:
> On 05.12.2016 09:51, Baolin Wang wrote:
>>
>> If a command event is found on the event ring during an interrupt,
>> we need to stop the command timer with del_timer(). Since del_timer()
>> can fail if the timer is running and waiting on the xHCI lock, then
>> it maybe get the wrong timeout command in xhci_handle_command_timeout()
>> if host fetched a new command and updated the xhci->current_cmd in
>> handle_cmd_completion(). For this situation, we need a way to signal
>> to the command timer that everything is fine and it should exit.
>
>
> Ah, right, this could actually happen.
>
>>
>>
>> We should introduce a counter (xhci->current_cmd_pending) for the number
>> of pending commands. If we need to cancel the command timer and
>> del_timer()
>> succeeds, we decrement the number of pending commands. If del_timer()
>> fails,
>> we leave the number of pending commands alone.
>>
>> For handling timeout command, in xhci_handle_command_timeout() we will
>> check
>> the counter after decrementing it, if the counter
>> (xhci->current_cmd_pending)
>> is 0, which means xhci->current_cmd is the right timeout command. If the
>> counter (xhci->current_cmd_pending) is greater than 0, which means current
>> timeout command has been handled by host and host has fetched new command
>> as
>> xhci->current_cmd, then just return and wait for new current command.
>
>
> A counter like this could work.
>
> Writing the abort bit can generate either ABORT+STOP, or just STOP
> event, this seems to cover both.
>
> quick check, case 1: timeout and cmd completion at the same time.
>
> cpu1                                    cpu2
>
> queue_command(first), p++ (=1)
> queue_command(more),
> --completion irq fires--                -- timer times out at same time--
> handle_cmd_completion()                 handle_cmd_timeout(),)
> lock(xhci_lock  )                       spin_on(xhci_lock)
> del_timer() fail, p (=1, nochange)
> cur_cmd = list_next(), p++ (=2)
> unlock(xhci_lock)
>                                         lock(xhci_lock)
>                                         p-- (=1)
>                                         if (p > 0), exit
> OK works
>
> case 2: normal timeout case with ABORT+STOP, no race.
>
> cpu1                                    cpu2
>
> queue_command(first), p++ (=1)
> queue_command(more),
>                                         handle_cmd_timeout()
>                                         p-- (P=0), don't exit
>                                         mod_timer(), p++ (P=1)
>                                         write_abort_bit()
> handle_cmd_comletion(ABORT)
> del_timer(), ok, p-- (p = 0)
> handle_cmd_completion(STOP)
> del_timer(), fail, (P=0)
> handle_stopped_cmd_ring()
> cur_cmd = list_next(), p++ (=1)
> mod_timer()
>
> OK, works, and same for just STOP case, with the only difference that
> during handle_cmd_completion(STOP) p is decremented (p--)

Yes, that's the cases what I want to handle, thanks for your explicit
explanation.

>
> So unless there is a way to find out if cur_cmd is valid in command timeout
> in command timeout with the help of existing flags and lists this would be a
> working
> solution.
>
> -Mathias
>



-- 
Baolin.wang
Best Regards

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command
  2016-12-13  3:21     ` Baolin Wang
@ 2016-12-19 10:33       ` Mathias Nyman
  2016-12-19 11:34         ` Baolin Wang
  0 siblings, 1 reply; 32+ messages in thread
From: Mathias Nyman @ 2016-12-19 10:33 UTC (permalink / raw)
  To: Baolin Wang
  Cc: mathias.nyman, Greg KH, Lu Baolu, USB, LKML, Mark Brown, Lu, Baolu

On 13.12.2016 05:21, Baolin Wang wrote:
> Hi Mathias,
>
> On 12 December 2016 at 23:52, Mathias Nyman
> <mathias.nyman@linux.intel.com> wrote:
>> On 05.12.2016 09:51, Baolin Wang wrote:
>>>
>>> If a command event is found on the event ring during an interrupt,
>>> we need to stop the command timer with del_timer(). Since del_timer()
>>> can fail if the timer is running and waiting on the xHCI lock, then
>>> it maybe get the wrong timeout command in xhci_handle_command_timeout()
>>> if host fetched a new command and updated the xhci->current_cmd in
>>> handle_cmd_completion(). For this situation, we need a way to signal
>>> to the command timer that everything is fine and it should exit.
>>
>>
>> Ah, right, this could actually happen.
>>
>>>
>>>
>>> We should introduce a counter (xhci->current_cmd_pending) for the number
>>> of pending commands. If we need to cancel the command timer and
>>> del_timer()
>>> succeeds, we decrement the number of pending commands. If del_timer()
>>> fails,
>>> we leave the number of pending commands alone.
>>>
>>> For handling timeout command, in xhci_handle_command_timeout() we will
>>> check
>>> the counter after decrementing it, if the counter
>>> (xhci->current_cmd_pending)
>>> is 0, which means xhci->current_cmd is the right timeout command. If the
>>> counter (xhci->current_cmd_pending) is greater than 0, which means current
>>> timeout command has been handled by host and host has fetched new command
>>> as
>>> xhci->current_cmd, then just return and wait for new current command.
>>
>>
>> A counter like this could work.
>>
>> Writing the abort bit can generate either ABORT+STOP, or just STOP
>> event, this seems to cover both.
>>
>> quick check, case 1: timeout and cmd completion at the same time.
>>
>> cpu1                                    cpu2
>>
>> queue_command(first), p++ (=1)
>> queue_command(more),
>> --completion irq fires--                -- timer times out at same time--
>> handle_cmd_completion()                 handle_cmd_timeout(),)
>> lock(xhci_lock  )                       spin_on(xhci_lock)
>> del_timer() fail, p (=1, nochange)
>> cur_cmd = list_next(), p++ (=2)
>> unlock(xhci_lock)
>>                                          lock(xhci_lock)
>>                                          p-- (=1)
>>                                          if (p > 0), exit
>> OK works
>>
>> case 2: normal timeout case with ABORT+STOP, no race.
>>
>> cpu1                                    cpu2
>>
>> queue_command(first), p++ (=1)
>> queue_command(more),
>>                                          handle_cmd_timeout()
>>                                          p-- (P=0), don't exit
>>                                          mod_timer(), p++ (P=1)
>>                                          write_abort_bit()
>> handle_cmd_comletion(ABORT)
>> del_timer(), ok, p-- (p = 0)
>> handle_cmd_completion(STOP)
>> del_timer(), fail, (P=0)
>> handle_stopped_cmd_ring()
>> cur_cmd = list_next(), p++ (=1)
>> mod_timer()
>>
>> OK, works, and same for just STOP case, with the only difference that
>> during handle_cmd_completion(STOP) p is decremented (p--)
>
> Yes, that's the cases what I want to handle, thanks for your explicit
> explanation.
>

Gave this some more thought over the weekend, and this implementation
doesn't solve the case when the last command times out and races with the
completion handler:

cpu1                                    cpu2

queue_command(first), p++ (=1)
--completion irq fires--		-- timer times out at same time--
handle_cmd_completion()			handle_cmd_timeout(),)
lock(xhci_lock )			spin_on(xhci_lock)
del_timer() fail, p (=1, nochange)
no more commands, P (=1, nochange)
unlock(xhci_lock)
					lock(xhci_lock)
					p-- (=0)
					p == 0, continue, even if we should not.
                                           
For this we still need to rely on checking cur_cmd == NULL in the timeout function.
(Baolus patch sets it to NULL if there are no more commands pending)

And then we could replace the whole counter with a simple check if the timeout timer
is pending in the timeout function:

xhci_handle_command_timeout()
	lock()
	if (!cur_cmd || timer_pending(timeout_timer)) {
		unlock();
		return;
	}

-Mathias

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command
  2016-12-19 10:33       ` Mathias Nyman
@ 2016-12-19 11:34         ` Baolin Wang
  2016-12-19 12:13           ` Mathias Nyman
  0 siblings, 1 reply; 32+ messages in thread
From: Baolin Wang @ 2016-12-19 11:34 UTC (permalink / raw)
  To: Mathias Nyman
  Cc: mathias.nyman, Greg KH, Lu Baolu, USB, LKML, Mark Brown, Lu, Baolu

Hi Mathias,

On 19 December 2016 at 18:33, Mathias Nyman
<mathias.nyman@linux.intel.com> wrote:
> On 13.12.2016 05:21, Baolin Wang wrote:
>>
>> Hi Mathias,
>>
>> On 12 December 2016 at 23:52, Mathias Nyman
>> <mathias.nyman@linux.intel.com> wrote:
>>>
>>> On 05.12.2016 09:51, Baolin Wang wrote:
>>>>
>>>>
>>>> If a command event is found on the event ring during an interrupt,
>>>> we need to stop the command timer with del_timer(). Since del_timer()
>>>> can fail if the timer is running and waiting on the xHCI lock, then
>>>> it maybe get the wrong timeout command in xhci_handle_command_timeout()
>>>> if host fetched a new command and updated the xhci->current_cmd in
>>>> handle_cmd_completion(). For this situation, we need a way to signal
>>>> to the command timer that everything is fine and it should exit.
>>>
>>>
>>>
>>> Ah, right, this could actually happen.
>>>
>>>>
>>>>
>>>> We should introduce a counter (xhci->current_cmd_pending) for the number
>>>> of pending commands. If we need to cancel the command timer and
>>>> del_timer()
>>>> succeeds, we decrement the number of pending commands. If del_timer()
>>>> fails,
>>>> we leave the number of pending commands alone.
>>>>
>>>> For handling timeout command, in xhci_handle_command_timeout() we will
>>>> check
>>>> the counter after decrementing it, if the counter
>>>> (xhci->current_cmd_pending)
>>>> is 0, which means xhci->current_cmd is the right timeout command. If the
>>>> counter (xhci->current_cmd_pending) is greater than 0, which means
>>>> current
>>>> timeout command has been handled by host and host has fetched new
>>>> command
>>>> as
>>>> xhci->current_cmd, then just return and wait for new current command.
>>>
>>>
>>>
>>> A counter like this could work.
>>>
>>> Writing the abort bit can generate either ABORT+STOP, or just STOP
>>> event, this seems to cover both.
>>>
>>> quick check, case 1: timeout and cmd completion at the same time.
>>>
>>> cpu1                                    cpu2
>>>
>>> queue_command(first), p++ (=1)
>>> queue_command(more),
>>> --completion irq fires--                -- timer times out at same time--
>>> handle_cmd_completion()                 handle_cmd_timeout(),)
>>> lock(xhci_lock  )                       spin_on(xhci_lock)
>>> del_timer() fail, p (=1, nochange)
>>> cur_cmd = list_next(), p++ (=2)
>>> unlock(xhci_lock)
>>>                                          lock(xhci_lock)
>>>                                          p-- (=1)
>>>                                          if (p > 0), exit
>>> OK works
>>>
>>> case 2: normal timeout case with ABORT+STOP, no race.
>>>
>>> cpu1                                    cpu2
>>>
>>> queue_command(first), p++ (=1)
>>> queue_command(more),
>>>                                          handle_cmd_timeout()
>>>                                          p-- (P=0), don't exit
>>>                                          mod_timer(), p++ (P=1)
>>>                                          write_abort_bit()
>>> handle_cmd_comletion(ABORT)
>>> del_timer(), ok, p-- (p = 0)
>>> handle_cmd_completion(STOP)
>>> del_timer(), fail, (P=0)
>>> handle_stopped_cmd_ring()
>>> cur_cmd = list_next(), p++ (=1)
>>> mod_timer()
>>>
>>> OK, works, and same for just STOP case, with the only difference that
>>> during handle_cmd_completion(STOP) p is decremented (p--)
>>
>>
>> Yes, that's the cases what I want to handle, thanks for your explicit
>> explanation.
>>
>
> Gave this some more thought over the weekend, and this implementation
> doesn't solve the case when the last command times out and races with the
> completion handler:
>
> cpu1                                    cpu2
>
> queue_command(first), p++ (=1)
> --completion irq fires--                -- timer times out at same time--
> handle_cmd_completion()                 handle_cmd_timeout(),)
> lock(xhci_lock )                        spin_on(xhci_lock)
> del_timer() fail, p (=1, nochange)
> no more commands, P (=1, nochange)
> unlock(xhci_lock)
>                                         lock(xhci_lock)
>                                         p-- (=0)
>                                         p == 0, continue, even if we should
> not.
>                                           For this we still need to rely on
> checking cur_cmd == NULL in the timeout function.
> (Baolus patch sets it to NULL if there are no more commands pending)

As I pointed out in patch 1 of this patchset, this patchset is based
on Lu Baolu's new fix patch:
usb: xhci: fix possible wild pointer
https://www.spinics.net/lists/linux-usb/msg150219.html

After applying Baolu's patch, after decrement the counter, we will
check the xhci->cur_command if is NULL. So in this situation:
cpu1                                    cpu2

 queue_command(first), p++ (=1)
 --completion irq fires--                -- timer times out at same time--
 handle_cmd_completion()                 handle_cmd_timeout(),)
 lock(xhci_lock )                        spin_on(xhci_lock)
 del_timer() fail, p (=1, nochange)
 no more commands, P (=1, nochange)
 unlock(xhci_lock)
                                         lock(xhci_lock)
                                         p-- (=0)
                                         no current command, return
                                         if (!xhci->current_cmd) {
                                              unlock(xhci_lock);
                                              return;
                                         }

It can work.

>
> And then we could replace the whole counter with a simple check if the
> timeout timer
> is pending in the timeout function:
>
> xhci_handle_command_timeout()
>         lock()
>         if (!cur_cmd || timer_pending(timeout_timer)) {
>                 unlock();
>                 return;
>         }
>
> -Mathias
>



-- 
Baolin.wang
Best Regards

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command
  2016-12-19 11:34         ` Baolin Wang
@ 2016-12-19 12:13           ` Mathias Nyman
  2016-12-20  3:23             ` Baolin Wang
  2016-12-20  4:29             ` Lu Baolu
  0 siblings, 2 replies; 32+ messages in thread
From: Mathias Nyman @ 2016-12-19 12:13 UTC (permalink / raw)
  To: Baolin Wang, Mathias Nyman
  Cc: Greg KH, Lu Baolu, USB, LKML, Mark Brown, Lu, Baolu

On 19.12.2016 13:34, Baolin Wang wrote:
> Hi Mathias,
>
> On 19 December 2016 at 18:33, Mathias Nyman
> <mathias.nyman@linux.intel.com> wrote:
>> On 13.12.2016 05:21, Baolin Wang wrote:
>>>
>>> Hi Mathias,
>>>
>>> On 12 December 2016 at 23:52, Mathias Nyman
>>> <mathias.nyman@linux.intel.com> wrote:
>>>>
>>>> On 05.12.2016 09:51, Baolin Wang wrote:
>>>>>
>>>>>
>>>>> If a command event is found on the event ring during an interrupt,
>>>>> we need to stop the command timer with del_timer(). Since del_timer()
>>>>> can fail if the timer is running and waiting on the xHCI lock, then
>>>>> it maybe get the wrong timeout command in xhci_handle_command_timeout()
>>>>> if host fetched a new command and updated the xhci->current_cmd in
>>>>> handle_cmd_completion(). For this situation, we need a way to signal
>>>>> to the command timer that everything is fine and it should exit.
>>>>
>>>>
>>>>
>>>> Ah, right, this could actually happen.
>>>>
>>>>>
>>>>>
>>>>> We should introduce a counter (xhci->current_cmd_pending) for the number
>>>>> of pending commands. If we need to cancel the command timer and
>>>>> del_timer()
>>>>> succeeds, we decrement the number of pending commands. If del_timer()
>>>>> fails,
>>>>> we leave the number of pending commands alone.
>>>>>
>>>>> For handling timeout command, in xhci_handle_command_timeout() we will
>>>>> check
>>>>> the counter after decrementing it, if the counter
>>>>> (xhci->current_cmd_pending)
>>>>> is 0, which means xhci->current_cmd is the right timeout command. If the
>>>>> counter (xhci->current_cmd_pending) is greater than 0, which means
>>>>> current
>>>>> timeout command has been handled by host and host has fetched new
>>>>> command
>>>>> as
>>>>> xhci->current_cmd, then just return and wait for new current command.
>>>>
>>>>
>>>>
>>>> A counter like this could work.
>>>>
>>>> Writing the abort bit can generate either ABORT+STOP, or just STOP
>>>> event, this seems to cover both.
>>>>
>>>> quick check, case 1: timeout and cmd completion at the same time.
>>>>
>>>> cpu1                                    cpu2
>>>>
>>>> queue_command(first), p++ (=1)
>>>> queue_command(more),
>>>> --completion irq fires--                -- timer times out at same time--
>>>> handle_cmd_completion()                 handle_cmd_timeout(),)
>>>> lock(xhci_lock  )                       spin_on(xhci_lock)
>>>> del_timer() fail, p (=1, nochange)
>>>> cur_cmd = list_next(), p++ (=2)
>>>> unlock(xhci_lock)
>>>>                                           lock(xhci_lock)
>>>>                                           p-- (=1)
>>>>                                           if (p > 0), exit
>>>> OK works
>>>>
>>>> case 2: normal timeout case with ABORT+STOP, no race.
>>>>
>>>> cpu1                                    cpu2
>>>>
>>>> queue_command(first), p++ (=1)
>>>> queue_command(more),
>>>>                                           handle_cmd_timeout()
>>>>                                           p-- (P=0), don't exit
>>>>                                           mod_timer(), p++ (P=1)
>>>>                                           write_abort_bit()
>>>> handle_cmd_comletion(ABORT)
>>>> del_timer(), ok, p-- (p = 0)
>>>> handle_cmd_completion(STOP)
>>>> del_timer(), fail, (P=0)
>>>> handle_stopped_cmd_ring()
>>>> cur_cmd = list_next(), p++ (=1)
>>>> mod_timer()
>>>>
>>>> OK, works, and same for just STOP case, with the only difference that
>>>> during handle_cmd_completion(STOP) p is decremented (p--)
>>>
>>>
>>> Yes, that's the cases what I want to handle, thanks for your explicit
>>> explanation.
>>>
>>
>> Gave this some more thought over the weekend, and this implementation
>> doesn't solve the case when the last command times out and races with the
>> completion handler:
>>
>> cpu1                                    cpu2
>>
>> queue_command(first), p++ (=1)
>> --completion irq fires--                -- timer times out at same time--
>> handle_cmd_completion()                 handle_cmd_timeout(),)
>> lock(xhci_lock )                        spin_on(xhci_lock)
>> del_timer() fail, p (=1, nochange)
>> no more commands, P (=1, nochange)
>> unlock(xhci_lock)
>>                                          lock(xhci_lock)
>>                                          p-- (=0)
>>                                          p == 0, continue, even if we should
>> not.
>>                                            For this we still need to rely on
>> checking cur_cmd == NULL in the timeout function.
>> (Baolus patch sets it to NULL if there are no more commands pending)
>
> As I pointed out in patch 1 of this patchset, this patchset is based
> on Lu Baolu's new fix patch:
> usb: xhci: fix possible wild pointer
> https://www.spinics.net/lists/linux-usb/msg150219.html
>
> After applying Baolu's patch, after decrement the counter, we will
> check the xhci->cur_command if is NULL. So in this situation:
> cpu1                                    cpu2
>
>   queue_command(first), p++ (=1)
>   --completion irq fires--                -- timer times out at same time--
>   handle_cmd_completion()                 handle_cmd_timeout(),)
>   lock(xhci_lock )                        spin_on(xhci_lock)
>   del_timer() fail, p (=1, nochange)
>   no more commands, P (=1, nochange)
>   unlock(xhci_lock)
>                                           lock(xhci_lock)
>                                           p-- (=0)
>                                           no current command, return
>                                           if (!xhci->current_cmd) {
>                                                unlock(xhci_lock);
>                                                return;
>                                           }
>
> It can work.

Yes,

What I wanted to say is that as it relies on Baolus patch for that one case
it seems that patch 2/2 can be replaced by a single line change:

if (!xhci->current_cmd || timer_pending(&xhci->cmd_timer))

Right?

-Mathias

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command
  2016-12-19 12:13           ` Mathias Nyman
@ 2016-12-20  3:23             ` Baolin Wang
  2016-12-20  4:29             ` Lu Baolu
  1 sibling, 0 replies; 32+ messages in thread
From: Baolin Wang @ 2016-12-20  3:23 UTC (permalink / raw)
  To: Mathias Nyman
  Cc: Mathias Nyman, Greg KH, Lu Baolu, USB, LKML, Mark Brown, Lu, Baolu

On 19 December 2016 at 20:13, Mathias Nyman <mathias.nyman@intel.com> wrote:
> On 19.12.2016 13:34, Baolin Wang wrote:
>>
>> Hi Mathias,
>>
>> On 19 December 2016 at 18:33, Mathias Nyman
>> <mathias.nyman@linux.intel.com> wrote:
>>>
>>> On 13.12.2016 05:21, Baolin Wang wrote:
>>>>
>>>>
>>>> Hi Mathias,
>>>>
>>>> On 12 December 2016 at 23:52, Mathias Nyman
>>>> <mathias.nyman@linux.intel.com> wrote:
>>>>>
>>>>>
>>>>> On 05.12.2016 09:51, Baolin Wang wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> If a command event is found on the event ring during an interrupt,
>>>>>> we need to stop the command timer with del_timer(). Since del_timer()
>>>>>> can fail if the timer is running and waiting on the xHCI lock, then
>>>>>> it maybe get the wrong timeout command in
>>>>>> xhci_handle_command_timeout()
>>>>>> if host fetched a new command and updated the xhci->current_cmd in
>>>>>> handle_cmd_completion(). For this situation, we need a way to signal
>>>>>> to the command timer that everything is fine and it should exit.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Ah, right, this could actually happen.
>>>>>
>>>>>>
>>>>>>
>>>>>> We should introduce a counter (xhci->current_cmd_pending) for the
>>>>>> number
>>>>>> of pending commands. If we need to cancel the command timer and
>>>>>> del_timer()
>>>>>> succeeds, we decrement the number of pending commands. If del_timer()
>>>>>> fails,
>>>>>> we leave the number of pending commands alone.
>>>>>>
>>>>>> For handling timeout command, in xhci_handle_command_timeout() we will
>>>>>> check
>>>>>> the counter after decrementing it, if the counter
>>>>>> (xhci->current_cmd_pending)
>>>>>> is 0, which means xhci->current_cmd is the right timeout command. If
>>>>>> the
>>>>>> counter (xhci->current_cmd_pending) is greater than 0, which means
>>>>>> current
>>>>>> timeout command has been handled by host and host has fetched new
>>>>>> command
>>>>>> as
>>>>>> xhci->current_cmd, then just return and wait for new current command.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> A counter like this could work.
>>>>>
>>>>> Writing the abort bit can generate either ABORT+STOP, or just STOP
>>>>> event, this seems to cover both.
>>>>>
>>>>> quick check, case 1: timeout and cmd completion at the same time.
>>>>>
>>>>> cpu1                                    cpu2
>>>>>
>>>>> queue_command(first), p++ (=1)
>>>>> queue_command(more),
>>>>> --completion irq fires--                -- timer times out at same
>>>>> time--
>>>>> handle_cmd_completion()                 handle_cmd_timeout(),)
>>>>> lock(xhci_lock  )                       spin_on(xhci_lock)
>>>>> del_timer() fail, p (=1, nochange)
>>>>> cur_cmd = list_next(), p++ (=2)
>>>>> unlock(xhci_lock)
>>>>>                                           lock(xhci_lock)
>>>>>                                           p-- (=1)
>>>>>                                           if (p > 0), exit
>>>>> OK works
>>>>>
>>>>> case 2: normal timeout case with ABORT+STOP, no race.
>>>>>
>>>>> cpu1                                    cpu2
>>>>>
>>>>> queue_command(first), p++ (=1)
>>>>> queue_command(more),
>>>>>                                           handle_cmd_timeout()
>>>>>                                           p-- (P=0), don't exit
>>>>>                                           mod_timer(), p++ (P=1)
>>>>>                                           write_abort_bit()
>>>>> handle_cmd_comletion(ABORT)
>>>>> del_timer(), ok, p-- (p = 0)
>>>>> handle_cmd_completion(STOP)
>>>>> del_timer(), fail, (P=0)
>>>>> handle_stopped_cmd_ring()
>>>>> cur_cmd = list_next(), p++ (=1)
>>>>> mod_timer()
>>>>>
>>>>> OK, works, and same for just STOP case, with the only difference that
>>>>> during handle_cmd_completion(STOP) p is decremented (p--)
>>>>
>>>>
>>>>
>>>> Yes, that's the cases what I want to handle, thanks for your explicit
>>>> explanation.
>>>>
>>>
>>> Gave this some more thought over the weekend, and this implementation
>>> doesn't solve the case when the last command times out and races with the
>>> completion handler:
>>>
>>> cpu1                                    cpu2
>>>
>>> queue_command(first), p++ (=1)
>>> --completion irq fires--                -- timer times out at same time--
>>> handle_cmd_completion()                 handle_cmd_timeout(),)
>>> lock(xhci_lock )                        spin_on(xhci_lock)
>>> del_timer() fail, p (=1, nochange)
>>> no more commands, P (=1, nochange)
>>> unlock(xhci_lock)
>>>                                          lock(xhci_lock)
>>>                                          p-- (=0)
>>>                                          p == 0, continue, even if we
>>> should
>>> not.
>>>                                            For this we still need to rely
>>> on
>>> checking cur_cmd == NULL in the timeout function.
>>> (Baolus patch sets it to NULL if there are no more commands pending)
>>
>>
>> As I pointed out in patch 1 of this patchset, this patchset is based
>> on Lu Baolu's new fix patch:
>> usb: xhci: fix possible wild pointer
>> https://www.spinics.net/lists/linux-usb/msg150219.html
>>
>> After applying Baolu's patch, after decrement the counter, we will
>> check the xhci->cur_command if is NULL. So in this situation:
>> cpu1                                    cpu2
>>
>>   queue_command(first), p++ (=1)
>>   --completion irq fires--                -- timer times out at same
>> time--
>>   handle_cmd_completion()                 handle_cmd_timeout(),)
>>   lock(xhci_lock )                        spin_on(xhci_lock)
>>   del_timer() fail, p (=1, nochange)
>>   no more commands, P (=1, nochange)
>>   unlock(xhci_lock)
>>                                           lock(xhci_lock)
>>                                           p-- (=0)
>>                                           no current command, return
>>                                           if (!xhci->current_cmd) {
>>                                                unlock(xhci_lock);
>>                                                return;
>>                                           }
>>
>> It can work.
>
>
> Yes,
>
> What I wanted to say is that as it relies on Baolus patch for that one case
> it seems that patch 2/2 can be replaced by a single line change:
>
> if (!xhci->current_cmd || timer_pending(&xhci->cmd_timer))
>
> Right?

After checking the code again, I am agree with you. I will resend the
patch with fixing this issue. Thanks.

-- 
Baolin.wang
Best Regards

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command
  2016-12-19 12:13           ` Mathias Nyman
  2016-12-20  3:23             ` Baolin Wang
@ 2016-12-20  4:29             ` Lu Baolu
  2016-12-20  6:06               ` Baolin Wang
  1 sibling, 1 reply; 32+ messages in thread
From: Lu Baolu @ 2016-12-20  4:29 UTC (permalink / raw)
  To: Mathias Nyman, Baolin Wang, Mathias Nyman
  Cc: Greg KH, USB, LKML, Mark Brown, Lu, Baolu

Hi Mathias,

On 12/19/2016 08:13 PM, Mathias Nyman wrote:
> On 19.12.2016 13:34, Baolin Wang wrote:
>> Hi Mathias,
>>
>> On 19 December 2016 at 18:33, Mathias Nyman
>> <mathias.nyman@linux.intel.com> wrote:
>>> On 13.12.2016 05:21, Baolin Wang wrote:
>>>>
>>>> Hi Mathias,
>>>>
>>>> On 12 December 2016 at 23:52, Mathias Nyman
>>>> <mathias.nyman@linux.intel.com> wrote:
>>>>>
>>>>> On 05.12.2016 09:51, Baolin Wang wrote:
>>>>>>
>>>>>>
>>>>>> If a command event is found on the event ring during an interrupt,
>>>>>> we need to stop the command timer with del_timer(). Since del_timer()
>>>>>> can fail if the timer is running and waiting on the xHCI lock, then
>>>>>> it maybe get the wrong timeout command in xhci_handle_command_timeout()
>>>>>> if host fetched a new command and updated the xhci->current_cmd in
>>>>>> handle_cmd_completion(). For this situation, we need a way to signal
>>>>>> to the command timer that everything is fine and it should exit.
>>>>>
>>>>>
>>>>>
>>>>> Ah, right, this could actually happen.
>>>>>
>>>>>>
>>>>>>
>>>>>> We should introduce a counter (xhci->current_cmd_pending) for the number
>>>>>> of pending commands. If we need to cancel the command timer and
>>>>>> del_timer()
>>>>>> succeeds, we decrement the number of pending commands. If del_timer()
>>>>>> fails,
>>>>>> we leave the number of pending commands alone.
>>>>>>
>>>>>> For handling timeout command, in xhci_handle_command_timeout() we will
>>>>>> check
>>>>>> the counter after decrementing it, if the counter
>>>>>> (xhci->current_cmd_pending)
>>>>>> is 0, which means xhci->current_cmd is the right timeout command. If the
>>>>>> counter (xhci->current_cmd_pending) is greater than 0, which means
>>>>>> current
>>>>>> timeout command has been handled by host and host has fetched new
>>>>>> command
>>>>>> as
>>>>>> xhci->current_cmd, then just return and wait for new current command.
>>>>>
>>>>>
>>>>>
>>>>> A counter like this could work.
>>>>>
>>>>> Writing the abort bit can generate either ABORT+STOP, or just STOP
>>>>> event, this seems to cover both.
>>>>>
>>>>> quick check, case 1: timeout and cmd completion at the same time.
>>>>>
>>>>> cpu1                                    cpu2
>>>>>
>>>>> queue_command(first), p++ (=1)
>>>>> queue_command(more),
>>>>> --completion irq fires--                -- timer times out at same time--
>>>>> handle_cmd_completion()                 handle_cmd_timeout(),)
>>>>> lock(xhci_lock  )                       spin_on(xhci_lock)
>>>>> del_timer() fail, p (=1, nochange)
>>>>> cur_cmd = list_next(), p++ (=2)
>>>>> unlock(xhci_lock)
>>>>>                                           lock(xhci_lock)
>>>>>                                           p-- (=1)
>>>>>                                           if (p > 0), exit
>>>>> OK works
>>>>>
>>>>> case 2: normal timeout case with ABORT+STOP, no race.
>>>>>
>>>>> cpu1                                    cpu2
>>>>>
>>>>> queue_command(first), p++ (=1)
>>>>> queue_command(more),
>>>>>                                           handle_cmd_timeout()
>>>>>                                           p-- (P=0), don't exit
>>>>>                                           mod_timer(), p++ (P=1)
>>>>>                                           write_abort_bit()
>>>>> handle_cmd_comletion(ABORT)
>>>>> del_timer(), ok, p-- (p = 0)
>>>>> handle_cmd_completion(STOP)
>>>>> del_timer(), fail, (P=0)
>>>>> handle_stopped_cmd_ring()
>>>>> cur_cmd = list_next(), p++ (=1)
>>>>> mod_timer()
>>>>>
>>>>> OK, works, and same for just STOP case, with the only difference that
>>>>> during handle_cmd_completion(STOP) p is decremented (p--)
>>>>
>>>>
>>>> Yes, that's the cases what I want to handle, thanks for your explicit
>>>> explanation.
>>>>
>>>
>>> Gave this some more thought over the weekend, and this implementation
>>> doesn't solve the case when the last command times out and races with the
>>> completion handler:
>>>
>>> cpu1                                    cpu2
>>>
>>> queue_command(first), p++ (=1)
>>> --completion irq fires--                -- timer times out at same time--
>>> handle_cmd_completion()                 handle_cmd_timeout(),)
>>> lock(xhci_lock )                        spin_on(xhci_lock)
>>> del_timer() fail, p (=1, nochange)
>>> no more commands, P (=1, nochange)
>>> unlock(xhci_lock)
>>>                                          lock(xhci_lock)
>>>                                          p-- (=0)
>>>                                          p == 0, continue, even if we should
>>> not.
>>>                                            For this we still need to rely on
>>> checking cur_cmd == NULL in the timeout function.
>>> (Baolus patch sets it to NULL if there are no more commands pending)
>>
>> As I pointed out in patch 1 of this patchset, this patchset is based
>> on Lu Baolu's new fix patch:
>> usb: xhci: fix possible wild pointer
>> https://www.spinics.net/lists/linux-usb/msg150219.html
>>
>> After applying Baolu's patch, after decrement the counter, we will
>> check the xhci->cur_command if is NULL. So in this situation:
>> cpu1                                    cpu2
>>
>>   queue_command(first), p++ (=1)
>>   --completion irq fires--                -- timer times out at same time--
>>   handle_cmd_completion()                 handle_cmd_timeout(),)
>>   lock(xhci_lock )                        spin_on(xhci_lock)
>>   del_timer() fail, p (=1, nochange)
>>   no more commands, P (=1, nochange)
>>   unlock(xhci_lock)
>>                                           lock(xhci_lock)
>>                                           p-- (=0)
>>                                           no current command, return
>>                                           if (!xhci->current_cmd) {
>>                                                unlock(xhci_lock);
>>                                                return;
>>                                           }
>>
>> It can work.
>
> Yes,
>
> What I wanted to say is that as it relies on Baolus patch for that one case
> it seems that patch 2/2 can be replaced by a single line change:
>
> if (!xhci->current_cmd || timer_pending(&xhci->cmd_timer))
>
> Right?
>
> -Mathias
>

It seems that the watch dog algorithm for command queue becomes
more and more complicated and hard for maintain. I am also seeing
another case where a command may lose the chance to be tracked by
the watch dog timer.

Say,

queue_command(the only command in queue)
  - completion irq fires--                - timer times out at same time--      - another command enqueue--
  - lock(xhci_lock )                         - spin_on(xhci_lock)                           - spin_on(xhci_lock)
  - del_timer() fail
  - free the command and
    set current_cmd to NULL
  - unlock(xhci_lock)
                                                                                                                - lock(xhci_lock)
                                                                                                                - queue_command()(timer will
                                                                                                                   not rescheduled since the timer
                                                                                                                   is pending)
                                                     - lock(xhci_lock)
                                                     - no current command
                                                     - return

As the result, the later command isn't under track of the watch dog.
If hardware fails to response to this command, kernel will hang in
the thread which is waiting for the completion of the command.

I can write a patch to fix this and cc stable kernel as well. For long
term, in order to make it simple and easy to maintain, how about
allocating a watch dog timer for each command? It could be part
of the command structure and be managed just like the life cycle
of a command structure.

I can write a patch for review and discussion, if you think this
change is possible.

Best regards,
Lu Baolu

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command
  2016-12-20  4:29             ` Lu Baolu
@ 2016-12-20  6:06               ` Baolin Wang
  2016-12-20  6:39                 ` Lu Baolu
  0 siblings, 1 reply; 32+ messages in thread
From: Baolin Wang @ 2016-12-20  6:06 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Mathias Nyman, Mathias Nyman, Greg KH, USB, LKML, Mark Brown, Lu, Baolu

Hi,

On 20 December 2016 at 12:29, Lu Baolu <baolu.lu@linux.intel.com> wrote:
> Hi Mathias,
>
> On 12/19/2016 08:13 PM, Mathias Nyman wrote:
>> On 19.12.2016 13:34, Baolin Wang wrote:
>>> Hi Mathias,
>>>
>>> On 19 December 2016 at 18:33, Mathias Nyman
>>> <mathias.nyman@linux.intel.com> wrote:
>>>> On 13.12.2016 05:21, Baolin Wang wrote:
>>>>>
>>>>> Hi Mathias,
>>>>>
>>>>> On 12 December 2016 at 23:52, Mathias Nyman
>>>>> <mathias.nyman@linux.intel.com> wrote:
>>>>>>
>>>>>> On 05.12.2016 09:51, Baolin Wang wrote:
>>>>>>>
>>>>>>>
>>>>>>> If a command event is found on the event ring during an interrupt,
>>>>>>> we need to stop the command timer with del_timer(). Since del_timer()
>>>>>>> can fail if the timer is running and waiting on the xHCI lock, then
>>>>>>> it maybe get the wrong timeout command in xhci_handle_command_timeout()
>>>>>>> if host fetched a new command and updated the xhci->current_cmd in
>>>>>>> handle_cmd_completion(). For this situation, we need a way to signal
>>>>>>> to the command timer that everything is fine and it should exit.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Ah, right, this could actually happen.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> We should introduce a counter (xhci->current_cmd_pending) for the number
>>>>>>> of pending commands. If we need to cancel the command timer and
>>>>>>> del_timer()
>>>>>>> succeeds, we decrement the number of pending commands. If del_timer()
>>>>>>> fails,
>>>>>>> we leave the number of pending commands alone.
>>>>>>>
>>>>>>> For handling timeout command, in xhci_handle_command_timeout() we will
>>>>>>> check
>>>>>>> the counter after decrementing it, if the counter
>>>>>>> (xhci->current_cmd_pending)
>>>>>>> is 0, which means xhci->current_cmd is the right timeout command. If the
>>>>>>> counter (xhci->current_cmd_pending) is greater than 0, which means
>>>>>>> current
>>>>>>> timeout command has been handled by host and host has fetched new
>>>>>>> command
>>>>>>> as
>>>>>>> xhci->current_cmd, then just return and wait for new current command.
>>>>>>
>>>>>>
>>>>>>
>>>>>> A counter like this could work.
>>>>>>
>>>>>> Writing the abort bit can generate either ABORT+STOP, or just STOP
>>>>>> event, this seems to cover both.
>>>>>>
>>>>>> quick check, case 1: timeout and cmd completion at the same time.
>>>>>>
>>>>>> cpu1                                    cpu2
>>>>>>
>>>>>> queue_command(first), p++ (=1)
>>>>>> queue_command(more),
>>>>>> --completion irq fires--                -- timer times out at same time--
>>>>>> handle_cmd_completion()                 handle_cmd_timeout(),)
>>>>>> lock(xhci_lock  )                       spin_on(xhci_lock)
>>>>>> del_timer() fail, p (=1, nochange)
>>>>>> cur_cmd = list_next(), p++ (=2)
>>>>>> unlock(xhci_lock)
>>>>>>                                           lock(xhci_lock)
>>>>>>                                           p-- (=1)
>>>>>>                                           if (p > 0), exit
>>>>>> OK works
>>>>>>
>>>>>> case 2: normal timeout case with ABORT+STOP, no race.
>>>>>>
>>>>>> cpu1                                    cpu2
>>>>>>
>>>>>> queue_command(first), p++ (=1)
>>>>>> queue_command(more),
>>>>>>                                           handle_cmd_timeout()
>>>>>>                                           p-- (P=0), don't exit
>>>>>>                                           mod_timer(), p++ (P=1)
>>>>>>                                           write_abort_bit()
>>>>>> handle_cmd_comletion(ABORT)
>>>>>> del_timer(), ok, p-- (p = 0)
>>>>>> handle_cmd_completion(STOP)
>>>>>> del_timer(), fail, (P=0)
>>>>>> handle_stopped_cmd_ring()
>>>>>> cur_cmd = list_next(), p++ (=1)
>>>>>> mod_timer()
>>>>>>
>>>>>> OK, works, and same for just STOP case, with the only difference that
>>>>>> during handle_cmd_completion(STOP) p is decremented (p--)
>>>>>
>>>>>
>>>>> Yes, that's the cases what I want to handle, thanks for your explicit
>>>>> explanation.
>>>>>
>>>>
>>>> Gave this some more thought over the weekend, and this implementation
>>>> doesn't solve the case when the last command times out and races with the
>>>> completion handler:
>>>>
>>>> cpu1                                    cpu2
>>>>
>>>> queue_command(first), p++ (=1)
>>>> --completion irq fires--                -- timer times out at same time--
>>>> handle_cmd_completion()                 handle_cmd_timeout(),)
>>>> lock(xhci_lock )                        spin_on(xhci_lock)
>>>> del_timer() fail, p (=1, nochange)
>>>> no more commands, P (=1, nochange)
>>>> unlock(xhci_lock)
>>>>                                          lock(xhci_lock)
>>>>                                          p-- (=0)
>>>>                                          p == 0, continue, even if we should
>>>> not.
>>>>                                            For this we still need to rely on
>>>> checking cur_cmd == NULL in the timeout function.
>>>> (Baolus patch sets it to NULL if there are no more commands pending)
>>>
>>> As I pointed out in patch 1 of this patchset, this patchset is based
>>> on Lu Baolu's new fix patch:
>>> usb: xhci: fix possible wild pointer
>>> https://www.spinics.net/lists/linux-usb/msg150219.html
>>>
>>> After applying Baolu's patch, after decrement the counter, we will
>>> check the xhci->cur_command if is NULL. So in this situation:
>>> cpu1                                    cpu2
>>>
>>>   queue_command(first), p++ (=1)
>>>   --completion irq fires--                -- timer times out at same time--
>>>   handle_cmd_completion()                 handle_cmd_timeout(),)
>>>   lock(xhci_lock )                        spin_on(xhci_lock)
>>>   del_timer() fail, p (=1, nochange)
>>>   no more commands, P (=1, nochange)
>>>   unlock(xhci_lock)
>>>                                           lock(xhci_lock)
>>>                                           p-- (=0)
>>>                                           no current command, return
>>>                                           if (!xhci->current_cmd) {
>>>                                                unlock(xhci_lock);
>>>                                                return;
>>>                                           }
>>>
>>> It can work.
>>
>> Yes,
>>
>> What I wanted to say is that as it relies on Baolus patch for that one case
>> it seems that patch 2/2 can be replaced by a single line change:
>>
>> if (!xhci->current_cmd || timer_pending(&xhci->cmd_timer))
>>
>> Right?
>>
>> -Mathias
>>
>
> It seems that the watch dog algorithm for command queue becomes
> more and more complicated and hard for maintain. I am also seeing
> another case where a command may lose the chance to be tracked by
> the watch dog timer.
>
> Say,
>
> queue_command(the only command in queue)
>   - completion irq fires--                - timer times out at same time--      - another command enqueue--
>   - lock(xhci_lock )                         - spin_on(xhci_lock)                           - spin_on(xhci_lock)
>   - del_timer() fail
>   - free the command and
>     set current_cmd to NULL
>   - unlock(xhci_lock)
>                                                                                                                 - lock(xhci_lock)
>                                                                                                                 - queue_command()(timer will
>                                                                                                                    not rescheduled since the timer
>                                                                                                                    is pending)

In this case, since the command timer was fired and you did not re-add
the command timer, why here timer is pending? Maybe I missed
something? Thanks.

>                                                      - lock(xhci_lock)
>                                                      - no current command
>                                                      - return
>
> As the result, the later command isn't under track of the watch dog.
> If hardware fails to response to this command, kernel will hang in
> the thread which is waiting for the completion of the command.
>
> I can write a patch to fix this and cc stable kernel as well. For long
> term, in order to make it simple and easy to maintain, how about
> allocating a watch dog timer for each command? It could be part
> of the command structure and be managed just like the life cycle
> of a command structure.
>
> I can write a patch for review and discussion, if you think this
> change is possible.
>
> Best regards,
> Lu Baolu



-- 
Baolin.wang
Best Regards

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command
  2016-12-20  6:06               ` Baolin Wang
@ 2016-12-20  6:39                 ` Lu Baolu
  2016-12-20  6:46                   ` Baolin Wang
  0 siblings, 1 reply; 32+ messages in thread
From: Lu Baolu @ 2016-12-20  6:39 UTC (permalink / raw)
  To: Baolin Wang
  Cc: Mathias Nyman, Mathias Nyman, Greg KH, USB, LKML, Mark Brown, Lu, Baolu

Hi,

On 12/20/2016 02:06 PM, Baolin Wang wrote:
> Hi,
>
> On 20 December 2016 at 12:29, Lu Baolu <baolu.lu@linux.intel.com> wrote:
>> Hi Mathias,
>>
>> On 12/19/2016 08:13 PM, Mathias Nyman wrote:
>>> On 19.12.2016 13:34, Baolin Wang wrote:
>>>> Hi Mathias,
>>>>
>>>> On 19 December 2016 at 18:33, Mathias Nyman
>>>> <mathias.nyman@linux.intel.com> wrote:
>>>>> On 13.12.2016 05:21, Baolin Wang wrote:
>>>>>> Hi Mathias,
>>>>>>
>>>>>> On 12 December 2016 at 23:52, Mathias Nyman
>>>>>> <mathias.nyman@linux.intel.com> wrote:
>>>>>>> On 05.12.2016 09:51, Baolin Wang wrote:
>>>>>>>>
>>>>>>>> If a command event is found on the event ring during an interrupt,
>>>>>>>> we need to stop the command timer with del_timer(). Since del_timer()
>>>>>>>> can fail if the timer is running and waiting on the xHCI lock, then
>>>>>>>> it maybe get the wrong timeout command in xhci_handle_command_timeout()
>>>>>>>> if host fetched a new command and updated the xhci->current_cmd in
>>>>>>>> handle_cmd_completion(). For this situation, we need a way to signal
>>>>>>>> to the command timer that everything is fine and it should exit.
>>>>>>>
>>>>>>>
>>>>>>> Ah, right, this could actually happen.
>>>>>>>
>>>>>>>>
>>>>>>>> We should introduce a counter (xhci->current_cmd_pending) for the number
>>>>>>>> of pending commands. If we need to cancel the command timer and
>>>>>>>> del_timer()
>>>>>>>> succeeds, we decrement the number of pending commands. If del_timer()
>>>>>>>> fails,
>>>>>>>> we leave the number of pending commands alone.
>>>>>>>>
>>>>>>>> For handling timeout command, in xhci_handle_command_timeout() we will
>>>>>>>> check
>>>>>>>> the counter after decrementing it, if the counter
>>>>>>>> (xhci->current_cmd_pending)
>>>>>>>> is 0, which means xhci->current_cmd is the right timeout command. If the
>>>>>>>> counter (xhci->current_cmd_pending) is greater than 0, which means
>>>>>>>> current
>>>>>>>> timeout command has been handled by host and host has fetched new
>>>>>>>> command
>>>>>>>> as
>>>>>>>> xhci->current_cmd, then just return and wait for new current command.
>>>>>>>
>>>>>>>
>>>>>>> A counter like this could work.
>>>>>>>
>>>>>>> Writing the abort bit can generate either ABORT+STOP, or just STOP
>>>>>>> event, this seems to cover both.
>>>>>>>
>>>>>>> quick check, case 1: timeout and cmd completion at the same time.
>>>>>>>
>>>>>>> cpu1                                    cpu2
>>>>>>>
>>>>>>> queue_command(first), p++ (=1)
>>>>>>> queue_command(more),
>>>>>>> --completion irq fires--                -- timer times out at same time--
>>>>>>> handle_cmd_completion()                 handle_cmd_timeout(),)
>>>>>>> lock(xhci_lock  )                       spin_on(xhci_lock)
>>>>>>> del_timer() fail, p (=1, nochange)
>>>>>>> cur_cmd = list_next(), p++ (=2)
>>>>>>> unlock(xhci_lock)
>>>>>>>                                           lock(xhci_lock)
>>>>>>>                                           p-- (=1)
>>>>>>>                                           if (p > 0), exit
>>>>>>> OK works
>>>>>>>
>>>>>>> case 2: normal timeout case with ABORT+STOP, no race.
>>>>>>>
>>>>>>> cpu1                                    cpu2
>>>>>>>
>>>>>>> queue_command(first), p++ (=1)
>>>>>>> queue_command(more),
>>>>>>>                                           handle_cmd_timeout()
>>>>>>>                                           p-- (P=0), don't exit
>>>>>>>                                           mod_timer(), p++ (P=1)
>>>>>>>                                           write_abort_bit()
>>>>>>> handle_cmd_comletion(ABORT)
>>>>>>> del_timer(), ok, p-- (p = 0)
>>>>>>> handle_cmd_completion(STOP)
>>>>>>> del_timer(), fail, (P=0)
>>>>>>> handle_stopped_cmd_ring()
>>>>>>> cur_cmd = list_next(), p++ (=1)
>>>>>>> mod_timer()
>>>>>>>
>>>>>>> OK, works, and same for just STOP case, with the only difference that
>>>>>>> during handle_cmd_completion(STOP) p is decremented (p--)
>>>>>>
>>>>>> Yes, that's the cases what I want to handle, thanks for your explicit
>>>>>> explanation.
>>>>>>
>>>>> Gave this some more thought over the weekend, and this implementation
>>>>> doesn't solve the case when the last command times out and races with the
>>>>> completion handler:
>>>>>
>>>>> cpu1                                    cpu2
>>>>>
>>>>> queue_command(first), p++ (=1)
>>>>> --completion irq fires--                -- timer times out at same time--
>>>>> handle_cmd_completion()                 handle_cmd_timeout(),)
>>>>> lock(xhci_lock )                        spin_on(xhci_lock)
>>>>> del_timer() fail, p (=1, nochange)
>>>>> no more commands, P (=1, nochange)
>>>>> unlock(xhci_lock)
>>>>>                                          lock(xhci_lock)
>>>>>                                          p-- (=0)
>>>>>                                          p == 0, continue, even if we should
>>>>> not.
>>>>>                                            For this we still need to rely on
>>>>> checking cur_cmd == NULL in the timeout function.
>>>>> (Baolus patch sets it to NULL if there are no more commands pending)
>>>> As I pointed out in patch 1 of this patchset, this patchset is based
>>>> on Lu Baolu's new fix patch:
>>>> usb: xhci: fix possible wild pointer
>>>> https://www.spinics.net/lists/linux-usb/msg150219.html
>>>>
>>>> After applying Baolu's patch, after decrement the counter, we will
>>>> check the xhci->cur_command if is NULL. So in this situation:
>>>> cpu1                                    cpu2
>>>>
>>>>   queue_command(first), p++ (=1)
>>>>   --completion irq fires--                -- timer times out at same time--
>>>>   handle_cmd_completion()                 handle_cmd_timeout(),)
>>>>   lock(xhci_lock )                        spin_on(xhci_lock)
>>>>   del_timer() fail, p (=1, nochange)
>>>>   no more commands, P (=1, nochange)
>>>>   unlock(xhci_lock)
>>>>                                           lock(xhci_lock)
>>>>                                           p-- (=0)
>>>>                                           no current command, return
>>>>                                           if (!xhci->current_cmd) {
>>>>                                                unlock(xhci_lock);
>>>>                                                return;
>>>>                                           }
>>>>
>>>> It can work.
>>> Yes,
>>>
>>> What I wanted to say is that as it relies on Baolus patch for that one case
>>> it seems that patch 2/2 can be replaced by a single line change:
>>>
>>> if (!xhci->current_cmd || timer_pending(&xhci->cmd_timer))
>>>
>>> Right?
>>>
>>> -Mathias
>>>
>> It seems that the watch dog algorithm for command queue becomes
>> more and more complicated and hard for maintain. I am also seeing
>> another case where a command may lose the chance to be tracked by
>> the watch dog timer.
>>
>> Say,
>>
>> queue_command(the only command in queue)
>>   - completion irq fires--                - timer times out at same time--      - another command enqueue--
>>   - lock(xhci_lock )                         - spin_on(xhci_lock)                           - spin_on(xhci_lock)
>>   - del_timer() fail
>>   - free the command and
>>     set current_cmd to NULL
>>   - unlock(xhci_lock)
>>                                                                                                                 - lock(xhci_lock)
>>                                                                                                                 - queue_command()(timer will
>>                                                                                                                    not rescheduled since the timer
>>                                                                                                                    is pending)
> In this case, since the command timer was fired and you did not re-add
> the command timer, why here timer is pending? Maybe I missed
> something? Thanks.

In queue_command(),

        /* if there are no other commands queued we start the timeout timer */
        if (list_is_singular(&xhci->cmd_list) &&
            !timer_pending(&xhci->cmd_timer)) {
                xhci->current_cmd = cmd;
                mod_timer(&xhci->cmd_timer, jiffies + XHCI_CMD_DEFAULT_TIMEOUT);
        }

timer_pending() will return true if the timer is fired, but the function is still
running on another CPU. Do I understand it right?

Best regards,
Lu Baolu

>>                                                      - lock(xhci_lock)
>>                                                      - no current command
>>                                                      - return
>>
>> As the result, the later command isn't under track of the watch dog.
>> If hardware fails to response to this command, kernel will hang in
>> the thread which is waiting for the completion of the command.
>>
>> I can write a patch to fix this and cc stable kernel as well. For long
>> term, in order to make it simple and easy to maintain, how about
>> allocating a watch dog timer for each command? It could be part
>> of the command structure and be managed just like the life cycle
>> of a command structure.
>>
>> I can write a patch for review and discussion, if you think this
>> change is possible.
>>
>> Best regards,
>> Lu Baolu
>
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command
  2016-12-20  6:39                 ` Lu Baolu
@ 2016-12-20  6:46                   ` Baolin Wang
  2016-12-20  7:18                     ` Lu Baolu
  0 siblings, 1 reply; 32+ messages in thread
From: Baolin Wang @ 2016-12-20  6:46 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Mathias Nyman, Mathias Nyman, Greg KH, USB, LKML, Mark Brown, Lu, Baolu

On 20 December 2016 at 14:39, Lu Baolu <baolu.lu@linux.intel.com> wrote:
> Hi,
>
> On 12/20/2016 02:06 PM, Baolin Wang wrote:
>> Hi,
>>
>> On 20 December 2016 at 12:29, Lu Baolu <baolu.lu@linux.intel.com> wrote:
>>> Hi Mathias,
>>>
>>> On 12/19/2016 08:13 PM, Mathias Nyman wrote:
>>>> On 19.12.2016 13:34, Baolin Wang wrote:
>>>>> Hi Mathias,
>>>>>
>>>>> On 19 December 2016 at 18:33, Mathias Nyman
>>>>> <mathias.nyman@linux.intel.com> wrote:
>>>>>> On 13.12.2016 05:21, Baolin Wang wrote:
>>>>>>> Hi Mathias,
>>>>>>>
>>>>>>> On 12 December 2016 at 23:52, Mathias Nyman
>>>>>>> <mathias.nyman@linux.intel.com> wrote:
>>>>>>>> On 05.12.2016 09:51, Baolin Wang wrote:
>>>>>>>>>
>>>>>>>>> If a command event is found on the event ring during an interrupt,
>>>>>>>>> we need to stop the command timer with del_timer(). Since del_timer()
>>>>>>>>> can fail if the timer is running and waiting on the xHCI lock, then
>>>>>>>>> it maybe get the wrong timeout command in xhci_handle_command_timeout()
>>>>>>>>> if host fetched a new command and updated the xhci->current_cmd in
>>>>>>>>> handle_cmd_completion(). For this situation, we need a way to signal
>>>>>>>>> to the command timer that everything is fine and it should exit.
>>>>>>>>
>>>>>>>>
>>>>>>>> Ah, right, this could actually happen.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> We should introduce a counter (xhci->current_cmd_pending) for the number
>>>>>>>>> of pending commands. If we need to cancel the command timer and
>>>>>>>>> del_timer()
>>>>>>>>> succeeds, we decrement the number of pending commands. If del_timer()
>>>>>>>>> fails,
>>>>>>>>> we leave the number of pending commands alone.
>>>>>>>>>
>>>>>>>>> For handling timeout command, in xhci_handle_command_timeout() we will
>>>>>>>>> check
>>>>>>>>> the counter after decrementing it, if the counter
>>>>>>>>> (xhci->current_cmd_pending)
>>>>>>>>> is 0, which means xhci->current_cmd is the right timeout command. If the
>>>>>>>>> counter (xhci->current_cmd_pending) is greater than 0, which means
>>>>>>>>> current
>>>>>>>>> timeout command has been handled by host and host has fetched new
>>>>>>>>> command
>>>>>>>>> as
>>>>>>>>> xhci->current_cmd, then just return and wait for new current command.
>>>>>>>>
>>>>>>>>
>>>>>>>> A counter like this could work.
>>>>>>>>
>>>>>>>> Writing the abort bit can generate either ABORT+STOP, or just STOP
>>>>>>>> event, this seems to cover both.
>>>>>>>>
>>>>>>>> quick check, case 1: timeout and cmd completion at the same time.
>>>>>>>>
>>>>>>>> cpu1                                    cpu2
>>>>>>>>
>>>>>>>> queue_command(first), p++ (=1)
>>>>>>>> queue_command(more),
>>>>>>>> --completion irq fires--                -- timer times out at same time--
>>>>>>>> handle_cmd_completion()                 handle_cmd_timeout(),)
>>>>>>>> lock(xhci_lock  )                       spin_on(xhci_lock)
>>>>>>>> del_timer() fail, p (=1, nochange)
>>>>>>>> cur_cmd = list_next(), p++ (=2)
>>>>>>>> unlock(xhci_lock)
>>>>>>>>                                           lock(xhci_lock)
>>>>>>>>                                           p-- (=1)
>>>>>>>>                                           if (p > 0), exit
>>>>>>>> OK works
>>>>>>>>
>>>>>>>> case 2: normal timeout case with ABORT+STOP, no race.
>>>>>>>>
>>>>>>>> cpu1                                    cpu2
>>>>>>>>
>>>>>>>> queue_command(first), p++ (=1)
>>>>>>>> queue_command(more),
>>>>>>>>                                           handle_cmd_timeout()
>>>>>>>>                                           p-- (P=0), don't exit
>>>>>>>>                                           mod_timer(), p++ (P=1)
>>>>>>>>                                           write_abort_bit()
>>>>>>>> handle_cmd_comletion(ABORT)
>>>>>>>> del_timer(), ok, p-- (p = 0)
>>>>>>>> handle_cmd_completion(STOP)
>>>>>>>> del_timer(), fail, (P=0)
>>>>>>>> handle_stopped_cmd_ring()
>>>>>>>> cur_cmd = list_next(), p++ (=1)
>>>>>>>> mod_timer()
>>>>>>>>
>>>>>>>> OK, works, and same for just STOP case, with the only difference that
>>>>>>>> during handle_cmd_completion(STOP) p is decremented (p--)
>>>>>>>
>>>>>>> Yes, that's the cases what I want to handle, thanks for your explicit
>>>>>>> explanation.
>>>>>>>
>>>>>> Gave this some more thought over the weekend, and this implementation
>>>>>> doesn't solve the case when the last command times out and races with the
>>>>>> completion handler:
>>>>>>
>>>>>> cpu1                                    cpu2
>>>>>>
>>>>>> queue_command(first), p++ (=1)
>>>>>> --completion irq fires--                -- timer times out at same time--
>>>>>> handle_cmd_completion()                 handle_cmd_timeout(),)
>>>>>> lock(xhci_lock )                        spin_on(xhci_lock)
>>>>>> del_timer() fail, p (=1, nochange)
>>>>>> no more commands, P (=1, nochange)
>>>>>> unlock(xhci_lock)
>>>>>>                                          lock(xhci_lock)
>>>>>>                                          p-- (=0)
>>>>>>                                          p == 0, continue, even if we should
>>>>>> not.
>>>>>>                                            For this we still need to rely on
>>>>>> checking cur_cmd == NULL in the timeout function.
>>>>>> (Baolus patch sets it to NULL if there are no more commands pending)
>>>>> As I pointed out in patch 1 of this patchset, this patchset is based
>>>>> on Lu Baolu's new fix patch:
>>>>> usb: xhci: fix possible wild pointer
>>>>> https://www.spinics.net/lists/linux-usb/msg150219.html
>>>>>
>>>>> After applying Baolu's patch, after decrement the counter, we will
>>>>> check the xhci->cur_command if is NULL. So in this situation:
>>>>> cpu1                                    cpu2
>>>>>
>>>>>   queue_command(first), p++ (=1)
>>>>>   --completion irq fires--                -- timer times out at same time--
>>>>>   handle_cmd_completion()                 handle_cmd_timeout(),)
>>>>>   lock(xhci_lock )                        spin_on(xhci_lock)
>>>>>   del_timer() fail, p (=1, nochange)
>>>>>   no more commands, P (=1, nochange)
>>>>>   unlock(xhci_lock)
>>>>>                                           lock(xhci_lock)
>>>>>                                           p-- (=0)
>>>>>                                           no current command, return
>>>>>                                           if (!xhci->current_cmd) {
>>>>>                                                unlock(xhci_lock);
>>>>>                                                return;
>>>>>                                           }
>>>>>
>>>>> It can work.
>>>> Yes,
>>>>
>>>> What I wanted to say is that as it relies on Baolus patch for that one case
>>>> it seems that patch 2/2 can be replaced by a single line change:
>>>>
>>>> if (!xhci->current_cmd || timer_pending(&xhci->cmd_timer))
>>>>
>>>> Right?
>>>>
>>>> -Mathias
>>>>
>>> It seems that the watch dog algorithm for command queue becomes
>>> more and more complicated and hard for maintain. I am also seeing
>>> another case where a command may lose the chance to be tracked by
>>> the watch dog timer.
>>>
>>> Say,
>>>
>>> queue_command(the only command in queue)
>>>   - completion irq fires--                - timer times out at same time--      - another command enqueue--
>>>   - lock(xhci_lock )                         - spin_on(xhci_lock)                           - spin_on(xhci_lock)
>>>   - del_timer() fail
>>>   - free the command and
>>>     set current_cmd to NULL
>>>   - unlock(xhci_lock)
>>>                                                                                                                 - lock(xhci_lock)
>>>                                                                                                                 - queue_command()(timer will
>>>                                                                                                                    not rescheduled since the timer
>>>                                                                                                                    is pending)
>> In this case, since the command timer was fired and you did not re-add
>> the command timer, why here timer is pending? Maybe I missed
>> something? Thanks.
>
> In queue_command(),
>
>         /* if there are no other commands queued we start the timeout timer */
>         if (list_is_singular(&xhci->cmd_list) &&
>             !timer_pending(&xhci->cmd_timer)) {
>                 xhci->current_cmd = cmd;
>                 mod_timer(&xhci->cmd_timer, jiffies + XHCI_CMD_DEFAULT_TIMEOUT);
>         }
>
> timer_pending() will return true if the timer is fired, but the function is still
> running on another CPU. Do I understand it right?

>From my understanding, if the timer was fired, no matter the timeout
function is running or finished, timer_pending() will return false.
Please correct me if I made mistakes. Thanks.

>
> Best regards,
> Lu Baolu
>
>>>                                                      - lock(xhci_lock)
>>>                                                      - no current command
>>>                                                      - return
>>>
>>> As the result, the later command isn't under track of the watch dog.
>>> If hardware fails to response to this command, kernel will hang in
>>> the thread which is waiting for the completion of the command.
>>>
>>> I can write a patch to fix this and cc stable kernel as well. For long
>>> term, in order to make it simple and easy to maintain, how about
>>> allocating a watch dog timer for each command? It could be part
>>> of the command structure and be managed just like the life cycle
>>> of a command structure.
>>>
>>> I can write a patch for review and discussion, if you think this
>>> change is possible.
>>>
>>> Best regards,
>>> Lu Baolu
>>
>>
>



-- 
Baolin.wang
Best Regards

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command
  2016-12-20  6:46                   ` Baolin Wang
@ 2016-12-20  7:18                     ` Lu Baolu
  2016-12-20  7:30                       ` Baolin Wang
  0 siblings, 1 reply; 32+ messages in thread
From: Lu Baolu @ 2016-12-20  7:18 UTC (permalink / raw)
  To: Baolin Wang
  Cc: Mathias Nyman, Mathias Nyman, Greg KH, USB, LKML, Mark Brown, Lu, Baolu

Hi,

On 12/20/2016 02:46 PM, Baolin Wang wrote:
> On 20 December 2016 at 14:39, Lu Baolu <baolu.lu@linux.intel.com> wrote:
>> Hi,
>>
>> On 12/20/2016 02:06 PM, Baolin Wang wrote:
>>> Hi,
>>>
>>> On 20 December 2016 at 12:29, Lu Baolu <baolu.lu@linux.intel.com> wrote:
>>>> Hi Mathias,
>>>>
>>>> On 12/19/2016 08:13 PM, Mathias Nyman wrote:
>>>>> On 19.12.2016 13:34, Baolin Wang wrote:
>>>>>> Hi Mathias,
>>>>>>
>>>>>> On 19 December 2016 at 18:33, Mathias Nyman
>>>>>> <mathias.nyman@linux.intel.com> wrote:
>>>>>>> On 13.12.2016 05:21, Baolin Wang wrote:
>>>>>>>> Hi Mathias,
>>>>>>>>
>>>>>>>> On 12 December 2016 at 23:52, Mathias Nyman
>>>>>>>> <mathias.nyman@linux.intel.com> wrote:
>>>>>>>>> On 05.12.2016 09:51, Baolin Wang wrote:
>>>>>>>>>> If a command event is found on the event ring during an interrupt,
>>>>>>>>>> we need to stop the command timer with del_timer(). Since del_timer()
>>>>>>>>>> can fail if the timer is running and waiting on the xHCI lock, then
>>>>>>>>>> it maybe get the wrong timeout command in xhci_handle_command_timeout()
>>>>>>>>>> if host fetched a new command and updated the xhci->current_cmd in
>>>>>>>>>> handle_cmd_completion(). For this situation, we need a way to signal
>>>>>>>>>> to the command timer that everything is fine and it should exit.
>>>>>>>>>
>>>>>>>>> Ah, right, this could actually happen.
>>>>>>>>>
>>>>>>>>>> We should introduce a counter (xhci->current_cmd_pending) for the number
>>>>>>>>>> of pending commands. If we need to cancel the command timer and
>>>>>>>>>> del_timer()
>>>>>>>>>> succeeds, we decrement the number of pending commands. If del_timer()
>>>>>>>>>> fails,
>>>>>>>>>> we leave the number of pending commands alone.
>>>>>>>>>>
>>>>>>>>>> For handling timeout command, in xhci_handle_command_timeout() we will
>>>>>>>>>> check
>>>>>>>>>> the counter after decrementing it, if the counter
>>>>>>>>>> (xhci->current_cmd_pending)
>>>>>>>>>> is 0, which means xhci->current_cmd is the right timeout command. If the
>>>>>>>>>> counter (xhci->current_cmd_pending) is greater than 0, which means
>>>>>>>>>> current
>>>>>>>>>> timeout command has been handled by host and host has fetched new
>>>>>>>>>> command
>>>>>>>>>> as
>>>>>>>>>> xhci->current_cmd, then just return and wait for new current command.
>>>>>>>>>
>>>>>>>>> A counter like this could work.
>>>>>>>>>
>>>>>>>>> Writing the abort bit can generate either ABORT+STOP, or just STOP
>>>>>>>>> event, this seems to cover both.
>>>>>>>>>
>>>>>>>>> quick check, case 1: timeout and cmd completion at the same time.
>>>>>>>>>
>>>>>>>>> cpu1                                    cpu2
>>>>>>>>>
>>>>>>>>> queue_command(first), p++ (=1)
>>>>>>>>> queue_command(more),
>>>>>>>>> --completion irq fires--                -- timer times out at same time--
>>>>>>>>> handle_cmd_completion()                 handle_cmd_timeout(),)
>>>>>>>>> lock(xhci_lock  )                       spin_on(xhci_lock)
>>>>>>>>> del_timer() fail, p (=1, nochange)
>>>>>>>>> cur_cmd = list_next(), p++ (=2)
>>>>>>>>> unlock(xhci_lock)
>>>>>>>>>                                           lock(xhci_lock)
>>>>>>>>>                                           p-- (=1)
>>>>>>>>>                                           if (p > 0), exit
>>>>>>>>> OK works
>>>>>>>>>
>>>>>>>>> case 2: normal timeout case with ABORT+STOP, no race.
>>>>>>>>>
>>>>>>>>> cpu1                                    cpu2
>>>>>>>>>
>>>>>>>>> queue_command(first), p++ (=1)
>>>>>>>>> queue_command(more),
>>>>>>>>>                                           handle_cmd_timeout()
>>>>>>>>>                                           p-- (P=0), don't exit
>>>>>>>>>                                           mod_timer(), p++ (P=1)
>>>>>>>>>                                           write_abort_bit()
>>>>>>>>> handle_cmd_comletion(ABORT)
>>>>>>>>> del_timer(), ok, p-- (p = 0)
>>>>>>>>> handle_cmd_completion(STOP)
>>>>>>>>> del_timer(), fail, (P=0)
>>>>>>>>> handle_stopped_cmd_ring()
>>>>>>>>> cur_cmd = list_next(), p++ (=1)
>>>>>>>>> mod_timer()
>>>>>>>>>
>>>>>>>>> OK, works, and same for just STOP case, with the only difference that
>>>>>>>>> during handle_cmd_completion(STOP) p is decremented (p--)
>>>>>>>> Yes, that's the cases what I want to handle, thanks for your explicit
>>>>>>>> explanation.
>>>>>>>>
>>>>>>> Gave this some more thought over the weekend, and this implementation
>>>>>>> doesn't solve the case when the last command times out and races with the
>>>>>>> completion handler:
>>>>>>>
>>>>>>> cpu1                                    cpu2
>>>>>>>
>>>>>>> queue_command(first), p++ (=1)
>>>>>>> --completion irq fires--                -- timer times out at same time--
>>>>>>> handle_cmd_completion()                 handle_cmd_timeout(),)
>>>>>>> lock(xhci_lock )                        spin_on(xhci_lock)
>>>>>>> del_timer() fail, p (=1, nochange)
>>>>>>> no more commands, P (=1, nochange)
>>>>>>> unlock(xhci_lock)
>>>>>>>                                          lock(xhci_lock)
>>>>>>>                                          p-- (=0)
>>>>>>>                                          p == 0, continue, even if we should
>>>>>>> not.
>>>>>>>                                            For this we still need to rely on
>>>>>>> checking cur_cmd == NULL in the timeout function.
>>>>>>> (Baolus patch sets it to NULL if there are no more commands pending)
>>>>>> As I pointed out in patch 1 of this patchset, this patchset is based
>>>>>> on Lu Baolu's new fix patch:
>>>>>> usb: xhci: fix possible wild pointer
>>>>>> https://www.spinics.net/lists/linux-usb/msg150219.html
>>>>>>
>>>>>> After applying Baolu's patch, after decrement the counter, we will
>>>>>> check the xhci->cur_command if is NULL. So in this situation:
>>>>>> cpu1                                    cpu2
>>>>>>
>>>>>>   queue_command(first), p++ (=1)
>>>>>>   --completion irq fires--                -- timer times out at same time--
>>>>>>   handle_cmd_completion()                 handle_cmd_timeout(),)
>>>>>>   lock(xhci_lock )                        spin_on(xhci_lock)
>>>>>>   del_timer() fail, p (=1, nochange)
>>>>>>   no more commands, P (=1, nochange)
>>>>>>   unlock(xhci_lock)
>>>>>>                                           lock(xhci_lock)
>>>>>>                                           p-- (=0)
>>>>>>                                           no current command, return
>>>>>>                                           if (!xhci->current_cmd) {
>>>>>>                                                unlock(xhci_lock);
>>>>>>                                                return;
>>>>>>                                           }
>>>>>>
>>>>>> It can work.
>>>>> Yes,
>>>>>
>>>>> What I wanted to say is that as it relies on Baolus patch for that one case
>>>>> it seems that patch 2/2 can be replaced by a single line change:
>>>>>
>>>>> if (!xhci->current_cmd || timer_pending(&xhci->cmd_timer))
>>>>>
>>>>> Right?
>>>>>
>>>>> -Mathias
>>>>>
>>>> It seems that the watch dog algorithm for command queue becomes
>>>> more and more complicated and hard for maintain. I am also seeing
>>>> another case where a command may lose the chance to be tracked by
>>>> the watch dog timer.
>>>>
>>>> Say,
>>>>
>>>> queue_command(the only command in queue)
>>>>   - completion irq fires--                - timer times out at same time--      - another command enqueue--
>>>>   - lock(xhci_lock )                         - spin_on(xhci_lock)                           - spin_on(xhci_lock)
>>>>   - del_timer() fail
>>>>   - free the command and
>>>>     set current_cmd to NULL
>>>>   - unlock(xhci_lock)
>>>>                                                                                                                 - lock(xhci_lock)
>>>>                                                                                                                 - queue_command()(timer will
>>>>                                                                                                                    not rescheduled since the timer
>>>>                                                                                                                    is pending)
>>> In this case, since the command timer was fired and you did not re-add
>>> the command timer, why here timer is pending? Maybe I missed
>>> something? Thanks.
>> In queue_command(),
>>
>>         /* if there are no other commands queued we start the timeout timer */
>>         if (list_is_singular(&xhci->cmd_list) &&
>>             !timer_pending(&xhci->cmd_timer)) {
>>                 xhci->current_cmd = cmd;
>>                 mod_timer(&xhci->cmd_timer, jiffies + XHCI_CMD_DEFAULT_TIMEOUT);
>>         }
>>
>> timer_pending() will return true if the timer is fired, but the function is still
>> running on another CPU. Do I understand it right?
> >From my understanding, if the timer was fired, no matter the timeout
> function is running or finished, timer_pending() will return false.
> Please correct me if I made mistakes. Thanks.

Just looked into kernel/time/timer.c. You are right.

The pending is cleared in expire_timers() before call the timer function.

Base on this fact, we don't need to check timer_pending() at all in below code.

        /* if there are no other commands queued we start the timeout timer */
        if (xhci->cmd_list.next == &cmd->cmd_list &&
            !timer_pending(&xhci->cmd_timer)) {
                xhci->current_cmd = cmd;
                mod_timer(&xhci->cmd_timer, jiffies + XHCI_CMD_DEFAULT_TIMEOUT);
        }


Best regards,
Lu Baolu

>
>> Best regards,
>> Lu Baolu
>>
>>>>                                                      - lock(xhci_lock)
>>>>                                                      - no current command
>>>>                                                      - return
>>>>
>>>> As the result, the later command isn't under track of the watch dog.
>>>> If hardware fails to response to this command, kernel will hang in
>>>> the thread which is waiting for the completion of the command.
>>>>
>>>> I can write a patch to fix this and cc stable kernel as well. For long
>>>> term, in order to make it simple and easy to maintain, how about
>>>> allocating a watch dog timer for each command? It could be part
>>>> of the command structure and be managed just like the life cycle
>>>> of a command structure.
>>>>
>>>> I can write a patch for review and discussion, if you think this
>>>> change is possible.
>>>>
>>>> Best regards,
>>>> Lu Baolu
>>>
>
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command
  2016-12-20  7:18                     ` Lu Baolu
@ 2016-12-20  7:30                       ` Baolin Wang
  2016-12-20 15:13                         ` Mathias Nyman
  0 siblings, 1 reply; 32+ messages in thread
From: Baolin Wang @ 2016-12-20  7:30 UTC (permalink / raw)
  To: Lu Baolu
  Cc: Mathias Nyman, Mathias Nyman, Greg KH, USB, LKML, Mark Brown, Lu, Baolu

Hi,

On 20 December 2016 at 15:18, Lu Baolu <baolu.lu@linux.intel.com> wrote:
> Hi,
>
> On 12/20/2016 02:46 PM, Baolin Wang wrote:
>> On 20 December 2016 at 14:39, Lu Baolu <baolu.lu@linux.intel.com> wrote:
>>> Hi,
>>>
>>> On 12/20/2016 02:06 PM, Baolin Wang wrote:
>>>> Hi,
>>>>
>>>> On 20 December 2016 at 12:29, Lu Baolu <baolu.lu@linux.intel.com> wrote:
>>>>> Hi Mathias,
>>>>>
>>>>> On 12/19/2016 08:13 PM, Mathias Nyman wrote:
>>>>>> On 19.12.2016 13:34, Baolin Wang wrote:
>>>>>>> Hi Mathias,
>>>>>>>
>>>>>>> On 19 December 2016 at 18:33, Mathias Nyman
>>>>>>> <mathias.nyman@linux.intel.com> wrote:
>>>>>>>> On 13.12.2016 05:21, Baolin Wang wrote:
>>>>>>>>> Hi Mathias,
>>>>>>>>>
>>>>>>>>> On 12 December 2016 at 23:52, Mathias Nyman
>>>>>>>>> <mathias.nyman@linux.intel.com> wrote:
>>>>>>>>>> On 05.12.2016 09:51, Baolin Wang wrote:
>>>>>>>>>>> If a command event is found on the event ring during an interrupt,
>>>>>>>>>>> we need to stop the command timer with del_timer(). Since del_timer()
>>>>>>>>>>> can fail if the timer is running and waiting on the xHCI lock, then
>>>>>>>>>>> it maybe get the wrong timeout command in xhci_handle_command_timeout()
>>>>>>>>>>> if host fetched a new command and updated the xhci->current_cmd in
>>>>>>>>>>> handle_cmd_completion(). For this situation, we need a way to signal
>>>>>>>>>>> to the command timer that everything is fine and it should exit.
>>>>>>>>>>
>>>>>>>>>> Ah, right, this could actually happen.
>>>>>>>>>>
>>>>>>>>>>> We should introduce a counter (xhci->current_cmd_pending) for the number
>>>>>>>>>>> of pending commands. If we need to cancel the command timer and
>>>>>>>>>>> del_timer()
>>>>>>>>>>> succeeds, we decrement the number of pending commands. If del_timer()
>>>>>>>>>>> fails,
>>>>>>>>>>> we leave the number of pending commands alone.
>>>>>>>>>>>
>>>>>>>>>>> For handling timeout command, in xhci_handle_command_timeout() we will
>>>>>>>>>>> check
>>>>>>>>>>> the counter after decrementing it, if the counter
>>>>>>>>>>> (xhci->current_cmd_pending)
>>>>>>>>>>> is 0, which means xhci->current_cmd is the right timeout command. If the
>>>>>>>>>>> counter (xhci->current_cmd_pending) is greater than 0, which means
>>>>>>>>>>> current
>>>>>>>>>>> timeout command has been handled by host and host has fetched new
>>>>>>>>>>> command
>>>>>>>>>>> as
>>>>>>>>>>> xhci->current_cmd, then just return and wait for new current command.
>>>>>>>>>>
>>>>>>>>>> A counter like this could work.
>>>>>>>>>>
>>>>>>>>>> Writing the abort bit can generate either ABORT+STOP, or just STOP
>>>>>>>>>> event, this seems to cover both.
>>>>>>>>>>
>>>>>>>>>> quick check, case 1: timeout and cmd completion at the same time.
>>>>>>>>>>
>>>>>>>>>> cpu1                                    cpu2
>>>>>>>>>>
>>>>>>>>>> queue_command(first), p++ (=1)
>>>>>>>>>> queue_command(more),
>>>>>>>>>> --completion irq fires--                -- timer times out at same time--
>>>>>>>>>> handle_cmd_completion()                 handle_cmd_timeout(),)
>>>>>>>>>> lock(xhci_lock  )                       spin_on(xhci_lock)
>>>>>>>>>> del_timer() fail, p (=1, nochange)
>>>>>>>>>> cur_cmd = list_next(), p++ (=2)
>>>>>>>>>> unlock(xhci_lock)
>>>>>>>>>>                                           lock(xhci_lock)
>>>>>>>>>>                                           p-- (=1)
>>>>>>>>>>                                           if (p > 0), exit
>>>>>>>>>> OK works
>>>>>>>>>>
>>>>>>>>>> case 2: normal timeout case with ABORT+STOP, no race.
>>>>>>>>>>
>>>>>>>>>> cpu1                                    cpu2
>>>>>>>>>>
>>>>>>>>>> queue_command(first), p++ (=1)
>>>>>>>>>> queue_command(more),
>>>>>>>>>>                                           handle_cmd_timeout()
>>>>>>>>>>                                           p-- (P=0), don't exit
>>>>>>>>>>                                           mod_timer(), p++ (P=1)
>>>>>>>>>>                                           write_abort_bit()
>>>>>>>>>> handle_cmd_comletion(ABORT)
>>>>>>>>>> del_timer(), ok, p-- (p = 0)
>>>>>>>>>> handle_cmd_completion(STOP)
>>>>>>>>>> del_timer(), fail, (P=0)
>>>>>>>>>> handle_stopped_cmd_ring()
>>>>>>>>>> cur_cmd = list_next(), p++ (=1)
>>>>>>>>>> mod_timer()
>>>>>>>>>>
>>>>>>>>>> OK, works, and same for just STOP case, with the only difference that
>>>>>>>>>> during handle_cmd_completion(STOP) p is decremented (p--)
>>>>>>>>> Yes, that's the cases what I want to handle, thanks for your explicit
>>>>>>>>> explanation.
>>>>>>>>>
>>>>>>>> Gave this some more thought over the weekend, and this implementation
>>>>>>>> doesn't solve the case when the last command times out and races with the
>>>>>>>> completion handler:
>>>>>>>>
>>>>>>>> cpu1                                    cpu2
>>>>>>>>
>>>>>>>> queue_command(first), p++ (=1)
>>>>>>>> --completion irq fires--                -- timer times out at same time--
>>>>>>>> handle_cmd_completion()                 handle_cmd_timeout(),)
>>>>>>>> lock(xhci_lock )                        spin_on(xhci_lock)
>>>>>>>> del_timer() fail, p (=1, nochange)
>>>>>>>> no more commands, P (=1, nochange)
>>>>>>>> unlock(xhci_lock)
>>>>>>>>                                          lock(xhci_lock)
>>>>>>>>                                          p-- (=0)
>>>>>>>>                                          p == 0, continue, even if we should
>>>>>>>> not.
>>>>>>>>                                            For this we still need to rely on
>>>>>>>> checking cur_cmd == NULL in the timeout function.
>>>>>>>> (Baolus patch sets it to NULL if there are no more commands pending)
>>>>>>> As I pointed out in patch 1 of this patchset, this patchset is based
>>>>>>> on Lu Baolu's new fix patch:
>>>>>>> usb: xhci: fix possible wild pointer
>>>>>>> https://www.spinics.net/lists/linux-usb/msg150219.html
>>>>>>>
>>>>>>> After applying Baolu's patch, after decrement the counter, we will
>>>>>>> check the xhci->cur_command if is NULL. So in this situation:
>>>>>>> cpu1                                    cpu2
>>>>>>>
>>>>>>>   queue_command(first), p++ (=1)
>>>>>>>   --completion irq fires--                -- timer times out at same time--
>>>>>>>   handle_cmd_completion()                 handle_cmd_timeout(),)
>>>>>>>   lock(xhci_lock )                        spin_on(xhci_lock)
>>>>>>>   del_timer() fail, p (=1, nochange)
>>>>>>>   no more commands, P (=1, nochange)
>>>>>>>   unlock(xhci_lock)
>>>>>>>                                           lock(xhci_lock)
>>>>>>>                                           p-- (=0)
>>>>>>>                                           no current command, return
>>>>>>>                                           if (!xhci->current_cmd) {
>>>>>>>                                                unlock(xhci_lock);
>>>>>>>                                                return;
>>>>>>>                                           }
>>>>>>>
>>>>>>> It can work.
>>>>>> Yes,
>>>>>>
>>>>>> What I wanted to say is that as it relies on Baolus patch for that one case
>>>>>> it seems that patch 2/2 can be replaced by a single line change:
>>>>>>
>>>>>> if (!xhci->current_cmd || timer_pending(&xhci->cmd_timer))
>>>>>>
>>>>>> Right?
>>>>>>
>>>>>> -Mathias
>>>>>>
>>>>> It seems that the watch dog algorithm for command queue becomes
>>>>> more and more complicated and hard for maintain. I am also seeing
>>>>> another case where a command may lose the chance to be tracked by
>>>>> the watch dog timer.
>>>>>
>>>>> Say,
>>>>>
>>>>> queue_command(the only command in queue)
>>>>>   - completion irq fires--                - timer times out at same time--      - another command enqueue--
>>>>>   - lock(xhci_lock )                         - spin_on(xhci_lock)                           - spin_on(xhci_lock)
>>>>>   - del_timer() fail
>>>>>   - free the command and
>>>>>     set current_cmd to NULL
>>>>>   - unlock(xhci_lock)
>>>>>                                                                                                                 - lock(xhci_lock)
>>>>>                                                                                                                 - queue_command()(timer will
>>>>>                                                                                                                    not rescheduled since the timer
>>>>>                                                                                                                    is pending)
>>>> In this case, since the command timer was fired and you did not re-add
>>>> the command timer, why here timer is pending? Maybe I missed
>>>> something? Thanks.
>>> In queue_command(),
>>>
>>>         /* if there are no other commands queued we start the timeout timer */
>>>         if (list_is_singular(&xhci->cmd_list) &&
>>>             !timer_pending(&xhci->cmd_timer)) {
>>>                 xhci->current_cmd = cmd;
>>>                 mod_timer(&xhci->cmd_timer, jiffies + XHCI_CMD_DEFAULT_TIMEOUT);
>>>         }
>>>
>>> timer_pending() will return true if the timer is fired, but the function is still
>>> running on another CPU. Do I understand it right?
>> >From my understanding, if the timer was fired, no matter the timeout
>> function is running or finished, timer_pending() will return false.
>> Please correct me if I made mistakes. Thanks.
>
> Just looked into kernel/time/timer.c. You are right.
>
> The pending is cleared in expire_timers() before call the timer function.
>
> Base on this fact, we don't need to check timer_pending() at all in below code.

Indeed, I think so too.

>
>         /* if there are no other commands queued we start the timeout timer */
>         if (xhci->cmd_list.next == &cmd->cmd_list &&
>             !timer_pending(&xhci->cmd_timer)) {
>                 xhci->current_cmd = cmd;
>                 mod_timer(&xhci->cmd_timer, jiffies + XHCI_CMD_DEFAULT_TIMEOUT);
>         }
>
>
> Best regards,
> Lu Baolu
>
>>
>>> Best regards,
>>> Lu Baolu
>>>
>>>>>                                                      - lock(xhci_lock)
>>>>>                                                      - no current command
>>>>>                                                      - return
>>>>>
>>>>> As the result, the later command isn't under track of the watch dog.
>>>>> If hardware fails to response to this command, kernel will hang in
>>>>> the thread which is waiting for the completion of the command.
>>>>>
>>>>> I can write a patch to fix this and cc stable kernel as well. For long
>>>>> term, in order to make it simple and easy to maintain, how about
>>>>> allocating a watch dog timer for each command? It could be part
>>>>> of the command structure and be managed just like the life cycle
>>>>> of a command structure.
>>>>>
>>>>> I can write a patch for review and discussion, if you think this
>>>>> change is possible.
>>>>>
>>>>> Best regards,
>>>>> Lu Baolu
>>>>
>>
>>
>



-- 
Baolin.wang
Best Regards

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command
  2016-12-20  7:30                       ` Baolin Wang
@ 2016-12-20 15:13                         ` Mathias Nyman
  2016-12-21  2:22                           ` Baolin Wang
                                             ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Mathias Nyman @ 2016-12-20 15:13 UTC (permalink / raw)
  To: Baolin Wang, Lu Baolu
  Cc: Mathias Nyman, Greg KH, USB, LKML, Mark Brown, Lu, Baolu

On 20.12.2016 09:30, Baolin Wang wrote:
...

Alright, I gathered all current work related to xhci races and timeouts
and put them into a branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git timeout_race_fixes

Its based on 4.9
It includes a few other patches just to avoid conflicts and  make my life easier

Interesting patches are:

ee4eb91 xhci: remove unnecessary check for pending timer
0cba67d xhci: detect stop endpoint race using pending timer instead of counter.
4f2535f xhci: Handle command completion and timeout race
b9d00d7 usb: host: xhci: Fix possible wild pointer when handling abort command
529a5a0 usb: xhci: fix possible wild pointer
4766555 xhci: Fix race related to abort operation
de834a3 xhci: Use delayed_work instead of timer for command timeout
69973b8 Linux 4.9

The fixes for command queue races will go to usb-linus and stable, the
reworks for stop ep watchdog timer will go to usb-next.

Still completely untested, (well it compiles)

Felipe gave instructions how to modify dwc3 driver to timeout on address
devicecommands to test these, I'll try to set that up.

All additional testing is welcome, especially if you can trigger timeouts
and races

-Mathias

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command
  2016-12-20 15:13                         ` Mathias Nyman
@ 2016-12-21  2:22                           ` Baolin Wang
  2016-12-21 13:00                             ` Mathias Nyman
  2016-12-21  6:17                           ` Lu Baolu
  2016-12-21  6:57                           ` Lu Baolu
  2 siblings, 1 reply; 32+ messages in thread
From: Baolin Wang @ 2016-12-21  2:22 UTC (permalink / raw)
  To: Mathias Nyman
  Cc: Lu Baolu, Mathias Nyman, Greg KH, USB, LKML, Mark Brown, Lu, Baolu

Hi Mathias,

On 20 December 2016 at 23:13, Mathias Nyman
<mathias.nyman@linux.intel.com> wrote:
> On 20.12.2016 09:30, Baolin Wang wrote:
> ...
>
> Alright, I gathered all current work related to xhci races and timeouts
> and put them into a branch:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git
> timeout_race_fixes
>
> Its based on 4.9
> It includes a few other patches just to avoid conflicts and  make my life
> easier
>
> Interesting patches are:
>
> ee4eb91 xhci: remove unnecessary check for pending timer
> 0cba67d xhci: detect stop endpoint race using pending timer instead of
> counter.
> 4f2535f xhci: Handle command completion and timeout race
> b9d00d7 usb: host: xhci: Fix possible wild pointer when handling abort
> command
> 529a5a0 usb: xhci: fix possible wild pointer
> 4766555 xhci: Fix race related to abort operation
> de834a3 xhci: Use delayed_work instead of timer for command timeout
> 69973b8 Linux 4.9
>
> The fixes for command queue races will go to usb-linus and stable, the
> reworks for stop ep watchdog timer will go to usb-next.

How about applying below patch in your 'timeout_race_fixes' branch?
[PATCH] usb: host: xhci: Clean up commands when stop endpoint command is timeout
https://lkml.org/lkml/2016/12/13/94

>
> Still completely untested, (well it compiles)
>
> Felipe gave instructions how to modify dwc3 driver to timeout on address
> devicecommands to test these, I'll try to set that up.
>
> All additional testing is welcome, especially if you can trigger timeouts
> and races

I will try to test these patches.

-- 
Baolin.wang
Best Regards

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command
  2016-12-20 15:13                         ` Mathias Nyman
  2016-12-21  2:22                           ` Baolin Wang
@ 2016-12-21  6:17                           ` Lu Baolu
  2016-12-21 12:48                             ` Mathias Nyman
  2016-12-21  6:57                           ` Lu Baolu
  2 siblings, 1 reply; 32+ messages in thread
From: Lu Baolu @ 2016-12-21  6:17 UTC (permalink / raw)
  To: Mathias Nyman, Baolin Wang
  Cc: Mathias Nyman, Greg KH, USB, LKML, Mark Brown, Lu, Baolu

Hi Mathias,

I have some comments for the implementation of xhci_abort_cmd_ring() below.

On 12/20/2016 11:13 PM, Mathias Nyman wrote:
> On 20.12.2016 09:30, Baolin Wang wrote:
> ...
>
> Alright, I gathered all current work related to xhci races and timeouts
> and put them into a branch:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git timeout_race_fixes
>
> Its based on 4.9
> It includes a few other patches just to avoid conflicts and  make my life easier
>
> Interesting patches are:
>
> ee4eb91 xhci: remove unnecessary check for pending timer
> 0cba67d xhci: detect stop endpoint race using pending timer instead of counter.
> 4f2535f xhci: Handle command completion and timeout race
> b9d00d7 usb: host: xhci: Fix possible wild pointer when handling abort command
> 529a5a0 usb: xhci: fix possible wild pointer
> 4766555 xhci: Fix race related to abort operation
> de834a3 xhci: Use delayed_work instead of timer for command timeout
> 69973b8 Linux 4.9
>
> The fixes for command queue races will go to usb-linus and stable, the
> reworks for stop ep watchdog timer will go to usb-next.
>
> Still completely untested, (well it compiles)
>
> Felipe gave instructions how to modify dwc3 driver to timeout on address
> devicecommands to test these, I'll try to set that up.
>
> All additional testing is welcome, especially if you can trigger timeouts
> and races
>
> -Mathias
>
>

Below is the latest code. I put my comments in line.

 322 static int xhci_abort_cmd_ring(struct xhci_hcd *xhci)
 323 {
 324         u64 temp_64;
 325         int ret;
 326
 327         xhci_dbg(xhci, "Abort command ring\n");
 328
 329         reinit_completion(&xhci->cmd_ring_stop_completion);
 330
 331         temp_64 = xhci_read_64(xhci, &xhci->op_regs->cmd_ring);
 332         xhci_write_64(xhci, temp_64 | CMD_RING_ABORT,
 333                         &xhci->op_regs->cmd_ring);

We should hold xhci->lock when we are modifying xhci registers
at runtime.


 334
 335         /* Section 4.6.1.2 of xHCI 1.0 spec says software should
 336          * time the completion od all xHCI commands, including

s/od/of/g

 337          * the Command Abort operation. If software doesn't see
 338          * CRR negated in a timely manner (e.g. longer than 5
 339          * seconds), then it should assume that the there are
 340          * larger problems with the xHC and assert HCRST.
 341          */
 342         ret = xhci_handshake(&xhci->op_regs->cmd_ring,
 343                         CMD_RING_RUNNING, 0, 5 * 1000 * 1000);
 344         if (ret < 0) {
 345                 /* we are about to kill xhci, give it one more chance */

The retry of setting CMD_RING_ABORT is not necessary according to
previous discussion. We have cleaned code for second try in
xhci_handle_command_timeout(). Need to clean up here as well.

 346                 xhci_write_64(xhci, temp_64 | CMD_RING_ABORT,
 347                               &xhci->op_regs->cmd_ring);
 348                 udelay(1000);
 349                 ret = xhci_handshake(&xhci->op_regs->cmd_ring,
 350                                      CMD_RING_RUNNING, 0, 3 * 1000 * 1000);
 351                 if (ret < 0) {
 352                         xhci_err(xhci, "Stopped the command ring failed, "
 353                                  "maybe the host is dead\n");
 354                         xhci->xhc_state |= XHCI_STATE_DYING;
 355                         xhci_halt(xhci);
 356                         return -ESHUTDOWN;
 357                 }
 358         }
 359         /*
 360          * Writing the CMD_RING_ABORT bit should cause a cmd completion event,
 361          * however on some host hw the CMD_RING_RUNNING bit is correctly cleared
 362          * but the completion event in never sent. Wait 2 secs (arbitrary

s/in never sent/is never sent/g

 363          * number) to handle those cases after negation of CMD_RING_RUNNING.
 364          */

This should be implemented with a quirk bit. It's not common for all host controllers.

 365         if (!wait_for_completion_timeout(&xhci->cmd_ring_stop_completion,
 366                                          2 * HZ)) {
 367                 xhci_dbg(xhci, "No stop event for abort, ring start fail?\n");
 368                 xhci_cleanup_command_queue(xhci);
 369         } else {
 370                 unsigned long flags;
 371
 372                 spin_lock_irqsave(&xhci->lock, flags);
 373                 xhci_handle_stopped_cmd_ring(xhci, xhci_next_queued_cmd(xhci));
 374                 spin_unlock_irqrestore(&xhci->lock, flags);
 375         }
 376         return 0;
 377 }

Best regards,
Lu Baolu

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command
  2016-12-20 15:13                         ` Mathias Nyman
  2016-12-21  2:22                           ` Baolin Wang
  2016-12-21  6:17                           ` Lu Baolu
@ 2016-12-21  6:57                           ` Lu Baolu
  2016-12-21 12:57                             ` Mathias Nyman
  2 siblings, 1 reply; 32+ messages in thread
From: Lu Baolu @ 2016-12-21  6:57 UTC (permalink / raw)
  To: Mathias Nyman, Baolin Wang
  Cc: Mathias Nyman, Greg KH, USB, LKML, Mark Brown, Lu, Baolu

Hi Mathias,

I have some comments for the implementation of
xhci_handle_command_timeout() as well.

On 12/20/2016 11:13 PM, Mathias Nyman wrote:
> On 20.12.2016 09:30, Baolin Wang wrote:
> ...
>
> Alright, I gathered all current work related to xhci races and timeouts
> and put them into a branch:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git timeout_race_fixes
>
> Its based on 4.9
> It includes a few other patches just to avoid conflicts and  make my life easier
>
> Interesting patches are:
>
> ee4eb91 xhci: remove unnecessary check for pending timer
> 0cba67d xhci: detect stop endpoint race using pending timer instead of counter.
> 4f2535f xhci: Handle command completion and timeout race
> b9d00d7 usb: host: xhci: Fix possible wild pointer when handling abort command
> 529a5a0 usb: xhci: fix possible wild pointer
> 4766555 xhci: Fix race related to abort operation
> de834a3 xhci: Use delayed_work instead of timer for command timeout
> 69973b8 Linux 4.9
>
> The fixes for command queue races will go to usb-linus and stable, the
> reworks for stop ep watchdog timer will go to usb-next.
>
> Still completely untested, (well it compiles)
>
> Felipe gave instructions how to modify dwc3 driver to timeout on address
> devicecommands to test these, I'll try to set that up.
>
> All additional testing is welcome, especially if you can trigger timeouts
> and races
>
> -Mathias
>
>

I post the code below and add my comments in line.

1276 void xhci_handle_command_timeout(struct work_struct *work)
1277 {
1278         struct xhci_hcd *xhci;
1279         int ret;
1280         unsigned long flags;
1281         u64 hw_ring_state;
1282
1283         xhci = container_of(to_delayed_work(work), struct xhci_hcd, cmd_timer);
1284
1285         spin_lock_irqsave(&xhci->lock, flags);
1286
1287         /*
1288          * If timeout work is pending, or current_cmd is NULL, it means we
1289          * raced with command completion. Command is handled so just return.
1290          */
1291         if (!xhci->current_cmd || delayed_work_pending(&xhci->cmd_timer)) {
1292                 spin_unlock_irqrestore(&xhci->lock, flags);
1293                 return;
1294         }
1295         /* mark this command to be cancelled */
1296         xhci->current_cmd->status = COMP_CMD_ABORT;
1297
1298         /* Make sure command ring is running before aborting it */
1299         hw_ring_state = xhci_read_64(xhci, &xhci->op_regs->cmd_ring);
1300         if ((xhci->cmd_ring_state & CMD_RING_STATE_RUNNING) &&
1301             (hw_ring_state & CMD_RING_RUNNING))  {
1302                 /* Prevent new doorbell, and start command abort */
1303                 xhci->cmd_ring_state = CMD_RING_STATE_ABORTED;
1304                 spin_unlock_irqrestore(&xhci->lock, flags);
1305                 xhci_dbg(xhci, "Command timeout\n");
1306                 ret = xhci_abort_cmd_ring(xhci);
1307                 if (unlikely(ret == -ESHUTDOWN)) {
1308                         xhci_err(xhci, "Abort command ring failed\n");
1309                         xhci_cleanup_command_queue(xhci);
1310                         usb_hc_died(xhci_to_hcd(xhci)->primary_hcd);
1311                         xhci_dbg(xhci, "xHCI host controller is dead.\n");
1312                 }
1313                 return;
1314         }
1315
1316         /* host removed. Bail out */
1317         if (xhci->xhc_state & XHCI_STATE_REMOVING) {
1318                 spin_unlock_irqrestore(&xhci->lock, flags);
1319                 xhci_dbg(xhci, "host removed, ring start fail?\n");
1320                 xhci_cleanup_command_queue(xhci);
1321                 return;
1322         }

I think this part of code should be moved up to line 1295.

1323
1324         /* command timeout on stopped ring, ring can't be aborted */
1325         xhci_dbg(xhci, "Command timeout on stopped ring\n");
1326         xhci_handle_stopped_cmd_ring(xhci, xhci->current_cmd);
1327         spin_unlock_irqrestore(&xhci->lock, flags);

This part of code is tricky. I have no idea about in which case should this
code be executed? Anyway, we shouldn't call xhci_handle_stopped_cmd_ring()
here, right?

1328         return;
1329 }

Best regards,
Lu Baolu

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command
  2016-12-21  6:17                           ` Lu Baolu
@ 2016-12-21 12:48                             ` Mathias Nyman
  2016-12-21 14:10                               ` OGAWA Hirofumi
  2016-12-22  1:43                               ` Lu Baolu
  0 siblings, 2 replies; 32+ messages in thread
From: Mathias Nyman @ 2016-12-21 12:48 UTC (permalink / raw)
  To: Lu Baolu, Baolin Wang
  Cc: Mathias Nyman, Greg KH, USB, LKML, Mark Brown, Lu, Baolu,
	Alan Stern, OGAWA Hirofumi

On 21.12.2016 08:17, Lu Baolu wrote:
> Hi Mathias,
>
> I have some comments for the implementation of xhci_abort_cmd_ring() below.
>
> On 12/20/2016 11:13 PM, Mathias Nyman wrote:
>> On 20.12.2016 09:30, Baolin Wang wrote:
>> ...
>>
>> Alright, I gathered all current work related to xhci races and timeouts
>> and put them into a branch:
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git timeout_race_fixes
>>
>> Its based on 4.9
>> It includes a few other patches just to avoid conflicts and  make my life easier
>>
>> Interesting patches are:
>>
>> ee4eb91 xhci: remove unnecessary check for pending timer
>> 0cba67d xhci: detect stop endpoint race using pending timer instead of counter.
>> 4f2535f xhci: Handle command completion and timeout race
>> b9d00d7 usb: host: xhci: Fix possible wild pointer when handling abort command
>> 529a5a0 usb: xhci: fix possible wild pointer
>> 4766555 xhci: Fix race related to abort operation
>> de834a3 xhci: Use delayed_work instead of timer for command timeout
>> 69973b8 Linux 4.9
>>
>> The fixes for command queue races will go to usb-linus and stable, the
>> reworks for stop ep watchdog timer will go to usb-next.
>>
>> Still completely untested, (well it compiles)
>>
>> Felipe gave instructions how to modify dwc3 driver to timeout on address
>> devicecommands to test these, I'll try to set that up.
>>
>> All additional testing is welcome, especially if you can trigger timeouts
>> and races
>>
>> -Mathias
>>
>>
>
> Below is the latest code. I put my comments in line.
>
>   322 static int xhci_abort_cmd_ring(struct xhci_hcd *xhci)
>   323 {
>   324         u64 temp_64;
>   325         int ret;
>   326
>   327         xhci_dbg(xhci, "Abort command ring\n");
>   328
>   329         reinit_completion(&xhci->cmd_ring_stop_completion);
>   330
>   331         temp_64 = xhci_read_64(xhci, &xhci->op_regs->cmd_ring);
>   332         xhci_write_64(xhci, temp_64 | CMD_RING_ABORT,
>   333                         &xhci->op_regs->cmd_ring);
>
> We should hold xhci->lock when we are modifying xhci registers
> at runtime.
>

Makes sense, but we need to unlock it before sleeping or waiting for completion.
I need to look into that in more detail.

But this was an issue already before these changes.

> The retry of setting CMD_RING_ABORT is not necessary according to
> previous discussion. We have cleaned code for second try in
> xhci_handle_command_timeout(). Need to clean up here as well.
>

Yes it can be cleaned up as well, but the two cases are a bit different.
The cleaned up one was about command ring not starting again after it was stopped.

This second try is a workaround for what we thought was the command ring failing
to stop in the first place, but is most likely due to the race that OGAWA Hirofumi
fixed.  It races if the stop command ring interrupt happens between writing the abort
bit and polling for the ring stopped bit. The interrupt hander may start the command
ring again, and we would believe we failed to stop it in the first place.

This race could probably be fixed by just extending the lock (and preventing
interrupts) to cover both writing the abort bit and polling for the command ring
running bit, as you pointed out here previously.

But then again I really like OGAWA Hiroumi's solution that separates the
command ring stopping from aborting commands and restarting the ring.

The current way of always restarting the command ring as a response to
a stop command ring command really limits its usage.

So, with this in mind most reasonable would be to
1. fix the lock to cover abort+CRR check, and send it to usb-linus +stable
2. rebase OGAWA Hirofumi's changes on top of that, and send to usb-linus only
3. remove unnecessary second abort try as a separate patch, send to usb-next
4. remove polling for the Command ring running (CRR), waiting for completion
    is enough, if completion times out then we can check CRR. for usb-next
    
I'll fix the typos these patches would introduce. Fixing old typos can be done as separate
patches later.

-Mathias

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command
  2016-12-21  6:57                           ` Lu Baolu
@ 2016-12-21 12:57                             ` Mathias Nyman
  2016-12-22  1:39                               ` Lu Baolu
  0 siblings, 1 reply; 32+ messages in thread
From: Mathias Nyman @ 2016-12-21 12:57 UTC (permalink / raw)
  To: Lu Baolu, Baolin Wang
  Cc: Mathias Nyman, Greg KH, USB, LKML, Mark Brown, Lu, Baolu

On 21.12.2016 08:57, Lu Baolu wrote:
> Hi Mathias,
>
> I have some comments for the implementation of
> xhci_handle_command_timeout() as well.
>
> On 12/20/2016 11:13 PM, Mathias Nyman wrote:
>> On 20.12.2016 09:30, Baolin Wang wrote:
>> ...
>>
>> Alright, I gathered all current work related to xhci races and timeouts
>> and put them into a branch:
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git timeout_race_fixes
>>
>> Its based on 4.9
>> It includes a few other patches just to avoid conflicts and  make my life easier
>>
>> Interesting patches are:
>>
>> ee4eb91 xhci: remove unnecessary check for pending timer
>> 0cba67d xhci: detect stop endpoint race using pending timer instead of counter.
>> 4f2535f xhci: Handle command completion and timeout race
>> b9d00d7 usb: host: xhci: Fix possible wild pointer when handling abort command
>> 529a5a0 usb: xhci: fix possible wild pointer
>> 4766555 xhci: Fix race related to abort operation
>> de834a3 xhci: Use delayed_work instead of timer for command timeout
>> 69973b8 Linux 4.9
>>
>> The fixes for command queue races will go to usb-linus and stable, the
>> reworks for stop ep watchdog timer will go to usb-next.
>>
>> Still completely untested, (well it compiles)
>>
>> Felipe gave instructions how to modify dwc3 driver to timeout on address
>> devicecommands to test these, I'll try to set that up.
>>
>> All additional testing is welcome, especially if you can trigger timeouts
>> and races
>>
>> -Mathias
>>
>>
>
> I post the code below and add my comments in line.
>
> 1276 void xhci_handle_command_timeout(struct work_struct *work)
> 1277 {
> 1278         struct xhci_hcd *xhci;
> 1279         int ret;
> 1280         unsigned long flags;
> 1281         u64 hw_ring_state;
> 1282
> 1283         xhci = container_of(to_delayed_work(work), struct xhci_hcd, cmd_timer);
> 1284
> 1285         spin_lock_irqsave(&xhci->lock, flags);
> 1286
> 1287         /*
> 1288          * If timeout work is pending, or current_cmd is NULL, it means we
> 1289          * raced with command completion. Command is handled so just return.
> 1290          */
> 1291         if (!xhci->current_cmd || delayed_work_pending(&xhci->cmd_timer)) {
> 1292                 spin_unlock_irqrestore(&xhci->lock, flags);
> 1293                 return;
> 1294         }
> 1295         /* mark this command to be cancelled */
> 1296         xhci->current_cmd->status = COMP_CMD_ABORT;
> 1297
> 1298         /* Make sure command ring is running before aborting it */
> 1299         hw_ring_state = xhci_read_64(xhci, &xhci->op_regs->cmd_ring);
> 1300         if ((xhci->cmd_ring_state & CMD_RING_STATE_RUNNING) &&
> 1301             (hw_ring_state & CMD_RING_RUNNING))  {
> 1302                 /* Prevent new doorbell, and start command abort */
> 1303                 xhci->cmd_ring_state = CMD_RING_STATE_ABORTED;
> 1304                 spin_unlock_irqrestore(&xhci->lock, flags);
> 1305                 xhci_dbg(xhci, "Command timeout\n");
> 1306                 ret = xhci_abort_cmd_ring(xhci);
> 1307                 if (unlikely(ret == -ESHUTDOWN)) {
> 1308                         xhci_err(xhci, "Abort command ring failed\n");
> 1309                         xhci_cleanup_command_queue(xhci);
> 1310                         usb_hc_died(xhci_to_hcd(xhci)->primary_hcd);
> 1311                         xhci_dbg(xhci, "xHCI host controller is dead.\n");
> 1312                 }
> 1313                 return;
> 1314         }
> 1315
> 1316         /* host removed. Bail out */
> 1317         if (xhci->xhc_state & XHCI_STATE_REMOVING) {
> 1318                 spin_unlock_irqrestore(&xhci->lock, flags);
> 1319                 xhci_dbg(xhci, "host removed, ring start fail?\n");
> 1320                 xhci_cleanup_command_queue(xhci);
> 1321                 return;
> 1322         }
>
> I think this part of code should be moved up to line 1295.

The XHCI_STATE_REMOVING and XHCI_STATE_DYING needs a rework,
I'm working on that.

Basically we want XHCI_STATE_REMOVING to mean that  all devices are going,
away and driver will be removed. Don't bother with re-calculating available
bandwidths after every device removal, but do use xhci hardware to disable
devices cleanly etc.

XHCI_STATE_DYING should mean hardware is not working/responding. Don't
bother writing any registers or queuing anything. Just return all
pending and cancelled URBs, notify core we died, and free all allocated memory.

>
> 1323
> 1324         /* command timeout on stopped ring, ring can't be aborted */
> 1325         xhci_dbg(xhci, "Command timeout on stopped ring\n");
> 1326         xhci_handle_stopped_cmd_ring(xhci, xhci->current_cmd);
> 1327         spin_unlock_irqrestore(&xhci->lock, flags);
>
> This part of code is tricky. I have no idea about in which case should this
> code be executed? Anyway, we shouldn't call xhci_handle_stopped_cmd_ring()
> here, right?
>

This isn't changed it these patches.

It will remove the aborted commands and restart the ring. It's useful if we
want to abort a command but command ring was not running. (if for some
unkown reason it was stopped, or forgot to restart.

-Mathias

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command
  2016-12-21  2:22                           ` Baolin Wang
@ 2016-12-21 13:00                             ` Mathias Nyman
  2016-12-27  3:07                               ` Baolin Wang
  0 siblings, 1 reply; 32+ messages in thread
From: Mathias Nyman @ 2016-12-21 13:00 UTC (permalink / raw)
  To: Baolin Wang
  Cc: Lu Baolu, Mathias Nyman, Greg KH, USB, LKML, Mark Brown, Lu, Baolu

On 21.12.2016 04:22, Baolin Wang wrote:
> Hi Mathias,
>
> On 20 December 2016 at 23:13, Mathias Nyman
> <mathias.nyman@linux.intel.com> wrote:
>> On 20.12.2016 09:30, Baolin Wang wrote:
>> ...
>>
>> Alright, I gathered all current work related to xhci races and timeouts
>> and put them into a branch:
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git
>> timeout_race_fixes
>>
>> Its based on 4.9
>> It includes a few other patches just to avoid conflicts and  make my life
>> easier
>>
>> Interesting patches are:
>>
>> ee4eb91 xhci: remove unnecessary check for pending timer
>> 0cba67d xhci: detect stop endpoint race using pending timer instead of
>> counter.
>> 4f2535f xhci: Handle command completion and timeout race
>> b9d00d7 usb: host: xhci: Fix possible wild pointer when handling abort
>> command
>> 529a5a0 usb: xhci: fix possible wild pointer
>> 4766555 xhci: Fix race related to abort operation
>> de834a3 xhci: Use delayed_work instead of timer for command timeout
>> 69973b8 Linux 4.9
>>
>> The fixes for command queue races will go to usb-linus and stable, the
>> reworks for stop ep watchdog timer will go to usb-next.
>
> How about applying below patch in your 'timeout_race_fixes' branch?
> [PATCH] usb: host: xhci: Clean up commands when stop endpoint command is timeout
> https://lkml.org/lkml/2016/12/13/94
>

usb_hc_died() should eventyally end up calling xhci_mem_cleanup()
which will cleanup the command queue. But I need to look into that

-Mathias

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command
  2016-12-21 12:48                             ` Mathias Nyman
@ 2016-12-21 14:10                               ` OGAWA Hirofumi
  2016-12-21 15:04                                 ` Mathias Nyman
  2016-12-22  1:43                               ` Lu Baolu
  1 sibling, 1 reply; 32+ messages in thread
From: OGAWA Hirofumi @ 2016-12-21 14:10 UTC (permalink / raw)
  To: Mathias Nyman
  Cc: Lu Baolu, Baolin Wang, Mathias Nyman, Greg KH, USB, LKML,
	Mark Brown, Lu, Baolu, Alan Stern

Mathias Nyman <mathias.nyman@linux.intel.com> writes:

>> Below is the latest code. I put my comments in line.
>>
>>   322 static int xhci_abort_cmd_ring(struct xhci_hcd *xhci)
>>   323 {
>>   324         u64 temp_64;
>>   325         int ret;
>>   326
>>   327         xhci_dbg(xhci, "Abort command ring\n");
>>   328
>>   329         reinit_completion(&xhci->cmd_ring_stop_completion);
>>   330
>>   331         temp_64 = xhci_read_64(xhci, &xhci->op_regs->cmd_ring);
>>   332         xhci_write_64(xhci, temp_64 | CMD_RING_ABORT,
>>   333                         &xhci->op_regs->cmd_ring);
>>
>> We should hold xhci->lock when we are modifying xhci registers
>> at runtime.
>>
>
> Makes sense, but we need to unlock it before sleeping or waiting for
> completion.  I need to look into that in more detail.
>
> But this was an issue already before these changes.

We set CMD_RING_STATE_ABORTED state under locking. I'm not checking what
is for taking lock for register though, I guess it should be enough just
lock around of read=>write of ->cmd_ring if need lock.

[Rather ->cmd_ring user should check CMD_RING_STATE_ABORTED state.]

> But then again I really like OGAWA Hiroumi's solution that separates the
> command ring stopping from aborting commands and restarting the ring.
>
> The current way of always restarting the command ring as a response to
> a stop command ring command really limits its usage.
>
> So, with this in mind most reasonable would be to
> 1. fix the lock to cover abort+CRR check, and send it to usb-linus +stable
> 2. rebase OGAWA Hirofumi's changes on top of that, and send to usb-linus only
> 3. remove unnecessary second abort try as a separate patch, send to usb-next
> 4. remove polling for the Command ring running (CRR), waiting for completion
>     is enough, if completion times out then we can check CRR. for usb-next

I think we should check both of CRR and even of stop completion. Because
CRR and stop completion was not same time (both can be first).

Thanks.
-- 
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command
  2016-12-21 14:10                               ` OGAWA Hirofumi
@ 2016-12-21 15:04                                 ` Mathias Nyman
  2016-12-21 15:18                                   ` OGAWA Hirofumi
  0 siblings, 1 reply; 32+ messages in thread
From: Mathias Nyman @ 2016-12-21 15:04 UTC (permalink / raw)
  To: OGAWA Hirofumi
  Cc: Lu Baolu, Baolin Wang, Mathias Nyman, Greg KH, USB, LKML,
	Mark Brown, Lu, Baolu, Alan Stern

On 21.12.2016 16:10, OGAWA Hirofumi wrote:
> Mathias Nyman <mathias.nyman@linux.intel.com> writes:
>
>>> Below is the latest code. I put my comments in line.
>>>
>>>    322 static int xhci_abort_cmd_ring(struct xhci_hcd *xhci)
>>>    323 {
>>>    324         u64 temp_64;
>>>    325         int ret;
>>>    326
>>>    327         xhci_dbg(xhci, "Abort command ring\n");
>>>    328
>>>    329         reinit_completion(&xhci->cmd_ring_stop_completion);
>>>    330
>>>    331         temp_64 = xhci_read_64(xhci, &xhci->op_regs->cmd_ring);
>>>    332         xhci_write_64(xhci, temp_64 | CMD_RING_ABORT,
>>>    333                         &xhci->op_regs->cmd_ring);
>>>
>>> We should hold xhci->lock when we are modifying xhci registers
>>> at runtime.
>>>
>>
>> Makes sense, but we need to unlock it before sleeping or waiting for
>> completion.  I need to look into that in more detail.
>>
>> But this was an issue already before these changes.
>
> We set CMD_RING_STATE_ABORTED state under locking. I'm not checking what
> is for taking lock for register though, I guess it should be enough just
> lock around of read=>write of ->cmd_ring if need lock.

After your patch it should be enough to have the lock only while
reading and writing the cmd_ring register.

If we want a locking fix that applies more easily to older stable releases before
your change then the lock needs to cover set CMD_RING_STATE_ABORT, read cmd_reg,
write cmd_reg and busiloop checking CRR bit.  Otherwise the stop cmd ring interrupt
handler may restart the ring just before we start checing CRR. The stop cmd ring
interrupt will set the CMD_RING_STATE_ABORTED to CMD_RING_STATE_RUNNING so
ring will really restart in the interrupt handler.

>
> [Rather ->cmd_ring user should check CMD_RING_STATE_ABORTED state.]
>
>> But then again I really like OGAWA Hiroumi's solution that separates the
>> command ring stopping from aborting commands and restarting the ring.
>>
>> The current way of always restarting the command ring as a response to
>> a stop command ring command really limits its usage.
>>
>> So, with this in mind most reasonable would be to
>> 1. fix the lock to cover abort+CRR check, and send it to usb-linus +stable
>> 2. rebase OGAWA Hirofumi's changes on top of that, and send to usb-linus only
>> 3. remove unnecessary second abort try as a separate patch, send to usb-next
>> 4. remove polling for the Command ring running (CRR), waiting for completion
>>      is enough, if completion times out then we can check CRR. for usb-next
>
> I think we should check both of CRR and even of stop completion. Because
> CRR and stop completion was not same time (both can be first).

We can keep both, maybe change the order and do the busylooping-checking after
waiting for completion, but thats a optimization for usb-next sometimes later.

Thanks

-Mathias

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command
  2016-12-21 15:04                                 ` Mathias Nyman
@ 2016-12-21 15:18                                   ` OGAWA Hirofumi
  2016-12-22  1:46                                     ` Lu Baolu
  0 siblings, 1 reply; 32+ messages in thread
From: OGAWA Hirofumi @ 2016-12-21 15:18 UTC (permalink / raw)
  To: Mathias Nyman
  Cc: Lu Baolu, Baolin Wang, Mathias Nyman, Greg KH, USB, LKML,
	Mark Brown, Lu, Baolu, Alan Stern

Mathias Nyman <mathias.nyman@linux.intel.com> writes:

>> We set CMD_RING_STATE_ABORTED state under locking. I'm not checking what
>> is for taking lock for register though, I guess it should be enough just
>> lock around of read=>write of ->cmd_ring if need lock.
>
> After your patch it should be enough to have the lock only while
> reading and writing the cmd_ring register.
>
> If we want a locking fix that applies more easily to older stable
> releases before your change then the lock needs to cover set
> CMD_RING_STATE_ABORT, read cmd_reg, write cmd_reg and busiloop
> checking CRR bit.  Otherwise the stop cmd ring interrupt handler may
> restart the ring just before we start checing CRR. The stop cmd ring
> interrupt will set the CMD_RING_STATE_ABORTED to
> CMD_RING_STATE_RUNNING so ring will really restart in the interrupt
> handler.

Just for record (no chance to make patch I myself for now, sorry), while
checking locking slightly, I noticed unrelated missing locking.

	xhci_cleanup_command_queue()

We are calling it without locking, but we need to lock for accessing list.

Thanks.
-- 
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command
  2016-12-21 12:57                             ` Mathias Nyman
@ 2016-12-22  1:39                               ` Lu Baolu
  0 siblings, 0 replies; 32+ messages in thread
From: Lu Baolu @ 2016-12-22  1:39 UTC (permalink / raw)
  To: Mathias Nyman, Baolin Wang
  Cc: Mathias Nyman, Greg KH, USB, LKML, Mark Brown, Lu, Baolu

Hi,

On 12/21/2016 08:57 PM, Mathias Nyman wrote:
> On 21.12.2016 08:57, Lu Baolu wrote:
>> Hi Mathias,
>>
>> I have some comments for the implementation of
>> xhci_handle_command_timeout() as well.
>>
>> On 12/20/2016 11:13 PM, Mathias Nyman wrote:
>>> On 20.12.2016 09:30, Baolin Wang wrote:
>>> ...
>>>
>>> Alright, I gathered all current work related to xhci races and timeouts
>>> and put them into a branch:
>>>
>>> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git timeout_race_fixes
>>>
>>> Its based on 4.9
>>> It includes a few other patches just to avoid conflicts and  make my life easier
>>>
>>> Interesting patches are:
>>>
>>> ee4eb91 xhci: remove unnecessary check for pending timer
>>> 0cba67d xhci: detect stop endpoint race using pending timer instead of counter.
>>> 4f2535f xhci: Handle command completion and timeout race
>>> b9d00d7 usb: host: xhci: Fix possible wild pointer when handling abort command
>>> 529a5a0 usb: xhci: fix possible wild pointer
>>> 4766555 xhci: Fix race related to abort operation
>>> de834a3 xhci: Use delayed_work instead of timer for command timeout
>>> 69973b8 Linux 4.9
>>>
>>> The fixes for command queue races will go to usb-linus and stable, the
>>> reworks for stop ep watchdog timer will go to usb-next.
>>>
>>> Still completely untested, (well it compiles)
>>>
>>> Felipe gave instructions how to modify dwc3 driver to timeout on address
>>> devicecommands to test these, I'll try to set that up.
>>>
>>> All additional testing is welcome, especially if you can trigger timeouts
>>> and races
>>>
>>> -Mathias
>>>
>>>
>>
>> I post the code below and add my comments in line.
>>
>> 1276 void xhci_handle_command_timeout(struct work_struct *work)
>> 1277 {
>> 1278         struct xhci_hcd *xhci;
>> 1279         int ret;
>> 1280         unsigned long flags;
>> 1281         u64 hw_ring_state;
>> 1282
>> 1283         xhci = container_of(to_delayed_work(work), struct xhci_hcd, cmd_timer);
>> 1284
>> 1285         spin_lock_irqsave(&xhci->lock, flags);
>> 1286
>> 1287         /*
>> 1288          * If timeout work is pending, or current_cmd is NULL, it means we
>> 1289          * raced with command completion. Command is handled so just return.
>> 1290          */
>> 1291         if (!xhci->current_cmd || delayed_work_pending(&xhci->cmd_timer)) {
>> 1292                 spin_unlock_irqrestore(&xhci->lock, flags);
>> 1293                 return;
>> 1294         }
>> 1295         /* mark this command to be cancelled */
>> 1296         xhci->current_cmd->status = COMP_CMD_ABORT;
>> 1297
>> 1298         /* Make sure command ring is running before aborting it */
>> 1299         hw_ring_state = xhci_read_64(xhci, &xhci->op_regs->cmd_ring);
>> 1300         if ((xhci->cmd_ring_state & CMD_RING_STATE_RUNNING) &&
>> 1301             (hw_ring_state & CMD_RING_RUNNING))  {
>> 1302                 /* Prevent new doorbell, and start command abort */
>> 1303                 xhci->cmd_ring_state = CMD_RING_STATE_ABORTED;
>> 1304                 spin_unlock_irqrestore(&xhci->lock, flags);
>> 1305                 xhci_dbg(xhci, "Command timeout\n");
>> 1306                 ret = xhci_abort_cmd_ring(xhci);
>> 1307                 if (unlikely(ret == -ESHUTDOWN)) {
>> 1308                         xhci_err(xhci, "Abort command ring failed\n");
>> 1309                         xhci_cleanup_command_queue(xhci);
>> 1310                         usb_hc_died(xhci_to_hcd(xhci)->primary_hcd);
>> 1311                         xhci_dbg(xhci, "xHCI host controller is dead.\n");
>> 1312                 }
>> 1313                 return;
>> 1314         }
>> 1315
>> 1316         /* host removed. Bail out */
>> 1317         if (xhci->xhc_state & XHCI_STATE_REMOVING) {
>> 1318                 spin_unlock_irqrestore(&xhci->lock, flags);
>> 1319                 xhci_dbg(xhci, "host removed, ring start fail?\n");
>> 1320                 xhci_cleanup_command_queue(xhci);
>> 1321                 return;
>> 1322         }
>>
>> I think this part of code should be moved up to line 1295.
>
> The XHCI_STATE_REMOVING and XHCI_STATE_DYING needs a rework,
> I'm working on that.
>
> Basically we want XHCI_STATE_REMOVING to mean that  all devices are going,
> away and driver will be removed. Don't bother with re-calculating available
> bandwidths after every device removal, but do use xhci hardware to disable
> devices cleanly etc.
>
> XHCI_STATE_DYING should mean hardware is not working/responding. Don't
> bother writing any registers or queuing anything. Just return all
> pending and cancelled URBs, notify core we died, and free all allocated memory.

Okay, thanks for the information.

>
>>
>> 1323
>> 1324         /* command timeout on stopped ring, ring can't be aborted */
>> 1325         xhci_dbg(xhci, "Command timeout on stopped ring\n");
>> 1326         xhci_handle_stopped_cmd_ring(xhci, xhci->current_cmd);
>> 1327         spin_unlock_irqrestore(&xhci->lock, flags);
>>
>> This part of code is tricky. I have no idea about in which case should this
>> code be executed? Anyway, we shouldn't call xhci_handle_stopped_cmd_ring()
>> here, right?
>>
>
> This isn't changed it these patches.
>
> It will remove the aborted commands and restart the ring. It's useful if we
> want to abort a command but command ring was not running. (if for some
> unkown reason it was stopped, or forgot to restart.

Make sense.

So how about put a warning (instead of a debug message which will normally
be ignored) here?

Best regards,
Lu Baolu

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command
  2016-12-21 12:48                             ` Mathias Nyman
  2016-12-21 14:10                               ` OGAWA Hirofumi
@ 2016-12-22  1:43                               ` Lu Baolu
  1 sibling, 0 replies; 32+ messages in thread
From: Lu Baolu @ 2016-12-22  1:43 UTC (permalink / raw)
  To: Mathias Nyman, Baolin Wang
  Cc: Mathias Nyman, Greg KH, USB, LKML, Mark Brown, Lu, Baolu,
	Alan Stern, OGAWA Hirofumi

Hi,

On 12/21/2016 08:48 PM, Mathias Nyman wrote:
> On 21.12.2016 08:17, Lu Baolu wrote:
>> Hi Mathias,
>>
>> I have some comments for the implementation of xhci_abort_cmd_ring() below.
>>
>> On 12/20/2016 11:13 PM, Mathias Nyman wrote:
>>> On 20.12.2016 09:30, Baolin Wang wrote:
>>> ...
>>>
>>> Alright, I gathered all current work related to xhci races and timeouts
>>> and put them into a branch:
>>>
>>> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git timeout_race_fixes
>>>
>>> Its based on 4.9
>>> It includes a few other patches just to avoid conflicts and  make my life easier
>>>
>>> Interesting patches are:
>>>
>>> ee4eb91 xhci: remove unnecessary check for pending timer
>>> 0cba67d xhci: detect stop endpoint race using pending timer instead of counter.
>>> 4f2535f xhci: Handle command completion and timeout race
>>> b9d00d7 usb: host: xhci: Fix possible wild pointer when handling abort command
>>> 529a5a0 usb: xhci: fix possible wild pointer
>>> 4766555 xhci: Fix race related to abort operation
>>> de834a3 xhci: Use delayed_work instead of timer for command timeout
>>> 69973b8 Linux 4.9
>>>
>>> The fixes for command queue races will go to usb-linus and stable, the
>>> reworks for stop ep watchdog timer will go to usb-next.
>>>
>>> Still completely untested, (well it compiles)
>>>
>>> Felipe gave instructions how to modify dwc3 driver to timeout on address
>>> devicecommands to test these, I'll try to set that up.
>>>
>>> All additional testing is welcome, especially if you can trigger timeouts
>>> and races
>>>
>>> -Mathias
>>>
>>>
>>
>> Below is the latest code. I put my comments in line.
>>
>>   322 static int xhci_abort_cmd_ring(struct xhci_hcd *xhci)
>>   323 {
>>   324         u64 temp_64;
>>   325         int ret;
>>   326
>>   327         xhci_dbg(xhci, "Abort command ring\n");
>>   328
>>   329         reinit_completion(&xhci->cmd_ring_stop_completion);
>>   330
>>   331         temp_64 = xhci_read_64(xhci, &xhci->op_regs->cmd_ring);
>>   332         xhci_write_64(xhci, temp_64 | CMD_RING_ABORT,
>>   333                         &xhci->op_regs->cmd_ring);
>>
>> We should hold xhci->lock when we are modifying xhci registers
>> at runtime.
>>
>
> Makes sense, but we need to unlock it before sleeping or waiting for completion.
> I need to look into that in more detail.
>
> But this was an issue already before these changes.
>
>> The retry of setting CMD_RING_ABORT is not necessary according to
>> previous discussion. We have cleaned code for second try in
>> xhci_handle_command_timeout(). Need to clean up here as well.
>>
>
> Yes it can be cleaned up as well, but the two cases are a bit different.
> The cleaned up one was about command ring not starting again after it was stopped.
>
> This second try is a workaround for what we thought was the command ring failing
> to stop in the first place, but is most likely due to the race that OGAWA Hirofumi
> fixed.  It races if the stop command ring interrupt happens between writing the abort
> bit and polling for the ring stopped bit. The interrupt hander may start the command
> ring again, and we would believe we failed to stop it in the first place.
>
> This race could probably be fixed by just extending the lock (and preventing
> interrupts) to cover both writing the abort bit and polling for the command ring
> running bit, as you pointed out here previously.
>
> But then again I really like OGAWA Hiroumi's solution that separates the
> command ring stopping from aborting commands and restarting the ring.
>
> The current way of always restarting the command ring as a response to
> a stop command ring command really limits its usage.
>
> So, with this in mind most reasonable would be to
> 1. fix the lock to cover abort+CRR check, and send it to usb-linus +stable
> 2. rebase OGAWA Hirofumi's changes on top of that, and send to usb-linus only
> 3. remove unnecessary second abort try as a separate patch, send to usb-next
> 4. remove polling for the Command ring running (CRR), waiting for completion
>    is enough, if completion times out then we can check CRR. for usb-next
>    I'll fix the typos these patches would introduce. Fixing old typos can be done as separate
> patches later.

This is exactly the same as what I am thinking of. I will submit the patches later.

Best regards,
Lu Baolu

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command
  2016-12-21 15:18                                   ` OGAWA Hirofumi
@ 2016-12-22  1:46                                     ` Lu Baolu
  2016-12-23 12:54                                       ` Mathias Nyman
  0 siblings, 1 reply; 32+ messages in thread
From: Lu Baolu @ 2016-12-22  1:46 UTC (permalink / raw)
  To: OGAWA Hirofumi, Mathias Nyman
  Cc: Baolin Wang, Mathias Nyman, Greg KH, USB, LKML, Mark Brown, Lu,
	Baolu, Alan Stern

Hi,

On 12/21/2016 11:18 PM, OGAWA Hirofumi wrote:
> Mathias Nyman <mathias.nyman@linux.intel.com> writes:
>
>>> We set CMD_RING_STATE_ABORTED state under locking. I'm not checking what
>>> is for taking lock for register though, I guess it should be enough just
>>> lock around of read=>write of ->cmd_ring if need lock.
>> After your patch it should be enough to have the lock only while
>> reading and writing the cmd_ring register.
>>
>> If we want a locking fix that applies more easily to older stable
>> releases before your change then the lock needs to cover set
>> CMD_RING_STATE_ABORT, read cmd_reg, write cmd_reg and busiloop
>> checking CRR bit.  Otherwise the stop cmd ring interrupt handler may
>> restart the ring just before we start checing CRR. The stop cmd ring
>> interrupt will set the CMD_RING_STATE_ABORTED to
>> CMD_RING_STATE_RUNNING so ring will really restart in the interrupt
>> handler.
> Just for record (no chance to make patch I myself for now, sorry), while
> checking locking slightly, I noticed unrelated missing locking.
>
> 	xhci_cleanup_command_queue()
>
> We are calling it without locking, but we need to lock for accessing list.

Yeah. I can make the patch.

Best regards,
Lu Baolu

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command
  2016-12-22  1:46                                     ` Lu Baolu
@ 2016-12-23 12:54                                       ` Mathias Nyman
  0 siblings, 0 replies; 32+ messages in thread
From: Mathias Nyman @ 2016-12-23 12:54 UTC (permalink / raw)
  To: Lu Baolu, OGAWA Hirofumi
  Cc: Baolin Wang, Mathias Nyman, Greg KH, USB, LKML, Mark Brown, Lu,
	Baolu, Alan Stern

On 22.12.2016 03:46, Lu Baolu wrote:
> Hi,
>
> On 12/21/2016 11:18 PM, OGAWA Hirofumi wrote:
>> Mathias Nyman <mathias.nyman@linux.intel.com> writes:
>>
>>>> We set CMD_RING_STATE_ABORTED state under locking. I'm not checking what
>>>> is for taking lock for register though, I guess it should be enough just
>>>> lock around of read=>write of ->cmd_ring if need lock.
>>> After your patch it should be enough to have the lock only while
>>> reading and writing the cmd_ring register.
>>>
>>> If we want a locking fix that applies more easily to older stable
>>> releases before your change then the lock needs to cover set
>>> CMD_RING_STATE_ABORT, read cmd_reg, write cmd_reg and busiloop
>>> checking CRR bit.  Otherwise the stop cmd ring interrupt handler may
>>> restart the ring just before we start checing CRR. The stop cmd ring
>>> interrupt will set the CMD_RING_STATE_ABORTED to
>>> CMD_RING_STATE_RUNNING so ring will really restart in the interrupt
>>> handler.
>> Just for record (no chance to make patch I myself for now, sorry), while
>> checking locking slightly, I noticed unrelated missing locking.
>>
>> 	xhci_cleanup_command_queue()
>>
>> We are calling it without locking, but we need to lock for accessing list.
>
> Yeah. I can make the patch.
>

Force updated timeout_race_fixes branch.

I'll be out for a few days during xmas, continue testing after that

-Mathias

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command
  2016-12-21 13:00                             ` Mathias Nyman
@ 2016-12-27  3:07                               ` Baolin Wang
  2017-01-02 14:57                                 ` Mathias Nyman
  0 siblings, 1 reply; 32+ messages in thread
From: Baolin Wang @ 2016-12-27  3:07 UTC (permalink / raw)
  To: Mathias Nyman
  Cc: Lu Baolu, Mathias Nyman, Greg KH, USB, LKML, Mark Brown, Lu, Baolu

Hi,

On 21 December 2016 at 21:00, Mathias Nyman
<mathias.nyman@linux.intel.com> wrote:
> On 21.12.2016 04:22, Baolin Wang wrote:
>>
>> Hi Mathias,
>>
>> On 20 December 2016 at 23:13, Mathias Nyman
>> <mathias.nyman@linux.intel.com> wrote:
>>>
>>> On 20.12.2016 09:30, Baolin Wang wrote:
>>> ...
>>>
>>> Alright, I gathered all current work related to xhci races and timeouts
>>> and put them into a branch:
>>>
>>> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git
>>> timeout_race_fixes
>>>
>>> Its based on 4.9
>>> It includes a few other patches just to avoid conflicts and  make my life
>>> easier
>>>
>>> Interesting patches are:
>>>
>>> ee4eb91 xhci: remove unnecessary check for pending timer
>>> 0cba67d xhci: detect stop endpoint race using pending timer instead of
>>> counter.
>>> 4f2535f xhci: Handle command completion and timeout race
>>> b9d00d7 usb: host: xhci: Fix possible wild pointer when handling abort
>>> command
>>> 529a5a0 usb: xhci: fix possible wild pointer
>>> 4766555 xhci: Fix race related to abort operation
>>> de834a3 xhci: Use delayed_work instead of timer for command timeout
>>> 69973b8 Linux 4.9
>>>
>>> The fixes for command queue races will go to usb-linus and stable, the
>>> reworks for stop ep watchdog timer will go to usb-next.
>>
>>
>> How about applying below patch in your 'timeout_race_fixes' branch?
>> [PATCH] usb: host: xhci: Clean up commands when stop endpoint command is
>> timeout
>> https://lkml.org/lkml/2016/12/13/94
>>
>
> usb_hc_died() should eventyally end up calling xhci_mem_cleanup()
> which will cleanup the command queue. But I need to look into that

usb_hc_died() did not call xhci_mem_cleanup() to clean up command
queue, it just disconnects all children devices attached on the dying
hub. I did not find where it will clean up the command queue when
issuing usb_hc_died(). Also before issuing usb_hc_died() in
xhci_handle_command_timeout(), we will call
xhci_cleanup_command_queue(). I think it should same as in
xhci_stop_endpoint_command_watchdog().

-- 
Baolin.wang
Best Regards

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command
  2016-12-27  3:07                               ` Baolin Wang
@ 2017-01-02 14:57                                 ` Mathias Nyman
  2017-01-03  6:20                                   ` Baolin Wang
  0 siblings, 1 reply; 32+ messages in thread
From: Mathias Nyman @ 2017-01-02 14:57 UTC (permalink / raw)
  To: Baolin Wang
  Cc: Lu Baolu, Mathias Nyman, Greg KH, USB, LKML, Mark Brown, Lu, Baolu

On 27.12.2016 05:07, Baolin Wang wrote:
> Hi,
>
> On 21 December 2016 at 21:00, Mathias Nyman
> <mathias.nyman@linux.intel.com> wrote:
>> On 21.12.2016 04:22, Baolin Wang wrote:
>>>
>>> Hi Mathias,
>>>
>>> On 20 December 2016 at 23:13, Mathias Nyman
>>> <mathias.nyman@linux.intel.com> wrote:
>>>>
>>>> On 20.12.2016 09:30, Baolin Wang wrote:
>>>> ...
>>>>
>>>> Alright, I gathered all current work related to xhci races and timeouts
>>>> and put them into a branch:
>>>>
>>>> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git
>>>> timeout_race_fixes
>>>>
>>>> Its based on 4.9
>>>> It includes a few other patches just to avoid conflicts and  make my life
>>>> easier
>>>>
>>>> Interesting patches are:
>>>>
>>>> ee4eb91 xhci: remove unnecessary check for pending timer
>>>> 0cba67d xhci: detect stop endpoint race using pending timer instead of
>>>> counter.
>>>> 4f2535f xhci: Handle command completion and timeout race
>>>> b9d00d7 usb: host: xhci: Fix possible wild pointer when handling abort
>>>> command
>>>> 529a5a0 usb: xhci: fix possible wild pointer
>>>> 4766555 xhci: Fix race related to abort operation
>>>> de834a3 xhci: Use delayed_work instead of timer for command timeout
>>>> 69973b8 Linux 4.9
>>>>
>>>> The fixes for command queue races will go to usb-linus and stable, the
>>>> reworks for stop ep watchdog timer will go to usb-next.
>>>
>>>
>>> How about applying below patch in your 'timeout_race_fixes' branch?
>>> [PATCH] usb: host: xhci: Clean up commands when stop endpoint command is
>>> timeout
>>> https://lkml.org/lkml/2016/12/13/94
>>>
>>
>> usb_hc_died() should eventyally end up calling xhci_mem_cleanup()
>> which will cleanup the command queue. But I need to look into that
>
> usb_hc_died() did not call xhci_mem_cleanup() to clean up command
> queue, it just disconnects all children devices attached on the dying
> hub. I did not find where it will clean up the command queue when
> issuing usb_hc_died(). Also before issuing usb_hc_died() in
> xhci_handle_command_timeout(), we will call
> xhci_cleanup_command_queue(). I think it should same as in
> xhci_stop_endpoint_command_watchdog().
>

You're right, it doesn't call xhci_mem_cleanup.
Need to look at this after getting first series of fixes to usb-linus

-Mathias

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command
  2017-01-02 14:57                                 ` Mathias Nyman
@ 2017-01-03  6:20                                   ` Baolin Wang
  0 siblings, 0 replies; 32+ messages in thread
From: Baolin Wang @ 2017-01-03  6:20 UTC (permalink / raw)
  To: Mathias Nyman
  Cc: Lu Baolu, Mathias Nyman, Greg KH, USB, LKML, Mark Brown, Lu, Baolu

On 2 January 2017 at 22:57, Mathias Nyman <mathias.nyman@linux.intel.com> wrote:
> On 27.12.2016 05:07, Baolin Wang wrote:
>>
>> Hi,
>>
>> On 21 December 2016 at 21:00, Mathias Nyman
>> <mathias.nyman@linux.intel.com> wrote:
>>>
>>> On 21.12.2016 04:22, Baolin Wang wrote:
>>>>
>>>>
>>>> Hi Mathias,
>>>>
>>>> On 20 December 2016 at 23:13, Mathias Nyman
>>>> <mathias.nyman@linux.intel.com> wrote:
>>>>>
>>>>>
>>>>> On 20.12.2016 09:30, Baolin Wang wrote:
>>>>> ...
>>>>>
>>>>> Alright, I gathered all current work related to xhci races and timeouts
>>>>> and put them into a branch:
>>>>>
>>>>> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git
>>>>> timeout_race_fixes
>>>>>
>>>>> Its based on 4.9
>>>>> It includes a few other patches just to avoid conflicts and  make my
>>>>> life
>>>>> easier
>>>>>
>>>>> Interesting patches are:
>>>>>
>>>>> ee4eb91 xhci: remove unnecessary check for pending timer
>>>>> 0cba67d xhci: detect stop endpoint race using pending timer instead of
>>>>> counter.
>>>>> 4f2535f xhci: Handle command completion and timeout race
>>>>> b9d00d7 usb: host: xhci: Fix possible wild pointer when handling abort
>>>>> command
>>>>> 529a5a0 usb: xhci: fix possible wild pointer
>>>>> 4766555 xhci: Fix race related to abort operation
>>>>> de834a3 xhci: Use delayed_work instead of timer for command timeout
>>>>> 69973b8 Linux 4.9
>>>>>
>>>>> The fixes for command queue races will go to usb-linus and stable, the
>>>>> reworks for stop ep watchdog timer will go to usb-next.
>>>>
>>>>
>>>>
>>>> How about applying below patch in your 'timeout_race_fixes' branch?
>>>> [PATCH] usb: host: xhci: Clean up commands when stop endpoint command is
>>>> timeout
>>>> https://lkml.org/lkml/2016/12/13/94
>>>>
>>>
>>> usb_hc_died() should eventyally end up calling xhci_mem_cleanup()
>>> which will cleanup the command queue. But I need to look into that
>>
>>
>> usb_hc_died() did not call xhci_mem_cleanup() to clean up command
>> queue, it just disconnects all children devices attached on the dying
>> hub. I did not find where it will clean up the command queue when
>> issuing usb_hc_died(). Also before issuing usb_hc_died() in
>> xhci_handle_command_timeout(), we will call
>> xhci_cleanup_command_queue(). I think it should same as in
>> xhci_stop_endpoint_command_watchdog().
>>
>
> You're right, it doesn't call xhci_mem_cleanup.
> Need to look at this after getting first series of fixes to usb-linus

Thanks.

-- 
Baolin.wang
Best Regards

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2017-01-03  6:20 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-05  7:51 [PATCH 1/2] usb: host: xhci: Fix possible wild pointer when handling abort command Baolin Wang
2016-12-05  7:51 ` [PATCH 2/2] usb: host: xhci: Handle the right timeout command Baolin Wang
2016-12-12 15:52   ` Mathias Nyman
2016-12-13  3:21     ` Baolin Wang
2016-12-19 10:33       ` Mathias Nyman
2016-12-19 11:34         ` Baolin Wang
2016-12-19 12:13           ` Mathias Nyman
2016-12-20  3:23             ` Baolin Wang
2016-12-20  4:29             ` Lu Baolu
2016-12-20  6:06               ` Baolin Wang
2016-12-20  6:39                 ` Lu Baolu
2016-12-20  6:46                   ` Baolin Wang
2016-12-20  7:18                     ` Lu Baolu
2016-12-20  7:30                       ` Baolin Wang
2016-12-20 15:13                         ` Mathias Nyman
2016-12-21  2:22                           ` Baolin Wang
2016-12-21 13:00                             ` Mathias Nyman
2016-12-27  3:07                               ` Baolin Wang
2017-01-02 14:57                                 ` Mathias Nyman
2017-01-03  6:20                                   ` Baolin Wang
2016-12-21  6:17                           ` Lu Baolu
2016-12-21 12:48                             ` Mathias Nyman
2016-12-21 14:10                               ` OGAWA Hirofumi
2016-12-21 15:04                                 ` Mathias Nyman
2016-12-21 15:18                                   ` OGAWA Hirofumi
2016-12-22  1:46                                     ` Lu Baolu
2016-12-23 12:54                                       ` Mathias Nyman
2016-12-22  1:43                               ` Lu Baolu
2016-12-21  6:57                           ` Lu Baolu
2016-12-21 12:57                             ` Mathias Nyman
2016-12-22  1:39                               ` Lu Baolu
2016-12-05 14:08 ` [PATCH 1/2] usb: host: xhci: Fix possible wild pointer when handling abort command Mathias Nyman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).