All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
@ 2013-06-09 15:18 Ming Lei
  2013-06-09 15:18 ` [RFC PATCH 1/4] USB: HCD: support " Ming Lei
                   ` (4 more replies)
  0 siblings, 5 replies; 66+ messages in thread
From: Ming Lei @ 2013-06-09 15:18 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

The patchset supports to run giveback of URB in tasklet context, so that
DMA unmapping/mapping on transfer buffer and compelte() callback can be
run with interrupt enabled, then time of HCD interrupt handler can be
saved much. Also this approach may simplify HCD since HCD lock needn't
be released any more before calling usb_hcd_giveback_urb().

The patchset enables the mechanism on EHCI HCD.

Patch 3/4 improves interrupt qh unlink on EHCI HCD when the mechanism
is introduced.

In the commit log of patch 4/4, detailed test result on three machines
(ARM A9/A15 dual core, X86) are provided about transfer performance and
ehci irq handling time. From the result, no transfer performance loss
is found and ehci irq handling time can drop much with the patchset.

 drivers/usb/core/hcd.c            |  170 +++++++++++++++++++++++++++++++------
 drivers/usb/host/ehci-fsl.c       |    2 +-
 drivers/usb/host/ehci-grlib.c     |    2 +-
 drivers/usb/host/ehci-hcd.c       |   18 ++--
 drivers/usb/host/ehci-hub.c       |    1 +
 drivers/usb/host/ehci-mv.c        |    2 +-
 drivers/usb/host/ehci-octeon.c    |    2 +-
 drivers/usb/host/ehci-pmcmsp.c    |    2 +-
 drivers/usb/host/ehci-ppc-of.c    |    2 +-
 drivers/usb/host/ehci-ps3.c       |    2 +-
 drivers/usb/host/ehci-q.c         |    6 +-
 drivers/usb/host/ehci-sched.c     |   60 ++++++++++++-
 drivers/usb/host/ehci-sead3.c     |    2 +-
 drivers/usb/host/ehci-sh.c        |    2 +-
 drivers/usb/host/ehci-tegra.c     |    2 +-
 drivers/usb/host/ehci-tilegx.c    |    2 +-
 drivers/usb/host/ehci-timer.c     |   43 ++++++++++
 drivers/usb/host/ehci-w90x900.c   |    2 +-
 drivers/usb/host/ehci-xilinx-of.c |    2 +-
 drivers/usb/host/ehci.h           |    3 +
 include/linux/usb/hcd.h           |   16 ++++
 21 files changed, 295 insertions(+), 48 deletions(-)


Thanks,
--
Ming Lei

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 1/4] USB: HCD: support giveback of URB in tasklet context
  2013-06-09 15:18 [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context Ming Lei
@ 2013-06-09 15:18 ` Ming Lei
  2013-06-09 15:58   ` Alan Stern
  2013-06-10  8:43   ` Oliver Neukum
  2013-06-09 15:18 ` [RFC PATCH 2/4] USB: EHCI: don't release ehci->lock if URB giveback " Ming Lei
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 66+ messages in thread
From: Ming Lei @ 2013-06-09 15:18 UTC (permalink / raw)
  To: linux-arm-kernel

This patch implements the mechanism of giveback of URB in
tasklet context, so that hardware interrupt handling time for
usb host controller can be saved much, and HCD interrupt handling
can be simplified.

Motivations:

1), on some arch(such as ARM), DMA mapping/unmapping is a bit
time-consuming, for example: when accessing usb mass storage
via EHCI on pandaboard, the common length of transfer buffer is 120KB,
the time consumed on DMA unmapping may reach hundreds of microseconds;
even on A15 based box, the time is still about scores of microseconds

2), on some arch, reading DMA coherent memoery is very time-consuming,
the most common example is usb video class driver[1]

3), driver's complete() callback may do much things which is driver
specific, so the time is consumed unnecessarily in hardware irq context.

4), running driver's complete() callback in hardware irq context causes
that host controller driver has to release its lock in interrupt handler,
so reacquiring the lock after return may busy wait a while and increase
interrupt handling time. More seriously, releasing the HCD lock makes
HCD becoming quite complicated to deal with introduced races.

So the patch proposes to run giveback of URB in tasklet context, then
time consumed in HCD irq handling doesn't depend on drivers' complete and
DMA mapping/unmapping any more, also we can simplify HCD since the HCD
lock isn't needed to be released during irq handling.

The patch should be reasonable and doable:

1), for drivers, they don't care if the complete() is called in hard irq
context or softirq context

2), the biggest change is the situation in which usb_submit_urb() is called
in complete() callback, so the introduced tasklet schedule delay might be a
con, but it shouldn't be a big deal:

	- control/bulk asynchronous transfer isn't sensitive to schedule
	  delay

	- the patch schedules giveback of periodic URBs using
	  tasklet_hi_schedule, so the introduced delay should be very
	  small

	- use percpu tasklset for each HCD to schedule giveback of URB,
	  which are running as much as parallel

	- for ISOC transfer, generally, drivers submit several URBs
	  concurrently to avoid interrupt delay, so it is OK with the
	  little schedule delay.

	- for interrupt transfer, generally, drivers only submit one URB
	  at the same time, but interrupt transfer is often used in event
	  report, polling, ... situations, and a little delay should be OK.

Considered that HCDs may optimize on submitting URB in complete(), the
patch may cause the optimization not working, so introduces one flag to mark
if the HCD supports to run giveback URB in tasklet context. When all HCDs
are ready, the flag can be removed.

[1], http://marc.info/?t=136438111600010&r=1&w=2

Cc: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 drivers/usb/core/hcd.c  |  170 ++++++++++++++++++++++++++++++++++++++++-------
 include/linux/usb/hcd.h |   16 +++++
 2 files changed, 162 insertions(+), 24 deletions(-)

diff --git a/drivers/usb/core/hcd.c b/drivers/usb/core/hcd.c
index 3d27d87..dc319e5 100644
--- a/drivers/usb/core/hcd.c
+++ b/drivers/usb/core/hcd.c
@@ -676,9 +676,11 @@ error:
 	 * Avoiding calls to local_irq_disable/enable makes the code
 	 * RT-friendly.
 	 */
-	spin_unlock(&hcd_root_hub_lock);
+	if (!hcd_giveback_urb_in_bh(hcd))
+		spin_unlock(&hcd_root_hub_lock);
 	usb_hcd_giveback_urb(hcd, urb, status);
-	spin_lock(&hcd_root_hub_lock);
+	if (!hcd_giveback_urb_in_bh(hcd))
+		spin_lock(&hcd_root_hub_lock);
 
 	spin_unlock_irq(&hcd_root_hub_lock);
 	return 0;
@@ -719,9 +721,11 @@ void usb_hcd_poll_rh_status(struct usb_hcd *hcd)
 			memcpy(urb->transfer_buffer, buffer, length);
 
 			usb_hcd_unlink_urb_from_ep(hcd, urb);
-			spin_unlock(&hcd_root_hub_lock);
+			if (!hcd_giveback_urb_in_bh(hcd))
+				spin_unlock(&hcd_root_hub_lock);
 			usb_hcd_giveback_urb(hcd, urb, 0);
-			spin_lock(&hcd_root_hub_lock);
+			if (!hcd_giveback_urb_in_bh(hcd))
+				spin_lock(&hcd_root_hub_lock);
 		} else {
 			length = 0;
 			set_bit(HCD_FLAG_POLL_PENDING, &hcd->flags);
@@ -812,9 +816,11 @@ static int usb_rh_urb_dequeue(struct usb_hcd *hcd, struct urb *urb, int status)
 			hcd->status_urb = NULL;
 			usb_hcd_unlink_urb_from_ep(hcd, urb);
 
-			spin_unlock(&hcd_root_hub_lock);
+			if (!hcd_giveback_urb_in_bh(hcd))
+				spin_unlock(&hcd_root_hub_lock);
 			usb_hcd_giveback_urb(hcd, urb, status);
-			spin_lock(&hcd_root_hub_lock);
+			if (!hcd_giveback_urb_in_bh(hcd))
+				spin_lock(&hcd_root_hub_lock);
 		}
 	}
  done:
@@ -1627,25 +1633,11 @@ int usb_hcd_unlink_urb (struct urb *urb, int status)
 
 /*-------------------------------------------------------------------------*/
 
-/**
- * usb_hcd_giveback_urb - return URB from HCD to device driver
- * @hcd: host controller returning the URB
- * @urb: urb being returned to the USB device driver.
- * @status: completion status code for the URB.
- * Context: in_interrupt()
- *
- * This hands the URB from HCD to its USB device driver, using its
- * completion function.  The HCD has freed all per-urb resources
- * (and is done using urb->hcpriv).  It also released all HCD locks;
- * the device driver won't cause problems if it frees, modifies,
- * or resubmits this URB.
- *
- * If @urb was unlinked, the value of @status will be overridden by
- * @urb->unlinked.  Erroneous short transfers are detected in case
- * the HCD hasn't checked for them.
- */
-void usb_hcd_giveback_urb(struct usb_hcd *hcd, struct urb *urb, int status)
+static void __usb_hcd_giveback_urb(struct urb *urb)
 {
+	struct usb_hcd *hcd = bus_to_hcd(urb->dev->bus);
+	int status = urb->status;
+
 	urb->hcpriv = NULL;
 	if (unlikely(urb->unlinked))
 		status = urb->unlinked;
@@ -1668,6 +1660,68 @@ void usb_hcd_giveback_urb(struct usb_hcd *hcd, struct urb *urb, int status)
 		wake_up (&usb_kill_urb_queue);
 	usb_put_urb (urb);
 }
+
+static void usb_giveback_urb_bh(unsigned long param)
+{
+	struct giveback_urb_bh *bh = (struct giveback_urb_bh *)param;
+	unsigned long flags;
+	struct list_head local_list;
+
+	spin_lock_irqsave(&bh->lock, flags);
+	list_replace_init(&bh->head, &local_list);
+	spin_unlock_irqrestore(&bh->lock, flags);
+
+	while (!list_empty(&local_list)) {
+		struct urb *urb;
+
+		urb = list_entry(local_list.next, struct urb, urb_list);
+		list_del_init(&urb->urb_list);
+		__usb_hcd_giveback_urb(urb);
+	}
+}
+
+/**
+ * usb_hcd_giveback_urb - return URB from HCD to device driver
+ * @hcd: host controller returning the URB
+ * @urb: urb being returned to the USB device driver.
+ * @status: completion status code for the URB.
+ * Context: in_interrupt()
+ *
+ * This hands the URB from HCD to its USB device driver, using its
+ * completion function.  The HCD has freed all per-urb resources
+ * (and is done using urb->hcpriv).  It also released all HCD locks;
+ * the device driver won't cause problems if it frees, modifies,
+ * or resubmits this URB.
+ *
+ * If @urb was unlinked, the value of @status will be overridden by
+ * @urb->unlinked.  Erroneous short transfers are detected in case
+ * the HCD hasn't checked for them.
+ */
+void usb_hcd_giveback_urb(struct usb_hcd *hcd, struct urb *urb, int status)
+{
+	struct giveback_urb_bh *bh = __this_cpu_ptr(hcd->async_bh);
+	bool async = 1;
+
+	urb->status = status;
+	if (!hcd_giveback_urb_in_bh(hcd)) {
+		__usb_hcd_giveback_urb(urb);
+		return;
+	}
+
+	if (usb_pipeisoc(urb->pipe) || usb_pipeint(urb->pipe)) {
+		bh = __this_cpu_ptr(hcd->periodic_bh);
+		async = 0;
+	}
+
+	spin_lock(&bh->lock);
+	list_add_tail(&urb->urb_list, &bh->head);
+	spin_unlock(&bh->lock);
+
+	if (async)
+		tasklet_schedule(&bh->bh);
+	else
+		tasklet_hi_schedule(&bh->bh);
+}
 EXPORT_SYMBOL_GPL(usb_hcd_giveback_urb);
 
 /*-------------------------------------------------------------------------*/
@@ -2288,6 +2342,59 @@ EXPORT_SYMBOL_GPL (usb_hc_died);
 
 /*-------------------------------------------------------------------------*/
 
+static void __init_giveback_urb_bh(struct giveback_urb_bh __percpu *bh)
+{
+	int i;
+
+	for_each_possible_cpu(i) {
+		struct giveback_urb_bh *pbh = per_cpu_ptr(bh, i);
+		spin_lock_init(&pbh->lock);
+		INIT_LIST_HEAD(&pbh->head);
+		tasklet_init(&pbh->bh, usb_giveback_urb_bh, (unsigned long)pbh);
+	}
+}
+
+static int init_giveback_urb_bh(struct usb_hcd *hcd)
+{
+	if (!hcd_giveback_urb_in_bh(hcd))
+		return 0;
+
+	hcd->periodic_bh = alloc_percpu(typeof(*hcd->periodic_bh));
+	if (!hcd->periodic_bh)
+		return -ENOMEM;
+
+	hcd->async_bh = alloc_percpu(typeof(*hcd->async_bh));
+	if (!hcd->async_bh) {
+		free_percpu(hcd->periodic_bh);
+		return -ENOMEM;
+	}
+
+	__init_giveback_urb_bh(hcd->periodic_bh);
+	__init_giveback_urb_bh(hcd->async_bh);
+
+	return 0;
+}
+
+static void __exit_giveback_urb_bh(struct giveback_urb_bh __percpu *bh)
+{
+	int i;
+
+	for_each_possible_cpu(i)
+		tasklet_kill(&per_cpu_ptr(bh, i)->bh);
+}
+
+static void exit_giveback_urb_bh(struct usb_hcd *hcd)
+{
+	if (!hcd_giveback_urb_in_bh(hcd))
+		return;
+
+	__exit_giveback_urb_bh(hcd->periodic_bh);
+	__exit_giveback_urb_bh(hcd->async_bh);
+
+	free_percpu(hcd->periodic_bh);
+	free_percpu(hcd->async_bh);
+}
+
 /**
  * usb_create_shared_hcd - create and initialize an HCD structure
  * @driver: HC driver that will use this hcd
@@ -2551,6 +2658,16 @@ int usb_add_hcd(struct usb_hcd *hcd,
 			&& device_can_wakeup(&hcd->self.root_hub->dev))
 		dev_dbg(hcd->self.controller, "supports USB remote wakeup\n");
 
+	if (usb_hcd_is_primary_hcd(hcd)) {
+		retval = init_giveback_urb_bh(hcd);
+		if (retval)
+			goto err_init_giveback_bh;
+	} else {
+		/* share tasklet handling with primary hcd */
+		hcd->async_bh = hcd->primary_hcd->async_bh;
+		hcd->periodic_bh = hcd->primary_hcd->periodic_bh;
+	}
+
 	/* enable irqs just before we start the controller,
 	 * if the BIOS provides legacy PCI irqs.
 	 */
@@ -2614,6 +2731,8 @@ err_hcd_driver_start:
 	if (usb_hcd_is_primary_hcd(hcd) && hcd->irq > 0)
 		free_irq(irqnum, hcd);
 err_request_irq:
+	exit_giveback_urb_bh(hcd);
+err_init_giveback_bh:
 err_hcd_driver_setup:
 err_set_rh_speed:
 	usb_put_dev(hcd->self.root_hub);
@@ -2659,6 +2778,9 @@ void usb_remove_hcd(struct usb_hcd *hcd)
 	usb_disconnect(&rhdev);		/* Sets rhdev to NULL */
 	mutex_unlock(&usb_bus_list_lock);
 
+	if (usb_hcd_is_primary_hcd(hcd))
+		exit_giveback_urb_bh(hcd);
+
 	/* Prevent any more root-hub status calls from the timer.
 	 * The HCD might still restart the timer (if a port status change
 	 * interrupt occurs), but usb_hcd_poll_rh_status() won't invoke
diff --git a/include/linux/usb/hcd.h b/include/linux/usb/hcd.h
index f5f5c7d..4e4e174 100644
--- a/include/linux/usb/hcd.h
+++ b/include/linux/usb/hcd.h
@@ -22,6 +22,7 @@
 #ifdef __KERNEL__
 
 #include <linux/rwsem.h>
+#include <linux/interrupt.h>
 
 #define MAX_TOPO_LEVEL		6
 
@@ -67,6 +68,12 @@
 
 /*-------------------------------------------------------------------------*/
 
+struct giveback_urb_bh {
+	spinlock_t lock;
+	struct list_head  head;
+	struct tasklet_struct bh;
+};
+
 struct usb_hcd {
 
 	/*
@@ -139,6 +146,9 @@ struct usb_hcd {
 	resource_size_t		rsrc_len;	/* memory/io resource length */
 	unsigned		power_budget;	/* in mA, 0 = no limit */
 
+	struct giveback_urb_bh __percpu *periodic_bh;
+	struct giveback_urb_bh __percpu *async_bh;
+
 	/* bandwidth_mutex should be taken before adding or removing
 	 * any new bus bandwidth constraints:
 	 *   1. Before adding a configuration for a new device.
@@ -220,6 +230,7 @@ struct hc_driver {
 #define	HCD_USB2	0x0020		/* USB 2.0 */
 #define	HCD_USB3	0x0040		/* USB 3.0 */
 #define	HCD_MASK	0x0070
+#define	HCD_BH		0x0100		/* URB complete in BH context */
 
 	/* called to init HCD and root hub */
 	int	(*reset) (struct usb_hcd *hcd);
@@ -360,6 +371,11 @@ struct hc_driver {
 	int	(*find_raw_port_number)(struct usb_hcd *, int);
 };
 
+static inline int hcd_giveback_urb_in_bh(struct usb_hcd *hcd)
+{
+	return hcd->driver->flags & HCD_BH;
+}
+
 extern int usb_hcd_link_urb_to_ep(struct usb_hcd *hcd, struct urb *urb);
 extern int usb_hcd_check_unlink_urb(struct usb_hcd *hcd, struct urb *urb,
 		int status);
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC PATCH 2/4] USB: EHCI: don't release ehci->lock if URB giveback in tasklet context
  2013-06-09 15:18 [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context Ming Lei
  2013-06-09 15:18 ` [RFC PATCH 1/4] USB: HCD: support " Ming Lei
@ 2013-06-09 15:18 ` Ming Lei
  2013-06-09 16:06   ` Alan Stern
  2013-06-09 15:18 ` [RFC PATCH 3/4] USB: EHCI: improve interrupt qh unlink Ming Lei
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 66+ messages in thread
From: Ming Lei @ 2013-06-09 15:18 UTC (permalink / raw)
  To: linux-arm-kernel

If HCD_BH is set for HC driver's flags, URB giveback will be
done in tasklet context instead of interrupt context, so the
ehci->lock needn't to be released any more before calling
usb_hcd_giveback_urb().

Cc: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 drivers/usb/host/ehci-q.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/usb/host/ehci-q.c b/drivers/usb/host/ehci-q.c
index d34b399..0387a81 100644
--- a/drivers/usb/host/ehci-q.c
+++ b/drivers/usb/host/ehci-q.c
@@ -283,9 +283,11 @@ __acquires(ehci->lock)
 
 	/* complete() can reenter this HCD */
 	usb_hcd_unlink_urb_from_ep(ehci_to_hcd(ehci), urb);
-	spin_unlock (&ehci->lock);
+	if (!hcd_giveback_urb_in_bh(ehci_to_hcd(ehci)))
+		spin_unlock(&ehci->lock);
 	usb_hcd_giveback_urb(ehci_to_hcd(ehci), urb, status);
-	spin_lock (&ehci->lock);
+	if (!hcd_giveback_urb_in_bh(ehci_to_hcd(ehci)))
+		spin_lock(&ehci->lock);
 }
 
 static int qh_schedule (struct ehci_hcd *ehci, struct ehci_qh *qh);
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC PATCH 3/4] USB: EHCI: improve interrupt qh unlink
  2013-06-09 15:18 [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context Ming Lei
  2013-06-09 15:18 ` [RFC PATCH 1/4] USB: HCD: support " Ming Lei
  2013-06-09 15:18 ` [RFC PATCH 2/4] USB: EHCI: don't release ehci->lock if URB giveback " Ming Lei
@ 2013-06-09 15:18 ` Ming Lei
  2013-06-09 15:18 ` [RFC PATCH 4/4] USB: EHCI: support running URB giveback in tasklet context Ming Lei
  2013-06-09 15:48 ` [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB " Alan Stern
  4 siblings, 0 replies; 66+ messages in thread
From: Ming Lei @ 2013-06-09 15:18 UTC (permalink / raw)
  To: linux-arm-kernel

Given interrupt URB will be resubmitted from tasklet context which
is scheduled by ehci hardware interrupt handler, and commonly only
one interrupt URB is scheduled on qh, so the qh may be unlinked
immediately once qh_completions() returns from ehci_irq(), then
the intr URB to be resubmitted in complete() can only be scheduled
and linked to hardware until the qh unlink is completed.

This patch applied the QH_STATE_UNLINK_WAIT(not used by interrupt qh)
state to improve this above situation, and the qh will wait for 5
milliseconds before being unlinked from hardware, if one URB is submitted
during the period, the qh is move out of unlink wait state and the
interrupt transfer can be scheduled immediately, not like before: the
transfer is only linked to hardware until previous unlink is completed.

Only enable the improvement in case that HCD supports to run
giveback of URB in tasklet context.

Cc: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 drivers/usb/host/ehci-hcd.c   |   16 ++++++++---
 drivers/usb/host/ehci-hub.c   |    1 +
 drivers/usb/host/ehci-sched.c |   60 ++++++++++++++++++++++++++++++++++++++---
 drivers/usb/host/ehci-timer.c |   43 +++++++++++++++++++++++++++++
 drivers/usb/host/ehci.h       |    3 +++
 5 files changed, 115 insertions(+), 8 deletions(-)

diff --git a/drivers/usb/host/ehci-hcd.c b/drivers/usb/host/ehci-hcd.c
index 246e124..0ab82de 100644
--- a/drivers/usb/host/ehci-hcd.c
+++ b/drivers/usb/host/ehci-hcd.c
@@ -304,7 +304,9 @@ static void end_unlink_async(struct ehci_hcd *ehci);
 static void unlink_empty_async(struct ehci_hcd *ehci);
 static void unlink_empty_async_suspended(struct ehci_hcd *ehci);
 static void ehci_work(struct ehci_hcd *ehci);
-static void start_unlink_intr(struct ehci_hcd *ehci, struct ehci_qh *qh);
+static void start_unlink_intr(struct ehci_hcd *ehci, struct ehci_qh *qh,
+			      bool wait);
+static void start_do_unlink_intr(struct ehci_hcd *ehci, struct ehci_qh *qh);
 static void end_unlink_intr(struct ehci_hcd *ehci, struct ehci_qh *qh);
 
 #include "ehci-timer.c"
@@ -484,6 +486,7 @@ static int ehci_init(struct usb_hcd *hcd)
 	ehci->periodic_size = DEFAULT_I_TDPS;
 	INIT_LIST_HEAD(&ehci->async_unlink);
 	INIT_LIST_HEAD(&ehci->async_idle);
+	INIT_LIST_HEAD(&ehci->intr_unlink_wait);
 	INIT_LIST_HEAD(&ehci->intr_unlink);
 	INIT_LIST_HEAD(&ehci->intr_qh_list);
 	INIT_LIST_HEAD(&ehci->cached_itd_list);
@@ -908,15 +911,20 @@ static int ehci_urb_dequeue(struct usb_hcd *hcd, struct urb *urb, int status)
 		switch (qh->qh_state) {
 		case QH_STATE_LINKED:
 			if (usb_pipetype(urb->pipe) == PIPE_INTERRUPT)
-				start_unlink_intr(ehci, qh);
+				start_unlink_intr(ehci, qh, false);
 			else
 				start_unlink_async(ehci, qh);
 			break;
 		case QH_STATE_COMPLETING:
 			qh->dequeue_during_giveback = 1;
 			break;
-		case QH_STATE_UNLINK:
 		case QH_STATE_UNLINK_WAIT:
+			if (usb_pipetype(urb->pipe) == PIPE_INTERRUPT) {
+				cancel_unlink_wait_intr(ehci, qh);
+				start_unlink_intr(ehci, qh, false);
+			}
+			break;
+		case QH_STATE_UNLINK:
 			/* already started */
 			break;
 		case QH_STATE_IDLE:
@@ -1042,7 +1050,7 @@ ehci_endpoint_reset(struct usb_hcd *hcd, struct usb_host_endpoint *ep)
 			if (eptype == USB_ENDPOINT_XFER_BULK)
 				start_unlink_async(ehci, qh);
 			else
-				start_unlink_intr(ehci, qh);
+				start_unlink_intr(ehci, qh, false);
 		}
 	}
 	spin_unlock_irqrestore(&ehci->lock, flags);
diff --git a/drivers/usb/host/ehci-hub.c b/drivers/usb/host/ehci-hub.c
index b2f6450..4104c66 100644
--- a/drivers/usb/host/ehci-hub.c
+++ b/drivers/usb/host/ehci-hub.c
@@ -345,6 +345,7 @@ static int ehci_bus_suspend (struct usb_hcd *hcd)
 
 	end_unlink_async(ehci);
 	unlink_empty_async_suspended(ehci);
+	ehci_handle_start_intr_unlinks(ehci);
 	ehci_handle_intr_unlinks(ehci);
 	end_free_itds(ehci);
 
diff --git a/drivers/usb/host/ehci-sched.c b/drivers/usb/host/ehci-sched.c
index acff5b8..d132c6f 100644
--- a/drivers/usb/host/ehci-sched.c
+++ b/drivers/usb/host/ehci-sched.c
@@ -601,10 +601,10 @@ static void qh_unlink_periodic(struct ehci_hcd *ehci, struct ehci_qh *qh)
 	list_del(&qh->intr_node);
 }
 
-static void start_unlink_intr(struct ehci_hcd *ehci, struct ehci_qh *qh)
+static void start_do_unlink_intr(struct ehci_hcd *ehci, struct ehci_qh *qh)
 {
-	/* If the QH isn't linked then there's nothing we can do. */
-	if (qh->qh_state != QH_STATE_LINKED)
+	/* If the QH isn't unlink wait then there's nothing we can do. */
+	if (qh->qh_state != QH_STATE_UNLINK_WAIT)
 		return;
 
 	qh_unlink_periodic (ehci, qh);
@@ -632,6 +632,55 @@ static void start_unlink_intr(struct ehci_hcd *ehci, struct ehci_qh *qh)
 	}
 }
 
+static void start_unlink_intr(struct ehci_hcd *ehci, struct ehci_qh *qh,
+			      bool wait)
+{
+	/* If the QH isn't linked then there's nothing we can do. */
+	if (qh->qh_state != QH_STATE_LINKED)
+		return;
+
+	/*
+	 * It is common only one intr URB is scheduled on one qh, and given
+	 * complete() is run in tasklet context, introduce QH_STATE_UNLINK_WAIT
+	 * to avoid unlink qh too early.
+	 */
+	qh->qh_state = QH_STATE_UNLINK_WAIT;
+
+	/* bypass the optimization without URB giveback in tasklet */
+	if (!hcd_giveback_urb_in_bh(ehci_to_hcd(ehci)) || !wait) {
+		start_do_unlink_intr(ehci, qh);
+		return;
+	}
+
+	qh->unlink_cycle = ehci->intr_unlink_wait_cycle;
+
+	/* New entries go at the end of the intr_unlink_wait list */
+	list_add_tail(&qh->unlink_node, &ehci->intr_unlink_wait);
+
+	if (ehci->rh_state < EHCI_RH_RUNNING)
+		ehci_handle_start_intr_unlinks(ehci);
+	else if (ehci->intr_unlink_wait.next == &qh->unlink_node) {
+		ehci_enable_event(ehci, EHCI_HRTIMER_START_UNLINK_INTR, true);
+		++ehci->intr_unlink_wait_cycle;
+	}
+}
+
+/* must be called with holding ehci->lock */
+static void cancel_unlink_wait_intr(struct ehci_hcd *ehci, struct ehci_qh *qh)
+{
+	/* If the QH isn't waited for unlink then do nothing */
+	if (qh->qh_state != QH_STATE_UNLINK_WAIT)
+		return;
+
+	/* restored to linked state */
+	list_del(&qh->unlink_node);
+	qh->qh_state = QH_STATE_LINKED;
+
+	/* avoid unnecessary CPU wakeup */
+	if (list_empty(&ehci->intr_unlink_wait))
+		ehci_disable_event(ehci, EHCI_HRTIMER_START_UNLINK_INTR);
+}
+
 static void end_unlink_intr(struct ehci_hcd *ehci, struct ehci_qh *qh)
 {
 	struct ehci_qh_hw	*hw = qh->hw;
@@ -876,6 +925,9 @@ static int intr_submit (
 			goto done;
 	}
 
+	/* put back qh from unlink wait list */
+	cancel_unlink_wait_intr(ehci, qh);
+
 	/* then queue the urb's tds to the qh */
 	qh = qh_append_tds(ehci, urb, qtd_list, epnum, &urb->ep->hcpriv);
 	BUG_ON (qh == NULL);
@@ -921,7 +973,7 @@ static void scan_intr(struct ehci_hcd *ehci)
 			temp = qh_completions(ehci, qh);
 			if (unlikely(temp || (list_empty(&qh->qtd_list) &&
 					qh->qh_state == QH_STATE_LINKED)))
-				start_unlink_intr(ehci, qh);
+				start_unlink_intr(ehci, qh, true);
 		}
 	}
 }
diff --git a/drivers/usb/host/ehci-timer.c b/drivers/usb/host/ehci-timer.c
index 11e5b32..c2320f3 100644
--- a/drivers/usb/host/ehci-timer.c
+++ b/drivers/usb/host/ehci-timer.c
@@ -72,6 +72,7 @@ static unsigned event_delays_ns[] = {
 	1 * NSEC_PER_MSEC,	/* EHCI_HRTIMER_POLL_DEAD */
 	1125 * NSEC_PER_USEC,	/* EHCI_HRTIMER_UNLINK_INTR */
 	2 * NSEC_PER_MSEC,	/* EHCI_HRTIMER_FREE_ITDS */
+	5 * NSEC_PER_MSEC,	/* EHCI_HRTIMER_START_UNLINK_INTR */
 	6 * NSEC_PER_MSEC,	/* EHCI_HRTIMER_ASYNC_UNLINKS */
 	10 * NSEC_PER_MSEC,	/* EHCI_HRTIMER_IAA_WATCHDOG */
 	10 * NSEC_PER_MSEC,	/* EHCI_HRTIMER_DISABLE_PERIODIC */
@@ -98,6 +99,17 @@ static void ehci_enable_event(struct ehci_hcd *ehci, unsigned event,
 	}
 }
 
+/* Warning: don't call this function from hrtimer handler context */
+static void ehci_disable_event(struct ehci_hcd *ehci, unsigned event)
+{
+	if (!hcd_giveback_urb_in_bh(ehci_to_hcd(ehci)))
+		return;
+
+	ehci->enabled_hrtimer_events &= ~(1 << event);
+	if (!ehci->enabled_hrtimer_events)
+		hrtimer_cancel(&ehci->hrtimer);
+}
+
 
 /* Poll the STS_ASS status bit; see when it agrees with CMD_ASE */
 static void ehci_poll_ASS(struct ehci_hcd *ehci)
@@ -215,6 +227,36 @@ static void ehci_handle_controller_death(struct ehci_hcd *ehci)
 	/* Not in process context, so don't try to reset the controller */
 }
 
+/* start to unlink interrupt QHs  */
+static void ehci_handle_start_intr_unlinks(struct ehci_hcd *ehci)
+{
+	bool		stopped = (ehci->rh_state < EHCI_RH_RUNNING);
+
+	/*
+	 * Process all the QHs on the intr_unlink list that were added
+	 * before the current unlink cycle began.  The list is in
+	 * temporal order, so stop when we reach the first entry in the
+	 * current cycle.  But if the root hub isn't running then
+	 * process all the QHs on the list.
+	 */
+	while (!list_empty(&ehci->intr_unlink_wait)) {
+		struct ehci_qh	*qh;
+
+		qh = list_first_entry(&ehci->intr_unlink_wait,
+				      struct ehci_qh, unlink_node);
+		if (!stopped &&
+		    qh->unlink_cycle == ehci->intr_unlink_wait_cycle)
+			break;
+		list_del(&qh->unlink_node);
+		start_do_unlink_intr(ehci, qh);
+	}
+
+	/* Handle remaining entries later */
+	if (!list_empty(&ehci->intr_unlink_wait)) {
+		ehci_enable_event(ehci, EHCI_HRTIMER_START_UNLINK_INTR, true);
+		++ehci->intr_unlink_wait_cycle;
+	}
+}
 
 /* Handle unlinked interrupt QHs once they are gone from the hardware */
 static void ehci_handle_intr_unlinks(struct ehci_hcd *ehci)
@@ -363,6 +405,7 @@ static void (*event_handlers[])(struct ehci_hcd *) = {
 	ehci_handle_controller_death,	/* EHCI_HRTIMER_POLL_DEAD */
 	ehci_handle_intr_unlinks,	/* EHCI_HRTIMER_UNLINK_INTR */
 	end_free_itds,			/* EHCI_HRTIMER_FREE_ITDS */
+	ehci_handle_start_intr_unlinks,	/* EHCI_HRTIMER_START_UNLINK_INTR */
 	unlink_empty_async,		/* EHCI_HRTIMER_ASYNC_UNLINKS */
 	ehci_iaa_watchdog,		/* EHCI_HRTIMER_IAA_WATCHDOG */
 	ehci_disable_PSE,		/* EHCI_HRTIMER_DISABLE_PERIODIC */
diff --git a/drivers/usb/host/ehci.h b/drivers/usb/host/ehci.h
index 7c978b2..1d2c164 100644
--- a/drivers/usb/host/ehci.h
+++ b/drivers/usb/host/ehci.h
@@ -88,6 +88,7 @@ enum ehci_hrtimer_event {
 	EHCI_HRTIMER_POLL_DEAD,		/* Wait for dead controller to stop */
 	EHCI_HRTIMER_UNLINK_INTR,	/* Wait for interrupt QH unlink */
 	EHCI_HRTIMER_FREE_ITDS,		/* Wait for unused iTDs and siTDs */
+	EHCI_HRTIMER_START_UNLINK_INTR, /* wait for new intr schedule */
 	EHCI_HRTIMER_ASYNC_UNLINKS,	/* Unlink empty async QHs */
 	EHCI_HRTIMER_IAA_WATCHDOG,	/* Handle lost IAA interrupts */
 	EHCI_HRTIMER_DISABLE_PERIODIC,	/* Wait to disable periodic sched */
@@ -143,6 +144,8 @@ struct ehci_hcd {			/* one per controller */
 	unsigned		i_thresh;	/* uframes HC might cache */
 
 	union ehci_shadow	*pshadow;	/* mirror hw periodic table */
+	struct list_head	intr_unlink_wait;
+	unsigned		intr_unlink_wait_cycle;
 	struct list_head	intr_unlink;
 	unsigned		intr_unlink_cycle;
 	unsigned		now_frame;	/* frame from HC hardware */
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC PATCH 4/4] USB: EHCI: support running URB giveback in tasklet context
  2013-06-09 15:18 [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context Ming Lei
                   ` (2 preceding siblings ...)
  2013-06-09 15:18 ` [RFC PATCH 3/4] USB: EHCI: improve interrupt qh unlink Ming Lei
@ 2013-06-09 15:18 ` Ming Lei
  2013-06-09 15:48 ` [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB " Alan Stern
  4 siblings, 0 replies; 66+ messages in thread
From: Ming Lei @ 2013-06-09 15:18 UTC (permalink / raw)
  To: linux-arm-kernel

Both 4 transfers can work well on EHCI HCD after switching to run
URB giveback in tasklet context, so mark all HCD drivers to support
it.

>From below test results on 3 machines(2 ARM and one x86), time
consumed by EHCI interrupt handler droped much without performance
loss.


1 test description
1.1 mass storage performance test:
- run below command 10 times and compute the average performance

    dd if=/dev/sdN iflag=direct of=/dev/null bs=200M count=1

- two usb mass storage device:
A: sandisk extreme USB 3.0 16G(used in test case 1 & case 2) 
B: kingston DataTraveler G2 4GB(only used in test case 2)

1.2 uvc function test:
- run one simple capture program in the below link
   
   http://kernel.ubuntu.com/~ming/up/capture.c

- capture format 640*480 and results in High Bandwidth mode on the
uvc device: Z-Star 0x0ac8/0x3450

- on T410(x86) laptop, also use guvcview to watch video capture/playback

1.3 about test2 and test4
- both two devices involved are tested concurrently by above test items

1.4 how to compute irq time(the time consumed by ehci_irq)
- use trace points of irq:irq_handler_entry and irq:irq_handler_exit

1.5 kernel
3.10.0-rc3-next-20130528

1.6 test machines
Pandaboard A1: ARM CortexA9 dural core
Arndale board: ARM CortexA15 dural core
T410: i5 CPU 2.67GHz quad core

2 test result
2.1 test case1: single mass storage device performance test
--------------------------------------------------------------------
		upstream 		| patched
		perf(MB/s)+irq time(us)	| perf(MB/s)+irq time(us)
--------------------------------------------------------------------
Pandaboard A1:  25.690(avg:144,max:818)	| 25.620(avg:14, max:77)
Arndale board:  29.700(avg:29, max:123)	| 29.700(avg:9,  max:48)
T410: 		34.420(avg:20, max:199*)| 34.440(avg:12, max:157)
---------------------------------------------------------------------

2.2 test case2: two mass storage devices' performance test
--------------------------------------------------------------------
		upstream 			| patched
		perf(MB/s)+irq time(us)		| perf(MB/s)+irq time(us)
--------------------------------------------------------------------
Pandaboard A1:  15.830/15.530(avg:158,max:1139)	| 16.490/16.180(avg:15, max:135)
Arndale board:  17.370/16.260(avg:30 max:222)	| 17.450/16.240(avg:10, max:92)
T410: 		21.300/19.900(avg:18 max:243)	| 21.230/19.920(avg:12, max:152)
---------------------------------------------------------------------

2.3 test case3: one uvc streaming test
- uvc device works well(on x86, luvcview can be used too and has
same result with uvc capture)
--------------------------------------------------------------------
		upstream 		| patched
		irq time(us)		| irq time(us)
--------------------------------------------------------------------
Pandaboard A1:  (avg:455, max:941)	| (avg:33, max:45)
Arndale board:  (avg:316, max:616)	| (avg:19, max:27)
T410: 		(avg:39,  max:164)	| (avg:10, max:142)
---------------------------------------------------------------------

2.4 test case4: one uvc streaming plus one mass storage device test
--------------------------------------------------------------------
		upstream 		| patched
		perf(MB/s)+irq time(us)	| perf(MB/s)+irq time(us)
--------------------------------------------------------------------
Pandaboard A1:  20.340(avg:258,max:1575)| 20.230(avg:24, max:100)
Arndale board:  23.570(avg:120,max:683)	| 23.560(avg:14, max:61)
T410: 		28.260(avg:27, max:178)	| 28.230(avg:13, max:155)
---------------------------------------------------------------------

* On T410, sometimes read ehci status register in ehci_irq takes more
than 100us, and the problem has been reported on the link:

	http://marc.info/?t=137065867300001&r=1&w=2


Cc: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
---
 drivers/usb/host/ehci-fsl.c       |    2 +-
 drivers/usb/host/ehci-grlib.c     |    2 +-
 drivers/usb/host/ehci-hcd.c       |    2 +-
 drivers/usb/host/ehci-mv.c        |    2 +-
 drivers/usb/host/ehci-octeon.c    |    2 +-
 drivers/usb/host/ehci-pmcmsp.c    |    2 +-
 drivers/usb/host/ehci-ppc-of.c    |    2 +-
 drivers/usb/host/ehci-ps3.c       |    2 +-
 drivers/usb/host/ehci-sead3.c     |    2 +-
 drivers/usb/host/ehci-sh.c        |    2 +-
 drivers/usb/host/ehci-tegra.c     |    2 +-
 drivers/usb/host/ehci-tilegx.c    |    2 +-
 drivers/usb/host/ehci-w90x900.c   |    2 +-
 drivers/usb/host/ehci-xilinx-of.c |    2 +-
 14 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/drivers/usb/host/ehci-fsl.c b/drivers/usb/host/ehci-fsl.c
index bd831ec..330274a 100644
--- a/drivers/usb/host/ehci-fsl.c
+++ b/drivers/usb/host/ehci-fsl.c
@@ -669,7 +669,7 @@ static const struct hc_driver ehci_fsl_hc_driver = {
 	 * generic hardware linkage
 	 */
 	.irq = ehci_irq,
-	.flags = HCD_USB2 | HCD_MEMORY,
+	.flags = HCD_USB2 | HCD_MEMORY | HCD_BH,
 
 	/*
 	 * basic lifecycle operations
diff --git a/drivers/usb/host/ehci-grlib.c b/drivers/usb/host/ehci-grlib.c
index 5d75de9..c6048ee 100644
--- a/drivers/usb/host/ehci-grlib.c
+++ b/drivers/usb/host/ehci-grlib.c
@@ -43,7 +43,7 @@ static const struct hc_driver ehci_grlib_hc_driver = {
 	 * generic hardware linkage
 	 */
 	.irq			= ehci_irq,
-	.flags			= HCD_MEMORY | HCD_USB2,
+	.flags			= HCD_MEMORY | HCD_USB2 | HCD_BH,
 
 	/*
 	 * basic lifecycle operations
diff --git a/drivers/usb/host/ehci-hcd.c b/drivers/usb/host/ehci-hcd.c
index 0ab82de..d856c5e 100644
--- a/drivers/usb/host/ehci-hcd.c
+++ b/drivers/usb/host/ehci-hcd.c
@@ -1174,7 +1174,7 @@ static const struct hc_driver ehci_hc_driver = {
 	 * generic hardware linkage
 	 */
 	.irq =			ehci_irq,
-	.flags =		HCD_MEMORY | HCD_USB2,
+	.flags =		HCD_MEMORY | HCD_USB2 | HCD_BH,
 
 	/*
 	 * basic lifecycle operations
diff --git a/drivers/usb/host/ehci-mv.c b/drivers/usb/host/ehci-mv.c
index 915c2db..ce18a36 100644
--- a/drivers/usb/host/ehci-mv.c
+++ b/drivers/usb/host/ehci-mv.c
@@ -96,7 +96,7 @@ static const struct hc_driver mv_ehci_hc_driver = {
 	 * generic hardware linkage
 	 */
 	.irq = ehci_irq,
-	.flags = HCD_MEMORY | HCD_USB2,
+	.flags = HCD_MEMORY | HCD_USB2 | HCD_BH,
 
 	/*
 	 * basic lifecycle operations
diff --git a/drivers/usb/host/ehci-octeon.c b/drivers/usb/host/ehci-octeon.c
index 45cc001..ab0397e 100644
--- a/drivers/usb/host/ehci-octeon.c
+++ b/drivers/usb/host/ehci-octeon.c
@@ -51,7 +51,7 @@ static const struct hc_driver ehci_octeon_hc_driver = {
 	 * generic hardware linkage
 	 */
 	.irq			= ehci_irq,
-	.flags			= HCD_MEMORY | HCD_USB2,
+	.flags			= HCD_MEMORY | HCD_USB2 | HCD_BH,
 
 	/*
 	 * basic lifecycle operations
diff --git a/drivers/usb/host/ehci-pmcmsp.c b/drivers/usb/host/ehci-pmcmsp.c
index 363890e..3ab32a2 100644
--- a/drivers/usb/host/ehci-pmcmsp.c
+++ b/drivers/usb/host/ehci-pmcmsp.c
@@ -286,7 +286,7 @@ static const struct hc_driver ehci_msp_hc_driver = {
 #else
 	.irq =			ehci_irq,
 #endif
-	.flags =		HCD_MEMORY | HCD_USB2,
+	.flags =		HCD_MEMORY | HCD_USB2 | HCD_BH,
 
 	/*
 	 * basic lifecycle operations
diff --git a/drivers/usb/host/ehci-ppc-of.c b/drivers/usb/host/ehci-ppc-of.c
index 56dc732..da95a31 100644
--- a/drivers/usb/host/ehci-ppc-of.c
+++ b/drivers/usb/host/ehci-ppc-of.c
@@ -28,7 +28,7 @@ static const struct hc_driver ehci_ppc_of_hc_driver = {
 	 * generic hardware linkage
 	 */
 	.irq			= ehci_irq,
-	.flags			= HCD_MEMORY | HCD_USB2,
+	.flags			= HCD_MEMORY | HCD_USB2 | HCD_BH,
 
 	/*
 	 * basic lifecycle operations
diff --git a/drivers/usb/host/ehci-ps3.c b/drivers/usb/host/ehci-ps3.c
index fd98377..8188542 100644
--- a/drivers/usb/host/ehci-ps3.c
+++ b/drivers/usb/host/ehci-ps3.c
@@ -71,7 +71,7 @@ static const struct hc_driver ps3_ehci_hc_driver = {
 	.product_desc		= "PS3 EHCI Host Controller",
 	.hcd_priv_size		= sizeof(struct ehci_hcd),
 	.irq			= ehci_irq,
-	.flags			= HCD_MEMORY | HCD_USB2,
+	.flags			= HCD_MEMORY | HCD_USB2 | HCD_BH,
 	.reset			= ps3_ehci_hc_reset,
 	.start			= ehci_run,
 	.stop			= ehci_stop,
diff --git a/drivers/usb/host/ehci-sead3.c b/drivers/usb/host/ehci-sead3.c
index b2de52d..8a73449 100644
--- a/drivers/usb/host/ehci-sead3.c
+++ b/drivers/usb/host/ehci-sead3.c
@@ -55,7 +55,7 @@ const struct hc_driver ehci_sead3_hc_driver = {
 	 * generic hardware linkage
 	 */
 	.irq			= ehci_irq,
-	.flags			= HCD_MEMORY | HCD_USB2,
+	.flags			= HCD_MEMORY | HCD_USB2 | HCD_BH,
 
 	/*
 	 * basic lifecycle operations
diff --git a/drivers/usb/host/ehci-sh.c b/drivers/usb/host/ehci-sh.c
index c4c0ee9..1691f8e 100644
--- a/drivers/usb/host/ehci-sh.c
+++ b/drivers/usb/host/ehci-sh.c
@@ -36,7 +36,7 @@ static const struct hc_driver ehci_sh_hc_driver = {
 	 * generic hardware linkage
 	 */
 	.irq				= ehci_irq,
-	.flags				= HCD_USB2 | HCD_MEMORY,
+	.flags				= HCD_USB2 | HCD_MEMORY | HCD_BH,
 
 	/*
 	 * basic lifecycle operations
diff --git a/drivers/usb/host/ehci-tegra.c b/drivers/usb/host/ehci-tegra.c
index 59d111b..c23db21 100644
--- a/drivers/usb/host/ehci-tegra.c
+++ b/drivers/usb/host/ehci-tegra.c
@@ -377,7 +377,7 @@ static const struct hc_driver tegra_ehci_hc_driver = {
 	.description		= hcd_name,
 	.product_desc		= "Tegra EHCI Host Controller",
 	.hcd_priv_size		= sizeof(struct ehci_hcd),
-	.flags			= HCD_USB2 | HCD_MEMORY,
+	.flags			= HCD_USB2 | HCD_MEMORY | HCD_BH,
 
 	/* standard ehci functions */
 	.irq			= ehci_irq,
diff --git a/drivers/usb/host/ehci-tilegx.c b/drivers/usb/host/ehci-tilegx.c
index d72b292..204d3b6 100644
--- a/drivers/usb/host/ehci-tilegx.c
+++ b/drivers/usb/host/ehci-tilegx.c
@@ -61,7 +61,7 @@ static const struct hc_driver ehci_tilegx_hc_driver = {
 	 * Generic hardware linkage.
 	 */
 	.irq			= ehci_irq,
-	.flags			= HCD_MEMORY | HCD_USB2,
+	.flags			= HCD_MEMORY | HCD_USB2 | HCD_BH,
 
 	/*
 	 * Basic lifecycle operations.
diff --git a/drivers/usb/host/ehci-w90x900.c b/drivers/usb/host/ehci-w90x900.c
index 59e0e24..1c370df 100644
--- a/drivers/usb/host/ehci-w90x900.c
+++ b/drivers/usb/host/ehci-w90x900.c
@@ -108,7 +108,7 @@ static const struct hc_driver ehci_w90x900_hc_driver = {
 	 * generic hardware linkage
 	 */
 	.irq = ehci_irq,
-	.flags = HCD_USB2|HCD_MEMORY,
+	.flags = HCD_USB2|HCD_MEMORY|HCD_BH,
 
 	/*
 	 * basic lifecycle operations
diff --git a/drivers/usb/host/ehci-xilinx-of.c b/drivers/usb/host/ehci-xilinx-of.c
index d845e3b..bb96a68 100644
--- a/drivers/usb/host/ehci-xilinx-of.c
+++ b/drivers/usb/host/ehci-xilinx-of.c
@@ -79,7 +79,7 @@ static const struct hc_driver ehci_xilinx_of_hc_driver = {
 	 * generic hardware linkage
 	 */
 	.irq			= ehci_irq,
-	.flags			= HCD_MEMORY | HCD_USB2,
+	.flags			= HCD_MEMORY | HCD_USB2 | HCD_BH,
 
 	/*
 	 * basic lifecycle operations
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-09 15:18 [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context Ming Lei
                   ` (3 preceding siblings ...)
  2013-06-09 15:18 ` [RFC PATCH 4/4] USB: EHCI: support running URB giveback in tasklet context Ming Lei
@ 2013-06-09 15:48 ` Alan Stern
  2013-06-10  6:05   ` Ming Lei
  4 siblings, 1 reply; 66+ messages in thread
From: Alan Stern @ 2013-06-09 15:48 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, 9 Jun 2013, Ming Lei wrote:

> Hi,
> 
> The patchset supports to run giveback of URB in tasklet context, so that
> DMA unmapping/mapping on transfer buffer and compelte() callback can be
> run with interrupt enabled, then time of HCD interrupt handler can be
> saved much. Also this approach may simplify HCD since HCD lock needn't
> be released any more before calling usb_hcd_giveback_urb().

That's a lot of material.  However, you haven't specifically said
_what_ you want to accomplish.  It sounds like you're trying to
minimize the amount of time spent in hardirq context, but I can't be
sure that is really what you want.

If it is, this is not a good way to do it.  A better way to minimize
the time spent in hardirq context is to use a threaded interrupt
handler.  But why do you want to minimize this time in the first place?  
The CPU will be still have to use at least as much time as before; the
only difference is that it will be in softirq or in a high-priority
thread instead of in hardirq.

Also, I don't see how you can avoid releasing the HCD's private lock
before calling usb_hcd_giveback_urb().  When the completion handler
tries to resubmit the URB, the HCD's enqueue routine will try to
reacquire the private lock.  If it didn't release the lock earlier, it
will hang.

Alan Stern

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 1/4] USB: HCD: support giveback of URB in tasklet context
  2013-06-09 15:18 ` [RFC PATCH 1/4] USB: HCD: support " Ming Lei
@ 2013-06-09 15:58   ` Alan Stern
  2013-06-10  8:12     ` Ming Lei
  2013-06-10  8:43   ` Oliver Neukum
  1 sibling, 1 reply; 66+ messages in thread
From: Alan Stern @ 2013-06-09 15:58 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, 9 Jun 2013, Ming Lei wrote:

> This patch implements the mechanism of giveback of URB in
> tasklet context, so that hardware interrupt handling time for
> usb host controller can be saved much, and HCD interrupt handling
> can be simplified.

Why do you want to minimize the hardware interrupt handling time?  The 
total interrupt handling time will actually be increased by your 
change; the only advantage is that much of the work will not be in 
hardirq context.

Also, I suspect that this will make HCD interrupt handling _more_ 
complicated, not less.

> Motivations:
> 
> 1), on some arch(such as ARM), DMA mapping/unmapping is a bit
> time-consuming, for example: when accessing usb mass storage
> via EHCI on pandaboard, the common length of transfer buffer is 120KB,
> the time consumed on DMA unmapping may reach hundreds of microseconds;
> even on A15 based box, the time is still about scores of microseconds

DMA mapping/unmapping will remain just as time-consuming as before.  
This patch won't change that.

> 2), on some arch, reading DMA coherent memoery is very time-consuming,
> the most common example is usb video class driver[1]

Reading DMA-coherent memory will remain just as time-consuming as 
before.

> 3), driver's complete() callback may do much things which is driver
> specific, so the time is consumed unnecessarily in hardware irq context.

With this patch, the time is consumed in softirq context.  What is the 
advantage?

> 4), running driver's complete() callback in hardware irq context causes
> that host controller driver has to release its lock in interrupt handler,
> so reacquiring the lock after return may busy wait a while and increase
> interrupt handling time. More seriously, releasing the HCD lock makes
> HCD becoming quite complicated to deal with introduced races.

There is no choice; the lock _has_ to be released before giving back 
completed URBs.  Therefore the races are unavoidable, as is the 
busy-wait.

> So the patch proposes to run giveback of URB in tasklet context, then
> time consumed in HCD irq handling doesn't depend on drivers' complete and
> DMA mapping/unmapping any more, also we can simplify HCD since the HCD
> lock isn't needed to be released during irq handling.

I'm not convinced.  In particular, I'm not convinced that this is 
better than using a threaded interrupt handler.

> The patch should be reasonable and doable:
> 
> 1), for drivers, they don't care if the complete() is called in hard irq
> context or softirq context

True.

> 2), the biggest change is the situation in which usb_submit_urb() is called
> in complete() callback, so the introduced tasklet schedule delay might be a
> con, but it shouldn't be a big deal:
> 
> 	- control/bulk asynchronous transfer isn't sensitive to schedule
> 	  delay

Mass-storage device access (which uses bulk async transfers) certainly
_is_ sensitive to scheduling delays.

> 	- the patch schedules giveback of periodic URBs using
> 	  tasklet_hi_schedule, so the introduced delay should be very
> 	  small
> 
> 	- use percpu tasklset for each HCD to schedule giveback of URB,
> 	  which are running as much as parallel

That might be an advantage; it depends on the work load.  But if the 
tasklet ends up running on a different CPU from the task that submitted 
the URB originally, it would be less of an advantage.

> 	- for ISOC transfer, generally, drivers submit several URBs
> 	  concurrently to avoid interrupt delay, so it is OK with the
> 	  little schedule delay.

This varies.  Some drivers use very low overheads for isochronous 
transfers.  A delay of even one millisecond might be too long for them.

> 	- for interrupt transfer, generally, drivers only submit one URB
> 	  at the same time, but interrupt transfer is often used in event
> 	  report, polling, ... situations, and a little delay should be OK.

Agreed.

Alan Stern

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 2/4] USB: EHCI: don't release ehci->lock if URB giveback in tasklet context
  2013-06-09 15:18 ` [RFC PATCH 2/4] USB: EHCI: don't release ehci->lock if URB giveback " Ming Lei
@ 2013-06-09 16:06   ` Alan Stern
  2013-06-10  9:10     ` Ming Lei
  0 siblings, 1 reply; 66+ messages in thread
From: Alan Stern @ 2013-06-09 16:06 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, 9 Jun 2013, Ming Lei wrote:

> If HCD_BH is set for HC driver's flags, URB giveback will be
> done in tasklet context instead of interrupt context, so the
> ehci->lock needn't to be released any more before calling
> usb_hcd_giveback_urb().

Ah, this is the reason why you don't need to release the private lock.  
I'm not sure that this will save much time overall.

With the existing code, the main reason for lock contention would be if
(for example) CPU 2 submitted or cancelled an URB while the giveback
was occurring on CPU 1.  Then CPU 1 would be forced to wait while CPU 2 
finished its submission or cancellation.

With this patch, it's the other way around.  CPU 2 would be forced to 
wait while CPU 1 does all the rest of the work in its interrupt 
handler.  The total time spent holding the lock will be the same; it 
will just be distributed differently among the CPUs.  Hence it is not 
at all clear that this change will save any time.

Alan Stern

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-09 15:48 ` [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB " Alan Stern
@ 2013-06-10  6:05   ` Ming Lei
  2013-06-10 14:03     ` Alan Stern
  0 siblings, 1 reply; 66+ messages in thread
From: Ming Lei @ 2013-06-10  6:05 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, Jun 9, 2013 at 11:48 PM, Alan Stern <stern@rowland.harvard.edu> wrote:
> On Sun, 9 Jun 2013, Ming Lei wrote:
>
>> Hi,
>>
>> The patchset supports to run giveback of URB in tasklet context, so that
>> DMA unmapping/mapping on transfer buffer and compelte() callback can be
>> run with interrupt enabled, then time of HCD interrupt handler can be
>> saved much. Also this approach may simplify HCD since HCD lock needn't
>> be released any more before calling usb_hcd_giveback_urb().
>
> That's a lot of material.  However, you haven't specifically said
> _what_ you want to accomplish.  It sounds like you're trying to

All IRQs are disabled(locally) in hard interrupt context, but they are
enabled in softirq context, this patchset can decrease the time a lot.

> minimize the amount of time spent in hardirq context, but I can't be
> sure that is really what you want.
>
> If it is, this is not a good way to do it.  A better way to minimize
> the time spent in hardirq context is to use a threaded interrupt
> handler.  But why do you want to minimize this time in the first place?

Process context switch may have more cost than switching to softirq,
then USB performance may be effected, that is why I choose to use
tasklet.  Also now there is no any sleep in URB giveback, so it isn't
necessary to introduce process to handle the giveback or completion.

> The CPU will be still have to use at least as much time as before; the
> only difference is that it will be in softirq or in a high-priority
> thread instead of in hardirq.

Yes, as I said above, but we can allow other IRQs to be served in
handling URB giveback in softirq context, that is the value of softirq/tasklet.
Of course, system may have more important things to do during the time.
Generally speaking, for one driver, hard interrupt handler should always
consume less time as possible.

>
> Also, I don't see how you can avoid releasing the HCD's private lock
> before calling usb_hcd_giveback_urb().  When the completion handler

Please see patch 1/4, :-)

> tries to resubmit the URB, the HCD's enqueue routine will try to
> reacquire the private lock.  If it didn't release the lock earlier, it

Yes, but with the patchset, resubmitting is called in taskelet context,
instead of hardirq context, so there isn't any recursion any more.


Thanks,
--
Ming Lei

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 1/4] USB: HCD: support giveback of URB in tasklet context
  2013-06-09 15:58   ` Alan Stern
@ 2013-06-10  8:12     ` Ming Lei
  0 siblings, 0 replies; 66+ messages in thread
From: Ming Lei @ 2013-06-10  8:12 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, Jun 9, 2013 at 11:58 PM, Alan Stern <stern@rowland.harvard.edu> wrote:
> On Sun, 9 Jun 2013, Ming Lei wrote:
>
>> This patch implements the mechanism of giveback of URB in
>> tasklet context, so that hardware interrupt handling time for
>> usb host controller can be saved much, and HCD interrupt handling
>> can be simplified.
>
> Why do you want to minimize the hardware interrupt handling time?  The
> total interrupt handling time will actually be increased by your
> change; the only advantage is that much of the work will not be in
> hardirq context.

Disabling interrupts too long will cause system to respond slowly, for
example timer interrupt may be handled very lately.  That is why tasklet/
softirq is introduced to split driver's interrupt handler into two parts: the
urgent thing is done quickly in hard irq context with IRQs disabled, and
the remaining is handled in softirq/tasklet part with IRQs enabled.

>
> Also, I suspect that this will make HCD interrupt handling _more_
> complicated, not less.

If HCD's lock wasn't released before calling usb_hcd_giveback_urb(),
HCD interrupt handling would have been a bit simpler.

>
>> Motivations:
>>
>> 1), on some arch(such as ARM), DMA mapping/unmapping is a bit
>> time-consuming, for example: when accessing usb mass storage
>> via EHCI on pandaboard, the common length of transfer buffer is 120KB,
>> the time consumed on DMA unmapping may reach hundreds of microseconds;
>> even on A15 based box, the time is still about scores of microseconds
>
> DMA mapping/unmapping will remain just as time-consuming as before.
> This patch won't change that.

Yes, it is platform dependent, and I have tried to improve it, but not succeed.

But, is there any good reason to do time-consuming DMA mapping/
unmapping with all IRQs disabled to make system response slowly?

>
>> 2), on some arch, reading DMA coherent memoery is very time-consuming,
>> the most common example is usb video class driver[1]
>
> Reading DMA-coherent memory will remain just as time-consuming as
> before.

Same with above.

>
>> 3), driver's complete() callback may do much things which is driver
>> specific, so the time is consumed unnecessarily in hardware irq context.
>
> With this patch, the time is consumed in softirq context.  What is the
> advantage?

IRQs are enabled in softirq context so that system can respond events
which might require to be handled very quickly.

>
>> 4), running driver's complete() callback in hardware irq context causes
>> that host controller driver has to release its lock in interrupt handler,
>> so reacquiring the lock after return may busy wait a while and increase
>> interrupt handling time. More seriously, releasing the HCD lock makes
>> HCD becoming quite complicated to deal with introduced races.
>
> There is no choice; the lock _has_ to be released before giving back
> completed URBs.  Therefore the races are unavoidable, as is the
> busy-wait.

Could you explain it in a bit detail why there is no choice?

IMO, the patchset can make it possible.

>
>> So the patch proposes to run giveback of URB in tasklet context, then
>> time consumed in HCD irq handling doesn't depend on drivers' complete and
>> DMA mapping/unmapping any more, also we can simplify HCD since the HCD
>> lock isn't needed to be released during irq handling.
>
> I'm not convinced.  In particular, I'm not convinced that this is
> better than using a threaded interrupt handler.

Threaded interrupt handler should be OK but with more context switch
cost, so URB giveback may be handled a bit lately and performance
might be affected. Also ISOC transfer for video/audio applicaton should
be handled with as less latency as possible to provide good user
experience, thread handler may be delayed a bit too long to meet the
demand.

Also there is no any sleep in usb_hcd_giveback_urb(), so tasklet
should be better.

>
>> The patch should be reasonable and doable:
>>
>> 1), for drivers, they don't care if the complete() is called in hard irq
>> context or softirq context
>
> True.
>
>> 2), the biggest change is the situation in which usb_submit_urb() is called
>> in complete() callback, so the introduced tasklet schedule delay might be a
>> con, but it shouldn't be a big deal:
>>
>>       - control/bulk asynchronous transfer isn't sensitive to schedule
>>         delay
>
> Mass-storage device access (which uses bulk async transfers) certainly
> _is_ sensitive to scheduling delays.
>
>>       - the patch schedules giveback of periodic URBs using
>>         tasklet_hi_schedule, so the introduced delay should be very
>>         small
>>
>>       - use percpu tasklset for each HCD to schedule giveback of URB,
>>         which are running as much as parallel
>
> That might be an advantage; it depends on the work load.  But if the
> tasklet ends up running on a different CPU from the task that submitted
> the URB originally, it would be less of an advantage.
>
>>       - for ISOC transfer, generally, drivers submit several URBs
>>         concurrently to avoid interrupt delay, so it is OK with the
>>         little schedule delay.
>
> This varies.  Some drivers use very low overheads for isochronous
> transfers.  A delay of even one millisecond might be too long for them.
>
>>       - for interrupt transfer, generally, drivers only submit one URB
>>         at the same time, but interrupt transfer is often used in event
>>         report, polling, ... situations, and a little delay should be OK.
>
> Agreed.
>
> Alan Stern
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-usb" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Thanks,
--
Ming Lei

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 1/4] USB: HCD: support giveback of URB in tasklet context
  2013-06-09 15:18 ` [RFC PATCH 1/4] USB: HCD: support " Ming Lei
  2013-06-09 15:58   ` Alan Stern
@ 2013-06-10  8:43   ` Oliver Neukum
  2013-06-10  9:23     ` Ming Lei
  1 sibling, 1 reply; 66+ messages in thread
From: Oliver Neukum @ 2013-06-10  8:43 UTC (permalink / raw)
  To: linux-arm-kernel

On Sunday 09 June 2013 23:18:28 Ming Lei wrote:
> 2), the biggest change is the situation in which usb_submit_urb() is called
> in complete() callback, so the introduced tasklet schedule delay might be a
> con, but it shouldn't be a big deal:
> 
>         - control/bulk asynchronous transfer isn't sensitive to schedule
>           delay

That is debatable.Missing a frame boundary is expensive because the increased
latency then translates into lower throughput.

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 2/4] USB: EHCI: don't release ehci->lock if URB giveback in tasklet context
  2013-06-09 16:06   ` Alan Stern
@ 2013-06-10  9:10     ` Ming Lei
  0 siblings, 0 replies; 66+ messages in thread
From: Ming Lei @ 2013-06-10  9:10 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jun 10, 2013 at 12:06 AM, Alan Stern <stern@rowland.harvard.edu> wrote:
> On Sun, 9 Jun 2013, Ming Lei wrote:
>
>> If HCD_BH is set for HC driver's flags, URB giveback will be
>> done in tasklet context instead of interrupt context, so the
>> ehci->lock needn't to be released any more before calling
>> usb_hcd_giveback_urb().
>
> Ah, this is the reason why you don't need to release the private lock.
> I'm not sure that this will save much time overall.

I did't mention this 3/4 patch can save much time. In fact, without this
patch, handling URB giveback still can work. But releasing the
private lock can make hard irq handler become random long,
since reacquiring the lock may busy wait quite a while until other
CPUs released the lock.

>
> With the existing code, the main reason for lock contention would be if
> (for example) CPU 2 submitted or cancelled an URB while the giveback
> was occurring on CPU 1.  Then CPU 1 would be forced to wait while CPU 2
> finished its submission or cancellation.
>
> With this patch, it's the other way around.  CPU 2 would be forced to
> wait while CPU 1 does all the rest of the work in its interrupt
> handler.  The total time spent holding the lock will be the same; it
> will just be distributed differently among the CPUs.  Hence it is not
> at all clear that this change will save any time.

Generally the interrupt handler without URB giveback consumes not
much time, and the time consumed is about 10~20us on fast machine,
and <40us on slow machine averagely per my tests in 4/4 commit log.

You may ague the situation would become worse if there are many
URBs to be handled in ehci_irq() on CPU1, but I think it should be
very very rare to happen attributed to this patchset: HCD hard interrupt
handler takes much less time than general HCD irq interval(for example,
in EHCI, generally there won't have many URBs to be completed in one
uFrame, and the URBs completed in next uFrame will interrupt CPUs
125us later)

Also we may introduce another lock/flags to let CPU1 handle CPU2's
interrupts and let CPU2 return IRQ_HANDLED immediately, but I am
not sure if it is necessary.

Thanks,
--
Ming Lei

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 1/4] USB: HCD: support giveback of URB in tasklet context
  2013-06-10  8:43   ` Oliver Neukum
@ 2013-06-10  9:23     ` Ming Lei
  2013-06-10  9:31       ` Oliver Neukum
  0 siblings, 1 reply; 66+ messages in thread
From: Ming Lei @ 2013-06-10  9:23 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jun 10, 2013 at 4:43 PM, Oliver Neukum <oliver@neukum.org> wrote:
> On Sunday 09 June 2013 23:18:28 Ming Lei wrote:
>> 2), the biggest change is the situation in which usb_submit_urb() is called
>> in complete() callback, so the introduced tasklet schedule delay might be a
>> con, but it shouldn't be a big deal:
>>
>>         - control/bulk asynchronous transfer isn't sensitive to schedule
>>           delay
>
> That is debatable.Missing a frame boundary is expensive because the increased
> latency then translates into lower throughput.

Suppose so, considered that bulk transfer will do large data block transfer, and
the extra frame or uFrame doesn't matter over the whole transfer time.

Also the tasklet function will be scheduled once the hard interrupt handler
completes, and the delay is often several microseconds or smaller, which
has a very low probability to miss frame/uframe boundary.

Even with submitting URBs in hardware interrupt handler, there is still the
interrupt handling delay, isn't there? (So disabling interrupt too
long is really
very bad, :-))

Thanks,
--
Ming Lei

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 1/4] USB: HCD: support giveback of URB in tasklet context
  2013-06-10  9:23     ` Ming Lei
@ 2013-06-10  9:31       ` Oliver Neukum
  2013-06-10  9:51         ` Ming Lei
  0 siblings, 1 reply; 66+ messages in thread
From: Oliver Neukum @ 2013-06-10  9:31 UTC (permalink / raw)
  To: linux-arm-kernel

On Monday 10 June 2013 17:23:46 Ming Lei wrote:
> On Mon, Jun 10, 2013 at 4:43 PM, Oliver Neukum <oliver@neukum.org> wrote:
> > On Sunday 09 June 2013 23:18:28 Ming Lei wrote:
> >> 2), the biggest change is the situation in which usb_submit_urb() is called
> >> in complete() callback, so the introduced tasklet schedule delay might be a
> >> con, but it shouldn't be a big deal:
> >>
> >>         - control/bulk asynchronous transfer isn't sensitive to schedule
> >>           delay
> >
> > That is debatable.Missing a frame boundary is expensive because the increased
> > latency then translates into lower throughput.
> 
> Suppose so, considered that bulk transfer will do large data block transfer, and
> the extra frame or uFrame doesn't matter over the whole transfer time.

That is not true for all use cases. Networking looks vulnerable.

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 1/4] USB: HCD: support giveback of URB in tasklet context
  2013-06-10  9:31       ` Oliver Neukum
@ 2013-06-10  9:51         ` Ming Lei
  0 siblings, 0 replies; 66+ messages in thread
From: Ming Lei @ 2013-06-10  9:51 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jun 10, 2013 at 5:31 PM, Oliver Neukum <oliver@neukum.org> wrote:
> On Monday 10 June 2013 17:23:46 Ming Lei wrote:
>> On Mon, Jun 10, 2013 at 4:43 PM, Oliver Neukum <oliver@neukum.org> wrote:
>> > On Sunday 09 June 2013 23:18:28 Ming Lei wrote:
>> >> 2), the biggest change is the situation in which usb_submit_urb() is called
>> >> in complete() callback, so the introduced tasklet schedule delay might be a
>> >> con, but it shouldn't be a big deal:
>> >>
>> >>         - control/bulk asynchronous transfer isn't sensitive to schedule
>> >>           delay
>> >
>> > That is debatable.Missing a frame boundary is expensive because the increased
>> > latency then translates into lower throughput.
>>
>> Suppose so, considered that bulk transfer will do large data block transfer, and
>> the extra frame or uFrame doesn't matter over the whole transfer time.
>
> That is not true for all use cases. Networking looks vulnerable.

> That is debatable.Missing a frame boundary is expensive because the increased
> latency then translates into lower throughput.

Missing uframe/frame boundary doesn't cause lower throughput since network
devices use bulk transfer, which is scheduled in hw aync queue and there should
always have pending transfers in the queue when submitting bulk URB to the
queue.


Thanks,
--
Ming Lei

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-10  6:05   ` Ming Lei
@ 2013-06-10 14:03     ` Alan Stern
  2013-06-10 14:12       ` Oliver Neukum
  2013-06-10 15:37       ` Ming Lei
  0 siblings, 2 replies; 66+ messages in thread
From: Alan Stern @ 2013-06-10 14:03 UTC (permalink / raw)
  To: linux-arm-kernel

[Thomas and Steve, please see the question below.]

On Mon, 10 Jun 2013, Ming Lei wrote:

> On Sun, Jun 9, 2013 at 11:48 PM, Alan Stern <stern@rowland.harvard.edu> wrote:
> > On Sun, 9 Jun 2013, Ming Lei wrote:
> >
> >> Hi,
> >>
> >> The patchset supports to run giveback of URB in tasklet context, so that
> >> DMA unmapping/mapping on transfer buffer and compelte() callback can be
> >> run with interrupt enabled, then time of HCD interrupt handler can be
> >> saved much. Also this approach may simplify HCD since HCD lock needn't
> >> be released any more before calling usb_hcd_giveback_urb().
> >
> > That's a lot of material.  However, you haven't specifically said
> > _what_ you want to accomplish.  It sounds like you're trying to
> 
> All IRQs are disabled(locally) in hard interrupt context, but they are
> enabled in softirq context, this patchset can decrease the time a lot.

That isn't clear.  It is documented that USB completion handlers are 
called with local interrupts disabled.  An IRQ arriving before the 
tasklet starts up might therefore be serviced during the short interval 
before the tasklet disables local interrupts, but if that happens it 
would mean that the completion handler would be delayed.

In effect, you are prioritizing other IRQs higher than USB completion
handlers.  Nevertheless, the total time spent with interrupts disabled
will remain the same.

(There's one other side effect that perhaps you haven't considered.  
When multiple URBs are addressed to the same endpoint, their completion 
handlers are called in order of URB completion, which is the same as 
the order of URB submission unless an URB is cancelled.  By delegating 
the completion handlers to tasklets, and particularly by using per-CPU 
tasklets, you may violate this behavior.)

> > minimize the amount of time spent in hardirq context, but I can't be
> > sure that is really what you want.
> >
> > If it is, this is not a good way to do it.  A better way to minimize
> > the time spent in hardirq context is to use a threaded interrupt
> > handler.  But why do you want to minimize this time in the first place?
> 
> Process context switch may have more cost than switching to softirq,
> then USB performance may be effected, that is why I choose to use
> tasklet.  Also now there is no any sleep in URB giveback, so it isn't
> necessary to introduce process to handle the giveback or completion.

Thomas and Steve, can you comment on this?  I do not have a good 
understanding of the relative benefits/drawbacks of tasklets vs. 
threaded interrupt handlers.

> > The CPU will be still have to use at least as much time as before; the
> > only difference is that it will be in softirq or in a high-priority
> > thread instead of in hardirq.
> 
> Yes, as I said above, but we can allow other IRQs to be served in
> handling URB giveback in softirq context, that is the value of softirq/tasklet.
> Of course, system may have more important things to do during the time.
> Generally speaking, for one driver, hard interrupt handler should always
> consume less time as possible.

As I understand it, the tradeoff between hardirq and softirq is
generally unimportant.  It does matter a lot in realtime settings --
but the RT kernel effectively makes _every_ interrupt handler threaded,
so it won't benefit from this change.

Alan Stern

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-10 14:03     ` Alan Stern
@ 2013-06-10 14:12       ` Oliver Neukum
  2013-06-10 15:33         ` Alan Stern
  2013-06-10 15:37       ` Ming Lei
  1 sibling, 1 reply; 66+ messages in thread
From: Oliver Neukum @ 2013-06-10 14:12 UTC (permalink / raw)
  To: linux-arm-kernel

On Monday 10 June 2013 10:03:00 Alan Stern wrote:
> [Thomas and Steve, please see the question below.]
> 
> On Mon, 10 Jun 2013, Ming Lei wrote:
> 

> That isn't clear.  It is documented that USB completion handlers are 
> called with local interrupts disabled.  An IRQ arriving before the 
> tasklet starts up might therefore be serviced during the short interval 
> before the tasklet disables local interrupts, but if that happens it 
> would mean that the completion handler would be delayed.

That is what tasklets do by definition, isn't it?
 
> In effect, you are prioritizing other IRQs higher than USB completion
> handlers.  Nevertheless, the total time spent with interrupts disabled
> will remain the same.

It pobably slightly increases. You have colder caches twice.
And potentially you swich CPUs.
 
> (There's one other side effect that perhaps you haven't considered.  
> When multiple URBs are addressed to the same endpoint, their completion 
> handlers are called in order of URB completion, which is the same as 
> the order of URB submission unless an URB is cancelled.  By delegating 
> the completion handlers to tasklets, and particularly by using per-CPU 
> tasklets, you may violate this behavior.)

This is quite serious. It mustn't happen.

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-10 14:12       ` Oliver Neukum
@ 2013-06-10 15:33         ` Alan Stern
  0 siblings, 0 replies; 66+ messages in thread
From: Alan Stern @ 2013-06-10 15:33 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, 10 Jun 2013, Oliver Neukum wrote:

> On Monday 10 June 2013 10:03:00 Alan Stern wrote:
> > [Thomas and Steve, please see the question below.]
> > 
> > On Mon, 10 Jun 2013, Ming Lei wrote:
> > 
> 
> > That isn't clear.  It is documented that USB completion handlers are 
> > called with local interrupts disabled.  An IRQ arriving before the 
> > tasklet starts up might therefore be serviced during the short interval 
> > before the tasklet disables local interrupts, but if that happens it 
> > would mean that the completion handler would be delayed.
> 
> That is what tasklets do by definition, isn't it?

Yes.  Although tasklets in general have no reason to leave interrupts 
disabled.

> > In effect, you are prioritizing other IRQs higher than USB completion
> > handlers.  Nevertheless, the total time spent with interrupts disabled
> > will remain the same.
> 
> It pobably slightly increases. You have colder caches twice.
> And potentially you swich CPUs.

Actually, the situation is a little better in one respect, which was
pointed out earlier but not emphasized.  The DMA unmapping can be
deferred to the tasklet, and it can run with interrupts enabled before
the completion hanlder is called.  Since the unmapping sometimes takes
a while to run, this is an advantage.

Alan Stern

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-10 14:03     ` Alan Stern
  2013-06-10 14:12       ` Oliver Neukum
@ 2013-06-10 15:37       ` Ming Lei
  2013-06-10 17:36         ` Alan Stern
  1 sibling, 1 reply; 66+ messages in thread
From: Ming Lei @ 2013-06-10 15:37 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jun 10, 2013 at 10:03 PM, Alan Stern <stern@rowland.harvard.edu> wrote:
> [Thomas and Steve, please see the question below.]
>
> On Mon, 10 Jun 2013, Ming Lei wrote:
>
>> On Sun, Jun 9, 2013 at 11:48 PM, Alan Stern <stern@rowland.harvard.edu> wrote:
>> > On Sun, 9 Jun 2013, Ming Lei wrote:
>> >
>> >> Hi,
>> >>
>> >> The patchset supports to run giveback of URB in tasklet context, so that
>> >> DMA unmapping/mapping on transfer buffer and compelte() callback can be
>> >> run with interrupt enabled, then time of HCD interrupt handler can be
>> >> saved much. Also this approach may simplify HCD since HCD lock needn't
>> >> be released any more before calling usb_hcd_giveback_urb().
>> >
>> > That's a lot of material.  However, you haven't specifically said
>> > _what_ you want to accomplish.  It sounds like you're trying to
>>
>> All IRQs are disabled(locally) in hard interrupt context, but they are
>> enabled in softirq context, this patchset can decrease the time a lot.
>
> That isn't clear.  It is documented that USB completion handlers are
> called with local interrupts disabled.  An IRQ arriving before the

Is there any good reason to do URB giveback with interrupt disabled?

IMO, disabling IRQs isn't necessary for completing URB since the
complete() of URBs for same endpoint is reentrant.

> tasklet starts up might therefore be serviced during the short interval
> before the tasklet disables local interrupts, but if that happens it

Tasklet doesn't disable local interrupts.

> would mean that the completion handler would be delayed.

Yes, the tasklet schedule delay is introduced to URB completion, but
the delay doesn't have much side effect about completing URB.

For async transfer, generally, it should have no effect since there should
have URBs queued on the qh queue before submitting URB.

For periodic transfer, the patch uses high priority tasklet to schedule it.

>
> In effect, you are prioritizing other IRQs higher than USB completion
> handlers.

Both other IRQs and HCD's IRQ are prioritizing the URB completion.

> Nevertheless, the total time spent with interrupts disabled
> will remain the same.

No, the total time spent with interrupts disabled does not remain same,
since no local IRQs are disabled during URB giveback anymore.

>
> (There's one other side effect that perhaps you haven't considered.
> When multiple URBs are addressed to the same endpoint, their completion
> handlers are called in order of URB completion, which is the same as
> the order of URB submission unless an URB is cancelled.  By delegating
> the completion handlers to tasklets, and particularly by using per-CPU
> tasklets, you may violate this behavior.)

Even though URB giveback is handled by hard interrupt handler, it still can't
guarantee that 100%,  please see below:

- interrupt for URB A comes in CPU 0, and handled, then HCD private
lock is released, and usb_hcd_giveback_urb() is called to complete URB A

- interrupt for URB B comes in CPU 1 just before CPU0 releases HCD private
lock, so CPU1 holds the private lock when CPU 0 releases the lock and
handles the HCD interrupt.

So now just like two CPUs are racing, if CPU0 wins, URB A is completed
first, otherwise URB B is completed first.

With introducing completing URB in percpu tasklet, the situation is
just very similar.

On ehci hcd, it can make sure that by always letting CPU0 complete all URBs
(use ehci->scanning/ehci->need_rescan flag), but the patch can change percpu
tasklet to tasklet to meet the demand.

>
>> > minimize the amount of time spent in hardirq context, but I can't be
>> > sure that is really what you want.
>> >
>> > If it is, this is not a good way to do it.  A better way to minimize
>> > the time spent in hardirq context is to use a threaded interrupt
>> > handler.  But why do you want to minimize this time in the first place?
>>
>> Process context switch may have more cost than switching to softirq,
>> then USB performance may be effected, that is why I choose to use
>> tasklet.  Also now there is no any sleep in URB giveback, so it isn't
>> necessary to introduce process to handle the giveback or completion.
>
> Thomas and Steve, can you comment on this?  I do not have a good
> understanding of the relative benefits/drawbacks of tasklets vs.
> threaded interrupt handlers.
>
>> > The CPU will be still have to use at least as much time as before; the
>> > only difference is that it will be in softirq or in a high-priority
>> > thread instead of in hardirq.
>>
>> Yes, as I said above, but we can allow other IRQs to be served in
>> handling URB giveback in softirq context, that is the value of softirq/tasklet.
>> Of course, system may have more important things to do during the time.
>> Generally speaking, for one driver, hard interrupt handler should always
>> consume less time as possible.
>
> As I understand it, the tradeoff between hardirq and softirq is
> generally unimportant.  It does matter a lot in realtime settings --

I don't think so, and obviously softirq allows IRQs to be served quickly,
otherwise why is softirq and tasklet introduced? and why so many drivers
are using tasklet/softirq?

> but the RT kernel effectively makes _every_ interrupt handler threaded,
> so it won't benefit from this change.

We still need mainline kernel to benefit the change, :-)

Thanks,
--
Ming Lei

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-10 15:37       ` Ming Lei
@ 2013-06-10 17:36         ` Alan Stern
  2013-06-10 18:52           ` Steven Rostedt
                             ` (2 more replies)
  0 siblings, 3 replies; 66+ messages in thread
From: Alan Stern @ 2013-06-10 17:36 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, 10 Jun 2013, Ming Lei wrote:

> > That isn't clear.  It is documented that USB completion handlers are
> > called with local interrupts disabled.  An IRQ arriving before the
> 
> Is there any good reason to do URB giveback with interrupt disabled?

That's a good question.  Offhand I can't think of any drivers that rely
on this -- although there certainly are places where callback routines
use spin_lock() rather than spin_lock_irq() or spin_lock_irqsave(),
because they know that interrupts are already disabled.

However, changing a documented API is not something to be done lightly.  
You would have to audit _every_ USB driver to make sure no trouble
would arise.

> IMO, disabling IRQs isn't necessary for completing URB since the
> complete() of URBs for same endpoint is reentrant.

Whether it is _necessary_ isn't the point.  It is _documented_, and 
therefore it cannot be changed easily.

> > tasklet starts up might therefore be serviced during the short interval
> > before the tasklet disables local interrupts, but if that happens it
> 
> Tasklet doesn't disable local interrupts.

It is documented that interrupts will be disabled while the completion 
handler runs.  Therefore the tasklet _must_ disable local interrupts.

> > would mean that the completion handler would be delayed.
> 
> Yes, the tasklet schedule delay is introduced to URB completion, but
> the delay doesn't have much side effect about completing URB.

What about delays that occur because a separate interrupt is serviced
before the URB's completion handler is called?  With the current code
that can't happen, but with your scheme it can.

> For async transfer, generally, it should have no effect since there should
> have URBs queued on the qh queue before submitting URB.

That's not how the Bulk-Only mass-storage protocol works.  There are
times when the protocol _requires_ bulk packets not to be submitted
until a particular URB has completed.

Since mass-storage is an important use case for USB, its requirements 
should not be ignored.

> > In effect, you are prioritizing other IRQs higher than USB completion
> > handlers.
> 
> Both other IRQs and HCD's IRQ are prioritizing the URB completion.

I don't understand that sentence.  "Prioritizing" means "assigning a 
priority to".  IRQs and HCDs don't assign priorities to anything; 
priorities are assigned by human programmers.

> > Nevertheless, the total time spent with interrupts disabled
> > will remain the same.
> 
> No, the total time spent with interrupts disabled does not remain same,
> since no local IRQs are disabled during URB giveback anymore.

That is a bug.  Local IRQs must be disabled during URB giveback.

> > (There's one other side effect that perhaps you haven't considered.
> > When multiple URBs are addressed to the same endpoint, their completion
> > handlers are called in order of URB completion, which is the same as
> > the order of URB submission unless an URB is cancelled.  By delegating
> > the completion handlers to tasklets, and particularly by using per-CPU
> > tasklets, you may violate this behavior.)
> 
> Even though URB giveback is handled by hard interrupt handler, it still can't
> guarantee that 100%,  please see below:
> 
> - interrupt for URB A comes in CPU 0, and handled, then HCD private
> lock is released, and usb_hcd_giveback_urb() is called to complete URB A
> 
> - interrupt for URB B comes in CPU 1 just before CPU0 releases HCD private
> lock, so CPU1 holds the private lock when CPU 0 releases the lock and
> handles the HCD interrupt.
> 
> So now just like two CPUs are racing, if CPU0 wins, URB A is completed
> first, otherwise URB B is completed first.

Although you didn't say so, I assume you mean that A and B are URBs for 
the same endpoint.  This means they use the same host controller and 
hence they use the same IRQ line.

When an interrupt is handled, it is not possible for a second interrupt
to arrive on the same IRQ line before the handler for the first 
interrupt returns.  The kernel disables the IRQ line when the first 
interrupt occurs, and keeps it disabled until all the handlers are 
finished.  Therefore the scenario you described cannot occur.

> With introducing completing URB in percpu tasklet, the situation is
> just very similar.

No, it isn't, for the reason given above.

> On ehci hcd, it can make sure that by always letting CPU0 complete all URBs
> (use ehci->scanning/ehci->need_rescan flag), but the patch can change percpu
> tasklet to tasklet to meet the demand.

Or perhaps you could assign each endpoint to a particular tasklet.

> > As I understand it, the tradeoff between hardirq and softirq is
> > generally unimportant.  It does matter a lot in realtime settings --
> 
> I don't think so, and obviously softirq allows IRQs to be served quickly,
> otherwise why is softirq and tasklet introduced? and why so many drivers
> are using tasklet/softirq?

This is part of what I would like to hear Thomas and Steve comment on.

Alan Stern

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-10 17:36         ` Alan Stern
@ 2013-06-10 18:52           ` Steven Rostedt
  2013-06-10 20:47             ` Alan Stern
  2013-06-10 20:51           ` Alan Stern
  2013-06-11  5:40           ` Ming Lei
  2 siblings, 1 reply; 66+ messages in thread
From: Steven Rostedt @ 2013-06-10 18:52 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, 2013-06-10 at 13:36 -0400, Alan Stern wrote:

> > I don't think so, and obviously softirq allows IRQs to be served quickly,
> > otherwise why is softirq and tasklet introduced?

Because of history. In the old days there were top halves (hard irq) and
bottom halves (like a softirq). The thing about a bottom half was, no
two could run at the same time. That is, if one CPU was running a bottom
half, another CPU would have to wait for that one to complete before it
could run any bottom half. The killer here is that it didn't matter if
it was the same bottom half or not! Microsoft/Mindcraft was kind enough
to point out this inefficiency with some benchmarks showing how horrible
Linux ran on a 4 way box. Thanks to them, softirqs and tasklets were
born :-)

A softirq does the rest of the interrupt work with interrupts enabled.

A tasklet is run from softirq context, but the difference between a
softirq and tasklet is that the same tasklet can not run on two
different CPUs at the same time. It acts similar to a task, as a task
can not run on two different CPUs at the same time either, hence the
name "tasklet". But tasklets are better than bottom halves because it
allows for different tasklets to run on different CPUs.



>  and why so many drivers
> > are using tasklet/softirq?

Because it's easy to set up and device driver authors don't know any
better ;-). Note, a lot of drivers are now using work queues today,
which run as a task.

Yes, there's a little more overhead with a task than for a softirq, but
the problem with softirqs and tasklets is that they can't be preempted,
and they are more important than all tasks on the system. If you have a
task that is critical, it must yield for your softirq. Almost!

That is, even if you have a softirq, it may not run in irq context at
all. If you get too many softirqs at a time (one comes while another one
is running), it will defer the processing to the ksoftirq thread. This
not only runs as a task, but also runs as SCHED_OTHER, and must yield to
other tasks like everyone else too.

Thus, adding it as a softirq does not guarantee that it will be
processed quickly. It just means that most of the time it will prevent
anything else from happening while your "most important handler in the
world" is running.

-- Steve

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-10 18:52           ` Steven Rostedt
@ 2013-06-10 20:47             ` Alan Stern
  2013-06-10 20:54               ` Steven Rostedt
  0 siblings, 1 reply; 66+ messages in thread
From: Alan Stern @ 2013-06-10 20:47 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, 10 Jun 2013, Steven Rostedt wrote:

> >  and why so many drivers
> > > are using tasklet/softirq?
> 
> Because it's easy to set up and device driver authors don't know any
> better ;-). Note, a lot of drivers are now using work queues today,
> which run as a task.
> 
> Yes, there's a little more overhead with a task than for a softirq, but
> the problem with softirqs and tasklets is that they can't be preempted,
> and they are more important than all tasks on the system. If you have a
> task that is critical, it must yield for your softirq. Almost!
> 
> That is, even if you have a softirq, it may not run in irq context at
> all. If you get too many softirqs at a time (one comes while another one
> is running), it will defer the processing to the ksoftirq thread. This
> not only runs as a task, but also runs as SCHED_OTHER, and must yield to
> other tasks like everyone else too.
> 
> Thus, adding it as a softirq does not guarantee that it will be
> processed quickly. It just means that most of the time it will prevent
> anything else from happening while your "most important handler in the
> world" is running.

>From this, it sounds like you generally advise using threaded interrupt 
handlers rather than tasklets/softirqs.

Alan Stern

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-10 17:36         ` Alan Stern
  2013-06-10 18:52           ` Steven Rostedt
@ 2013-06-10 20:51           ` Alan Stern
  2013-06-11  6:19             ` Ming Lei
  2013-06-11  5:40           ` Ming Lei
  2 siblings, 1 reply; 66+ messages in thread
From: Alan Stern @ 2013-06-10 20:51 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, 10 Jun 2013, Alan Stern wrote:

> > Tasklet doesn't disable local interrupts.
> 
> It is documented that interrupts will be disabled while the completion 
> handler runs.  Therefore the tasklet _must_ disable local interrupts.

You know, it may be that you can get most of the advantages you want by 
enabling local interrupts around the call to unmap_urb_for_dma() in 
usb_hcd_giveback_urb().

This may be a little dangerous, though, because it is possible for an
URB to be given back at the time it is submitted.  Drivers may not 
expect interrupts to get enabled temporarily when they call 
usb_submit_urb().

Alan Stern

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-10 20:47             ` Alan Stern
@ 2013-06-10 20:54               ` Steven Rostedt
  2013-06-11  8:40                 ` Ming Lei
  0 siblings, 1 reply; 66+ messages in thread
From: Steven Rostedt @ 2013-06-10 20:54 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, 2013-06-10 at 16:47 -0400, Alan Stern wrote:
> On Mon, 10 Jun 2013, Steven Rostedt wrote:
> 
> > >  and why so many drivers
> > > > are using tasklet/softirq?
> > 
> > Because it's easy to set up and device driver authors don't know any
> > better ;-). Note, a lot of drivers are now using work queues today,
> > which run as a task.
> > 
> > Yes, there's a little more overhead with a task than for a softirq, but
> > the problem with softirqs and tasklets is that they can't be preempted,
> > and they are more important than all tasks on the system. If you have a
> > task that is critical, it must yield for your softirq. Almost!
> > 
> > That is, even if you have a softirq, it may not run in irq context at
> > all. If you get too many softirqs at a time (one comes while another one
> > is running), it will defer the processing to the ksoftirq thread. This
> > not only runs as a task, but also runs as SCHED_OTHER, and must yield to
> > other tasks like everyone else too.
> > 
> > Thus, adding it as a softirq does not guarantee that it will be
> > processed quickly. It just means that most of the time it will prevent
> > anything else from happening while your "most important handler in the
> > world" is running.
> 
> >From this, it sounds like you generally advise using threaded interrupt 
> handlers rather than tasklets/softirqs.

Yes, there's plenty of benefits for them, and I highly doubt that any
USB device would even notice the difference between a softirq and a
thread for response time latencies.

-- Steve

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-10 17:36         ` Alan Stern
  2013-06-10 18:52           ` Steven Rostedt
  2013-06-10 20:51           ` Alan Stern
@ 2013-06-11  5:40           ` Ming Lei
  2013-06-11  7:18             ` Oliver Neukum
  2013-06-11 19:10             ` Alan Stern
  2 siblings, 2 replies; 66+ messages in thread
From: Ming Lei @ 2013-06-11  5:40 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jun 11, 2013 at 1:36 AM, Alan Stern <stern@rowland.harvard.edu> wrote:
> On Mon, 10 Jun 2013, Ming Lei wrote:
>
>> > That isn't clear.  It is documented that USB completion handlers are
>> > called with local interrupts disabled.  An IRQ arriving before the
>>
>> Is there any good reason to do URB giveback with interrupt disabled?
>
> That's a good question.  Offhand I can't think of any drivers that rely
> on this -- although there certainly are places where callback routines
> use spin_lock() rather than spin_lock_irq() or spin_lock_irqsave(),
> because they know that interrupts are already disabled.

Complete() may take long time, for example in UVC's driver, URB's
complete() may take more than 3 ms, as reported in below link:

      http://marc.info/?t=136438111600010&r=1&w=2

Also complete() is provided by interface driver, and it does many driver
specific works, so it isn't good to disable interrupt only for one driver.

If complete() callback runs in one tasklet context, spin_lock() inside
complete() is enough, just like hard irq, the tasklet itself is disabled
during complete(), if the percpu tasklet is converted to single tasklet.
So no problem when the lock is to protect shared resources between
complete().

When the lock is to protect shared resources between complete() and
non-IRQ context, currently spin_lock_irqsave() is used in non-IRQ
context, which is enough to prevent tasklet from being run on the CPU,
so no problem for this situation too.

When all HCDs support to run URB giveback in tasklet context, we can
change all spin_lock_irq*() to spin_lock() in drivers URB->complete(), and
in other places, the spin_lock_irq*() can be changed to spin_lock_bh().

>
> However, changing a documented API is not something to be done lightly.
> You would have to audit _every_ USB driver to make sure no trouble
> would arise.

OK, I will write patch to request for changing the API if the improvement
on the situation isn't objected.

In fact, at least now, the change on API is only about document change, and
no drivers' change is required, isn't it?  We can describe that URB->complete()
might run in hard irq or softirq context, depends on HCD_BH flag of
hcd->driver->flags.

>
>> IMO, disabling IRQs isn't necessary for completing URB since the
>> complete() of URBs for same endpoint is reentrant.
>
> Whether it is _necessary_ isn't the point.  It is _documented_, and
> therefore it cannot be changed easily.

Same with above.

>
>> > tasklet starts up might therefore be serviced during the short interval
>> > before the tasklet disables local interrupts, but if that happens it
>>
>> Tasklet doesn't disable local interrupts.
>
> It is documented that interrupts will be disabled while the completion
> handler runs.  Therefore the tasklet _must_ disable local interrupts.

Same with above, we can change the document which itself shouldn't
have been the only reason for blocking the change, :-)

>> > would mean that the completion handler would be delayed.
>>
>> Yes, the tasklet schedule delay is introduced to URB completion, but
>> the delay doesn't have much side effect about completing URB.
>
> What about delays that occur because a separate interrupt is serviced
> before the URB's completion handler is called?  With the current code
> that can't happen, but with your scheme it can.

Even with current code, one HCD's interrupt handling still may delay
the handling for another HCD interrupt handling for quite long, that is
the motivation to decrease the interrupt handling time of USB HCD, ;-)
And finally, USB HCDs themselves will benefit from the change, won't they?

Also it shouldn't be a problem since most of interrupt handlers
takes very less time.

>> For async transfer, generally, it should have no effect since there should
>> have URBs queued on the qh queue before submitting URB.
>
> That's not how the Bulk-Only mass-storage protocol works.  There are
> times when the protocol _requires_ bulk packets not to be submitted
> until a particular URB has completed.

Yes, I agree. But mass-storage driver itself is very fragile, it will perform
badly if CPU has a higher load.  And it is better to submit DATA/CSW
transfer in previous transfer's complete(), IMO.

Compared with "usb-storage" task's schedule latency, the tasklet
schedule delay should be lower at most of situations since the tasklet
is scheduled inside irq handler.

>
> Since mass-storage is an important use case for USB, its requirements
> should not be ignored.

That is why I did many tests on mass storage case, and from the results
in commit log of patch 4/4, looks no performance loss is involved with
the change on usb storage.

>
>> > In effect, you are prioritizing other IRQs higher than USB completion
>> > handlers.
>>
>> Both other IRQs and HCD's IRQ are prioritizing the URB completion.
>
> I don't understand that sentence.  "Prioritizing" means "assigning a
> priority to".  IRQs and HCDs don't assign priorities to anything;
> priorities are assigned by human programmers.

Sorry, prioritizing should have been "prioritized".

>> > Nevertheless, the total time spent with interrupts disabled
>> > will remain the same.
>>
>> No, the total time spent with interrupts disabled does not remain same,
>> since no local IRQs are disabled during URB giveback anymore.
>
> That is a bug.  Local IRQs must be disabled during URB giveback.

I guess it is still the document API. If so, please see my above explanation,
and I will request to change it.

>> > (There's one other side effect that perhaps you haven't considered.
>> > When multiple URBs are addressed to the same endpoint, their completion
>> > handlers are called in order of URB completion, which is the same as
>> > the order of URB submission unless an URB is cancelled.  By delegating
>> > the completion handlers to tasklets, and particularly by using per-CPU
>> > tasklets, you may violate this behavior.)
>>
>> Even though URB giveback is handled by hard interrupt handler, it still can't
>> guarantee that 100%,  please see below:
>>
>> - interrupt for URB A comes in CPU 0, and handled, then HCD private
>> lock is released, and usb_hcd_giveback_urb() is called to complete URB A
>>
>> - interrupt for URB B comes in CPU 1 just before CPU0 releases HCD private
>> lock, so CPU1 holds the private lock when CPU 0 releases the lock and
>> handles the HCD interrupt.
>>
>> So now just like two CPUs are racing, if CPU0 wins, URB A is completed
>> first, otherwise URB B is completed first.
>
> Although you didn't say so, I assume you mean that A and B are URBs for
> the same endpoint.  This means they use the same host controller and

Exactly.

> hence they use the same IRQ line.
>
> When an interrupt is handled, it is not possible for a second interrupt
> to arrive on the same IRQ line before the handler for the first
> interrupt returns.  The kernel disables the IRQ line when the first
> interrupt occurs, and keeps it disabled until all the handlers are
> finished.  Therefore the scenario you described cannot occur.

OK, thanks for pointing it out.

So I decided to replace the percpu tasklet with one single tasklet,
which can avoid the problem.

Also the tasklet running in CPU0 can handle the work which should
have been done by the same tasket scheduled in CPU1,  so we can
avoid busy-waitting in CPU1 which may be introduced by tasklet_schedule()
in CPU1.

>> With introducing completing URB in percpu tasklet, the situation is
>> just very similar.
>
> No, it isn't, for the reason given above.

Right, but the single tasklet may avoid the problem.

>
>> On ehci hcd, it can make sure that by always letting CPU0 complete all URBs
>> (use ehci->scanning/ehci->need_rescan flag), but the patch can change percpu
>> tasklet to tasklet to meet the demand.
>
> Or perhaps you could assign each endpoint to a particular tasklet.

That still need changing percpu tasklet to single tasklet, since per-endpoint
tasklet can't be distributed on each CPU.

>> > As I understand it, the tradeoff between hardirq and softirq is
>> > generally unimportant.  It does matter a lot in realtime settings --
>>
>> I don't think so, and obviously softirq allows IRQs to be served quickly,
>> otherwise why is softirq and tasklet introduced? and why so many drivers
>> are using tasklet/softirq?
>
> This is part of what I would like to hear Thomas and Steve comment on.

Me too, :-)

Thanks,
--
Ming Lei

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-10 20:51           ` Alan Stern
@ 2013-06-11  6:19             ` Ming Lei
  0 siblings, 0 replies; 66+ messages in thread
From: Ming Lei @ 2013-06-11  6:19 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jun 11, 2013 at 4:51 AM, Alan Stern <stern@rowland.harvard.edu> wrote:
> On Mon, 10 Jun 2013, Alan Stern wrote:
>
>> > Tasklet doesn't disable local interrupts.
>>
>> It is documented that interrupts will be disabled while the completion
>> handler runs.  Therefore the tasklet _must_ disable local interrupts.
>
> You know, it may be that you can get most of the advantages you want by
> enabling local interrupts around the call to unmap_urb_for_dma() in
> usb_hcd_giveback_urb().

No, please don't enable IRQs inside interrupt handler, and you will
get warning dump.

Except for unmap_urb_for_dma() in usb_hcd_giveback_urb(),  I hope
map_urb_for_dma() inside complete() can benefit from the change too,
since map_urb_for_dma() is same time-consuming with unmap_urb_for_dma().

Also I hope complete() can be run with interrupt enabled since
driver's complete() may take long time on some ARCH, as I mentioned
in my last mail. And it isn't good to disable interrupt only for one interface
driver, looks I still don't get real good explanation about disabling
IRQs here, :-)

>
> This may be a little dangerous, though, because it is possible for an
> URB to be given back at the time it is submitted.  Drivers may not
> expect interrupts to get enabled temporarily when they call
> usb_submit_urb().

Thanks,
--
Ming Lei

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-11  5:40           ` Ming Lei
@ 2013-06-11  7:18             ` Oliver Neukum
  2013-06-11  8:14               ` Ming Lei
  2013-06-11 19:10             ` Alan Stern
  1 sibling, 1 reply; 66+ messages in thread
From: Oliver Neukum @ 2013-06-11  7:18 UTC (permalink / raw)
  To: linux-arm-kernel

On Tuesday 11 June 2013 13:40:20 Ming Lei wrote:
> On Tue, Jun 11, 2013 at 1:36 AM, Alan Stern <stern@rowland.harvard.edu> wrote:
> > On Mon, 10 Jun 2013, Ming Lei wrote:

> If complete() callback runs in one tasklet context, spin_lock() inside
> complete() is enough, just like hard irq, the tasklet itself is disabled
> during complete(), if the percpu tasklet is converted to single tasklet.
> So no problem when the lock is to protect shared resources between
> complete().

We also have exclusion between complete() and other contexts, i.e. timers.

> When the lock is to protect shared resources between complete() and
> non-IRQ context, currently spin_lock_irqsave() is used in non-IRQ
> context, which is enough to prevent tasklet from being run on the CPU,
> so no problem for this situation too.
> 
> When all HCDs support to run URB giveback in tasklet context, we can
> change all spin_lock_irq*() to spin_lock() in drivers URB->complete(), and
> in other places, the spin_lock_irq*() can be changed to spin_lock_bh().

Even now we cannot guarantee that all calls to complete() are in irq.
There is the case of HCD hotunplug and other cases, like timeouts.
They will have to be verified.

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-11  7:18             ` Oliver Neukum
@ 2013-06-11  8:14               ` Ming Lei
  2013-06-11  8:49                 ` Oliver Neukum
  0 siblings, 1 reply; 66+ messages in thread
From: Ming Lei @ 2013-06-11  8:14 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jun 11, 2013 at 3:18 PM, Oliver Neukum <oliver@neukum.org> wrote:
> On Tuesday 11 June 2013 13:40:20 Ming Lei wrote:
>> On Tue, Jun 11, 2013 at 1:36 AM, Alan Stern <stern@rowland.harvard.edu> wrote:
>> > On Mon, 10 Jun 2013, Ming Lei wrote:
>
>> If complete() callback runs in one tasklet context, spin_lock() inside
>> complete() is enough, just like hard irq, the tasklet itself is disabled
>> during complete(), if the percpu tasklet is converted to single tasklet.
>> So no problem when the lock is to protect shared resources between
>> complete().
>
> We also have exclusion between complete() and other contexts, i.e. timers.

The patch also converts the complete() called in timers to tasklet context too.

So could you let me know if that eases your concern? If not, could you explain
your concern about other contexts in a bit detail?

>
>> When the lock is to protect shared resources between complete() and
>> non-IRQ context, currently spin_lock_irqsave() is used in non-IRQ
>> context, which is enough to prevent tasklet from being run on the CPU,
>> so no problem for this situation too.
>>
>> When all HCDs support to run URB giveback in tasklet context, we can
>> change all spin_lock_irq*() to spin_lock() in drivers URB->complete(), and
>> in other places, the spin_lock_irq*() can be changed to spin_lock_bh().
>
> Even now we cannot guarantee that all calls to complete() are in irq.
> There is the case of HCD hotunplug and other cases, like timeouts.
> They will have to be verified.

All URBs are completed via usb_hcd_giveback_urb(), so there should
be no differences between these cases and the common one about
introducing the patchset.

Thanks,
--
Ming Lei

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-10 20:54               ` Steven Rostedt
@ 2013-06-11  8:40                 ` Ming Lei
  0 siblings, 0 replies; 66+ messages in thread
From: Ming Lei @ 2013-06-11  8:40 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jun 11, 2013 at 4:54 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Mon, 2013-06-10 at 16:47 -0400, Alan Stern wrote:
>> On Mon, 10 Jun 2013, Steven Rostedt wrote:
>>
>> > >  and why so many drivers
>> > > > are using tasklet/softirq?
>> >
>> > Because it's easy to set up and device driver authors don't know any
>> > better ;-). Note, a lot of drivers are now using work queues today,
>> > which run as a task.
>> >
>> > Yes, there's a little more overhead with a task than for a softirq, but
>> > the problem with softirqs and tasklets is that they can't be preempted,
>> > and they are more important than all tasks on the system. If you have a
>> > task that is critical, it must yield for your softirq. Almost!
>> >
>> > That is, even if you have a softirq, it may not run in irq context at
>> > all. If you get too many softirqs at a time (one comes while another one
>> > is running), it will defer the processing to the ksoftirq thread. This
>> > not only runs as a task, but also runs as SCHED_OTHER, and must yield to
>> > other tasks like everyone else too.
>> >
>> > Thus, adding it as a softirq does not guarantee that it will be
>> > processed quickly. It just means that most of the time it will prevent
>> > anything else from happening while your "most important handler in the
>> > world" is running.
>>
>> >From this, it sounds like you generally advise using threaded interrupt
>> handlers rather than tasklets/softirqs.
>
> Yes, there's plenty of benefits for them, and I highly doubt that any
> USB device would even notice the difference between a softirq and a
> thread for response time latencies.

Steven, thanks for your input.

IMO, about the problem, the most important point between threaded interrupt
handler and tasklet is the latency.

For USB video/audio application, the data need to be handled very quickly.
Also new transfer need to be scheduled without much latency. I am wondering
if interrupt thread handler can respond quickly enough if there is high load on
CPUs.

For USB mass storage driver, it still need to be handled quickly.

If tasklet is taken, tasklet_schedule() is always called from hard interrupt
context, so most of time the latency should be better than interrupt thread
handler latency since tasklet handler is called just after irq handler exits in
the situation.  Also we can avoid to do tasklet_schedule when the tasklet
is being handled.

I will compare, collect and post out some latency data later for the sake of
choosing which one to handle URB giveback.

Thanks,
--
Ming Lei

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-11  8:14               ` Ming Lei
@ 2013-06-11  8:49                 ` Oliver Neukum
  2013-06-11  9:27                   ` Ming Lei
  0 siblings, 1 reply; 66+ messages in thread
From: Oliver Neukum @ 2013-06-11  8:49 UTC (permalink / raw)
  To: linux-arm-kernel

On Tuesday 11 June 2013 16:14:25 Ming Lei wrote:
> On Tue, Jun 11, 2013 at 3:18 PM, Oliver Neukum <oliver@neukum.org> wrote:
> > On Tuesday 11 June 2013 13:40:20 Ming Lei wrote:
> >> On Tue, Jun 11, 2013 at 1:36 AM, Alan Stern <stern@rowland.harvard.edu> wrote:
> >> > On Mon, 10 Jun 2013, Ming Lei wrote:
> >
> >> If complete() callback runs in one tasklet context, spin_lock() inside
> >> complete() is enough, just like hard irq, the tasklet itself is disabled
> >> during complete(), if the percpu tasklet is converted to single tasklet.
> >> So no problem when the lock is to protect shared resources between
> >> complete().
> >
> > We also have exclusion between complete() and other contexts, i.e. timers.
> 
> The patch also converts the complete() called in timers to tasklet context too.
> 
> So could you let me know if that eases your concern? If not, could you explain
> your concern about other contexts in a bit detail?

The driver itself may have submitted a timer and race against it.
What locking do you need in complete() and a timer to lock against
each other?

> > Even now we cannot guarantee that all calls to complete() are in irq.
> > There is the case of HCD hotunplug and other cases, like timeouts.
> > They will have to be verified.
> 
> All URBs are completed via usb_hcd_giveback_urb(), so there should
> be no differences between these cases and the common one about
> introducing the patchset.

But it makes no sense to go to a tasklet when you are already in task context.
In those cases you need to do something, essentially blocking the tasklet.

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-11  8:49                 ` Oliver Neukum
@ 2013-06-11  9:27                   ` Ming Lei
  2013-06-11 10:51                     ` Oliver Neukum
  0 siblings, 1 reply; 66+ messages in thread
From: Ming Lei @ 2013-06-11  9:27 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jun 11, 2013 at 4:49 PM, Oliver Neukum <oliver@neukum.org> wrote:
> On Tuesday 11 June 2013 16:14:25 Ming Lei wrote:
>> On Tue, Jun 11, 2013 at 3:18 PM, Oliver Neukum <oliver@neukum.org> wrote:
>> > On Tuesday 11 June 2013 13:40:20 Ming Lei wrote:
>> >> On Tue, Jun 11, 2013 at 1:36 AM, Alan Stern <stern@rowland.harvard.edu> wrote:
>> >> > On Mon, 10 Jun 2013, Ming Lei wrote:
>> >
>> >> If complete() callback runs in one tasklet context, spin_lock() inside
>> >> complete() is enough, just like hard irq, the tasklet itself is disabled
>> >> during complete(), if the percpu tasklet is converted to single tasklet.
>> >> So no problem when the lock is to protect shared resources between
>> >> complete().
>> >
>> > We also have exclusion between complete() and other contexts, i.e. timers.
>>
>> The patch also converts the complete() called in timers to tasklet context too.
>>
>> So could you let me know if that eases your concern? If not, could you explain
>> your concern about other contexts in a bit detail?
>
> The driver itself may have submitted a timer and race against it.
> What locking do you need in complete() and a timer to lock against
> each other?

Good catch.

The problem will come if only spin_lock() is called inside complete(),
I will check main USB drivers in tree to see if there is such use case.

>
>> > Even now we cannot guarantee that all calls to complete() are in irq.
>> > There is the case of HCD hotunplug and other cases, like timeouts.
>> > They will have to be verified.
>>
>> All URBs are completed via usb_hcd_giveback_urb(), so there should
>> be no differences between these cases and the common one about
>> introducing the patchset.
>
> But it makes no sense to go to a tasklet when you are already in task context.
> In those cases you need to do something, essentially blocking the tasklet.

At least now, always doing complete() in tasklet handler can simplify
implementation since these cases aren't in hot path.

Thanks,
--
Ming Lei

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-11  9:27                   ` Ming Lei
@ 2013-06-11 10:51                     ` Oliver Neukum
  2013-06-11 11:19                       ` Ming Lei
  0 siblings, 1 reply; 66+ messages in thread
From: Oliver Neukum @ 2013-06-11 10:51 UTC (permalink / raw)
  To: linux-arm-kernel

On Tuesday 11 June 2013 17:27:28 Ming Lei wrote:
> On Tue, Jun 11, 2013 at 4:49 PM, Oliver Neukum <oliver@neukum.org> wrote:
> > On Tuesday 11 June 2013 16:14:25 Ming Lei wrote:

> > The driver itself may have submitted a timer and race against it.
> > What locking do you need in complete() and a timer to lock against
> > each other?
> 
> Good catch.
> 
> The problem will come if only spin_lock() is called inside complete(),
> I will check main USB drivers in tree to see if there is such use case.

All network drivers race against timeout. I think they just unlink the URB,
but there's a lot of them.

> > But it makes no sense to go to a tasklet when you are already in task context.
> > In those cases you need to do something, essentially blocking the tasklet.
> 
> At least now, always doing complete() in tasklet handler can simplify
> implementation since these cases aren't in hot path.

Well, I am afraid this is not simply the case. These cases are partially
synchronous. For example you need to make sure all calls to complete()
are finished before you disconnect a HCD itself. The same applies to a device
being disconnected.

It the same area, what happens if an URB is unlinked between the irq handler
and the tasklet?

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-11 10:51                     ` Oliver Neukum
@ 2013-06-11 11:19                       ` Ming Lei
  0 siblings, 0 replies; 66+ messages in thread
From: Ming Lei @ 2013-06-11 11:19 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jun 11, 2013 at 6:51 PM, Oliver Neukum <oliver@neukum.org> wrote:
> On Tuesday 11 June 2013 17:27:28 Ming Lei wrote:
>> On Tue, Jun 11, 2013 at 4:49 PM, Oliver Neukum <oliver@neukum.org> wrote:
>> > On Tuesday 11 June 2013 16:14:25 Ming Lei wrote:
>
>> > The driver itself may have submitted a timer and race against it.
>> > What locking do you need in complete() and a timer to lock against
>> > each other?
>>
>> Good catch.
>>
>> The problem will come if only spin_lock() is called inside complete(),
>> I will check main USB drivers in tree to see if there is such use case.
>
> All network drivers race against timeout. I think they just unlink the URB,
> but there's a lot of them.

usbnet is fine since no spin_lock() is used in complete() and the same lock
is acquired in timer handler.

As far as I think of, the problem only exists in the below situation:

complete():
     spin_lock(&A)
     ...
     spin_unlock(&A)

timer handler():
    spin_lock(&A)
     ....
    spin_unlock(&A)

And we need to replace spin_lock() in complete() with spin_lock_irqsave()
if the change is going to be introduced.

>
>> > But it makes no sense to go to a tasklet when you are already in task context.
>> > In those cases you need to do something, essentially blocking the tasklet.
>>
>> At least now, always doing complete() in tasklet handler can simplify
>> implementation since these cases aren't in hot path.
>
> Well, I am afraid this is not simply the case. These cases are partially
> synchronous. For example you need to make sure all calls to complete()
> are finished before you disconnect a HCD itself. The same applies to a device
> being disconnected.

That is fine since the coming giveback() in tasklet context will drop the URB's
reference count(urb->use_count) finally if the scheduled tasklet can't
be dropped.
(looks tasklet schedule can make sure that)

>
> It the same area, what happens if an URB is unlinked between the irq handler
> and the tasklet?

The unlink will return failure since the URB isn't in queue of ep->urb_list.

Thanks,
--
Ming Lei

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-11  5:40           ` Ming Lei
  2013-06-11  7:18             ` Oliver Neukum
@ 2013-06-11 19:10             ` Alan Stern
  2013-06-12  2:43               ` Ming Lei
  2013-06-12  9:11               ` Oliver Neukum
  1 sibling, 2 replies; 66+ messages in thread
From: Alan Stern @ 2013-06-11 19:10 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 11 Jun 2013, Ming Lei wrote:

> >> Is there any good reason to do URB giveback with interrupt disabled?
> >
> > That's a good question.  Offhand I can't think of any drivers that rely
> > on this -- although there certainly are places where callback routines
> > use spin_lock() rather than spin_lock_irq() or spin_lock_irqsave(),
> > because they know that interrupts are already disabled.
> 
> Complete() may take long time, for example in UVC's driver, URB's
> complete() may take more than 3 ms, as reported in below link:
> 
>       http://marc.info/?t=136438111600010&r=1&w=2
> 
> Also complete() is provided by interface driver, and it does many driver
> specific works, so it isn't good to disable interrupt only for one driver.

You probably mean "it isn't good to disable interrupts for the sake of
only one driver".  As things stand now, we disable interrupts for all 
callbacks in all drivers.

> If complete() callback runs in one tasklet context, spin_lock() inside
> complete() is enough, just like hard irq, the tasklet itself is disabled
> during complete(), if the percpu tasklet is converted to single tasklet.
> So no problem when the lock is to protect shared resources between
> complete().
> 
> When the lock is to protect shared resources between complete() and
> non-IRQ context, currently spin_lock_irqsave() is used in non-IRQ
> context, which is enough to prevent tasklet from being run on the CPU,
> so no problem for this situation too.
> 
> When all HCDs support to run URB giveback in tasklet context, we can
> change all spin_lock_irq*() to spin_lock() in drivers URB->complete(), and
> in other places, the spin_lock_irq*() can be changed to spin_lock_bh().

I don't follow your reasoning.  Consider the following situation, where
the same spinlock is used in both a URB completion handler and an
interrupt handler:

	URB completes
	A tasklet calls the completion handler, with interrupts enabled
	The completion handler does
		spin_lock(&lock);
	An interrupt occurs
	The interrupt handler does
		spin_lock(&lock);	// Deadlock!

In order to prevent this from happening, you would have to change the
spin_lock() call in the completion handler to spin_lock_irqsave().  
Furthermore, you will have to audit every USB driver to make sure that
all the completion handlers get fixed.

> > However, changing a documented API is not something to be done lightly.
> > You would have to audit _every_ USB driver to make sure no trouble
> > would arise.
> 
> OK, I will write patch to request for changing the API if the improvement
> on the situation isn't objected.

I don't mind.  But don't be surprised if you end up breaking some
drivers.

> In fact, at least now, the change on API is only about document change, and
> no drivers' change is required, isn't it?  We can describe that URB->complete()
> might run in hard irq or softirq context, depends on HCD_BH flag of
> hcd->driver->flags.

The drivers _do_ need to be changed, as explained above.  And that
explanation was only one failure scenario.  There may be others.

> Even with current code, one HCD's interrupt handling still may delay
> the handling for another HCD interrupt handling for quite long, that is
> the motivation to decrease the interrupt handling time of USB HCD, ;-)
> And finally, USB HCDs themselves will benefit from the change, won't they?

How will they benefit?  Merely from not releasing their private locks?  
That involves a fairly small amount of code.

> >> For async transfer, generally, it should have no effect since there should
> >> have URBs queued on the qh queue before submitting URB.
> >
> > That's not how the Bulk-Only mass-storage protocol works.  There are
> > times when the protocol _requires_ bulk packets not to be submitted
> > until a particular URB has completed.
> 
> Yes, I agree. But mass-storage driver itself is very fragile, it will perform
> badly if CPU has a higher load.

Why does that make it fragile?  _Every_ driver and program will behave
worse when the system is under a heavy load.

>  And it is better to submit DATA/CSW
> transfer in previous transfer's complete(), IMO.

Actually, it would be even better to submit DATA _before_ the CBW
completes, and to submit the CSW before submitting DATA-OUT.  
Unfortunately, the design of the SG library makes it impossible to
submit the CSW during DATA-IN completion.

> Compared with "usb-storage" task's schedule latency, the tasklet
> schedule delay should be lower at most of situations since the tasklet
> is scheduled inside irq handler.

True.  But the two latencies add.

> So I decided to replace the percpu tasklet with one single tasklet,
> which can avoid the problem.

Hmmm.  What happens when there is more than one host controller?  The 
tasklet will give back only one URB at a time, and it won't run on more 
than one CPU at a time.  The existing drivers allow URBs to be given 
back on multiple CPUs simultaneously.

> Also the tasklet running in CPU0 can handle the work which should
> have been done by the same tasket scheduled in CPU1,  so we can
> avoid busy-waitting in CPU1 which may be introduced by tasklet_schedule()
> in CPU1.

Or you could have a separate tasklet for each host controller.

Alan Stern

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-11 19:10             ` Alan Stern
@ 2013-06-12  2:43               ` Ming Lei
  2013-06-12  6:41                 ` Ming Lei
                                   ` (2 more replies)
  2013-06-12  9:11               ` Oliver Neukum
  1 sibling, 3 replies; 66+ messages in thread
From: Ming Lei @ 2013-06-12  2:43 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jun 12, 2013 at 3:10 AM, Alan Stern <stern@rowland.harvard.edu> wrote:
> On Tue, 11 Jun 2013, Ming Lei wrote:
>
>> >> Is there any good reason to do URB giveback with interrupt disabled?
>> >
>> > That's a good question.  Offhand I can't think of any drivers that rely
>> > on this -- although there certainly are places where callback routines
>> > use spin_lock() rather than spin_lock_irq() or spin_lock_irqsave(),
>> > because they know that interrupts are already disabled.
>>
>> Complete() may take long time, for example in UVC's driver, URB's
>> complete() may take more than 3 ms, as reported in below link:
>>
>>       http://marc.info/?t=136438111600010&r=1&w=2
>>
>> Also complete() is provided by interface driver, and it does many driver
>> specific works, so it isn't good to disable interrupt only for one driver.
>
> You probably mean "it isn't good to disable interrupts for the sake of
> only one driver".  As things stand now, we disable interrupts for all
> callbacks in all drivers.
>
>> If complete() callback runs in one tasklet context, spin_lock() inside
>> complete() is enough, just like hard irq, the tasklet itself is disabled
>> during complete(), if the percpu tasklet is converted to single tasklet.
>> So no problem when the lock is to protect shared resources between
>> complete().
>>
>> When the lock is to protect shared resources between complete() and
>> non-IRQ context, currently spin_lock_irqsave() is used in non-IRQ
>> context, which is enough to prevent tasklet from being run on the CPU,
>> so no problem for this situation too.
>>
>> When all HCDs support to run URB giveback in tasklet context, we can
>> change all spin_lock_irq*() to spin_lock() in drivers URB->complete(), and
>> in other places, the spin_lock_irq*() can be changed to spin_lock_bh().
>
> I don't follow your reasoning.  Consider the following situation, where
> the same spinlock is used in both a URB completion handler and an
> interrupt handler:
>
>         URB completes
>         A tasklet calls the completion handler, with interrupts enabled
>         The completion handler does
>                 spin_lock(&lock);
>         An interrupt occurs
>         The interrupt handler does
>                 spin_lock(&lock);       // Deadlock!

If you mean the interrupt handler is HCD irq handler of the device, no
such problem since all complete() will run in tasklet context.

If not, I am wondering why one USB driver need register another hard
interrupt handler? Could you share such examples? I understand the
case only exists with timer handler as discussed with Oliver, don't I?

>
> In order to prevent this from happening, you would have to change the
> spin_lock() call in the completion handler to spin_lock_irqsave().
> Furthermore, you will have to audit every USB driver to make sure that
> all the completion handlers get fixed.

I have written one script to search usb drivers which may use timers and
spin_lock() at the same time, about 30 drivers are found, and looks we
can fix them.

>
>> > However, changing a documented API is not something to be done lightly.
>> > You would have to audit _every_ USB driver to make sure no trouble
>> > would arise.
>>
>> OK, I will write patch to request for changing the API if the improvement
>> on the situation isn't objected.
>
> I don't mind.  But don't be surprised if you end up breaking some
> drivers.

Sure, we should change these drivers before changing the API.

>
>> In fact, at least now, the change on API is only about document change, and
>> no drivers' change is required, isn't it?  We can describe that URB->complete()
>> might run in hard irq or softirq context, depends on HCD_BH flag of
>> hcd->driver->flags.
>
> The drivers _do_ need to be changed, as explained above.  And that
> explanation was only one failure scenario.  There may be others.

OK, I'd like to know what the others are? :-)

>
>> Even with current code, one HCD's interrupt handling still may delay
>> the handling for another HCD interrupt handling for quite long, that is
>> the motivation to decrease the interrupt handling time of USB HCD, ;-)
>> And finally, USB HCDs themselves will benefit from the change, won't they?
>
> How will they benefit?  Merely from not releasing their private locks?

The interrupt of one HCD will be responded lately since another HCD's
interrupt handler takes very long time. With the change, the problem can
be avoided.

> That involves a fairly small amount of code.

Maybe no much code, but :

1) Inside interrupt handler no submitting and unlinking requests from
drivers can happen any more, so interrupt handler should have became
simple.

2), Also the race between irq handler and related timer handler needn't
to be considered because the private lock can't be released in irq handler.

>
>> >> For async transfer, generally, it should have no effect since there should
>> >> have URBs queued on the qh queue before submitting URB.
>> >
>> > That's not how the Bulk-Only mass-storage protocol works.  There are
>> > times when the protocol _requires_ bulk packets not to be submitted
>> > until a particular URB has completed.
>>
>> Yes, I agree. But mass-storage driver itself is very fragile, it will perform
>> badly if CPU has a higher load.
>
> Why does that make it fragile?  _Every_ driver and program will behave
> worse when the system is under a heavy load.

Tasklet and interrupt handler may be scheduled with less latency than
task scheduling latency under high load.

>>  And it is better to submit DATA/CSW
>> transfer in previous transfer's complete(), IMO.
>
> Actually, it would be even better to submit DATA _before_ the CBW
> completes, and to submit the CSW before submitting DATA-OUT.
> Unfortunately, the design of the SG library makes it impossible to
> submit the CSW during DATA-IN completion.
>
>> Compared with "usb-storage" task's schedule latency, the tasklet
>> schedule delay should be lower at most of situations since the tasklet
>> is scheduled inside irq handler.
>
> True.  But the two latencies add.

I think it should be one tasklet latency added, but what is the other one?

>
>> So I decided to replace the percpu tasklet with one single tasklet,
>> which can avoid the problem.
>
> Hmmm.  What happens when there is more than one host controller?  The
> tasklet will give back only one URB at a time, and it won't run on more
> than one CPU at a time.  The existing drivers allow URBs to be given
> back on multiple CPUs simultaneously.

Sorry for not describing it explicitly, I mean single tasklet is per HCD
since current percpu tasklet in this patchset is per HCD already.

>
>> Also the tasklet running in CPU0 can handle the work which should
>> have been done by the same tasket scheduled in CPU1,  so we can
>> avoid busy-waitting in CPU1 which may be introduced by tasklet_schedule()
>> in CPU1.
>
> Or you could have a separate tasklet for each host controller.

Yes, but I will compare tasklet with interrupt threaded handler first.


Thanks,
--
Ming Lei

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-12  2:43               ` Ming Lei
@ 2013-06-12  6:41                 ` Ming Lei
  2013-06-12  7:45                 ` Thomas Gleixner
  2013-06-12 14:35                 ` Alan Stern
  2 siblings, 0 replies; 66+ messages in thread
From: Ming Lei @ 2013-06-12  6:41 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jun 12, 2013 at 10:43 AM, Ming Lei <ming.lei@canonical.com> wrote:
> On Wed, Jun 12, 2013 at 3:10 AM, Alan Stern <stern@rowland.harvard.edu> wrote:
>>
>> I don't follow your reasoning.  Consider the following situation, where
>> the same spinlock is used in both a URB completion handler and an
>> interrupt handler:
>>
>>         URB completes
>>         A tasklet calls the completion handler, with interrupts enabled
>>         The completion handler does
>>                 spin_lock(&lock);
>>         An interrupt occurs
>>         The interrupt handler does
>>                 spin_lock(&lock);       // Deadlock!
>
> If you mean the interrupt handler is HCD irq handler of the device, no
> such problem since all complete() will run in tasklet context.
>
> If not, I am wondering why one USB driver need register another hard
> interrupt handler? Could you share such examples? I understand the
> case only exists with timer handler as discussed with Oliver, don't I?

In fact, timer funtion is still run in softirq context, so the deadlock won't
be caused with acquiring same lock(spin_lock) in both timer function
and complete() from tasklet.


Thanks,
--
Ming Lei

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-12  2:43               ` Ming Lei
  2013-06-12  6:41                 ` Ming Lei
@ 2013-06-12  7:45                 ` Thomas Gleixner
  2013-06-13  2:25                   ` Ming Lei
  2013-06-12 14:35                 ` Alan Stern
  2 siblings, 1 reply; 66+ messages in thread
From: Thomas Gleixner @ 2013-06-12  7:45 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 12 Jun 2013, Ming Lei wrote:
> On Wed, Jun 12, 2013 at 3:10 AM, Alan Stern <stern@rowland.harvard.edu> wrote:
> >> Also the tasklet running in CPU0 can handle the work which should
> >> have been done by the same tasket scheduled in CPU1,  so we can
> >> avoid busy-waitting in CPU1 which may be introduced by tasklet_schedule()
> >> in CPU1.
> >
> > Or you could have a separate tasklet for each host controller.
> 
> Yes, but I will compare tasklet with interrupt threaded handler first.

Yes, please. I read through the thread and I really recommend that you
get rid of the tasklet. tasklets are a complete disaster by design and
we really should make efforts to get rid of the last users instead of
trying to work around their semantical shortcomings.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-11 19:10             ` Alan Stern
  2013-06-12  2:43               ` Ming Lei
@ 2013-06-12  9:11               ` Oliver Neukum
  2013-06-12 10:11                 ` Ming Lei
  1 sibling, 1 reply; 66+ messages in thread
From: Oliver Neukum @ 2013-06-12  9:11 UTC (permalink / raw)
  To: linux-arm-kernel

On Tuesday 11 June 2013 15:10:03 Alan Stern wrote:
> In order to prevent this from happening, you would have to change the
> spin_lock() call in the completion handler to spin_lock_irqsave().  
> Furthermore, you will have to audit every USB driver to make sure that
> all the completion handlers get fixed.

Yes. However, it can be done mechanically. And we know only
the handlers for complete need to be fixed.

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-12  9:11               ` Oliver Neukum
@ 2013-06-12 10:11                 ` Ming Lei
  2013-06-12 10:19                   ` Oliver Neukum
  0 siblings, 1 reply; 66+ messages in thread
From: Ming Lei @ 2013-06-12 10:11 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jun 12, 2013 at 5:11 PM, Oliver Neukum <oliver@neukum.org> wrote:
> On Tuesday 11 June 2013 15:10:03 Alan Stern wrote:
>> In order to prevent this from happening, you would have to change the
>> spin_lock() call in the completion handler to spin_lock_irqsave().
>> Furthermore, you will have to audit every USB driver to make sure that
>> all the completion handlers get fixed.
>
> Yes. However, it can be done mechanically. And we know only
> the handlers for complete need to be fixed.

I am wondering if the change is needed since timer function is still
run in softirq context instead of hard irq.

Looks Alan concerned that one USB interface driver may have another
hard interrupt handler involved. Is there such kind of USB driver/device
in tree?

Thanks,
--
Ming

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-12 10:11                 ` Ming Lei
@ 2013-06-12 10:19                   ` Oliver Neukum
  2013-06-12 11:33                     ` Ming Lei
  0 siblings, 1 reply; 66+ messages in thread
From: Oliver Neukum @ 2013-06-12 10:19 UTC (permalink / raw)
  To: linux-arm-kernel

On Wednesday 12 June 2013 18:11:34 Ming Lei wrote:
> On Wed, Jun 12, 2013 at 5:11 PM, Oliver Neukum <oliver@neukum.org> wrote:
> > On Tuesday 11 June 2013 15:10:03 Alan Stern wrote:
> >> In order to prevent this from happening, you would have to change the
> >> spin_lock() call in the completion handler to spin_lock_irqsave().
> >> Furthermore, you will have to audit every USB driver to make sure that
> >> all the completion handlers get fixed.
> >
> > Yes. However, it can be done mechanically. And we know only
> > the handlers for complete need to be fixed.
> 
> I am wondering if the change is needed since timer function is still
> run in softirq context instead of hard irq.
> 
> Looks Alan concerned that one USB interface driver may have another
> hard interrupt handler involved. Is there such kind of USB driver/device
> in tree?

No. Suppose a USB network driver.
The complete() handler is written on the assumption that interrupts are off.
So it takes a spinlock from the network subsystem. It does so with spin_lock()

Other network drivers also take the lock. And they may take it from an IRQ handler.
If such an IRQ interrupts the tasklet complete() is running in, the CPU will deadlock.

The danger is not interrupt handlers in the same driver but IRQ handlers of _other_
drivers (PCI, ...) a lock is shared with.

You need to go through all USB drivers and change every spin_lock() that goes
for a lock that is exported to a spin_lock_irqsave()

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-12 10:19                   ` Oliver Neukum
@ 2013-06-12 11:33                     ` Ming Lei
  0 siblings, 0 replies; 66+ messages in thread
From: Ming Lei @ 2013-06-12 11:33 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jun 12, 2013 at 6:19 PM, Oliver Neukum <oliver@neukum.org> wrote:
> On Wednesday 12 June 2013 18:11:34 Ming Lei wrote:
>> On Wed, Jun 12, 2013 at 5:11 PM, Oliver Neukum <oliver@neukum.org> wrote:
>> > On Tuesday 11 June 2013 15:10:03 Alan Stern wrote:
>> >> In order to prevent this from happening, you would have to change the
>> >> spin_lock() call in the completion handler to spin_lock_irqsave().
>> >> Furthermore, you will have to audit every USB driver to make sure that
>> >> all the completion handlers get fixed.
>> >
>> > Yes. However, it can be done mechanically. And we know only
>> > the handlers for complete need to be fixed.
>>
>> I am wondering if the change is needed since timer function is still
>> run in softirq context instead of hard irq.
>>
>> Looks Alan concerned that one USB interface driver may have another
>> hard interrupt handler involved. Is there such kind of USB driver/device
>> in tree?
>
> No. Suppose a USB network driver.
> The complete() handler is written on the assumption that interrupts are off.
> So it takes a spinlock from the network subsystem. It does so with spin_lock()

Looks I misses the case, but IMO, it might not be very common, generally
subsystem provides API for drivers to do something(handle rx/report tx) inside
complete(), and seldom exports subsystem-wide lock directly to drivers.
API has no context info, and should have called spin_lock_irqrestore().

>
> Other network drivers also take the lock. And they may take it from an IRQ handler.
> If such an IRQ interrupts the tasklet complete() is running in, the CPU will deadlock.

Looks no usbnet drivers hold subsystem(network) locks in its complete().
Both the locks held are per device/per skb list. In usbnet_bh(), there
is one, eg.
netif_rx(), which is a API and disables local IRQ.

>
> The danger is not interrupt handlers in the same driver but IRQ handlers of _other_
> drivers (PCI, ...) a lock is shared with.

Right, so generally drivers which spin_lock(subsystem lock) in complete()
might be affected.

>
> You need to go through all USB drivers and change every spin_lock() that goes
> for a lock that is exported to a spin_lock_irqsave()

Yes, that is the safest way.

Thanks,
--
Ming Lei

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-12  2:43               ` Ming Lei
  2013-06-12  6:41                 ` Ming Lei
  2013-06-12  7:45                 ` Thomas Gleixner
@ 2013-06-12 14:35                 ` Alan Stern
  2013-06-12 15:10                   ` Oliver Neukum
  2013-06-13  3:41                   ` Ming Lei
  2 siblings, 2 replies; 66+ messages in thread
From: Alan Stern @ 2013-06-12 14:35 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 12 Jun 2013, Ming Lei wrote:

> If not, I am wondering why one USB driver need register another hard
> interrupt handler? Could you share such examples? I understand the
> case only exists with timer handler as discussed with Oliver, don't I?

Here's another possibility.  A USB driver has a private lock, and it
acquires this lock in its completion handler.  But it also acquires
this lock in other routines, and those other routines may be called by
other drivers or subsystems in interrupt context.

> > The drivers _do_ need to be changed, as explained above.  And that
> > explanation was only one failure scenario.  There may be others.
> 
> OK, I'd like to know what the others are? :-)

Me too.  :-)

> >> And finally, USB HCDs themselves will benefit from the change, won't they?
> >
> > How will they benefit?  Merely from not releasing their private locks?
> 
> The interrupt of one HCD will be responded lately since another HCD's
> interrupt handler takes very long time. With the change, the problem can
> be avoided.

For reasonably sophisticated host controllers like EHCI, the delay in
responding to an interrupt doesn't matter much.  But for simpler host
controllers (for example, those using PIO instead of DMA) it matters
more, so those other HCDs will benefit more from the conversion.

> >> Compared with "usb-storage" task's schedule latency, the tasklet
> >> schedule delay should be lower at most of situations since the tasklet
> >> is scheduled inside irq handler.
> >
> > True.  But the two latencies add.
> 
> I think it should be one tasklet latency added, but what is the other one?

I meant that the latency in tasklet scheduling has to be added to the 
latency in scheduling the usb-storage thread.

> > Or you could have a separate tasklet for each host controller.
> 
> Yes, but I will compare tasklet with interrupt threaded handler first.

Switching to a threaded interrupt handler offers more possibilities.  
Instead of moving just the givebacks to the thread, we could put almost 
all the processing there.  (At least, we can for ehci-hcd.  Other HCDs 
may not be able to do this.)

At this point we are talking about multiple changes:

	Moving the givebacks to a tasklet or threaded handler.

	Changing the USB completion handlers so that they can be called 
	with interrupts enabled.	

	Allowing the tasklet/thread to run with interrupts enabled.

	Moving more of the HCD processing into the tasklet/thread.

Maybe other things too.  In principle, the first and second items can 
be done simultaneously.

Alan Stern

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-12 14:35                 ` Alan Stern
@ 2013-06-12 15:10                   ` Oliver Neukum
  2013-06-13  3:41                   ` Ming Lei
  1 sibling, 0 replies; 66+ messages in thread
From: Oliver Neukum @ 2013-06-12 15:10 UTC (permalink / raw)
  To: linux-arm-kernel

On Wednesday 12 June 2013 10:35:44 Alan Stern wrote:
> On Wed, 12 Jun 2013, Ming Lei wrote:
> 
> > If not, I am wondering why one USB driver need register another hard
> > interrupt handler? Could you share such examples? I understand the
> > case only exists with timer handler as discussed with Oliver, don't I?
> 
> Here's another possibility.  A USB driver has a private lock, and it
> acquires this lock in its completion handler.  But it also acquires
> this lock in other routines, and those other routines may be called by
> other drivers or subsystems in interrupt context.

That is fatal. But again mechanically using _irqsave in complete()
does the job. No other place can be affected.

	Regards
		Oliver

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-12  7:45                 ` Thomas Gleixner
@ 2013-06-13  2:25                   ` Ming Lei
  2013-06-13 14:54                     ` Alan Stern
  0 siblings, 1 reply; 66+ messages in thread
From: Ming Lei @ 2013-06-13  2:25 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jun 12, 2013 at 3:45 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Wed, 12 Jun 2013, Ming Lei wrote:
>
> Yes, please. I read through the thread and I really recommend that you
> get rid of the tasklet. tasklets are a complete disaster by design and
> we really should make efforts to get rid of the last users instead of
> trying to work around their semantical shortcomings.

The attached patch001 supports tasklet in HCD and patch002 implements
interrupt threaded handler, then if USB_HCD_THREADED_IRQ
is set, interrupt threaded handler is used to do URB giveback, otherwise
tasklet is used to do that.

The other 3 patches(2/4, 3/4, 4/4) aren't changed now, if anyone want to try it.
(the patch 2 is only for comparing performance difference, and welcome to
review/comment it so that we can get accurate test result to make decision)

Follows the mass storage test result(average over 10 times test) with
tasklet /interrupt threaded handler/hard interrupt handler under  same
environment of lenovo T410(x86), which means the test is switched by
reinserting module of usbcore or ehci-hcd without changing other things
in the machine.

- do below tests 10 times and figure out the average speed

          dd if=/dev/sdN of=/dev/null iflag=direct bs=200M 1

          device: sandisk extreme USB 3.0 16G, host: Lenovo T410 EHCI

- using interrupt threaded handler(default)
        33.440 MB/sec

- using tasklet(#undef USB_HCD_THREADED_IRQ)
        34.29 MB/sec

- using hard interrupt handler(by removing HCD_BH in ehci-hcd.c )
        34.260 MB/s


So looks usb mass storage performance loss can be observed with
interrupt threaded handler because one mass storage read/write sectors
requires at least 3 interrupts which wake up usb-storage thread 3 times
(each interrupt wakeup the usb-storage each time), introducing irq threaded
handler will make 2 threads to be waken up about 6 times for one read/write.

I think usb mass storage transfer handler need to be rewritten, otherwise
it may become worsen after using irq threaded handler in USB 3.0.(the
above device can reach >120MB/sec with hardware handler or tasklet handler,
which means about ~3K interrupts/sec, so ~6K contexts switch in case of
using irq threaded handler)

So how about supporting tasklet first, then convert to interrupt
threaded handler
after usb mass storage transfer is rewritten without performance loss?
(rewriting
usb mass storage transfer handler may need some time and work since storage
stability/correctness is extremely important, :-)

Also another problem with irq threaded handler is that there is no sort of
tasklet_schedule() interface to wakeup the thread handler manually, so
I have to use work to schedule some URB giveback from drivers(root hub
transfer, unlink),  even though that isn't a big deal but will cause code a bit
much/complicated, :-)

Thanks,
--
Ming Lei
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0002-USB-support-interrupt-threaded-handler.patch
Type: application/octet-stream
Size: 5752 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20130613/50f04eb1/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-USB-HCD-support-giveback-of-URB-in-tasklet-context.patch
Type: application/octet-stream
Size: 12423 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20130613/50f04eb1/attachment-0003.obj>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-12 14:35                 ` Alan Stern
  2013-06-12 15:10                   ` Oliver Neukum
@ 2013-06-13  3:41                   ` Ming Lei
  2013-06-13 15:05                     ` Alan Stern
  1 sibling, 1 reply; 66+ messages in thread
From: Ming Lei @ 2013-06-13  3:41 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jun 12, 2013 at 10:35 PM, Alan Stern <stern@rowland.harvard.edu> wrote:
> On Wed, 12 Jun 2013, Ming Lei wrote:
>
>> If not, I am wondering why one USB driver need register another hard
>> interrupt handler? Could you share such examples? I understand the
>> case only exists with timer handler as discussed with Oliver, don't I?
>
> Here's another possibility.  A USB driver has a private lock, and it
> acquires this lock in its completion handler.  But it also acquires
> this lock in other routines, and those other routines may be called by
> other drivers or subsystems in interrupt context.

OK, so the problem is more complicated than I imaged at start, :-(

>
>> > The drivers _do_ need to be changed, as explained above.  And that
>> > explanation was only one failure scenario.  There may be others.
>>
>> OK, I'd like to know what the others are? :-)
>
> Me too.  :-)
>
>> >> And finally, USB HCDs themselves will benefit from the change, won't they?
>> >
>> > How will they benefit?  Merely from not releasing their private locks?
>>
>> The interrupt of one HCD will be responded lately since another HCD's
>> interrupt handler takes very long time. With the change, the problem can
>> be avoided.
>
> For reasonably sophisticated host controllers like EHCI, the delay in
> responding to an interrupt doesn't matter much.  But for simpler host

EHCI might still benefit from the change: suppose uvc video is playing on
one EHCI bus, and mass storage's performance over another EHCI may
decrease because of the IRQ latency introduced.

> controllers (for example, those using PIO instead of DMA) it matters
> more, so those other HCDs will benefit more from the conversion.

Indeed, musb and ehci/ohci are coexisted on OMAP3/4/5.

>> >> Compared with "usb-storage" task's schedule latency, the tasklet
>> >> schedule delay should be lower at most of situations since the tasklet
>> >> is scheduled inside irq handler.
>> >
>> > True.  But the two latencies add.
>>
>> I think it should be one tasklet latency added, but what is the other one?
>
> I meant that the latency in tasklet scheduling has to be added to the
> latency in scheduling the usb-storage thread.
>
>> > Or you could have a separate tasklet for each host controller.
>>
>> Yes, but I will compare tasklet with interrupt threaded handler first.
>
> Switching to a threaded interrupt handler offers more possibilities.
> Instead of moving just the givebacks to the thread, we could put almost
> all the processing there.  (At least, we can for ehci-hcd.  Other HCDs
> may not be able to do this.)
>
> At this point we are talking about multiple changes:
>
>         Moving the givebacks to a tasklet or threaded handler.
>
>         Changing the USB completion handlers so that they can be called
>         with interrupts enabled.
>
>         Allowing the tasklet/thread to run with interrupts enabled.
>
>         Moving more of the HCD processing into the tasklet/thread.
>
> Maybe other things too.  In principle, the first and second items can
> be done simultaneously.

Very good summery, we can push the 1st change with disabling local IRQ
when calling complete(), so that at least DMA unmapping can be done
with IRQ enabled, and at the same time do the API change, which
should be a bit slow but the mechanical way proposed by Oliver may
be OK.

The 3rd item is easy once the 1st two items are completed.

For the 4th one, it might be a long term thing, since refactoring
HCD interrupt handler is a bit complicated. Also, when the 1st
three items are completed, hard interrupt handler takes less time,
often less than 20us at average about EHCI, so I am wondering
if the 4th is worthy.


Thanks,
--
Ming Lei

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-13  2:25                   ` Ming Lei
@ 2013-06-13 14:54                     ` Alan Stern
  2013-06-13 18:47                       ` Greg Kroah-Hartman
  2013-06-14  2:03                       ` Ming Lei
  0 siblings, 2 replies; 66+ messages in thread
From: Alan Stern @ 2013-06-13 14:54 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 13 Jun 2013, Ming Lei wrote:

> - using interrupt threaded handler(default)
>         33.440 MB/sec
> 
> - using tasklet(#undef USB_HCD_THREADED_IRQ)
>         34.29 MB/sec
> 
> - using hard interrupt handler(by removing HCD_BH in ehci-hcd.c )
>         34.260 MB/s
> 
> 
> So looks usb mass storage performance loss can be observed with
> interrupt threaded handler because one mass storage read/write sectors
> requires at least 3 interrupts which wake up usb-storage thread 3 times
> (each interrupt wakeup the usb-storage each time), introducing irq threaded
> handler will make 2 threads to be waken up about 6 times for one read/write.
> 
> I think usb mass storage transfer handler need to be rewritten, otherwise
> it may become worsen after using irq threaded handler in USB 3.0.(the
> above device can reach >120MB/sec with hardware handler or tasklet handler,
> which means about ~3K interrupts/sec, so ~6K contexts switch in case of
> using irq threaded handler)
> 
> So how about supporting tasklet first, then convert to interrupt
> threaded handler
> after usb mass storage transfer is rewritten without performance loss?
> (rewriting
> usb mass storage transfer handler may need some time and work since storage
> stability/correctness is extremely important, :-)

Maybe we should simply copy what the networking people do.  They are 
very concerned about performance and latency; whatever technique they 
use should be good for USB too.

> Also another problem with irq threaded handler is that there is no sort of
> tasklet_schedule() interface to wakeup the thread handler manually, so
> I have to use work to schedule some URB giveback from drivers(root hub
> transfer, unlink),  even though that isn't a big deal but will cause code a bit
> much/complicated, :-)

Yes, I was going to bring that up.

Thomas, sometimes we need the IRQ handler thread to do some work even
though an interrupt hasn't occurred.  Is there an API for this, or can 
one be added?

Alan Stern

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-13  3:41                   ` Ming Lei
@ 2013-06-13 15:05                     ` Alan Stern
  0 siblings, 0 replies; 66+ messages in thread
From: Alan Stern @ 2013-06-13 15:05 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 13 Jun 2013, Ming Lei wrote:

> > For reasonably sophisticated host controllers like EHCI, the delay in
> > responding to an interrupt doesn't matter much.  But for simpler host
> 
> EHCI might still benefit from the change: suppose uvc video is playing on
> one EHCI bus, and mass storage's performance over another EHCI may
> decrease because of the IRQ latency introduced.

If multiple CPUs are involved, then we could see an improvement in the 
case where the two EHCI controllers share the same IRQ line.  With 
different IRQs, though, everything would run in parallel on different 
CPUs, so there is no advantage.

> > At this point we are talking about multiple changes:
> >
> >         Moving the givebacks to a tasklet or threaded handler.
> >
> >         Changing the USB completion handlers so that they can be called
> >         with interrupts enabled.
> >
> >         Allowing the tasklet/thread to run with interrupts enabled.
> >
> >         Moving more of the HCD processing into the tasklet/thread.
> >
> > Maybe other things too.  In principle, the first and second items can
> > be done simultaneously.
> 
> Very good summery, we can push the 1st change with disabling local IRQ
> when calling complete(), so that at least DMA unmapping can be done
> with IRQ enabled, and at the same time do the API change, which
> should be a bit slow but the mechanical way proposed by Oliver may
> be OK.
> 
> The 3rd item is easy once the 1st two items are completed.
> 
> For the 4th one, it might be a long term thing, since refactoring
> HCD interrupt handler is a bit complicated. Also, when the 1st
> three items are completed, hard interrupt handler takes less time,
> often less than 20us at average about EHCI, so I am wondering
> if the 4th is worthy.

Yes, I don't know about that.  If the time spent in the hardirq
processing is reasonably short then moving it to the threaded handler
won't help very much.

Alan Stern

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-13 14:54                     ` Alan Stern
@ 2013-06-13 18:47                       ` Greg Kroah-Hartman
  2013-06-13 19:41                         ` Alan Stern
  2013-06-14  2:03                       ` Ming Lei
  1 sibling, 1 reply; 66+ messages in thread
From: Greg Kroah-Hartman @ 2013-06-13 18:47 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Jun 13, 2013 at 10:54:13AM -0400, Alan Stern wrote:
> On Thu, 13 Jun 2013, Ming Lei wrote:
> 
> > - using interrupt threaded handler(default)
> >         33.440 MB/sec
> > 
> > - using tasklet(#undef USB_HCD_THREADED_IRQ)
> >         34.29 MB/sec
> > 
> > - using hard interrupt handler(by removing HCD_BH in ehci-hcd.c )
> >         34.260 MB/s
> > 
> > 
> > So looks usb mass storage performance loss can be observed with
> > interrupt threaded handler because one mass storage read/write sectors
> > requires at least 3 interrupts which wake up usb-storage thread 3 times
> > (each interrupt wakeup the usb-storage each time), introducing irq threaded
> > handler will make 2 threads to be waken up about 6 times for one read/write.
> > 
> > I think usb mass storage transfer handler need to be rewritten, otherwise
> > it may become worsen after using irq threaded handler in USB 3.0.(the
> > above device can reach >120MB/sec with hardware handler or tasklet handler,
> > which means about ~3K interrupts/sec, so ~6K contexts switch in case of
> > using irq threaded handler)
> > 
> > So how about supporting tasklet first, then convert to interrupt
> > threaded handler
> > after usb mass storage transfer is rewritten without performance loss?
> > (rewriting
> > usb mass storage transfer handler may need some time and work since storage
> > stability/correctness is extremely important, :-)
> 
> Maybe we should simply copy what the networking people do.  They are 
> very concerned about performance and latency; whatever technique they 
> use should be good for USB too.

Yes, but for "old-style" usb-storage, is this really a big deal?  We are
still easily hitting the "line-speed" of USB for usb-storage with simple
machines, the bottlenecks that I'm seeing are in the devices themselves,
and then in the USB wire speed.

Once hardware comes out that uses USB streams, and we get device support
for the UAS protocol, then we might have a need to change things, but at
this point in time, for the "old" driver, I think we are fine.

Unless someone has a workload / benchmark that shows otherwise?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-13 18:47                       ` Greg Kroah-Hartman
@ 2013-06-13 19:41                         ` Alan Stern
  2013-06-13 20:08                           ` Steven Rostedt
                                             ` (3 more replies)
  0 siblings, 4 replies; 66+ messages in thread
From: Alan Stern @ 2013-06-13 19:41 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 13 Jun 2013, Greg Kroah-Hartman wrote:

> On Thu, Jun 13, 2013 at 10:54:13AM -0400, Alan Stern wrote:
> > On Thu, 13 Jun 2013, Ming Lei wrote:
> > 
> > > - using interrupt threaded handler(default)
> > >         33.440 MB/sec
> > > 
> > > - using tasklet(#undef USB_HCD_THREADED_IRQ)
> > >         34.29 MB/sec
> > > 
> > > - using hard interrupt handler(by removing HCD_BH in ehci-hcd.c )
> > >         34.260 MB/s
> > > 
> > > 
> > > So looks usb mass storage performance loss can be observed with
> > > interrupt threaded handler because one mass storage read/write sectors
> > > requires at least 3 interrupts which wake up usb-storage thread 3 times
> > > (each interrupt wakeup the usb-storage each time), introducing irq threaded
> > > handler will make 2 threads to be waken up about 6 times for one read/write.
> > > 
> > > I think usb mass storage transfer handler need to be rewritten, otherwise
> > > it may become worsen after using irq threaded handler in USB 3.0.(the
> > > above device can reach >120MB/sec with hardware handler or tasklet handler,
> > > which means about ~3K interrupts/sec, so ~6K contexts switch in case of
> > > using irq threaded handler)
> > > 
> > > So how about supporting tasklet first, then convert to interrupt
> > > threaded handler
> > > after usb mass storage transfer is rewritten without performance loss?
> > > (rewriting
> > > usb mass storage transfer handler may need some time and work since storage
> > > stability/correctness is extremely important, :-)
> > 
> > Maybe we should simply copy what the networking people do.  They are 
> > very concerned about performance and latency; whatever technique they 
> > use should be good for USB too.
> 
> Yes, but for "old-style" usb-storage, is this really a big deal?  We are
> still easily hitting the "line-speed" of USB for usb-storage with simple
> machines, the bottlenecks that I'm seeing are in the devices themselves,
> and then in the USB wire speed.

What about with USB-3 storage devices?  Many of them still use the
bulk-only transport instead of UAS.  They may push the limits up.

> Once hardware comes out that uses USB streams, and we get device support
> for the UAS protocol, then we might have a need to change things, but at
> this point in time, for the "old" driver, I think we are fine.
> 
> Unless someone has a workload / benchmark that shows otherwise?

The test results above show a 2.4% degradation for threaded interrupts 
as compared to tasklets.  That's in addition to the bottlenecks caused 
by the device; no doubt it would be worse for a faster device.  This 
result calls into question the benefits of threaded interrupts.

The main reason for moving away from the current scheme is to reduce
latency for other interrupt handlers.  Ming gave two examples of slow
USB code that runs in hardirq context now; with his change they would
run in softirq context and therefore wouldn't delay other interrupts so
much.  (Interrupt latency is hard to measure, however.)

Alan Stern

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-13 19:41                         ` Alan Stern
@ 2013-06-13 20:08                           ` Steven Rostedt
  2013-06-13 21:09                             ` Alan Stern
  2013-06-14  0:35                           ` Greg Kroah-Hartman
                                             ` (2 subsequent siblings)
  3 siblings, 1 reply; 66+ messages in thread
From: Steven Rostedt @ 2013-06-13 20:08 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 2013-06-13 at 15:41 -0400, Alan Stern wrote:

> The test results above show a 2.4% degradation for threaded interrupts 
> as compared to tasklets.  That's in addition to the bottlenecks caused 
> by the device; no doubt it would be worse for a faster device.  This 
> result calls into question the benefits of threaded interrupts.

That's because it was written like a top half and not a full blown
interrupt. I just looked at the patch, and saw this:

+#ifndef USB_HCD_THREADED_IRQ
        if (sched) {
                if (async)
                        tasklet_schedule(&bh->bh);
                else
                        tasklet_hi_schedule(&bh->bh);
        }
+#else
+       if (sched)
+               schedule_work(&hcd->periodic_bh->work);
+#endif

What is this? The work isn't done by an interrupt thread, but by work queues!

The point of the interrupt thread is that you do *all* the work that
needs to be done when an interrupt comes in. You don't need to delay the
work.

If you just treat a threaded interrupt like a real interrupt and push
off work to something else, then yes, it will degrade performance.

If you go the threaded interrupt route, you need to rethink the
paradigm. There's no reason that the interrupt handler needs to be fast
like it needs to be in true interrupt context. The handler can now use
mutexes, and other full features that currently only threads benefit
from. It should improve locking issues, and can serialize things if
needed.

All this patch did was to switch the main irq to a thread and make a
bottom half into a work queue.

Why couldn't you just do:

	if (sched)
		usb_giveback_urb_bh(bh);

?

-- Steve

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-13 20:08                           ` Steven Rostedt
@ 2013-06-13 21:09                             ` Alan Stern
  2013-06-13 22:24                               ` Steven Rostedt
  2013-06-14  1:27                               ` Ming Lei
  0 siblings, 2 replies; 66+ messages in thread
From: Alan Stern @ 2013-06-13 21:09 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 13 Jun 2013, Steven Rostedt wrote:

> On Thu, 2013-06-13 at 15:41 -0400, Alan Stern wrote:
> 
> > The test results above show a 2.4% degradation for threaded interrupts 
> > as compared to tasklets.  That's in addition to the bottlenecks caused 
> > by the device; no doubt it would be worse for a faster device.  This 
> > result calls into question the benefits of threaded interrupts.
> 
> That's because it was written like a top half and not a full blown
> interrupt. I just looked at the patch, and saw this:
> 
> +#ifndef USB_HCD_THREADED_IRQ
>         if (sched) {
>                 if (async)
>                         tasklet_schedule(&bh->bh);
>                 else
>                         tasklet_hi_schedule(&bh->bh);
>         }
> +#else
> +       if (sched)
> +               schedule_work(&hcd->periodic_bh->work);
> +#endif
> 
> What is this? The work isn't done by an interrupt thread, but by work queues!

You don't understand the patch.  Most of the time, sched will be 0 
here and hence the work queue won't be involved.

Yes, part of the work is done by a work queue rather than the interrupt 
thread.  But it is an unimportant part, the part that involves 
transfers to root hubs or transfers that were cancelled.  These things 
can complete without any interrupt occurring, so they can't be handled 
by the interrupt thread.  However, they are the abnormal case; the 
transfers we care about are not to root hubs and they do complete 
normally.

> The point of the interrupt thread is that you do *all* the work that
> needs to be done when an interrupt comes in. You don't need to delay the
> work.

You've got it backward.  The patch doesn't leave part of the work 
undone when an interrupt occurs.  Rather it's the other way around -- 
sometimes work needs to be done when there isn't any interrupt.  This 
could happen in a timer callback, or it could happen as a direct result 
of a function call.

Since there doesn't seem to be any way to invoke the interrupt thread 
in the absence of an interrupt, Ming pushed the job off to a work 
queue.

> If you just treat a threaded interrupt like a real interrupt and push
> off work to something else, then yes, it will degrade performance.
> 
> If you go the threaded interrupt route, you need to rethink the
> paradigm. There's no reason that the interrupt handler needs to be fast
> like it needs to be in true interrupt context. The handler can now use
> mutexes, and other full features that currently only threads benefit
> from. It should improve locking issues, and can serialize things if
> needed.
> 
> All this patch did was to switch the main irq to a thread and make a
> bottom half into a work queue.

In case it's not clear, the code you quoted above is part of the 
interrupt handler, not part of the thread.

> Why couldn't you just do:
> 
> 	if (sched)
> 		usb_giveback_urb_bh(bh);
> 
> ?

Because usb_giveback_urb_bh() is supposed to run in the context of the
tasklet or interrupt thread or work queue, not in the context of the
interrupt handler.

Alan Stern

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-13 21:09                             ` Alan Stern
@ 2013-06-13 22:24                               ` Steven Rostedt
  2013-06-13 23:08                                 ` Alan Stern
  2013-06-14  1:27                               ` Ming Lei
  1 sibling, 1 reply; 66+ messages in thread
From: Steven Rostedt @ 2013-06-13 22:24 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 2013-06-13 at 17:09 -0400, Alan Stern wrote:
> On Thu, 13 Jun 2013, Steven Rostedt wrote:
> 
> > On Thu, 2013-06-13 at 15:41 -0400, Alan Stern wrote:
> > 
> > > The test results above show a 2.4% degradation for threaded interrupts 
> > > as compared to tasklets.  That's in addition to the bottlenecks caused 
> > > by the device; no doubt it would be worse for a faster device.  This 
> > > result calls into question the benefits of threaded interrupts.
> > 
> > That's because it was written like a top half and not a full blown
> > interrupt. I just looked at the patch, and saw this:
> > 
> > +#ifndef USB_HCD_THREADED_IRQ
> >         if (sched) {
> >                 if (async)
> >                         tasklet_schedule(&bh->bh);
> >                 else
> >                         tasklet_hi_schedule(&bh->bh);
> >         }
> > +#else
> > +       if (sched)
> > +               schedule_work(&hcd->periodic_bh->work);
> > +#endif
> > 
> > What is this? The work isn't done by an interrupt thread, but by work queues!
> 
> You don't understand the patch.

Hey, I'll admit that I don't understand how USB works ;-)

I also only looked at the second patch without applying it. Thus, a lot
of the changes were out of context for me.


>   Most of the time, sched will be 0 
> here and hence the work queue won't be involved.

OK, I only compared that current tasklets were being commented out, and
that's pretty much all I had to go on.

> 
> Yes, part of the work is done by a work queue rather than the interrupt 
> thread.  But it is an unimportant part, the part that involves 
> transfers to root hubs or transfers that were cancelled.  These things 
> can complete without any interrupt occurring, so they can't be handled 
> by the interrupt thread.  However, they are the abnormal case; the 
> transfers we care about are not to root hubs and they do complete 
> normally.
> 
> > The point of the interrupt thread is that you do *all* the work that
> > needs to be done when an interrupt comes in. You don't need to delay the
> > work.
> 
> You've got it backward.  The patch doesn't leave part of the work 
> undone when an interrupt occurs.  Rather it's the other way around -- 
> sometimes work needs to be done when there isn't any interrupt.  This 
> could happen in a timer callback, or it could happen as a direct result 
> of a function call.
> 
> Since there doesn't seem to be any way to invoke the interrupt thread 
> in the absence of an interrupt, Ming pushed the job off to a work 
> queue.
> 
> > If you just treat a threaded interrupt like a real interrupt and push
> > off work to something else, then yes, it will degrade performance.
> > 
> > If you go the threaded interrupt route, you need to rethink the
> > paradigm. There's no reason that the interrupt handler needs to be fast
> > like it needs to be in true interrupt context. The handler can now use
> > mutexes, and other full features that currently only threads benefit
> > from. It should improve locking issues, and can serialize things if
> > needed.
> > 
> > All this patch did was to switch the main irq to a thread and make a
> > bottom half into a work queue.
> 
> In case it's not clear, the code you quoted above is part of the 
> interrupt handler, not part of the thread.

Got it.

> 
> > Why couldn't you just do:
> > 
> > 	if (sched)
> > 		usb_giveback_urb_bh(bh);
> > 
> > ?
> 
> Because usb_giveback_urb_bh() is supposed to run in the context of the
> tasklet or interrupt thread or work queue, not in the context of the
> interrupt handler.

I only took a quick look at the second patch. I'm now looking at both
patches applied to the code. I didn't realize this was called from the
top half.

Usually the top half for threaded interrupts is used just to quite the
interrupt line. Either by acknowledging the interrupt or by disabling
the device from sending more interrupts till the bottom half (thread)
can run. This looks to be doing a bit more than that.

I'll look a bit deeper at the patch, but this still doesn't look like a
typical threaded interrupt usage.

-- Steve

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-13 22:24                               ` Steven Rostedt
@ 2013-06-13 23:08                                 ` Alan Stern
  0 siblings, 0 replies; 66+ messages in thread
From: Alan Stern @ 2013-06-13 23:08 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 13 Jun 2013, Steven Rostedt wrote:

> I only took a quick look at the second patch. I'm now looking at both
> patches applied to the code. I didn't realize this was called from the
> top half.
> 
> Usually the top half for threaded interrupts is used just to quite the
> interrupt line. Either by acknowledging the interrupt or by disabling
> the device from sending more interrupts till the bottom half (thread)
> can run. This looks to be doing a bit more than that.

Yes, it does.  Ming left all the host controller processing in the top 
half.  The bottom half merely handles the completion callbacks.

One of the things we discussed during this email thread was the
possibility of moving _all_ the work into the threaded handler.  
That's not as easy to do, though; it requires significant modification
of the host controller driver.  And each controller driver would need
its own modifications, whereas with Ming's approach only the core needs
to be changed.

> I'll look a bit deeper at the patch, but this still doesn't look like a
> typical threaded interrupt usage.

I'll agree with that; it isn't typical.  Ming claims that the work
remaining in the top half often takes less than 20 us on average in his
tests.  Depending on the particular device, the work in the bottom half
can be very quick (little more than waking up a thread in the case of
usb-storage) or quite slow (perhaps on the order of a ms or more for
the UVC video driver on some ARM platforms).

Alan Stern

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-13 19:41                         ` Alan Stern
  2013-06-13 20:08                           ` Steven Rostedt
@ 2013-06-14  0:35                           ` Greg Kroah-Hartman
  2013-06-14  1:53                             ` Ming Lei
  2013-06-14  1:37                           ` Ming Lei
  2013-06-14  3:24                           ` Ming Lei
  3 siblings, 1 reply; 66+ messages in thread
From: Greg Kroah-Hartman @ 2013-06-14  0:35 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Jun 13, 2013 at 03:41:17PM -0400, Alan Stern wrote:
> On Thu, 13 Jun 2013, Greg Kroah-Hartman wrote:
> 
> > On Thu, Jun 13, 2013 at 10:54:13AM -0400, Alan Stern wrote:
> > > On Thu, 13 Jun 2013, Ming Lei wrote:
> > > 
> > > > - using interrupt threaded handler(default)
> > > >         33.440 MB/sec
> > > > 
> > > > - using tasklet(#undef USB_HCD_THREADED_IRQ)
> > > >         34.29 MB/sec
> > > > 
> > > > - using hard interrupt handler(by removing HCD_BH in ehci-hcd.c )
> > > >         34.260 MB/s
> > > > 
> > > > 
> > > > So looks usb mass storage performance loss can be observed with
> > > > interrupt threaded handler because one mass storage read/write sectors
> > > > requires at least 3 interrupts which wake up usb-storage thread 3 times
> > > > (each interrupt wakeup the usb-storage each time), introducing irq threaded
> > > > handler will make 2 threads to be waken up about 6 times for one read/write.
> > > > 
> > > > I think usb mass storage transfer handler need to be rewritten, otherwise
> > > > it may become worsen after using irq threaded handler in USB 3.0.(the
> > > > above device can reach >120MB/sec with hardware handler or tasklet handler,
> > > > which means about ~3K interrupts/sec, so ~6K contexts switch in case of
> > > > using irq threaded handler)
> > > > 
> > > > So how about supporting tasklet first, then convert to interrupt
> > > > threaded handler
> > > > after usb mass storage transfer is rewritten without performance loss?
> > > > (rewriting
> > > > usb mass storage transfer handler may need some time and work since storage
> > > > stability/correctness is extremely important, :-)
> > > 
> > > Maybe we should simply copy what the networking people do.  They are 
> > > very concerned about performance and latency; whatever technique they 
> > > use should be good for USB too.
> > 
> > Yes, but for "old-style" usb-storage, is this really a big deal?  We are
> > still easily hitting the "line-speed" of USB for usb-storage with simple
> > machines, the bottlenecks that I'm seeing are in the devices themselves,
> > and then in the USB wire speed.
> 
> What about with USB-3 storage devices?  Many of them still use the
> bulk-only transport instead of UAS.  They may push the limits up.

Are they really?  Have we seen that happen yet?  With the number's I've
seen published, we are easily serving up enough data to keep the pipe
full, but that all depends on your CPU / host controller.

> > Once hardware comes out that uses USB streams, and we get device support
> > for the UAS protocol, then we might have a need to change things, but at
> > this point in time, for the "old" driver, I think we are fine.
> > 
> > Unless someone has a workload / benchmark that shows otherwise?
> 
> The test results above show a 2.4% degradation for threaded interrupts 
> as compared to tasklets.  That's in addition to the bottlenecks caused 
> by the device; no doubt it would be worse for a faster device.  This 
> result calls into question the benefits of threaded interrupts.
> 
> The main reason for moving away from the current scheme is to reduce
> latency for other interrupt handlers.  Ming gave two examples of slow
> USB code that runs in hardirq context now; with his change they would
> run in softirq context and therefore wouldn't delay other interrupts so
> much.  (Interrupt latency is hard to measure, however.)

Yes, I know that people keep wanting to worry about latency issues, and
the best answer for them has always been, "don't use USB." :)

You suffer throughput issues with predicitable latency dependancies, so
we need to be careful we don't slow down the 99% of the systems out
there that do not care about this at all.

greg k-h

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-13 21:09                             ` Alan Stern
  2013-06-13 22:24                               ` Steven Rostedt
@ 2013-06-14  1:27                               ` Ming Lei
  1 sibling, 0 replies; 66+ messages in thread
From: Ming Lei @ 2013-06-14  1:27 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Jun 14, 2013 at 5:09 AM, Alan Stern <stern@rowland.harvard.edu> wrote:
> On Thu, 13 Jun 2013, Steven Rostedt wrote:
>
>> On Thu, 2013-06-13 at 15:41 -0400, Alan Stern wrote:
>>
>> > The test results above show a 2.4% degradation for threaded interrupts
>> > as compared to tasklets.  That's in addition to the bottlenecks caused
>> > by the device; no doubt it would be worse for a faster device.  This
>> > result calls into question the benefits of threaded interrupts.
>>
>> That's because it was written like a top half and not a full blown
>> interrupt. I just looked at the patch, and saw this:
>>
>> +#ifndef USB_HCD_THREADED_IRQ
>>         if (sched) {
>>                 if (async)
>>                         tasklet_schedule(&bh->bh);
>>                 else
>>                         tasklet_hi_schedule(&bh->bh);
>>         }
>> +#else
>> +       if (sched)
>> +               schedule_work(&hcd->periodic_bh->work);
>> +#endif
>>
>> What is this? The work isn't done by an interrupt thread, but by work queues!
>
> You don't understand the patch.  Most of the time, sched will be 0
> here and hence the work queue won't be involved.
>
> Yes, part of the work is done by a work queue rather than the interrupt
> thread.  But it is an unimportant part, the part that involves
> transfers to root hubs or transfers that were cancelled.  These things
> can complete without any interrupt occurring, so they can't be handled
> by the interrupt thread.  However, they are the abnormal case; the
> transfers we care about are not to root hubs and they do complete
> normally.
>
>> The point of the interrupt thread is that you do *all* the work that
>> needs to be done when an interrupt comes in. You don't need to delay the
>> work.
>
> You've got it backward.  The patch doesn't leave part of the work
> undone when an interrupt occurs.  Rather it's the other way around --
> sometimes work needs to be done when there isn't any interrupt.  This
> could happen in a timer callback, or it could happen as a direct result
> of a function call.
>
> Since there doesn't seem to be any way to invoke the interrupt thread
> in the absence of an interrupt, Ming pushed the job off to a work
> queue.
>
>> If you just treat a threaded interrupt like a real interrupt and push
>> off work to something else, then yes, it will degrade performance.
>>
>> If you go the threaded interrupt route, you need to rethink the
>> paradigm. There's no reason that the interrupt handler needs to be fast
>> like it needs to be in true interrupt context. The handler can now use
>> mutexes, and other full features that currently only threads benefit
>> from. It should improve locking issues, and can serialize things if
>> needed.
>>
>> All this patch did was to switch the main irq to a thread and make a
>> bottom half into a work queue.
>
> In case it's not clear, the code you quoted above is part of the
> interrupt handler, not part of the thread.
>
>> Why couldn't you just do:
>>
>>       if (sched)
>>               usb_giveback_urb_bh(bh);
>>
>> ?
>
> Because usb_giveback_urb_bh() is supposed to run in the context of the
> tasklet or interrupt thread or work queue, not in the context of the
> interrupt handler.

Exactly, the code should have been the below shape:

        #ifndef USB_HCD_THREADED_IRQ
              ......
        #else
               if (!in_irq()) {
                      bh = hcd->periodic_bh;
                      sched = 1;
              }
        #endif

Then it will be quite clear.

Thanks,
--
Ming Lei

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-13 19:41                         ` Alan Stern
  2013-06-13 20:08                           ` Steven Rostedt
  2013-06-14  0:35                           ` Greg Kroah-Hartman
@ 2013-06-14  1:37                           ` Ming Lei
  2013-06-14  3:24                           ` Ming Lei
  3 siblings, 0 replies; 66+ messages in thread
From: Ming Lei @ 2013-06-14  1:37 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Jun 14, 2013 at 3:41 AM, Alan Stern <stern@rowland.harvard.edu> wrote:
> On Thu, 13 Jun 2013, Greg Kroah-Hartman wrote:
>
>> On Thu, Jun 13, 2013 at 10:54:13AM -0400, Alan Stern wrote:
>> > On Thu, 13 Jun 2013, Ming Lei wrote:
>> >
>> > > - using interrupt threaded handler(default)
>> > >         33.440 MB/sec
>> > >
>> > > - using tasklet(#undef USB_HCD_THREADED_IRQ)
>> > >         34.29 MB/sec
>> > >
>> > > - using hard interrupt handler(by removing HCD_BH in ehci-hcd.c )
>> > >         34.260 MB/s
>> > >
>> > >
>> > > So looks usb mass storage performance loss can be observed with
>> > > interrupt threaded handler because one mass storage read/write sectors
>> > > requires at least 3 interrupts which wake up usb-storage thread 3 times
>> > > (each interrupt wakeup the usb-storage each time), introducing irq threaded
>> > > handler will make 2 threads to be waken up about 6 times for one read/write.
>> > >
>> > > I think usb mass storage transfer handler need to be rewritten, otherwise
>> > > it may become worsen after using irq threaded handler in USB 3.0.(the
>> > > above device can reach >120MB/sec with hardware handler or tasklet handler,
>> > > which means about ~3K interrupts/sec, so ~6K contexts switch in case of
>> > > using irq threaded handler)
>> > >
>> > > So how about supporting tasklet first, then convert to interrupt
>> > > threaded handler
>> > > after usb mass storage transfer is rewritten without performance loss?
>> > > (rewriting
>> > > usb mass storage transfer handler may need some time and work since storage
>> > > stability/correctness is extremely important, :-)
>> >
>> > Maybe we should simply copy what the networking people do.  They are
>> > very concerned about performance and latency; whatever technique they
>> > use should be good for USB too.
>>
>> Yes, but for "old-style" usb-storage, is this really a big deal?  We are
>> still easily hitting the "line-speed" of USB for usb-storage with simple
>> machines, the bottlenecks that I'm seeing are in the devices themselves,
>> and then in the USB wire speed.
>
> What about with USB-3 storage devices?  Many of them still use the
> bulk-only transport instead of UAS.  They may push the limits up.

Exactly, my test device(sandisk extreme USB 3.0 16G, 0781:5580) is very
popular, which is faster than most USB 3.0 pendrive in market, but the device
is bulk-only, and no UAS support, so I guess most of the USB 3.0 pendrive in
market still may not support UAS.

>
>> Once hardware comes out that uses USB streams, and we get device support
>> for the UAS protocol, then we might have a need to change things, but at
>> this point in time, for the "old" driver, I think we are fine.
>>
>> Unless someone has a workload / benchmark that shows otherwise?
>
> The test results above show a 2.4% degradation for threaded interrupts
> as compared to tasklets.  That's in addition to the bottlenecks caused
> by the device; no doubt it would be worse for a faster device.  This
> result calls into question the benefits of threaded interrupts.

If I enable HCD_BH in xhci driver and enable requst_threaded_irq in xhci driver,
the degradation becomes >10%, see below test on the same device connected to
xhci-hcd:

[tom at board]$ps -ax | grep xhci
Warning: bad ps syntax, perhaps a bogus '-'? See http://procps.sf.net/faq.html
 4896 pts/1    S+     0:00 grep --color=auto xhci
[tom at board]$sudo ./ds-msg /dev/sdb 400M 1 4
No. 0, time 121 MB
No. 1, time 122 MB
No. 2, time 124 MB
No. 3, time 122 MB
count=4, total=489 ms, average=122.250 MB
[tom at board]$
[tom at board]$ps -ax | grep xhci
Warning: bad ps syntax, perhaps a bogus '-'? See http://procps.sf.net/faq.html
 6037 ?        S      0:00 [irq/42-xhci_hcd]
 6038 ?        S      0:00 [irq/43-xhci_hcd]
 6039 ?        S      0:00 [irq/44-xhci_hcd]
 6040 ?        S      0:00 [irq/45-xhci_hcd]
 6041 ?        S      0:00 [irq/46-xhci_hcd]
 6304 pts/1    S+     0:00 grep --color=auto xhci
[tom at board]$
[tom at board]$
[tom at board]$sudo ./ds-msg /dev/sdb 400M 1 4
No. 0, time 107 MB
No. 1, time 108 MB
No. 2, time 108 MB
No. 3, time 109 MB
count=4, total=432 ms, average=108.000 MB

Thanks,
--
Ming Lei

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-14  0:35                           ` Greg Kroah-Hartman
@ 2013-06-14  1:53                             ` Ming Lei
  2013-06-14  6:05                               ` Greg Kroah-Hartman
  0 siblings, 1 reply; 66+ messages in thread
From: Ming Lei @ 2013-06-14  1:53 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Jun 14, 2013 at 8:35 AM, Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
> On Thu, Jun 13, 2013 at 03:41:17PM -0400, Alan Stern wrote:
>> On Thu, 13 Jun 2013, Greg Kroah-Hartman wrote:
>>
>> > On Thu, Jun 13, 2013 at 10:54:13AM -0400, Alan Stern wrote:
>> > > On Thu, 13 Jun 2013, Ming Lei wrote:
>> > >
>> > > > - using interrupt threaded handler(default)
>> > > >         33.440 MB/sec
>> > > >
>> > > > - using tasklet(#undef USB_HCD_THREADED_IRQ)
>> > > >         34.29 MB/sec
>> > > >
>> > > > - using hard interrupt handler(by removing HCD_BH in ehci-hcd.c )
>> > > >         34.260 MB/s
>> > > >
>> > > >
>> > > > So looks usb mass storage performance loss can be observed with
>> > > > interrupt threaded handler because one mass storage read/write sectors
>> > > > requires at least 3 interrupts which wake up usb-storage thread 3 times
>> > > > (each interrupt wakeup the usb-storage each time), introducing irq threaded
>> > > > handler will make 2 threads to be waken up about 6 times for one read/write.
>> > > >
>> > > > I think usb mass storage transfer handler need to be rewritten, otherwise
>> > > > it may become worsen after using irq threaded handler in USB 3.0.(the
>> > > > above device can reach >120MB/sec with hardware handler or tasklet handler,
>> > > > which means about ~3K interrupts/sec, so ~6K contexts switch in case of
>> > > > using irq threaded handler)
>> > > >
>> > > > So how about supporting tasklet first, then convert to interrupt
>> > > > threaded handler
>> > > > after usb mass storage transfer is rewritten without performance loss?
>> > > > (rewriting
>> > > > usb mass storage transfer handler may need some time and work since storage
>> > > > stability/correctness is extremely important, :-)
>> > >
>> > > Maybe we should simply copy what the networking people do.  They are
>> > > very concerned about performance and latency; whatever technique they
>> > > use should be good for USB too.
>> >
>> > Yes, but for "old-style" usb-storage, is this really a big deal?  We are
>> > still easily hitting the "line-speed" of USB for usb-storage with simple
>> > machines, the bottlenecks that I'm seeing are in the devices themselves,
>> > and then in the USB wire speed.
>>
>> What about with USB-3 storage devices?  Many of them still use the
>> bulk-only transport instead of UAS.  They may push the limits up.
>
> Are they really?  Have we seen that happen yet?  With the number's I've

Yes, the device I am testing is bulk-only, no uas support , and it is very
popular in market.

> seen published, we are easily serving up enough data to keep the pipe
> full, but that all depends on your CPU / host controller.
>
>> > Once hardware comes out that uses USB streams, and we get device support
>> > for the UAS protocol, then we might have a need to change things, but at
>> > this point in time, for the "old" driver, I think we are fine.
>> >
>> > Unless someone has a workload / benchmark that shows otherwise?
>>
>> The test results above show a 2.4% degradation for threaded interrupts
>> as compared to tasklets.  That's in addition to the bottlenecks caused
>> by the device; no doubt it would be worse for a faster device.  This
>> result calls into question the benefits of threaded interrupts.
>>
>> The main reason for moving away from the current scheme is to reduce
>> latency for other interrupt handlers.  Ming gave two examples of slow
>> USB code that runs in hardirq context now; with his change they would
>> run in softirq context and therefore wouldn't delay other interrupts so
>> much.  (Interrupt latency is hard to measure, however.)
>
> Yes, I know that people keep wanting to worry about latency issues, and
> the best answer for them has always been, "don't use USB." :)

I think we can do it better, why don't do it? :-)

> You suffer throughput issues with predicitable latency dependancies, so

This patchset don't cause throughout degradation but decrease latency much,
also has other advantages.

> we need to be careful we don't slow down the 99% of the systems out
> there that do not care about this at all.

Considered great amount of ARM devices in market, I think we need to
consider the problem on these devices, :-)

Thanks,
--
Ming Lei

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-13 14:54                     ` Alan Stern
  2013-06-13 18:47                       ` Greg Kroah-Hartman
@ 2013-06-14  2:03                       ` Ming Lei
  1 sibling, 0 replies; 66+ messages in thread
From: Ming Lei @ 2013-06-14  2:03 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Jun 13, 2013 at 10:54 PM, Alan Stern <stern@rowland.harvard.edu> wrote:
>
> Maybe we should simply copy what the networking people do.  They are
> very concerned about performance and latency; whatever technique they
> use should be good for USB too.

Most of net devices don't use interrupt threaded handler, looks we need to
copy that first, :-)

$git grep -n request_threaded_irq drivers/net/ | wc -l
9

$git grep -n request_threaded_irq drivers/net/wireless | wc -l
5

Also 5 of 9 using interrupt threaded handler are wireless network devices, which
are a bit slow, and only one ethernet driver uses it.

So I plan to use tasklet first, then we can switch to interrupt threaded handler
in the future if mass storage devices are OK with it.

If no one objects to use tasklet, I will post out the v1 patch for review later.

Thanks,
--
Ming Lei

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-13 19:41                         ` Alan Stern
                                             ` (2 preceding siblings ...)
  2013-06-14  1:37                           ` Ming Lei
@ 2013-06-14  3:24                           ` Ming Lei
  2013-06-14 14:56                             ` Alan Stern
  3 siblings, 1 reply; 66+ messages in thread
From: Ming Lei @ 2013-06-14  3:24 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Jun 14, 2013 at 3:41 AM, Alan Stern <stern@rowland.harvard.edu> wrote:
>
> The main reason for moving away from the current scheme is to reduce
> latency for other interrupt handlers.  Ming gave two examples of slow
> USB code that runs in hardirq context now; with his change they would
> run in softirq context and therefore wouldn't delay other interrupts so
> much.  (Interrupt latency is hard to measure, however.)

With the two trace points of irq_handler_entry and irq_handler_exit, the
interrupt latency(or the time taken by hard irq handler) isn't hard to measure.
One simple script can figure out the average/maximum latency for one irq
handler, like I did in 4/4.


Thanks,
--
Ming Lei

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-14  1:53                             ` Ming Lei
@ 2013-06-14  6:05                               ` Greg Kroah-Hartman
  2013-06-14 10:05                                 ` Ming Lei
  0 siblings, 1 reply; 66+ messages in thread
From: Greg Kroah-Hartman @ 2013-06-14  6:05 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Jun 14, 2013 at 09:53:52AM +0800, Ming Lei wrote:
> On Fri, Jun 14, 2013 at 8:35 AM, Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> > On Thu, Jun 13, 2013 at 03:41:17PM -0400, Alan Stern wrote:
> >> On Thu, 13 Jun 2013, Greg Kroah-Hartman wrote:
> >>
> >> > On Thu, Jun 13, 2013 at 10:54:13AM -0400, Alan Stern wrote:
> >> > > On Thu, 13 Jun 2013, Ming Lei wrote:
> >> > >
> >> > > > - using interrupt threaded handler(default)
> >> > > >         33.440 MB/sec
> >> > > >
> >> > > > - using tasklet(#undef USB_HCD_THREADED_IRQ)
> >> > > >         34.29 MB/sec
> >> > > >
> >> > > > - using hard interrupt handler(by removing HCD_BH in ehci-hcd.c )
> >> > > >         34.260 MB/s
> >> > > >
> >> > > >
> >> > > > So looks usb mass storage performance loss can be observed with
> >> > > > interrupt threaded handler because one mass storage read/write sectors
> >> > > > requires at least 3 interrupts which wake up usb-storage thread 3 times
> >> > > > (each interrupt wakeup the usb-storage each time), introducing irq threaded
> >> > > > handler will make 2 threads to be waken up about 6 times for one read/write.
> >> > > >
> >> > > > I think usb mass storage transfer handler need to be rewritten, otherwise
> >> > > > it may become worsen after using irq threaded handler in USB 3.0.(the
> >> > > > above device can reach >120MB/sec with hardware handler or tasklet handler,
> >> > > > which means about ~3K interrupts/sec, so ~6K contexts switch in case of
> >> > > > using irq threaded handler)
> >> > > >
> >> > > > So how about supporting tasklet first, then convert to interrupt
> >> > > > threaded handler
> >> > > > after usb mass storage transfer is rewritten without performance loss?
> >> > > > (rewriting
> >> > > > usb mass storage transfer handler may need some time and work since storage
> >> > > > stability/correctness is extremely important, :-)
> >> > >
> >> > > Maybe we should simply copy what the networking people do.  They are
> >> > > very concerned about performance and latency; whatever technique they
> >> > > use should be good for USB too.
> >> >
> >> > Yes, but for "old-style" usb-storage, is this really a big deal?  We are
> >> > still easily hitting the "line-speed" of USB for usb-storage with simple
> >> > machines, the bottlenecks that I'm seeing are in the devices themselves,
> >> > and then in the USB wire speed.
> >>
> >> What about with USB-3 storage devices?  Many of them still use the
> >> bulk-only transport instead of UAS.  They may push the limits up.
> >
> > Are they really?  Have we seen that happen yet?  With the number's I've
> 
> Yes, the device I am testing is bulk-only, no uas support , and it is very
> popular in market.
> 
> > seen published, we are easily serving up enough data to keep the pipe
> > full, but that all depends on your CPU / host controller.
> >
> >> > Once hardware comes out that uses USB streams, and we get device support
> >> > for the UAS protocol, then we might have a need to change things, but at
> >> > this point in time, for the "old" driver, I think we are fine.
> >> >
> >> > Unless someone has a workload / benchmark that shows otherwise?
> >>
> >> The test results above show a 2.4% degradation for threaded interrupts
> >> as compared to tasklets.  That's in addition to the bottlenecks caused
> >> by the device; no doubt it would be worse for a faster device.  This
> >> result calls into question the benefits of threaded interrupts.
> >>
> >> The main reason for moving away from the current scheme is to reduce
> >> latency for other interrupt handlers.  Ming gave two examples of slow
> >> USB code that runs in hardirq context now; with his change they would
> >> run in softirq context and therefore wouldn't delay other interrupts so
> >> much.  (Interrupt latency is hard to measure, however.)
> >
> > Yes, I know that people keep wanting to worry about latency issues, and
> > the best answer for them has always been, "don't use USB." :)
> 
> I think we can do it better, why don't do it? :-)

Because of other issues, that have been brought up here already.

But if you can do it without affecting others, that's fine.

> > You suffer throughput issues with predicitable latency dependancies, so
> 
> This patchset don't cause throughout degradation but decrease latency much,
> also has other advantages.

Like what?

> > we need to be careful we don't slow down the 99% of the systems out
> > there that do not care about this at all.
> 
> Considered great amount of ARM devices in market, I think we need to
> consider the problem on these devices, :-)

Is it a problem on those devices?  I think they have host controller
issues that are way bigger problems than this device driver, right?

greg k-h

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-14  6:05                               ` Greg Kroah-Hartman
@ 2013-06-14 10:05                                 ` Ming Lei
  0 siblings, 0 replies; 66+ messages in thread
From: Ming Lei @ 2013-06-14 10:05 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Jun 14, 2013 at 2:05 PM, Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
> On Fri, Jun 14, 2013 at 09:53:52AM +0800, Ming Lei wrote:
>> On Fri, Jun 14, 2013 at 8:35 AM, Greg Kroah-Hartman
>> <gregkh@linuxfoundation.org> wrote:
>> > On Thu, Jun 13, 2013 at 03:41:17PM -0400, Alan Stern wrote:
>> >> On Thu, 13 Jun 2013, Greg Kroah-Hartman wrote:
>> >>
>> >> > On Thu, Jun 13, 2013 at 10:54:13AM -0400, Alan Stern wrote:
>> >> > > On Thu, 13 Jun 2013, Ming Lei wrote:
>> >> > >
>> >> > > > - using interrupt threaded handler(default)
>> >> > > >         33.440 MB/sec
>> >> > > >
>> >> > > > - using tasklet(#undef USB_HCD_THREADED_IRQ)
>> >> > > >         34.29 MB/sec
>> >> > > >
>> >> > > > - using hard interrupt handler(by removing HCD_BH in ehci-hcd.c )
>> >> > > >         34.260 MB/s
>> >> > > >
>> >> > > >
>> >> > > > So looks usb mass storage performance loss can be observed with
>> >> > > > interrupt threaded handler because one mass storage read/write sectors
>> >> > > > requires at least 3 interrupts which wake up usb-storage thread 3 times
>> >> > > > (each interrupt wakeup the usb-storage each time), introducing irq threaded
>> >> > > > handler will make 2 threads to be waken up about 6 times for one read/write.
>> >> > > >
>> >> > > > I think usb mass storage transfer handler need to be rewritten, otherwise
>> >> > > > it may become worsen after using irq threaded handler in USB 3.0.(the
>> >> > > > above device can reach >120MB/sec with hardware handler or tasklet handler,
>> >> > > > which means about ~3K interrupts/sec, so ~6K contexts switch in case of
>> >> > > > using irq threaded handler)
>> >> > > >
>> >> > > > So how about supporting tasklet first, then convert to interrupt
>> >> > > > threaded handler
>> >> > > > after usb mass storage transfer is rewritten without performance loss?
>> >> > > > (rewriting
>> >> > > > usb mass storage transfer handler may need some time and work since storage
>> >> > > > stability/correctness is extremely important, :-)
>> >> > >
>> >> > > Maybe we should simply copy what the networking people do.  They are
>> >> > > very concerned about performance and latency; whatever technique they
>> >> > > use should be good for USB too.
>> >> >
>> >> > Yes, but for "old-style" usb-storage, is this really a big deal?  We are
>> >> > still easily hitting the "line-speed" of USB for usb-storage with simple
>> >> > machines, the bottlenecks that I'm seeing are in the devices themselves,
>> >> > and then in the USB wire speed.
>> >>
>> >> What about with USB-3 storage devices?  Many of them still use the
>> >> bulk-only transport instead of UAS.  They may push the limits up.
>> >
>> > Are they really?  Have we seen that happen yet?  With the number's I've
>>
>> Yes, the device I am testing is bulk-only, no uas support , and it is very
>> popular in market.
>>
>> > seen published, we are easily serving up enough data to keep the pipe
>> > full, but that all depends on your CPU / host controller.
>> >
>> >> > Once hardware comes out that uses USB streams, and we get device support
>> >> > for the UAS protocol, then we might have a need to change things, but at
>> >> > this point in time, for the "old" driver, I think we are fine.
>> >> >
>> >> > Unless someone has a workload / benchmark that shows otherwise?
>> >>
>> >> The test results above show a 2.4% degradation for threaded interrupts
>> >> as compared to tasklets.  That's in addition to the bottlenecks caused
>> >> by the device; no doubt it would be worse for a faster device.  This
>> >> result calls into question the benefits of threaded interrupts.
>> >>
>> >> The main reason for moving away from the current scheme is to reduce
>> >> latency for other interrupt handlers.  Ming gave two examples of slow
>> >> USB code that runs in hardirq context now; with his change they would
>> >> run in softirq context and therefore wouldn't delay other interrupts so
>> >> much.  (Interrupt latency is hard to measure, however.)
>> >
>> > Yes, I know that people keep wanting to worry about latency issues, and
>> > the best answer for them has always been, "don't use USB." :)
>>
>> I think we can do it better, why don't do it? :-)
>
> Because of other issues, that have been brought up here already.
>
> But if you can do it without affecting others, that's fine.

It won't affect others. As discussed in the thread, we can do the change
with following steps:

1), move URB giveback from hard irq handler to tasklet, but call complete()
with local IRQs disabled;

2), cleanup current drivers which call 'spin_lock' in compele() by
replacing spin_lock() with spin_lock_irqrestore(), and change the API of
complete() callback by claiming it called with local IRQs enabled, but bh is
disabled.

3), enable local IRQs before calling complete().

There is no good reason for complete() to run with interrupt disabled, as
discussed, so the API change and driver cleanup still makes sense.

In fact, only the below two cases requires the 2nd step's change:

- drivers->complete() acquire a subsystem wide lock which may be called
in another hard irq context, and the subsystem wide lock is acquired by
spin_lock() in complete()

- drivers->complete() hold a private lock with spin_lock() but may export
APIs to let other drivers acquire the private lock in its interrupt handler.

Looks both two cases aren't very common.

>
>> > You suffer throughput issues with predicitable latency dependancies, so
>>
>> This patchset don't cause throughout degradation but decrease latency much,
>> also has other advantages.
>
> Like what?

For example, the HCD private lock needn't to be released inside irq handler,
and HCD interrupt handling may be simplified a bit:

1) Inside interrupt handler no submitting and unlinking requests from
drivers can happen any more, so interrupt handler should have became
simple.

2) Also the race between irq handler and related timer handler needn't
to be considered because the private lock can't be released in irq handler.

>
>> > we need to be careful we don't slow down the 99% of the systems out
>> > there that do not care about this at all.
>>
>> Considered great amount of ARM devices in market, I think we need to
>> consider the problem on these devices, :-)
>
> Is it a problem on those devices?  I think they have host controller
> issues that are way bigger problems than this device driver, right?

Disabling IRQs for long time(e.g, several milliseconds) is always bad
since the CPU can't respond any events during the period, so USB
application may affect whole system response.

Generally the problem is arch related, not only about the host controller
(which can't have very big fifo to hold lots of packets)

- DMA mapping/unmapping(only cache clean/invalidate is involved) is
time-consuming, especially when the driver's transfer buffer is very big
(eg. mass storage, some usbnet devices, ...)

- Accessing(often reading) DMA coherent buffer is very slow(someone
complained[1] in linux-usb list that memcpy() in uvc complete() may take more
than 3ms, then later packets won't be moved to buffer from hw fifo in irq
handler until the memcpy is completed, and cause packet loss)

[1], http://marc.info/?l=linux-usb&m=136438101304211&w=2

Thanks,
--
Ming Lei

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-14  3:24                           ` Ming Lei
@ 2013-06-14 14:56                             ` Alan Stern
  2013-06-14 15:15                               ` Ming Lei
  0 siblings, 1 reply; 66+ messages in thread
From: Alan Stern @ 2013-06-14 14:56 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 14 Jun 2013, Ming Lei wrote:

> On Fri, Jun 14, 2013 at 3:41 AM, Alan Stern <stern@rowland.harvard.edu> wrote:
> >
> > The main reason for moving away from the current scheme is to reduce
> > latency for other interrupt handlers.  Ming gave two examples of slow
> > USB code that runs in hardirq context now; with his change they would
> > run in softirq context and therefore wouldn't delay other interrupts so
> > much.  (Interrupt latency is hard to measure, however.)
> 
> With the two trace points of irq_handler_entry and irq_handler_exit, the
> interrupt latency(or the time taken by hard irq handler) isn't hard to measure.
> One simple script can figure out the average/maximum latency for one irq
> handler, like I did in 4/4.

But that doesn't measure the time between when the IRQ request is 
issued and when irq_handler_entry runs.

Alan Stern

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-14 14:56                             ` Alan Stern
@ 2013-06-14 15:15                               ` Ming Lei
  2013-06-14 20:23                                 ` Alan Stern
  0 siblings, 1 reply; 66+ messages in thread
From: Ming Lei @ 2013-06-14 15:15 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Jun 14, 2013 at 10:56 PM, Alan Stern <stern@rowland.harvard.edu> wrote:
> On Fri, 14 Jun 2013, Ming Lei wrote:
>
>> On Fri, Jun 14, 2013 at 3:41 AM, Alan Stern <stern@rowland.harvard.edu> wrote:
>> >
>> > The main reason for moving away from the current scheme is to reduce
>> > latency for other interrupt handlers.  Ming gave two examples of slow
>> > USB code that runs in hardirq context now; with his change they would
>> > run in softirq context and therefore wouldn't delay other interrupts so
>> > much.  (Interrupt latency is hard to measure, however.)
>>
>> With the two trace points of irq_handler_entry and irq_handler_exit, the
>> interrupt latency(or the time taken by hard irq handler) isn't hard to measure.
>> One simple script can figure out the average/maximum latency for one irq
>> handler, like I did in 4/4.
>
> But that doesn't measure the time between when the IRQ request is
> issued and when irq_handler_entry runs.

That might be hard to measure, also I am wondering if the time can be
measured only by software, but these patches only focus on the time
between HCD irq entry and irq exit.


Thanks,
--
Ming Lei

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-14 15:15                               ` Ming Lei
@ 2013-06-14 20:23                                 ` Alan Stern
  2013-06-14 21:09                                   ` Thomas Gleixner
  2013-06-15  1:49                                   ` Ming Lei
  0 siblings, 2 replies; 66+ messages in thread
From: Alan Stern @ 2013-06-14 20:23 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 14 Jun 2013, Ming Lei wrote:

> >> With the two trace points of irq_handler_entry and irq_handler_exit, the
> >> interrupt latency(or the time taken by hard irq handler) isn't hard to measure.
> >> One simple script can figure out the average/maximum latency for one irq
> >> handler, like I did in 4/4.
> >
> > But that doesn't measure the time between when the IRQ request is
> > issued and when irq_handler_entry runs.
> 
> That might be hard to measure, also I am wondering if the time can be
> measured only by software, but these patches only focus on the time
> between HCD irq entry and irq exit.

Not entirely.  On a UP system, leaving interrupts disabled for a long
time (which is what we do now) increases the delay between when the IRQ
is raised and when it is serviced.  On an SMP system, a long-running 
interrupt handler will delay servicing a different device that shares 
the same IRQ line.

Alan Stern

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-14 20:23                                 ` Alan Stern
@ 2013-06-14 21:09                                   ` Thomas Gleixner
  2013-06-15  1:49                                   ` Ming Lei
  1 sibling, 0 replies; 66+ messages in thread
From: Thomas Gleixner @ 2013-06-14 21:09 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 14 Jun 2013, Alan Stern wrote:

> On Fri, 14 Jun 2013, Ming Lei wrote:
> 
> > >> With the two trace points of irq_handler_entry and irq_handler_exit, the
> > >> interrupt latency(or the time taken by hard irq handler) isn't hard to measure.
> > >> One simple script can figure out the average/maximum latency for one irq
> > >> handler, like I did in 4/4.
> > >
> > > But that doesn't measure the time between when the IRQ request is
> > > issued and when irq_handler_entry runs.
> > 
> > That might be hard to measure, also I am wondering if the time can be
> > measured only by software, but these patches only focus on the time
> > between HCD irq entry and irq exit.
> 
> Not entirely.  On a UP system, leaving interrupts disabled for a long
> time (which is what we do now) increases the delay between when the IRQ
> is raised and when it is serviced.  On an SMP system, a long-running 
> interrupt handler will delay servicing a different device that shares 
> the same IRQ line.
 
And on UP it delays ALL other interrupts. I've seen 500us+ caused by
the USB interrupt handlers...

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context
  2013-06-14 20:23                                 ` Alan Stern
  2013-06-14 21:09                                   ` Thomas Gleixner
@ 2013-06-15  1:49                                   ` Ming Lei
  1 sibling, 0 replies; 66+ messages in thread
From: Ming Lei @ 2013-06-15  1:49 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, Jun 15, 2013 at 4:23 AM, Alan Stern <stern@rowland.harvard.edu> wrote:
>
> Not entirely.  On a UP system, leaving interrupts disabled for a long
> time (which is what we do now) increases the delay between when the IRQ
> is raised and when it is serviced.  On an SMP system, a long-running

Yes, I mean the HCD IRQ handling time is already too long, and it
isn't affected by the time between raising and servicing the IRQ.

> interrupt handler will delay servicing a different device that shares
> the same IRQ line.

Not always so.

Currently, ARM can only set one irq line to be served by one of CPU
at the same time for power saving, which still results in above situation.
In fact, without irq-balance, on ARM, all IRQs are handled by CPU0 only.

On Sat, Jun 15, 2013 at 5:09 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> And on UP it delays ALL other interrupts. I've seen 500us+ caused by
> the USB interrupt handlers...

On SMP the above case may be worse than UP, when the same completion
things(from hw view) happen on one of CPU, the releasing & reacquiring HCD
private lock in interrupt handler may cause longer time.


Thanks,
--
Ming Lei

^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2013-06-15  1:49 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-06-09 15:18 [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB in tasklet context Ming Lei
2013-06-09 15:18 ` [RFC PATCH 1/4] USB: HCD: support " Ming Lei
2013-06-09 15:58   ` Alan Stern
2013-06-10  8:12     ` Ming Lei
2013-06-10  8:43   ` Oliver Neukum
2013-06-10  9:23     ` Ming Lei
2013-06-10  9:31       ` Oliver Neukum
2013-06-10  9:51         ` Ming Lei
2013-06-09 15:18 ` [RFC PATCH 2/4] USB: EHCI: don't release ehci->lock if URB giveback " Ming Lei
2013-06-09 16:06   ` Alan Stern
2013-06-10  9:10     ` Ming Lei
2013-06-09 15:18 ` [RFC PATCH 3/4] USB: EHCI: improve interrupt qh unlink Ming Lei
2013-06-09 15:18 ` [RFC PATCH 4/4] USB: EHCI: support running URB giveback in tasklet context Ming Lei
2013-06-09 15:48 ` [RFC PATCH 0/4] USB: HCD/EHCI: giveback of URB " Alan Stern
2013-06-10  6:05   ` Ming Lei
2013-06-10 14:03     ` Alan Stern
2013-06-10 14:12       ` Oliver Neukum
2013-06-10 15:33         ` Alan Stern
2013-06-10 15:37       ` Ming Lei
2013-06-10 17:36         ` Alan Stern
2013-06-10 18:52           ` Steven Rostedt
2013-06-10 20:47             ` Alan Stern
2013-06-10 20:54               ` Steven Rostedt
2013-06-11  8:40                 ` Ming Lei
2013-06-10 20:51           ` Alan Stern
2013-06-11  6:19             ` Ming Lei
2013-06-11  5:40           ` Ming Lei
2013-06-11  7:18             ` Oliver Neukum
2013-06-11  8:14               ` Ming Lei
2013-06-11  8:49                 ` Oliver Neukum
2013-06-11  9:27                   ` Ming Lei
2013-06-11 10:51                     ` Oliver Neukum
2013-06-11 11:19                       ` Ming Lei
2013-06-11 19:10             ` Alan Stern
2013-06-12  2:43               ` Ming Lei
2013-06-12  6:41                 ` Ming Lei
2013-06-12  7:45                 ` Thomas Gleixner
2013-06-13  2:25                   ` Ming Lei
2013-06-13 14:54                     ` Alan Stern
2013-06-13 18:47                       ` Greg Kroah-Hartman
2013-06-13 19:41                         ` Alan Stern
2013-06-13 20:08                           ` Steven Rostedt
2013-06-13 21:09                             ` Alan Stern
2013-06-13 22:24                               ` Steven Rostedt
2013-06-13 23:08                                 ` Alan Stern
2013-06-14  1:27                               ` Ming Lei
2013-06-14  0:35                           ` Greg Kroah-Hartman
2013-06-14  1:53                             ` Ming Lei
2013-06-14  6:05                               ` Greg Kroah-Hartman
2013-06-14 10:05                                 ` Ming Lei
2013-06-14  1:37                           ` Ming Lei
2013-06-14  3:24                           ` Ming Lei
2013-06-14 14:56                             ` Alan Stern
2013-06-14 15:15                               ` Ming Lei
2013-06-14 20:23                                 ` Alan Stern
2013-06-14 21:09                                   ` Thomas Gleixner
2013-06-15  1:49                                   ` Ming Lei
2013-06-14  2:03                       ` Ming Lei
2013-06-12 14:35                 ` Alan Stern
2013-06-12 15:10                   ` Oliver Neukum
2013-06-13  3:41                   ` Ming Lei
2013-06-13 15:05                     ` Alan Stern
2013-06-12  9:11               ` Oliver Neukum
2013-06-12 10:11                 ` Ming Lei
2013-06-12 10:19                   ` Oliver Neukum
2013-06-12 11:33                     ` Ming Lei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.