linux-arm-msm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v9 0/5] Re-introduce TX FIFO resize for larger EP bursting
@ 2021-05-19  7:49 Wesley Cheng
  2021-05-19  7:49 ` [PATCH v9 1/5] usb: gadget: udc: core: Introduce check_config to verify USB configuration Wesley Cheng
                   ` (5 more replies)
  0 siblings, 6 replies; 31+ messages in thread
From: Wesley Cheng @ 2021-05-19  7:49 UTC (permalink / raw)
  To: balbi, gregkh, agross, bjorn.andersson, robh+dt
  Cc: linux-usb, linux-kernel, linux-arm-msm, devicetree, jackp,
	Thinh.Nguyen, Wesley Cheng

Changes in V9:
 - Fixed incorrect patch in series.  Removed changes in DTSI, as dwc3-qcom will
   add the property by default from the kernel.

Changes in V8:
 - Rebased to usb-testing
 - Using devm_kzalloc for adding txfifo property in dwc3-qcom
 - Removed DWC3 QCOM ACPI property for enabling the txfifo resize

Changes in V7:
 - Added a new property tx-fifo-max-num for limiting how much fifo space the
   resizing logic can allocate for endpoints with large burst values.  This
   can differ across platforms, and tie in closely with overall system latency.
 - Added recommended checks for DWC32.
 - Added changes to set the tx-fifo-resize property from dwc3-qcom by default
   instead of modifying the current DTSI files.
 - Added comments on all APIs/variables introduced.
 - Updated the DWC3 YAML to include a better description of the tx-fifo-resize
   property and added an entry for tx-fifo-max-num.

Changes in V6:
 - Rebased patches to usb-testing.
 - Renamed to PATCH series instead of RFC.
 - Checking for fs_descriptors instead of ss_descriptors for determining the
   endpoint count for a particular configuration.
 - Re-ordered patch series to fix patch dependencies.

Changes in V5:
 - Added check_config() logic, which is used to communicate the number of EPs
   used in a particular configuration.  Based on this, the DWC3 gadget driver
   has the ability to know the maximum number of eps utilized in all configs.
   This helps reduce unnecessary allocation to unused eps, and will catch fifo
   allocation issues at bind() time.
 - Fixed variable declaration to single line per variable, and reverse xmas.
 - Created a helper for fifo clearing, which is used by ep0.c

Changes in V4:
 - Removed struct dwc3* as an argument for dwc3_gadget_resize_tx_fifos()
 - Removed WARN_ON(1) in case we run out of fifo space
 
Changes in V3:
 - Removed "Reviewed-by" tags
 - Renamed series back to RFC
 - Modified logic to ensure that fifo_size is reset if we pass the minimum
   threshold.  Tested with binding multiple FDs requesting 6 FIFOs.

Changes in V2:
 - Modified TXFIFO resizing logic to ensure that each EP is reserved a
   FIFO.
 - Removed dev_dbg() prints and fixed typos from patches
 - Added some more description on the dt-bindings commit message

Currently, there is no functionality to allow for resizing the TXFIFOs, and
relying on the HW default setting for the TXFIFO depth.  In most cases, the
HW default is probably sufficient, but for USB compositions that contain
multiple functions that require EP bursting, the default settings
might not be enough.  Also to note, the current SW will assign an EP to a
function driver w/o checking to see if the TXFIFO size for that particular
EP is large enough. (this is a problem if there are multiple HW defined
values for the TXFIFO size)

It is mentioned in the SNPS databook that a minimum of TX FIFO depth = 3
is required for an EP that supports bursting.  Otherwise, there may be
frequent occurences of bursts ending.  For high bandwidth functions,
such as data tethering (protocols that support data aggregation), mass
storage, and media transfer protocol (over FFS), the bMaxBurst value can be
large, and a bigger TXFIFO depth may prove to be beneficial in terms of USB
throughput. (which can be associated to system access latency, etc...)  It
allows for a more consistent burst of traffic, w/o any interruptions, as
data is readily available in the FIFO.

With testing done using the mass storage function driver, the results show
that with a larger TXFIFO depth, the bandwidth increased significantly.

Test Parameters:
 - Platform: Qualcomm SM8150
 - bMaxBurst = 6
 - USB req size = 256kB
 - Num of USB reqs = 16
 - USB Speed = Super-Speed
 - Function Driver: Mass Storage (w/ ramdisk)
 - Test Application: CrystalDiskMark

Results:

TXFIFO Depth = 3 max packets

Test Case | Data Size | AVG tput (in MB/s)
-------------------------------------------
Sequential|1 GB x     | 
Read      |9 loops    | 193.60
	  |           | 195.86
          |           | 184.77
          |           | 193.60
-------------------------------------------

TXFIFO Depth = 6 max packets

Test Case | Data Size | AVG tput (in MB/s)
-------------------------------------------
Sequential|1 GB x     | 
Read      |9 loops    | 287.35
	  |           | 304.94
          |           | 289.64
          |           | 293.61
-------------------------------------------

Wesley Cheng (5):
  usb: gadget: udc: core: Introduce check_config to verify USB
    configuration
  usb: gadget: configfs: Check USB configuration before adding
  usb: dwc3: Resize TX FIFOs to meet EP bursting requirements
  usb: dwc3: dwc3-qcom: Enable tx-fifo-resize property by default
  dt-bindings: usb: dwc3: Update dwc3 TX fifo properties

 .../devicetree/bindings/usb/snps,dwc3.yaml         |  15 +-
 drivers/usb/dwc3/core.c                            |   9 +
 drivers/usb/dwc3/core.h                            |  15 ++
 drivers/usb/dwc3/dwc3-qcom.c                       |   9 +
 drivers/usb/dwc3/ep0.c                             |   2 +
 drivers/usb/dwc3/gadget.c                          | 212 +++++++++++++++++++++
 drivers/usb/gadget/configfs.c                      |  22 +++
 drivers/usb/gadget/udc/core.c                      |  25 +++
 include/linux/usb/gadget.h                         |   5 +
 9 files changed, 312 insertions(+), 2 deletions(-)

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v9 1/5] usb: gadget: udc: core: Introduce check_config to verify USB configuration
  2021-05-19  7:49 [PATCH v9 0/5] Re-introduce TX FIFO resize for larger EP bursting Wesley Cheng
@ 2021-05-19  7:49 ` Wesley Cheng
  2021-05-19  7:49 ` [PATCH v9 2/5] usb: gadget: configfs: Check USB configuration before adding Wesley Cheng
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 31+ messages in thread
From: Wesley Cheng @ 2021-05-19  7:49 UTC (permalink / raw)
  To: balbi, gregkh, agross, bjorn.andersson, robh+dt
  Cc: linux-usb, linux-kernel, linux-arm-msm, devicetree, jackp,
	Thinh.Nguyen, Wesley Cheng

Some UDCs may have constraints on how many high bandwidth endpoints it can
support in a certain configuration.  This API allows for the composite
driver to pass down the total number of endpoints to the UDC so it can verify
it has the required resources to support the configuration.

Signed-off-by: Wesley Cheng <wcheng@codeaurora.org>
---
 drivers/usb/gadget/udc/core.c | 25 +++++++++++++++++++++++++
 include/linux/usb/gadget.h    |  5 +++++
 2 files changed, 30 insertions(+)

diff --git a/drivers/usb/gadget/udc/core.c b/drivers/usb/gadget/udc/core.c
index 493ff93..e67fd93 100644
--- a/drivers/usb/gadget/udc/core.c
+++ b/drivers/usb/gadget/udc/core.c
@@ -1003,6 +1003,31 @@ int usb_gadget_ep_match_desc(struct usb_gadget *gadget,
 }
 EXPORT_SYMBOL_GPL(usb_gadget_ep_match_desc);
 
+/**
+ * usb_gadget_check_config - checks if the UDC can support the number of eps
+ * @gadget: controller to check the USB configuration
+ * @ep_map: bitmap of endpoints being requested by a USB configuration
+ *
+ * Ensure that a UDC is able to support the number of endpoints within a USB
+ * configuration, and that there are no resource limitations to support all
+ * requested eps.
+ *
+ * Returns zero on success, else a negative errno.
+ */
+int usb_gadget_check_config(struct usb_gadget *gadget, unsigned long ep_map)
+{
+	int ret = 0;
+
+	if (!gadget->ops->check_config)
+		goto out;
+
+	ret = gadget->ops->check_config(gadget, ep_map);
+
+out:
+	return ret;
+}
+EXPORT_SYMBOL_GPL(usb_gadget_check_config);
+
 /* ------------------------------------------------------------------------- */
 
 static void usb_gadget_state_work(struct work_struct *work)
diff --git a/include/linux/usb/gadget.h b/include/linux/usb/gadget.h
index ee04ef2..9fb69eb 100644
--- a/include/linux/usb/gadget.h
+++ b/include/linux/usb/gadget.h
@@ -328,6 +328,7 @@ struct usb_gadget_ops {
 	struct usb_ep *(*match_ep)(struct usb_gadget *,
 			struct usb_endpoint_descriptor *,
 			struct usb_ss_ep_comp_descriptor *);
+	int	(*check_config)(struct usb_gadget *gadget, unsigned long ep_map);
 };
 
 /**
@@ -607,6 +608,7 @@ int usb_gadget_connect(struct usb_gadget *gadget);
 int usb_gadget_disconnect(struct usb_gadget *gadget);
 int usb_gadget_deactivate(struct usb_gadget *gadget);
 int usb_gadget_activate(struct usb_gadget *gadget);
+int usb_gadget_check_config(struct usb_gadget *gadget, unsigned long ep_map);
 #else
 static inline int usb_gadget_frame_number(struct usb_gadget *gadget)
 { return 0; }
@@ -630,6 +632,9 @@ static inline int usb_gadget_deactivate(struct usb_gadget *gadget)
 { return 0; }
 static inline int usb_gadget_activate(struct usb_gadget *gadget)
 { return 0; }
+static inline int usb_gadget_check_config(struct usb_gadget *gadget,
+					  unsigned long ep_map)
+{ return 0; }
 #endif /* CONFIG_USB_GADGET */
 
 /*-------------------------------------------------------------------------*/
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v9 2/5] usb: gadget: configfs: Check USB configuration before adding
  2021-05-19  7:49 [PATCH v9 0/5] Re-introduce TX FIFO resize for larger EP bursting Wesley Cheng
  2021-05-19  7:49 ` [PATCH v9 1/5] usb: gadget: udc: core: Introduce check_config to verify USB configuration Wesley Cheng
@ 2021-05-19  7:49 ` Wesley Cheng
  2021-05-19  7:49 ` [PATCH v9 3/5] usb: dwc3: Resize TX FIFOs to meet EP bursting requirements Wesley Cheng
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 31+ messages in thread
From: Wesley Cheng @ 2021-05-19  7:49 UTC (permalink / raw)
  To: balbi, gregkh, agross, bjorn.andersson, robh+dt
  Cc: linux-usb, linux-kernel, linux-arm-msm, devicetree, jackp,
	Thinh.Nguyen, Wesley Cheng

Ensure that the USB gadget is able to support the configuration being
added based on the number of endpoints required from all interfaces.  This
is for accounting for any bandwidth or space limitations.

Signed-off-by: Wesley Cheng <wcheng@codeaurora.org>
---
 drivers/usb/gadget/configfs.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/drivers/usb/gadget/configfs.c b/drivers/usb/gadget/configfs.c
index 15a607c..76b9983 100644
--- a/drivers/usb/gadget/configfs.c
+++ b/drivers/usb/gadget/configfs.c
@@ -1374,6 +1374,7 @@ static int configfs_composite_bind(struct usb_gadget *gadget,
 		struct usb_function *f;
 		struct usb_function *tmp;
 		struct gadget_config_name *cn;
+		unsigned long ep_map = 0;
 
 		if (gadget_is_otg(gadget))
 			c->descriptors = otg_desc;
@@ -1403,7 +1404,28 @@ static int configfs_composite_bind(struct usb_gadget *gadget,
 				list_add(&f->list, &cfg->func_list);
 				goto err_purge_funcs;
 			}
+			if (f->fs_descriptors) {
+				struct usb_descriptor_header **d;
+
+				d = f->fs_descriptors;
+				for (; *d; ++d) {
+					struct usb_endpoint_descriptor *ep;
+					int addr;
+
+					if ((*d)->bDescriptorType != USB_DT_ENDPOINT)
+						continue;
+
+					ep = (struct usb_endpoint_descriptor *)*d;
+					addr = ((ep->bEndpointAddress & 0x80) >> 3) |
+						(ep->bEndpointAddress & 0x0f);
+					set_bit(addr, &ep_map);
+				}
+			}
 		}
+		ret = usb_gadget_check_config(cdev->gadget, ep_map);
+		if (ret)
+			goto err_purge_funcs;
+
 		usb_ep_autoconfig_reset(cdev->gadget);
 	}
 	if (cdev->use_os_string) {
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v9 3/5] usb: dwc3: Resize TX FIFOs to meet EP bursting requirements
  2021-05-19  7:49 [PATCH v9 0/5] Re-introduce TX FIFO resize for larger EP bursting Wesley Cheng
  2021-05-19  7:49 ` [PATCH v9 1/5] usb: gadget: udc: core: Introduce check_config to verify USB configuration Wesley Cheng
  2021-05-19  7:49 ` [PATCH v9 2/5] usb: gadget: configfs: Check USB configuration before adding Wesley Cheng
@ 2021-05-19  7:49 ` Wesley Cheng
  2021-05-19  7:49 ` [PATCH v9 4/5] usb: dwc3: dwc3-qcom: Enable tx-fifo-resize property by default Wesley Cheng
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 31+ messages in thread
From: Wesley Cheng @ 2021-05-19  7:49 UTC (permalink / raw)
  To: balbi, gregkh, agross, bjorn.andersson, robh+dt
  Cc: linux-usb, linux-kernel, linux-arm-msm, devicetree, jackp,
	Thinh.Nguyen, Wesley Cheng

Some devices have USB compositions which may require multiple endpoints
that support EP bursting.  HW defined TX FIFO sizes may not always be
sufficient for these compositions.  By utilizing flexible TX FIFO
allocation, this allows for endpoints to request the required FIFO depth to
achieve higher bandwidth.  With some higher bMaxBurst configurations, using
a larger TX FIFO size results in better TX throughput.

By introducing the check_config() callback, the resizing logic can fetch
the maximum number of endpoints used in the USB composition (can contain
multiple configurations), which helps ensure that the resizing logic can
fulfill the configuration(s), or return an error to the gadget layer
otherwise during bind time.

Signed-off-by: Wesley Cheng <wcheng@codeaurora.org>
---
 drivers/usb/dwc3/core.c   |   9 ++
 drivers/usb/dwc3/core.h   |  15 ++++
 drivers/usb/dwc3/ep0.c    |   2 +
 drivers/usb/dwc3/gadget.c | 212 ++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 238 insertions(+)

diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
index b6e53d8..74c0543 100644
--- a/drivers/usb/dwc3/core.c
+++ b/drivers/usb/dwc3/core.c
@@ -1267,6 +1267,7 @@ static void dwc3_get_properties(struct dwc3 *dwc)
 	u8			rx_max_burst_prd;
 	u8			tx_thr_num_pkt_prd;
 	u8			tx_max_burst_prd;
+	u8			tx_fifo_resize_max_num;
 	const char		*usb_psy_name;
 	int			ret;
 
@@ -1282,6 +1283,8 @@ static void dwc3_get_properties(struct dwc3 *dwc)
 	 */
 	hird_threshold = 12;
 
+	tx_fifo_resize_max_num = 6;
+
 	dwc->maximum_speed = usb_get_maximum_speed(dev);
 	dwc->max_ssp_rate = usb_get_maximum_ssp_rate(dev);
 	dwc->dr_mode = usb_get_dr_mode(dev);
@@ -1325,6 +1328,10 @@ static void dwc3_get_properties(struct dwc3 *dwc)
 				&tx_thr_num_pkt_prd);
 	device_property_read_u8(dev, "snps,tx-max-burst-prd",
 				&tx_max_burst_prd);
+	dwc->do_fifo_resize = device_property_read_bool(dev,
+							"tx-fifo-resize");
+	device_property_read_u8(dev, "tx-fifo-max-num",
+				&tx_fifo_resize_max_num);
 
 	dwc->disable_scramble_quirk = device_property_read_bool(dev,
 				"snps,disable_scramble_quirk");
@@ -1390,6 +1397,8 @@ static void dwc3_get_properties(struct dwc3 *dwc)
 	dwc->tx_max_burst_prd = tx_max_burst_prd;
 
 	dwc->imod_interval = 0;
+
+	dwc->tx_fifo_resize_max_num = tx_fifo_resize_max_num;
 }
 
 /* check whether the core supports IMOD */
diff --git a/drivers/usb/dwc3/core.h b/drivers/usb/dwc3/core.h
index b1e875c..84d4682 100644
--- a/drivers/usb/dwc3/core.h
+++ b/drivers/usb/dwc3/core.h
@@ -1023,6 +1023,7 @@ struct dwc3_scratchpad_array {
  * @rx_max_burst_prd: max periodic ESS receive burst size
  * @tx_thr_num_pkt_prd: periodic ESS transmit packet count
  * @tx_max_burst_prd: max periodic ESS transmit burst size
+ * @tx_fifo_resize_max_num: max number of fifos allocated during txfifo resize
  * @hsphy_interface: "utmi" or "ulpi"
  * @connected: true when we're connected to a host, false otherwise
  * @delayed_status: true when gadget driver asks for delayed status
@@ -1079,6 +1080,11 @@ struct dwc3_scratchpad_array {
  * @dis_split_quirk: set to disable split boundary.
  * @imod_interval: set the interrupt moderation interval in 250ns
  *			increments or 0 to disable.
+ * @max_cfg_eps: current max number of IN eps used across all USB configs.
+ * @last_fifo_depth: last fifo depth used to determine next fifo ram start
+ *		     address.
+ * @num_ep_resized: carries the current number endpoints which have had its tx
+ *		    fifo resized.
  */
 struct dwc3 {
 	struct work_struct	drd_work;
@@ -1234,6 +1240,7 @@ struct dwc3 {
 	u8			rx_max_burst_prd;
 	u8			tx_thr_num_pkt_prd;
 	u8			tx_max_burst_prd;
+	u8			tx_fifo_resize_max_num;
 
 	const char		*hsphy_interface;
 
@@ -1247,6 +1254,7 @@ struct dwc3 {
 	unsigned		is_utmi_l1_suspend:1;
 	unsigned		is_fpga:1;
 	unsigned		pending_events:1;
+	unsigned		do_fifo_resize:1;
 	unsigned		pullups_connected:1;
 	unsigned		setup_packet_pending:1;
 	unsigned		three_stage_setup:1;
@@ -1282,6 +1290,10 @@ struct dwc3 {
 	unsigned		dis_split_quirk:1;
 
 	u16			imod_interval;
+
+	int			max_cfg_eps;
+	int			last_fifo_depth;
+	int			num_ep_resized;
 };
 
 #define INCRX_BURST_MODE 0
@@ -1513,6 +1525,7 @@ int dwc3_send_gadget_ep_cmd(struct dwc3_ep *dep, unsigned int cmd,
 		struct dwc3_gadget_ep_cmd_params *params);
 int dwc3_send_gadget_generic_command(struct dwc3 *dwc, unsigned int cmd,
 		u32 param);
+void dwc3_gadget_clear_tx_fifos(struct dwc3 *dwc);
 #else
 static inline int dwc3_gadget_init(struct dwc3 *dwc)
 { return 0; }
@@ -1532,6 +1545,8 @@ static inline int dwc3_send_gadget_ep_cmd(struct dwc3_ep *dep, unsigned int cmd,
 static inline int dwc3_send_gadget_generic_command(struct dwc3 *dwc,
 		int cmd, u32 param)
 { return 0; }
+static inline void dwc3_gadget_clear_tx_fifos(struct dwc3 *dwc)
+{ }
 #endif
 
 #if IS_ENABLED(CONFIG_USB_DWC3_DUAL_ROLE)
diff --git a/drivers/usb/dwc3/ep0.c b/drivers/usb/dwc3/ep0.c
index 8b668ef..4f216bd 100644
--- a/drivers/usb/dwc3/ep0.c
+++ b/drivers/usb/dwc3/ep0.c
@@ -616,6 +616,8 @@ static int dwc3_ep0_set_config(struct dwc3 *dwc, struct usb_ctrlrequest *ctrl)
 		return -EINVAL;
 
 	case USB_STATE_ADDRESS:
+		dwc3_gadget_clear_tx_fifos(dwc);
+
 		ret = dwc3_ep0_delegate_req(dwc, ctrl);
 		/* if the cfg matches and the cfg is non zero */
 		if (cfg && (!ret || (ret == USB_GADGET_DELAYED_STATUS))) {
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c
index dd80e5c..e2f4dad 100644
--- a/drivers/usb/dwc3/gadget.c
+++ b/drivers/usb/dwc3/gadget.c
@@ -632,6 +632,179 @@ static void dwc3_stop_active_transfer(struct dwc3_ep *dep, bool force,
 		bool interrupt);
 
 /**
+ * dwc3_gadget_calc_tx_fifo_size - calculates the txfifo size value
+ * @dwc: pointer to the DWC3 context
+ * @nfifos: number of fifos to calculate for
+ *
+ * Calculates the size value based on the equation below:
+ *
+ * fifo_size = mult * ((max_packet + mdwidth)/mdwidth + 1) + 1
+ *
+ * The max packet size is set to 1024, as the txfifo requirements mainly apply
+ * to super speed USB use cases.  However, it is safe to overestimate the fifo
+ * allocations for other scenarios, i.e. high speed USB.
+ */
+static int dwc3_gadget_calc_tx_fifo_size(struct dwc3 *dwc, int mult)
+{
+	int max_packet = 1024;
+	int fifo_size;
+	int mdwidth;
+
+	mdwidth = dwc3_mdwidth(dwc->hwparams.hwparams0);
+
+	/* MDWIDTH is represented in bits, we need it in bytes */
+	mdwidth >>= 3;
+
+	fifo_size = mult * ((max_packet + mdwidth) / mdwidth) + 1;
+	return fifo_size;
+}
+
+/**
+ * dwc3_gadget_clear_tx_fifo_size - Clears txfifo allocation
+ * @dwc: pointer to the DWC3 context
+ *
+ * Iterates through all the endpoint registers and clears the previous txfifo
+ * allocations.
+ */
+void dwc3_gadget_clear_tx_fifos(struct dwc3 *dwc)
+{
+	struct dwc3_ep *dep;
+	int fifo_depth;
+	int size;
+	int num;
+
+	if (!dwc->do_fifo_resize)
+		return;
+
+	/* Read ep0IN related TXFIFO size */
+	dep = dwc->eps[1];
+	size = dwc3_readl(dwc->regs, DWC3_GTXFIFOSIZ(0));
+	if (DWC3_IP_IS(DWC3))
+		fifo_depth = DWC3_GTXFIFOSIZ_TXFDEP(size);
+	else
+		fifo_depth = DWC31_GTXFIFOSIZ_TXFDEP(size);
+
+	dwc->last_fifo_depth = fifo_depth;
+	/* Clear existing TXFIFO for all IN eps except ep0 */
+	for (num = 3; num < min_t(int, dwc->num_eps, DWC3_ENDPOINTS_NUM);
+	     num += 2) {
+		dep = dwc->eps[num];
+		/* Don't change TXFRAMNUM on usb31 version */
+		size = DWC3_IP_IS(DWC3) ? 0 :
+			dwc3_readl(dwc->regs, DWC3_GTXFIFOSIZ(num >> 1)) &
+				   DWC31_GTXFIFOSIZ_TXFRAMNUM;
+
+		dwc3_writel(dwc->regs, DWC3_GTXFIFOSIZ(num >> 1), size);
+	}
+	dwc->num_ep_resized = 0;
+}
+
+/*
+ * dwc3_gadget_resize_tx_fifos - reallocate fifo spaces for current use-case
+ * @dwc: pointer to our context structure
+ *
+ * This function will a best effort FIFO allocation in order
+ * to improve FIFO usage and throughput, while still allowing
+ * us to enable as many endpoints as possible.
+ *
+ * Keep in mind that this operation will be highly dependent
+ * on the configured size for RAM1 - which contains TxFifo -,
+ * the amount of endpoints enabled on coreConsultant tool, and
+ * the width of the Master Bus.
+ *
+ * In general, FIFO depths are represented with the following equation:
+ *
+ * fifo_size = mult * ((max_packet + mdwidth)/mdwidth + 1) + 1
+ *
+ * Conversions can be done to the equation to derive the number of packets that
+ * will fit to a particular FIFO size value.
+ */
+static int dwc3_gadget_resize_tx_fifos(struct dwc3_ep *dep)
+{
+	struct dwc3 *dwc = dep->dwc;
+	int fifo_0_start;
+	int ram1_depth;
+	int fifo_size;
+	int min_depth;
+	int num_in_ep;
+	int remaining;
+	int num_fifos = 1;
+	int fifo;
+	int tmp;
+
+	if (!dwc->do_fifo_resize)
+		return 0;
+
+	/* resize IN endpoints except ep0 */
+	if (!usb_endpoint_dir_in(dep->endpoint.desc) || dep->number <= 1)
+		return 0;
+
+	ram1_depth = DWC3_RAM1_DEPTH(dwc->hwparams.hwparams7);
+
+	if ((dep->endpoint.maxburst > 1 &&
+	     usb_endpoint_xfer_bulk(dep->endpoint.desc)) ||
+	    usb_endpoint_xfer_isoc(dep->endpoint.desc))
+		num_fifos = 3;
+
+	if (dep->endpoint.maxburst > 6 &&
+	    usb_endpoint_xfer_bulk(dep->endpoint.desc) && DWC3_IP_IS(DWC31))
+		num_fifos = dwc->tx_fifo_resize_max_num;
+
+	/* FIFO size for a single buffer */
+	fifo = dwc3_gadget_calc_tx_fifo_size(dwc, 1);
+
+	/* Calculate the number of remaining EPs w/o any FIFO */
+	num_in_ep = dwc->max_cfg_eps;
+	num_in_ep -= dwc->num_ep_resized;
+
+	/* Reserve at least one FIFO for the number of IN EPs */
+	min_depth = num_in_ep * (fifo + 1);
+	remaining = ram1_depth - min_depth - dwc->last_fifo_depth;
+	remaining = max_t(int, 0, remaining);
+	/*
+	 * We've already reserved 1 FIFO per EP, so check what we can fit in
+	 * addition to it.  If there is not enough remaining space, allocate
+	 * all the remaining space to the EP.
+	 */
+	fifo_size = (num_fifos - 1) * fifo;
+	if (remaining < fifo_size)
+		fifo_size = remaining;
+
+	fifo_size += fifo;
+	/* Last increment according to the TX FIFO size equation */
+	fifo_size++;
+
+	/* Check if TXFIFOs start at non-zero addr */
+	tmp = dwc3_readl(dwc->regs, DWC3_GTXFIFOSIZ(0));
+	fifo_0_start = DWC3_GTXFIFOSIZ_TXFSTADDR(tmp);
+
+	fifo_size |= (fifo_0_start + (dwc->last_fifo_depth << 16));
+	if (DWC3_IP_IS(DWC3))
+		dwc->last_fifo_depth += DWC3_GTXFIFOSIZ_TXFDEP(fifo_size);
+	else
+		dwc->last_fifo_depth += DWC31_GTXFIFOSIZ_TXFDEP(fifo_size);
+
+	/* Check fifo size allocation doesn't exceed available RAM size. */
+	if (dwc->last_fifo_depth >= ram1_depth) {
+		dev_err(dwc->dev, "Fifosize(%d) > RAM size(%d) %s depth:%d\n",
+			dwc->last_fifo_depth, ram1_depth,
+			dep->endpoint.name, fifo_size);
+		if (DWC3_IP_IS(DWC3))
+			fifo_size = DWC3_GTXFIFOSIZ_TXFDEP(fifo_size);
+		else
+			fifo_size = DWC31_GTXFIFOSIZ_TXFDEP(fifo_size);
+
+		dwc->last_fifo_depth -= fifo_size;
+		return -ENOMEM;
+	}
+
+	dwc3_writel(dwc->regs, DWC3_GTXFIFOSIZ(dep->number >> 1), fifo_size);
+	dwc->num_ep_resized++;
+
+	return 0;
+}
+
+/**
  * __dwc3_gadget_ep_enable - initializes a hw endpoint
  * @dep: endpoint to be initialized
  * @action: one of INIT, MODIFY or RESTORE
@@ -648,6 +821,10 @@ static int __dwc3_gadget_ep_enable(struct dwc3_ep *dep, unsigned int action)
 	int			ret;
 
 	if (!(dep->flags & DWC3_EP_ENABLED)) {
+		ret = dwc3_gadget_resize_tx_fifos(dep);
+		if (ret)
+			return ret;
+
 		ret = dwc3_gadget_start_config(dep);
 		if (ret)
 			return ret;
@@ -2492,6 +2669,7 @@ static int dwc3_gadget_stop(struct usb_gadget *g)
 
 	spin_lock_irqsave(&dwc->lock, flags);
 	dwc->gadget_driver	= NULL;
+	dwc->max_cfg_eps = 0;
 	spin_unlock_irqrestore(&dwc->lock, flags);
 
 	free_irq(dwc->irq_gadget, dwc->ev_buf);
@@ -2579,6 +2757,39 @@ static int dwc3_gadget_vbus_draw(struct usb_gadget *g, unsigned int mA)
 	return ret;
 }
 
+static int dwc3_gadget_check_config(struct usb_gadget *g, unsigned long ep_map)
+{
+	struct dwc3 *dwc = gadget_to_dwc(g);
+	unsigned long in_ep_map;
+	int fifo_size = 0;
+	int ram1_depth;
+	int ep_num;
+
+	if (!dwc->do_fifo_resize)
+		return 0;
+
+	/* Only interested in the IN endpoints */
+	in_ep_map = ep_map >> 16;
+	ep_num = hweight_long(in_ep_map);
+
+	if (ep_num <= dwc->max_cfg_eps)
+		return 0;
+
+	/* Update the max number of eps in the composition */
+	dwc->max_cfg_eps = ep_num;
+
+	fifo_size = dwc3_gadget_calc_tx_fifo_size(dwc, dwc->max_cfg_eps);
+	/* Based on the equation, increment by one for every ep */
+	fifo_size += dwc->max_cfg_eps;
+
+	/* Check if we can fit a single fifo per endpoint */
+	ram1_depth = DWC3_RAM1_DEPTH(dwc->hwparams.hwparams7);
+	if (fifo_size > ram1_depth)
+		return -ENOMEM;
+
+	return 0;
+}
+
 static const struct usb_gadget_ops dwc3_gadget_ops = {
 	.get_frame		= dwc3_gadget_get_frame,
 	.wakeup			= dwc3_gadget_wakeup,
@@ -2590,6 +2801,7 @@ static const struct usb_gadget_ops dwc3_gadget_ops = {
 	.udc_set_ssp_rate	= dwc3_gadget_set_ssp_rate,
 	.get_config_params	= dwc3_gadget_config_params,
 	.vbus_draw		= dwc3_gadget_vbus_draw,
+	.check_config		= dwc3_gadget_check_config,
 };
 
 /* -------------------------------------------------------------------------- */
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v9 4/5] usb: dwc3: dwc3-qcom: Enable tx-fifo-resize property by default
  2021-05-19  7:49 [PATCH v9 0/5] Re-introduce TX FIFO resize for larger EP bursting Wesley Cheng
                   ` (2 preceding siblings ...)
  2021-05-19  7:49 ` [PATCH v9 3/5] usb: dwc3: Resize TX FIFOs to meet EP bursting requirements Wesley Cheng
@ 2021-05-19  7:49 ` Wesley Cheng
  2021-05-19  7:49 ` [PATCH v9 5/5] dt-bindings: usb: dwc3: Update dwc3 TX fifo properties Wesley Cheng
  2021-06-04 11:54 ` [PATCH v9 0/5] Re-introduce TX FIFO resize for larger EP bursting Greg KH
  5 siblings, 0 replies; 31+ messages in thread
From: Wesley Cheng @ 2021-05-19  7:49 UTC (permalink / raw)
  To: balbi, gregkh, agross, bjorn.andersson, robh+dt
  Cc: linux-usb, linux-kernel, linux-arm-msm, devicetree, jackp,
	Thinh.Nguyen, Wesley Cheng

In order to take advantage of the TX fifo resizing logic, manually add
these properties to the DWC3 child node by default.  This will allow
the DWC3 gadget to resize the TX fifos for the IN endpoints, which
help with performance.

Signed-off-by: Wesley Cheng <wcheng@codeaurora.org>
---
 drivers/usb/dwc3/dwc3-qcom.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/drivers/usb/dwc3/dwc3-qcom.c b/drivers/usb/dwc3/dwc3-qcom.c
index 49e6ca9..44e0eaa1 100644
--- a/drivers/usb/dwc3/dwc3-qcom.c
+++ b/drivers/usb/dwc3/dwc3-qcom.c
@@ -645,6 +645,7 @@ static int dwc3_qcom_of_register_core(struct platform_device *pdev)
 	struct dwc3_qcom	*qcom = platform_get_drvdata(pdev);
 	struct device_node	*np = pdev->dev.of_node, *dwc3_np;
 	struct device		*dev = &pdev->dev;
+	struct property		*prop;
 	int			ret;
 
 	dwc3_np = of_get_compatible_child(np, "snps,dwc3");
@@ -653,6 +654,14 @@ static int dwc3_qcom_of_register_core(struct platform_device *pdev)
 		return -ENODEV;
 	}
 
+	prop = devm_kzalloc(dev, sizeof(*prop), GFP_KERNEL);
+	if (prop) {
+		prop->name = "tx-fifo-resize";
+		ret = of_add_property(dwc3_np, prop);
+		if (ret < 0)
+			dev_info(dev, "unable to add tx-fifo-resize prop\n");
+	}
+
 	ret = of_platform_populate(np, NULL, NULL, dev);
 	if (ret) {
 		dev_err(dev, "failed to register dwc3 core - %d\n", ret);
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH v9 5/5] dt-bindings: usb: dwc3: Update dwc3 TX fifo properties
  2021-05-19  7:49 [PATCH v9 0/5] Re-introduce TX FIFO resize for larger EP bursting Wesley Cheng
                   ` (3 preceding siblings ...)
  2021-05-19  7:49 ` [PATCH v9 4/5] usb: dwc3: dwc3-qcom: Enable tx-fifo-resize property by default Wesley Cheng
@ 2021-05-19  7:49 ` Wesley Cheng
  2021-06-04 11:54 ` [PATCH v9 0/5] Re-introduce TX FIFO resize for larger EP bursting Greg KH
  5 siblings, 0 replies; 31+ messages in thread
From: Wesley Cheng @ 2021-05-19  7:49 UTC (permalink / raw)
  To: balbi, gregkh, agross, bjorn.andersson, robh+dt
  Cc: linux-usb, linux-kernel, linux-arm-msm, devicetree, jackp,
	Thinh.Nguyen, Wesley Cheng

Update the tx-fifo-resize property with a better description, while
adding the tx-fifo-max-num, which is a new parameter allowing
adjustments for the maximum number of packets the txfifo resizing logic
can account for while resizing the endpoints.

Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: Wesley Cheng <wcheng@codeaurora.org>
---
 Documentation/devicetree/bindings/usb/snps,dwc3.yaml | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/Documentation/devicetree/bindings/usb/snps,dwc3.yaml b/Documentation/devicetree/bindings/usb/snps,dwc3.yaml
index 41416fb..078fb78 100644
--- a/Documentation/devicetree/bindings/usb/snps,dwc3.yaml
+++ b/Documentation/devicetree/bindings/usb/snps,dwc3.yaml
@@ -289,10 +289,21 @@ properties:
     maximum: 16
 
   tx-fifo-resize:
-    description: Determines if the FIFO *has* to be reallocated
-    deprecated: true
+    description: Determines if the TX fifos can be dynamically resized depending
+      on the number of IN endpoints used and if bursting is supported.  This
+      may help improve bandwidth on platforms with higher system latencies, as
+      increased fifo space allows for the controller to prefetch data into its
+      internal memory.
     type: boolean
 
+  tx-fifo-max-num:
+    description: Specifies the max number of packets the txfifo resizing logic
+      can account for when higher endpoint bursting is used. (bMaxBurst > 6) The
+      higher the number, the more fifo space the txfifo resizing logic will
+      allocate for that endpoint.
+    $ref: /schemas/types.yaml#/definitions/uint8
+    minimum: 3
+
   snps,incr-burst-type-adjustment:
     description:
       Value for INCR burst type of GSBUSCFG0 register, undefined length INCR
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH v9 0/5] Re-introduce TX FIFO resize for larger EP bursting
  2021-05-19  7:49 [PATCH v9 0/5] Re-introduce TX FIFO resize for larger EP bursting Wesley Cheng
                   ` (4 preceding siblings ...)
  2021-05-19  7:49 ` [PATCH v9 5/5] dt-bindings: usb: dwc3: Update dwc3 TX fifo properties Wesley Cheng
@ 2021-06-04 11:54 ` Greg KH
  2021-06-04 14:18   ` Felipe Balbi
  2021-06-07 16:04   ` Jack Pham
  5 siblings, 2 replies; 31+ messages in thread
From: Greg KH @ 2021-06-04 11:54 UTC (permalink / raw)
  To: Wesley Cheng
  Cc: balbi, agross, bjorn.andersson, robh+dt, linux-usb, linux-kernel,
	linux-arm-msm, devicetree, jackp, Thinh.Nguyen

On Wed, May 19, 2021 at 12:49:16AM -0700, Wesley Cheng wrote:
> Changes in V9:
>  - Fixed incorrect patch in series.  Removed changes in DTSI, as dwc3-qcom will
>    add the property by default from the kernel.

This patch series has one build failure and one warning added:

drivers/usb/dwc3/gadget.c: In function ‘dwc3_gadget_calc_tx_fifo_size’:
drivers/usb/dwc3/gadget.c:653:45: warning: passing argument 1 of ‘dwc3_mdwidth’ makes pointer from integer without a cast [-Wint-conversion]
  653 |         mdwidth = dwc3_mdwidth(dwc->hwparams.hwparams0);
      |                                ~~~~~~~~~~~~~^~~~~~~~~~
      |                                             |
      |                                             u32 {aka unsigned int}
In file included from drivers/usb/dwc3/debug.h:14,
                 from drivers/usb/dwc3/gadget.c:25:
drivers/usb/dwc3/core.h:1493:45: note: expected ‘struct dwc3 *’ but argument is of type ‘u32’ {aka ‘unsigned int’}
 1493 | static inline u32 dwc3_mdwidth(struct dwc3 *dwc)
      |                                ~~~~~~~~~~~~~^~~


drivers/usb/dwc3/dwc3-qcom.c: In function ‘dwc3_qcom_of_register_core’:
drivers/usb/dwc3/dwc3-qcom.c:660:23: error: implicit declaration of function ‘of_add_property’; did you mean ‘of_get_property’? [-Werror=implicit-function-declaration]
  660 |                 ret = of_add_property(dwc3_np, prop);
      |                       ^~~~~~~~~~~~~~~
      |                       of_get_property


How did you test these?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v9 0/5] Re-introduce TX FIFO resize for larger EP bursting
  2021-06-04 11:54 ` [PATCH v9 0/5] Re-introduce TX FIFO resize for larger EP bursting Greg KH
@ 2021-06-04 14:18   ` Felipe Balbi
  2021-06-04 14:36     ` Greg KH
  2021-06-07 16:04   ` Jack Pham
  1 sibling, 1 reply; 31+ messages in thread
From: Felipe Balbi @ 2021-06-04 14:18 UTC (permalink / raw)
  To: Greg KH, Wesley Cheng
  Cc: agross, bjorn.andersson, robh+dt, linux-usb, linux-kernel,
	linux-arm-msm, devicetree, jackp, Thinh.Nguyen

[-- Attachment #1: Type: text/plain, Size: 2777 bytes --]


Hi,

Greg KH <gregkh@linuxfoundation.org> writes:
> On Wed, May 19, 2021 at 12:49:16AM -0700, Wesley Cheng wrote:
>> Changes in V9:
>>  - Fixed incorrect patch in series.  Removed changes in DTSI, as dwc3-qcom will
>>    add the property by default from the kernel.
>
> This patch series has one build failure and one warning added:
>
> drivers/usb/dwc3/gadget.c: In function ‘dwc3_gadget_calc_tx_fifo_size’:
> drivers/usb/dwc3/gadget.c:653:45: warning: passing argument 1 of ‘dwc3_mdwidth’ makes pointer from integer without a cast [-Wint-conversion]
>   653 |         mdwidth = dwc3_mdwidth(dwc->hwparams.hwparams0);
>       |                                ~~~~~~~~~~~~~^~~~~~~~~~
>       |                                             |
>       |                                             u32 {aka unsigned int}
> In file included from drivers/usb/dwc3/debug.h:14,
>                  from drivers/usb/dwc3/gadget.c:25:
> drivers/usb/dwc3/core.h:1493:45: note: expected ‘struct dwc3 *’ but argument is of type ‘u32’ {aka ‘unsigned int’}
>  1493 | static inline u32 dwc3_mdwidth(struct dwc3 *dwc)
>       |                                ~~~~~~~~~~~~~^~~
>
>
> drivers/usb/dwc3/dwc3-qcom.c: In function ‘dwc3_qcom_of_register_core’:
> drivers/usb/dwc3/dwc3-qcom.c:660:23: error: implicit declaration of function ‘of_add_property’; did you mean ‘of_get_property’? [-Werror=implicit-function-declaration]
>   660 |                 ret = of_add_property(dwc3_np, prop);
>       |                       ^~~~~~~~~~~~~~~
>       |                       of_get_property
>
>
> How did you test these?

to be honest, I don't think these should go in (apart from the build
failure) because it's likely to break instantiations of the core with
differing FIFO sizes. Some instantiations even have some endpoints with
dedicated functionality that requires the default FIFO size configured
during coreConsultant instantiation. I know of at OMAP5 and some Intel
implementations which have dedicated endpoints for processor tracing.

With OMAP5, these endpoints are configured at the top of the available
endpoints, which means that if a gadget driver gets loaded and takes
over most of the FIFO space because of this resizing, processor tracing
will have a hard time running. That being said, processor tracing isn't
supported in upstream at this moment.

I still think this may cause other places to break down. The promise the
databook makes is that increasing the FIFO size over 2x wMaxPacketSize
should bring little to no benefit, if we're not maintaining that, I
wonder if the problem is with some of the BUSCFG registers instead,
where we configure interconnect bursting and the like.

-- 
balbi

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 511 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v9 0/5] Re-introduce TX FIFO resize for larger EP bursting
  2021-06-04 14:18   ` Felipe Balbi
@ 2021-06-04 14:36     ` Greg KH
  2021-06-08  5:44       ` Wesley Cheng
  0 siblings, 1 reply; 31+ messages in thread
From: Greg KH @ 2021-06-04 14:36 UTC (permalink / raw)
  To: Felipe Balbi
  Cc: Wesley Cheng, agross, bjorn.andersson, robh+dt, linux-usb,
	linux-kernel, linux-arm-msm, devicetree, jackp, Thinh.Nguyen

On Fri, Jun 04, 2021 at 05:18:16PM +0300, Felipe Balbi wrote:
> 
> Hi,
> 
> Greg KH <gregkh@linuxfoundation.org> writes:
> > On Wed, May 19, 2021 at 12:49:16AM -0700, Wesley Cheng wrote:
> >> Changes in V9:
> >>  - Fixed incorrect patch in series.  Removed changes in DTSI, as dwc3-qcom will
> >>    add the property by default from the kernel.
> >
> > This patch series has one build failure and one warning added:
> >
> > drivers/usb/dwc3/gadget.c: In function ‘dwc3_gadget_calc_tx_fifo_size’:
> > drivers/usb/dwc3/gadget.c:653:45: warning: passing argument 1 of ‘dwc3_mdwidth’ makes pointer from integer without a cast [-Wint-conversion]
> >   653 |         mdwidth = dwc3_mdwidth(dwc->hwparams.hwparams0);
> >       |                                ~~~~~~~~~~~~~^~~~~~~~~~
> >       |                                             |
> >       |                                             u32 {aka unsigned int}
> > In file included from drivers/usb/dwc3/debug.h:14,
> >                  from drivers/usb/dwc3/gadget.c:25:
> > drivers/usb/dwc3/core.h:1493:45: note: expected ‘struct dwc3 *’ but argument is of type ‘u32’ {aka ‘unsigned int’}
> >  1493 | static inline u32 dwc3_mdwidth(struct dwc3 *dwc)
> >       |                                ~~~~~~~~~~~~~^~~
> >
> >
> > drivers/usb/dwc3/dwc3-qcom.c: In function ‘dwc3_qcom_of_register_core’:
> > drivers/usb/dwc3/dwc3-qcom.c:660:23: error: implicit declaration of function ‘of_add_property’; did you mean ‘of_get_property’? [-Werror=implicit-function-declaration]
> >   660 |                 ret = of_add_property(dwc3_np, prop);
> >       |                       ^~~~~~~~~~~~~~~
> >       |                       of_get_property
> >
> >
> > How did you test these?
> 
> to be honest, I don't think these should go in (apart from the build
> failure) because it's likely to break instantiations of the core with
> differing FIFO sizes. Some instantiations even have some endpoints with
> dedicated functionality that requires the default FIFO size configured
> during coreConsultant instantiation. I know of at OMAP5 and some Intel
> implementations which have dedicated endpoints for processor tracing.
> 
> With OMAP5, these endpoints are configured at the top of the available
> endpoints, which means that if a gadget driver gets loaded and takes
> over most of the FIFO space because of this resizing, processor tracing
> will have a hard time running. That being said, processor tracing isn't
> supported in upstream at this moment.
> 
> I still think this may cause other places to break down. The promise the
> databook makes is that increasing the FIFO size over 2x wMaxPacketSize
> should bring little to no benefit, if we're not maintaining that, I
> wonder if the problem is with some of the BUSCFG registers instead,
> where we configure interconnect bursting and the like.

Good points.

Wesley, what kind of testing have you done on this on different devices?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v9 0/5] Re-introduce TX FIFO resize for larger EP bursting
  2021-06-04 11:54 ` [PATCH v9 0/5] Re-introduce TX FIFO resize for larger EP bursting Greg KH
  2021-06-04 14:18   ` Felipe Balbi
@ 2021-06-07 16:04   ` Jack Pham
  2021-06-08  5:07     ` Wesley Cheng
  1 sibling, 1 reply; 31+ messages in thread
From: Jack Pham @ 2021-06-07 16:04 UTC (permalink / raw)
  To: Greg KH
  Cc: Wesley Cheng, balbi, agross, bjorn.andersson, robh+dt, linux-usb,
	linux-kernel, linux-arm-msm, devicetree, Thinh.Nguyen

Hey Wesley,

On Fri, Jun 04, 2021 at 01:54:48PM +0200, Greg KH wrote:
> On Wed, May 19, 2021 at 12:49:16AM -0700, Wesley Cheng wrote:
> > Changes in V9:
> >  - Fixed incorrect patch in series.  Removed changes in DTSI, as dwc3-qcom will
> >    add the property by default from the kernel.
> 
> This patch series has one build failure and one warning added:
> 
> drivers/usb/dwc3/gadget.c: In function ‘dwc3_gadget_calc_tx_fifo_size’:
> drivers/usb/dwc3/gadget.c:653:45: warning: passing argument 1 of ‘dwc3_mdwidth’ makes pointer from integer without a cast [-Wint-conversion]
>   653 |         mdwidth = dwc3_mdwidth(dwc->hwparams.hwparams0);
>       |                                ~~~~~~~~~~~~~^~~~~~~~~~
>       |                                             |
>       |                                             u32 {aka unsigned int}
> In file included from drivers/usb/dwc3/debug.h:14,
>                  from drivers/usb/dwc3/gadget.c:25:
> drivers/usb/dwc3/core.h:1493:45: note: expected ‘struct dwc3 *’ but argument is of type ‘u32’ {aka ‘unsigned int’}
>  1493 | static inline u32 dwc3_mdwidth(struct dwc3 *dwc)
>       |                                ~~~~~~~~~~~~~^~~

I'm guessing you were previously using the DWC3_MDWIDTH macro which
operated on the 'hwparams0' reg value directly, but probably had to
switch it to the dwc3_mdwidth() inline function that Thinh had replaced
it with recently. Forgot to compile-test I bet? :)

> drivers/usb/dwc3/dwc3-qcom.c: In function ‘dwc3_qcom_of_register_core’:
> drivers/usb/dwc3/dwc3-qcom.c:660:23: error: implicit declaration of function ‘of_add_property’; did you mean ‘of_get_property’? [-Werror=implicit-function-declaration]
>   660 |                 ret = of_add_property(dwc3_np, prop);
>       |                       ^~~~~~~~~~~~~~~
>       |                       of_get_property

Scratched my head on this one a bit, since 'of_add_property' is clearly
declared in <linux/of.h> which dwc3-qcom.c directly includes. Then I
looked closer and saw the declaration only in case of #ifdef CONFIG_OF
and noticed it doesn't have a corresponding no-op static inline
definition in the case of !CONFIG_OF. Again I'm guessing here that Greg
must have built on a non-OF config.  We should probably include a patch
that adds the stub.

Thanks,
Jack
-- 
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v9 0/5] Re-introduce TX FIFO resize for larger EP bursting
  2021-06-07 16:04   ` Jack Pham
@ 2021-06-08  5:07     ` Wesley Cheng
  0 siblings, 0 replies; 31+ messages in thread
From: Wesley Cheng @ 2021-06-08  5:07 UTC (permalink / raw)
  To: Jack Pham, Greg KH
  Cc: balbi, agross, bjorn.andersson, robh+dt, linux-usb, linux-kernel,
	linux-arm-msm, devicetree, Thinh.Nguyen

Hi Jack,

On 6/7/2021 9:04 AM, Jack Pham wrote:
> Hey Wesley,
> 
> On Fri, Jun 04, 2021 at 01:54:48PM +0200, Greg KH wrote:
>> On Wed, May 19, 2021 at 12:49:16AM -0700, Wesley Cheng wrote:
>>> Changes in V9:
>>>  - Fixed incorrect patch in series.  Removed changes in DTSI, as dwc3-qcom will
>>>    add the property by default from the kernel.
>>
>> This patch series has one build failure and one warning added:
>>
>> drivers/usb/dwc3/gadget.c: In function ‘dwc3_gadget_calc_tx_fifo_size’:
>> drivers/usb/dwc3/gadget.c:653:45: warning: passing argument 1 of ‘dwc3_mdwidth’ makes pointer from integer without a cast [-Wint-conversion]
>>   653 |         mdwidth = dwc3_mdwidth(dwc->hwparams.hwparams0);
>>       |                                ~~~~~~~~~~~~~^~~~~~~~~~
>>       |                                             |
>>       |                                             u32 {aka unsigned int}
>> In file included from drivers/usb/dwc3/debug.h:14,
>>                  from drivers/usb/dwc3/gadget.c:25:
>> drivers/usb/dwc3/core.h:1493:45: note: expected ‘struct dwc3 *’ but argument is of type ‘u32’ {aka ‘unsigned int’}
>>  1493 | static inline u32 dwc3_mdwidth(struct dwc3 *dwc)
>>       |                                ~~~~~~~~~~~~~^~~
> 
> I'm guessing you were previously using the DWC3_MDWIDTH macro which
> operated on the 'hwparams0' reg value directly, but probably had to
> switch it to the dwc3_mdwidth() inline function that Thinh had replaced
> it with recently. Forgot to compile-test I bet? :)
> 
Ah, looks like that's the case.  I tried this on our internal branches,
which didn't have Thinh's change, which is probably why it worked in the
first place.  Will fix this.

>> drivers/usb/dwc3/dwc3-qcom.c: In function ‘dwc3_qcom_of_register_core’:
>> drivers/usb/dwc3/dwc3-qcom.c:660:23: error: implicit declaration of function ‘of_add_property’; did you mean ‘of_get_property’? [-Werror=implicit-function-declaration]
>>   660 |                 ret = of_add_property(dwc3_np, prop);
>>       |                       ^~~~~~~~~~~~~~~
>>       |                       of_get_property
> 
> Scratched my head on this one a bit, since 'of_add_property' is clearly
> declared in <linux/of.h> which dwc3-qcom.c directly includes. Then I
> looked closer and saw the declaration only in case of #ifdef CONFIG_OF
> and noticed it doesn't have a corresponding no-op static inline
> definition in the case of !CONFIG_OF. Again I'm guessing here that Greg
> must have built on a non-OF config.  We should probably include a patch
> that adds the stub.
> 

Nice catch, will add the stub.

Thanks
Wesley Cheng

> Thanks,
> Jack
> 

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v9 0/5] Re-introduce TX FIFO resize for larger EP bursting
  2021-06-04 14:36     ` Greg KH
@ 2021-06-08  5:44       ` Wesley Cheng
  2021-06-10  9:20         ` Felipe Balbi
  0 siblings, 1 reply; 31+ messages in thread
From: Wesley Cheng @ 2021-06-08  5:44 UTC (permalink / raw)
  To: Greg KH, Felipe Balbi
  Cc: agross, bjorn.andersson, robh+dt, linux-usb, linux-kernel,
	linux-arm-msm, devicetree, jackp, Thinh.Nguyen

Hi Greg/Felipe,

On 6/4/2021 7:36 AM, Greg KH wrote:
> On Fri, Jun 04, 2021 at 05:18:16PM +0300, Felipe Balbi wrote:
>>
>> Hi,
>>
>> Greg KH <gregkh@linuxfoundation.org> writes:
>>> On Wed, May 19, 2021 at 12:49:16AM -0700, Wesley Cheng wrote:
>>>> Changes in V9:
>>>>  - Fixed incorrect patch in series.  Removed changes in DTSI, as dwc3-qcom will
>>>>    add the property by default from the kernel.
>>>
>>> This patch series has one build failure and one warning added:
>>>
>>> drivers/usb/dwc3/gadget.c: In function ‘dwc3_gadget_calc_tx_fifo_size’:
>>> drivers/usb/dwc3/gadget.c:653:45: warning: passing argument 1 of ‘dwc3_mdwidth’ makes pointer from integer without a cast [-Wint-conversion]
>>>   653 |         mdwidth = dwc3_mdwidth(dwc->hwparams.hwparams0);
>>>       |                                ~~~~~~~~~~~~~^~~~~~~~~~
>>>       |                                             |
>>>       |                                             u32 {aka unsigned int}
>>> In file included from drivers/usb/dwc3/debug.h:14,
>>>                  from drivers/usb/dwc3/gadget.c:25:
>>> drivers/usb/dwc3/core.h:1493:45: note: expected ‘struct dwc3 *’ but argument is of type ‘u32’ {aka ‘unsigned int’}
>>>  1493 | static inline u32 dwc3_mdwidth(struct dwc3 *dwc)
>>>       |                                ~~~~~~~~~~~~~^~~
>>>
>>>
>>> drivers/usb/dwc3/dwc3-qcom.c: In function ‘dwc3_qcom_of_register_core’:
>>> drivers/usb/dwc3/dwc3-qcom.c:660:23: error: implicit declaration of function ‘of_add_property’; did you mean ‘of_get_property’? [-Werror=implicit-function-declaration]
>>>   660 |                 ret = of_add_property(dwc3_np, prop);
>>>       |                       ^~~~~~~~~~~~~~~
>>>       |                       of_get_property
>>>
>>>
>>> How did you test these?

I ran these changes on our internal branches, which were probably
missing some of the recent changes done to the DWC3 drivers.  Will fix
the above compile errors and re-submit.

In regards to how much these changes have been tested, we've been
maintaining the TX FIFO resize logic downstream for a few years already,
so its being used in end products.  We also verify this with our
internal testing, which has certain benchmarks we need to meet.

>>
>> to be honest, I don't think these should go in (apart from the build
>> failure) because it's likely to break instantiations of the core with
>> differing FIFO sizes. Some instantiations even have some endpoints with
>> dedicated functionality that requires the default FIFO size configured
>> during coreConsultant instantiation. I know of at OMAP5 and some Intel
>> implementations which have dedicated endpoints for processor tracing.
>>
>> With OMAP5, these endpoints are configured at the top of the available
>> endpoints, which means that if a gadget driver gets loaded and takes
>> over most of the FIFO space because of this resizing, processor tracing
>> will have a hard time running. That being said, processor tracing isn't
>> supported in upstream at this moment.
>>

I agree that the application of this logic may differ between vendors,
hence why I wanted to keep this controllable by the DT property, so that
for those which do not support this use case can leave it disabled.  The
logic is there to ensure that for a given USB configuration, for each EP
it would have at least 1 TX FIFO.  For USB configurations which don't
utilize all available IN EPs, it would allow re-allocation of internal
memory to EPs which will actually be in use.

>> I still think this may cause other places to break down. The promise the
>> databook makes is that increasing the FIFO size over 2x wMaxPacketSize
>> should bring little to no benefit, if we're not maintaining that, I
>> wonder if the problem is with some of the BUSCFG registers instead,
>> where we configure interconnect bursting and the like.
> 

I've been referring mainly to the DWC3 programming guide for
recommendations on how to improve USB performance in:
Section 3.3.5 System Bus Features to Improve USB Performance

At least when I ran the initial profiling, adjusting the RX/TX
thresholds brought little to no benefits.  Even in some of the examples,
they have diagrams showing a TXFIFO size of 6 max packets (Figure 3-5).
 I think its difficult to say that the TX fifo resizing won't help in
systems with limited, or shared resources where the bus latencies would
be somewhat larger.  By adjusting the TX FIFO size, the controller would
be able to fetch more data from system memory into the memory within the
controller, leading to less frequent end of bursts, etc... as data is
readily available.

In terms of adjusting the AXI/AHB bursting, I would think the bandwidth
increase would eventually be constrained based on your system's design.
 We don't touch the GSBUSCFG registers, and leave them as is based off
the recommendations from the HW designers.

> Good points.
> 
> Wesley, what kind of testing have you done on this on different devices?
> 

As mentioned above, these changes are currently present on end user
devices for the past few years, so its been through a lot of testing :).

Thanks
Wesley Cheng


-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v9 0/5] Re-introduce TX FIFO resize for larger EP bursting
  2021-06-08  5:44       ` Wesley Cheng
@ 2021-06-10  9:20         ` Felipe Balbi
  2021-06-10 10:03           ` Greg KH
  2021-06-10 18:15           ` Wesley Cheng
  0 siblings, 2 replies; 31+ messages in thread
From: Felipe Balbi @ 2021-06-10  9:20 UTC (permalink / raw)
  To: Wesley Cheng, Greg KH
  Cc: agross, bjorn.andersson, robh+dt, linux-usb, linux-kernel,
	linux-arm-msm, devicetree, jackp, Thinh.Nguyen

[-- Attachment #1: Type: text/plain, Size: 6886 bytes --]


Hi,

Wesley Cheng <wcheng@codeaurora.org> writes:

> Hi Greg/Felipe,
>
> On 6/4/2021 7:36 AM, Greg KH wrote:
>> On Fri, Jun 04, 2021 at 05:18:16PM +0300, Felipe Balbi wrote:
>>>
>>> Hi,
>>>
>>> Greg KH <gregkh@linuxfoundation.org> writes:
>>>> On Wed, May 19, 2021 at 12:49:16AM -0700, Wesley Cheng wrote:
>>>>> Changes in V9:
>>>>>  - Fixed incorrect patch in series.  Removed changes in DTSI, as dwc3-qcom will
>>>>>    add the property by default from the kernel.
>>>>
>>>> This patch series has one build failure and one warning added:
>>>>
>>>> drivers/usb/dwc3/gadget.c: In function ‘dwc3_gadget_calc_tx_fifo_size’:
>>>> drivers/usb/dwc3/gadget.c:653:45: warning: passing argument 1 of ‘dwc3_mdwidth’ makes pointer from integer without a cast [-Wint-conversion]
>>>>   653 |         mdwidth = dwc3_mdwidth(dwc->hwparams.hwparams0);
>>>>       |                                ~~~~~~~~~~~~~^~~~~~~~~~
>>>>       |                                             |
>>>>       |                                             u32 {aka unsigned int}
>>>> In file included from drivers/usb/dwc3/debug.h:14,
>>>>                  from drivers/usb/dwc3/gadget.c:25:
>>>> drivers/usb/dwc3/core.h:1493:45: note: expected ‘struct dwc3 *’ but argument is of type ‘u32’ {aka ‘unsigned int’}
>>>>  1493 | static inline u32 dwc3_mdwidth(struct dwc3 *dwc)
>>>>       |                                ~~~~~~~~~~~~~^~~
>>>>
>>>>
>>>> drivers/usb/dwc3/dwc3-qcom.c: In function ‘dwc3_qcom_of_register_core’:
>>>> drivers/usb/dwc3/dwc3-qcom.c:660:23: error: implicit declaration of function ‘of_add_property’; did you mean ‘of_get_property’? [-Werror=implicit-function-declaration]
>>>>   660 |                 ret = of_add_property(dwc3_np, prop);
>>>>       |                       ^~~~~~~~~~~~~~~
>>>>       |                       of_get_property
>>>>
>>>>
>>>> How did you test these?
>
> I ran these changes on our internal branches, which were probably
> missing some of the recent changes done to the DWC3 drivers.  Will fix
> the above compile errors and re-submit.
>
> In regards to how much these changes have been tested, we've been
> maintaining the TX FIFO resize logic downstream for a few years already,
> so its being used in end products.  We also verify this with our
> internal testing, which has certain benchmarks we need to meet.

the problem with that is that you *know* which gadget is running
there. You know everyone of those is going to run the android
gadget. In a sense, all those multiple products are testing the same
exact use case :-)

>>> to be honest, I don't think these should go in (apart from the build
>>> failure) because it's likely to break instantiations of the core with
>>> differing FIFO sizes. Some instantiations even have some endpoints with
>>> dedicated functionality that requires the default FIFO size configured
>>> during coreConsultant instantiation. I know of at OMAP5 and some Intel
>>> implementations which have dedicated endpoints for processor tracing.
>>>
>>> With OMAP5, these endpoints are configured at the top of the available
>>> endpoints, which means that if a gadget driver gets loaded and takes
>>> over most of the FIFO space because of this resizing, processor tracing
>>> will have a hard time running. That being said, processor tracing isn't
>>> supported in upstream at this moment.
>>>
>
> I agree that the application of this logic may differ between vendors,
> hence why I wanted to keep this controllable by the DT property, so that
> for those which do not support this use case can leave it disabled.  The
> logic is there to ensure that for a given USB configuration, for each EP
> it would have at least 1 TX FIFO.  For USB configurations which don't
> utilize all available IN EPs, it would allow re-allocation of internal
> memory to EPs which will actually be in use.

The feature ends up being all-or-nothing, then :-) It sounds like we can
be a little nicer in this regard.

>>> I still think this may cause other places to break down. The promise the
>>> databook makes is that increasing the FIFO size over 2x wMaxPacketSize
>>> should bring little to no benefit, if we're not maintaining that, I
>>> wonder if the problem is with some of the BUSCFG registers instead,
>>> where we configure interconnect bursting and the like.
>> 
>
> I've been referring mainly to the DWC3 programming guide for
> recommendations on how to improve USB performance in:
> Section 3.3.5 System Bus Features to Improve USB Performance

dwc3 or dwc3.1? Either way, since I left Intel I don't have access to
the databook anymore. I have to trust what you guys are telling me and,
based on the description so far, I don't think we're doing the right
thing (yet).

It would be nice if other users would test this patchset with different
gadget drivers and different platforms to have some confidence that
we're limiting possible regressions.

I would like for Thinh to comment from Synopsys side here.

> At least when I ran the initial profiling, adjusting the RX/TX
> thresholds brought little to no benefits.  Even in some of the examples,

right, the FIFO sizes shouldn't help much. At least that's what Paul
told me several years ago. Thinh, has the recommendation changed?

> they have diagrams showing a TXFIFO size of 6 max packets (Figure 3-5).
>  I think its difficult to say that the TX fifo resizing won't help in
> systems with limited, or shared resources where the bus latencies would
> be somewhat larger.  By adjusting the TX FIFO size, the controller would
> be able to fetch more data from system memory into the memory within the
> controller, leading to less frequent end of bursts, etc... as data is
> readily available.
>
> In terms of adjusting the AXI/AHB bursting, I would think the bandwidth
> increase would eventually be constrained based on your system's design.
>  We don't touch the GSBUSCFG registers, and leave them as is based off
> the recommendations from the HW designers.

Right, I want to touch those as little as possible too :-) However, to
illustrate, the only reason I implemented FIFO resizing was because
OMAP5 ES1 had TX FIFOs that were smaller than a full USB3 packet. HW
Designer's recommendation can be bogus too ;-)

>> Good points.
>> 
>> Wesley, what kind of testing have you done on this on different devices?
>> 
>
> As mentioned above, these changes are currently present on end user
> devices for the past few years, so its been through a lot of testing :).

all with the same gadget driver. Also, who uses USB on android devices
these days? Most of the data transfer goes via WiFi or Bluetooth, anyway
:-)

I guess only developers are using USB during development to flash dev
images heh.

-- 
balbi

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 511 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v9 0/5] Re-introduce TX FIFO resize for larger EP bursting
  2021-06-10  9:20         ` Felipe Balbi
@ 2021-06-10 10:03           ` Greg KH
  2021-06-10 10:16             ` Felipe Balbi
  2021-06-10 18:15           ` Wesley Cheng
  1 sibling, 1 reply; 31+ messages in thread
From: Greg KH @ 2021-06-10 10:03 UTC (permalink / raw)
  To: Felipe Balbi
  Cc: Wesley Cheng, agross, bjorn.andersson, robh+dt, linux-usb,
	linux-kernel, linux-arm-msm, devicetree, jackp, Thinh.Nguyen

On Thu, Jun 10, 2021 at 12:20:00PM +0300, Felipe Balbi wrote:
> > As mentioned above, these changes are currently present on end user
> > devices for the past few years, so its been through a lot of testing :).
> 
> all with the same gadget driver. Also, who uses USB on android devices
> these days? Most of the data transfer goes via WiFi or Bluetooth, anyway
> :-)

I used to think that too, until I saw some of the new crazy designs
where lots of SoC connections internally to the device run on USB.  Also
look at the USB offload stuff as one example of how the voice sound path
goes through the USB controller on the SoC on the latest Samsung Galaxy
phones that are shipping now :(

There's also devices with the modem/network connection going over USB,
along with other device types as well.  Android Auto is crazy with
almost everything hooked up directly with a USB connection to the host
system running Linux.

> I guess only developers are using USB during development to flash dev
> images heh.

I wish, we are reaching the point where the stability of the overall
Android system depends on how well the USB controller works.  We are a
product of our success...

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v9 0/5] Re-introduce TX FIFO resize for larger EP bursting
  2021-06-10 10:03           ` Greg KH
@ 2021-06-10 10:16             ` Felipe Balbi
  0 siblings, 0 replies; 31+ messages in thread
From: Felipe Balbi @ 2021-06-10 10:16 UTC (permalink / raw)
  To: Greg KH
  Cc: Wesley Cheng, agross, bjorn.andersson, robh+dt, linux-usb,
	linux-kernel, linux-arm-msm, devicetree, jackp, Thinh.Nguyen

[-- Attachment #1: Type: text/plain, Size: 1829 bytes --]


Hi,

Greg KH <gregkh@linuxfoundation.org> writes:
> On Thu, Jun 10, 2021 at 12:20:00PM +0300, Felipe Balbi wrote:
>> > As mentioned above, these changes are currently present on end user
>> > devices for the past few years, so its been through a lot of testing :).
>> 
>> all with the same gadget driver. Also, who uses USB on android devices
>> these days? Most of the data transfer goes via WiFi or Bluetooth, anyway
>> :-)
>
> I used to think that too, until I saw some of the new crazy designs
> where lots of SoC connections internally to the device run on USB.  Also
> look at the USB offload stuff as one example of how the voice sound path
> goes through the USB controller on the SoC on the latest Samsung Galaxy
> phones that are shipping now :(

yeah, that's one reason NOT to touch the FIFO sizes :-) OMAP5 has, as
mentioned before, processor trace offload via USB too. If we modify the
FIFO configuration set by the HW designer we risk loosing those features.

> There's also devices with the modem/network connection going over USB,
> along with other device types as well.  Android Auto is crazy with

yeah, and there will be more coming. USB Debug class is already
integrated in some SoCs, that gives us jtag-like access over USB.

> almost everything hooked up directly with a USB connection to the host
> system running Linux.

that's running against USB host, though, right? Android is the host, not
the gadget :-)

The FIFO sizes here are for the gadget side.

>> I guess only developers are using USB during development to flash dev
>> images heh.
>
> I wish, we are reaching the point where the stability of the overall
> Android system depends on how well the USB controller works.  We are a
> product of our success...
>
> thanks,
>
> greg k-h

-- 
balbi

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 511 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v9 0/5] Re-introduce TX FIFO resize for larger EP bursting
  2021-06-10  9:20         ` Felipe Balbi
  2021-06-10 10:03           ` Greg KH
@ 2021-06-10 18:15           ` Wesley Cheng
  2021-06-11  6:29             ` Felipe Balbi
  1 sibling, 1 reply; 31+ messages in thread
From: Wesley Cheng @ 2021-06-10 18:15 UTC (permalink / raw)
  To: Felipe Balbi, Greg KH
  Cc: agross, bjorn.andersson, robh+dt, linux-usb, linux-kernel,
	linux-arm-msm, devicetree, jackp, Thinh.Nguyen

Hi Felipe,

On 6/10/2021 2:20 AM, Felipe Balbi wrote:
> 
> Hi,
> 
> Wesley Cheng <wcheng@codeaurora.org> writes:
> 
>> Hi Greg/Felipe,
>>
>> On 6/4/2021 7:36 AM, Greg KH wrote:
>>> On Fri, Jun 04, 2021 at 05:18:16PM +0300, Felipe Balbi wrote:
>>>>
>>>> Hi,
>>>>
>>>> Greg KH <gregkh@linuxfoundation.org> writes:
>>>>> On Wed, May 19, 2021 at 12:49:16AM -0700, Wesley Cheng wrote:
>>>>>> Changes in V9:
>>>>>>  - Fixed incorrect patch in series.  Removed changes in DTSI, as dwc3-qcom will
>>>>>>    add the property by default from the kernel.
>>>>>
>>>>> This patch series has one build failure and one warning added:
>>>>>
>>>>> drivers/usb/dwc3/gadget.c: In function ‘dwc3_gadget_calc_tx_fifo_size’:
>>>>> drivers/usb/dwc3/gadget.c:653:45: warning: passing argument 1 of ‘dwc3_mdwidth’ makes pointer from integer without a cast [-Wint-conversion]
>>>>>   653 |         mdwidth = dwc3_mdwidth(dwc->hwparams.hwparams0);
>>>>>       |                                ~~~~~~~~~~~~~^~~~~~~~~~
>>>>>       |                                             |
>>>>>       |                                             u32 {aka unsigned int}
>>>>> In file included from drivers/usb/dwc3/debug.h:14,
>>>>>                  from drivers/usb/dwc3/gadget.c:25:
>>>>> drivers/usb/dwc3/core.h:1493:45: note: expected ‘struct dwc3 *’ but argument is of type ‘u32’ {aka ‘unsigned int’}
>>>>>  1493 | static inline u32 dwc3_mdwidth(struct dwc3 *dwc)
>>>>>       |                                ~~~~~~~~~~~~~^~~
>>>>>
>>>>>
>>>>> drivers/usb/dwc3/dwc3-qcom.c: In function ‘dwc3_qcom_of_register_core’:
>>>>> drivers/usb/dwc3/dwc3-qcom.c:660:23: error: implicit declaration of function ‘of_add_property’; did you mean ‘of_get_property’? [-Werror=implicit-function-declaration]
>>>>>   660 |                 ret = of_add_property(dwc3_np, prop);
>>>>>       |                       ^~~~~~~~~~~~~~~
>>>>>       |                       of_get_property
>>>>>
>>>>>
>>>>> How did you test these?
>>
>> I ran these changes on our internal branches, which were probably
>> missing some of the recent changes done to the DWC3 drivers.  Will fix
>> the above compile errors and re-submit.
>>
>> In regards to how much these changes have been tested, we've been
>> maintaining the TX FIFO resize logic downstream for a few years already,
>> so its being used in end products.  We also verify this with our
>> internal testing, which has certain benchmarks we need to meet.
> 
> the problem with that is that you *know* which gadget is running
> there. You know everyone of those is going to run the android
> gadget. In a sense, all those multiple products are testing the same
> exact use case :-)
> 

Mmmm, the USB gadget has changed from since we've implemented it, such
as going from Android gadget to Configfs.  Don't forget, we do have
other business segments that use this feature in other configurations as
well :).

>>>> to be honest, I don't think these should go in (apart from the build
>>>> failure) because it's likely to break instantiations of the core with
>>>> differing FIFO sizes. Some instantiations even have some endpoints with
>>>> dedicated functionality that requires the default FIFO size configured
>>>> during coreConsultant instantiation. I know of at OMAP5 and some Intel
>>>> implementations which have dedicated endpoints for processor tracing.
>>>>
>>>> With OMAP5, these endpoints are configured at the top of the available
>>>> endpoints, which means that if a gadget driver gets loaded and takes
>>>> over most of the FIFO space because of this resizing, processor tracing
>>>> will have a hard time running. That being said, processor tracing isn't
>>>> supported in upstream at this moment.
>>>>
>>
>> I agree that the application of this logic may differ between vendors,
>> hence why I wanted to keep this controllable by the DT property, so that
>> for those which do not support this use case can leave it disabled.  The
>> logic is there to ensure that for a given USB configuration, for each EP
>> it would have at least 1 TX FIFO.  For USB configurations which don't
>> utilize all available IN EPs, it would allow re-allocation of internal
>> memory to EPs which will actually be in use.
> 
> The feature ends up being all-or-nothing, then :-) It sounds like we can
> be a little nicer in this regard.
> 

Don't get me wrong, I think once those features become available
upstream, we can improve the logic.  From what I remember when looking
at Andy Shevchenko's Github, the Intel tracer downstream changes were
just to remove physical EP1 and 2 from the DWC3 endpoint list.  If that
was the change which ended up upstream for the Intel tracer then we
could improve the logic to avoid re-sizing those particular EPs.
However, I'm not sure how the changes would look like in the end, so I
would like to wait later down the line to include that :).

>>>> I still think this may cause other places to break down. The promise the
>>>> databook makes is that increasing the FIFO size over 2x wMaxPacketSize
>>>> should bring little to no benefit, if we're not maintaining that, I
>>>> wonder if the problem is with some of the BUSCFG registers instead,
>>>> where we configure interconnect bursting and the like.
>>>
>>
>> I've been referring mainly to the DWC3 programming guide for
>> recommendations on how to improve USB performance in:
>> Section 3.3.5 System Bus Features to Improve USB Performance
> 
> dwc3 or dwc3.1? Either way, since I left Intel I don't have access to
> the databook anymore. I have to trust what you guys are telling me and,
> based on the description so far, I don't think we're doing the right
> thing (yet).
> 

Ah, I see.  DWC3.1 and DWC3 both have that USB performance section.  I
can explain some of the points I made with a bit more detail.  I thought
you still had access to it.

> It would be nice if other users would test this patchset with different
> gadget drivers and different platforms to have some confidence that
> we're limiting possible regressions.
> 
> I would like for Thinh to comment from Synopsys side here.
> 
>> At least when I ran the initial profiling, adjusting the RX/TX
>> thresholds brought little to no benefits.  Even in some of the examples,
> 
> right, the FIFO sizes shouldn't help much. At least that's what Paul
> told me several years ago. Thinh, has the recommendation changed?
> 

So when I mention the RX/TX thresholds, this is different than the FIFO
resize.  The RX/TX threshold is used by the controller to determine when
to send or receive data based on the number of available FIFOs.  So for
the TX case, if we set the TX threshold, the controller will not start
transmitting data over the link after X amount of packets are copied to
the TXFIFO.  So for example, a TXFIFO size of 6 w/ a TX threshold of 3,
means that the controller will wait for 3 FIFO slots to be filled before
it sends the data.  So as you can see, with our configuration of TX FIFO
size of 2 and TX threshold of 1, this would really be not beneficial to
us, because we can only change the TX threshold to 2 at max, and at
least in my observations, once we have to go out to system memory to
fetch the next data packet, that latency takes enough time for the
controller to end the current burst.

>> they have diagrams showing a TXFIFO size of 6 max packets (Figure 3-5).
>>  I think its difficult to say that the TX fifo resizing won't help in
>> systems with limited, or shared resources where the bus latencies would
>> be somewhat larger.  By adjusting the TX FIFO size, the controller would
>> be able to fetch more data from system memory into the memory within the
>> controller, leading to less frequent end of bursts, etc... as data is
>> readily available.
>>
>> In terms of adjusting the AXI/AHB bursting, I would think the bandwidth
>> increase would eventually be constrained based on your system's design.
>>  We don't touch the GSBUSCFG registers, and leave them as is based off
>> the recommendations from the HW designers.
> 
> Right, I want to touch those as little as possible too :-) However, to
> illustrate, the only reason I implemented FIFO resizing was because
> OMAP5 ES1 had TX FIFOs that were smaller than a full USB3 packet. HW
> Designer's recommendation can be bogus too ;-)
> 

Haha...true, we question their designs only when there's something
clearly wrong, but the AXI/AHB settings look good.  :)

>>> Good points.
>>>
>>> Wesley, what kind of testing have you done on this on different devices?
>>>
>>
>> As mentioned above, these changes are currently present on end user
>> devices for the past few years, so its been through a lot of testing :).
> 
> all with the same gadget driver. Also, who uses USB on android devices
> these days? Most of the data transfer goes via WiFi or Bluetooth, anyway
> :-)
> 
> I guess only developers are using USB during development to flash dev
> images heh.
> 

I used to be a customer facing engineer, so honestly I did see some
really interesting and crazy designs.  Again, we do have non-Android
products that use the same code, and it has been working in there for a
few years as well.  The TXFIFO sizing really has helped with multimedia
use cases, which use isoc endpoints, since esp. in those lower end CPU
chips where latencies across the system are much larger, and a missed
ISOC interval leads to a pop in your ear.

Thanks
Wesley Cheng

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v9 0/5] Re-introduce TX FIFO resize for larger EP bursting
  2021-06-10 18:15           ` Wesley Cheng
@ 2021-06-11  6:29             ` Felipe Balbi
  2021-06-11  8:43               ` Wesley Cheng
  0 siblings, 1 reply; 31+ messages in thread
From: Felipe Balbi @ 2021-06-11  6:29 UTC (permalink / raw)
  To: Wesley Cheng, Greg KH
  Cc: agross, bjorn.andersson, robh+dt, linux-usb, linux-kernel,
	linux-arm-msm, devicetree, jackp, Thinh.Nguyen

[-- Attachment #1: Type: text/plain, Size: 11546 bytes --]


Hi,

Wesley Cheng <wcheng@codeaurora.org> writes:
>>>>> Greg KH <gregkh@linuxfoundation.org> writes:
>>>>>> On Wed, May 19, 2021 at 12:49:16AM -0700, Wesley Cheng wrote:
>>>>>>> Changes in V9:
>>>>>>>  - Fixed incorrect patch in series.  Removed changes in DTSI, as dwc3-qcom will
>>>>>>>    add the property by default from the kernel.
>>>>>>
>>>>>> This patch series has one build failure and one warning added:
>>>>>>
>>>>>> drivers/usb/dwc3/gadget.c: In function ‘dwc3_gadget_calc_tx_fifo_size’:
>>>>>> drivers/usb/dwc3/gadget.c:653:45: warning: passing argument 1 of ‘dwc3_mdwidth’ makes pointer from integer without a cast [-Wint-conversion]
>>>>>>   653 |         mdwidth = dwc3_mdwidth(dwc->hwparams.hwparams0);
>>>>>>       |                                ~~~~~~~~~~~~~^~~~~~~~~~
>>>>>>       |                                             |
>>>>>>       |                                             u32 {aka unsigned int}
>>>>>> In file included from drivers/usb/dwc3/debug.h:14,
>>>>>>                  from drivers/usb/dwc3/gadget.c:25:
>>>>>> drivers/usb/dwc3/core.h:1493:45: note: expected ‘struct dwc3 *’ but argument is of type ‘u32’ {aka ‘unsigned int’}
>>>>>>  1493 | static inline u32 dwc3_mdwidth(struct dwc3 *dwc)
>>>>>>       |                                ~~~~~~~~~~~~~^~~
>>>>>>
>>>>>>
>>>>>> drivers/usb/dwc3/dwc3-qcom.c: In function ‘dwc3_qcom_of_register_core’:
>>>>>> drivers/usb/dwc3/dwc3-qcom.c:660:23: error: implicit declaration of function ‘of_add_property’; did you mean ‘of_get_property’? [-Werror=implicit-function-declaration]
>>>>>>   660 |                 ret = of_add_property(dwc3_np, prop);
>>>>>>       |                       ^~~~~~~~~~~~~~~
>>>>>>       |                       of_get_property
>>>>>>
>>>>>>
>>>>>> How did you test these?
>>>
>>> I ran these changes on our internal branches, which were probably
>>> missing some of the recent changes done to the DWC3 drivers.  Will fix
>>> the above compile errors and re-submit.
>>>
>>> In regards to how much these changes have been tested, we've been
>>> maintaining the TX FIFO resize logic downstream for a few years already,
>>> so its being used in end products.  We also verify this with our
>>> internal testing, which has certain benchmarks we need to meet.
>> 
>> the problem with that is that you *know* which gadget is running
>> there. You know everyone of those is going to run the android
>> gadget. In a sense, all those multiple products are testing the same
>> exact use case :-)
>> 
>
> Mmmm, the USB gadget has changed from since we've implemented it, such
> as going from Android gadget to Configfs.  Don't forget, we do have
> other business segments that use this feature in other configurations as
> well :).

:)

>>>>> to be honest, I don't think these should go in (apart from the build
>>>>> failure) because it's likely to break instantiations of the core with
>>>>> differing FIFO sizes. Some instantiations even have some endpoints with
>>>>> dedicated functionality that requires the default FIFO size configured
>>>>> during coreConsultant instantiation. I know of at OMAP5 and some Intel
>>>>> implementations which have dedicated endpoints for processor tracing.
>>>>>
>>>>> With OMAP5, these endpoints are configured at the top of the available
>>>>> endpoints, which means that if a gadget driver gets loaded and takes
>>>>> over most of the FIFO space because of this resizing, processor tracing
>>>>> will have a hard time running. That being said, processor tracing isn't
>>>>> supported in upstream at this moment.
>>>>>
>>>
>>> I agree that the application of this logic may differ between vendors,
>>> hence why I wanted to keep this controllable by the DT property, so that
>>> for those which do not support this use case can leave it disabled.  The
>>> logic is there to ensure that for a given USB configuration, for each EP
>>> it would have at least 1 TX FIFO.  For USB configurations which don't
>>> utilize all available IN EPs, it would allow re-allocation of internal
>>> memory to EPs which will actually be in use.
>> 
>> The feature ends up being all-or-nothing, then :-) It sounds like we can
>> be a little nicer in this regard.
>> 
>
> Don't get me wrong, I think once those features become available
> upstream, we can improve the logic.  From what I remember when looking

sure, I support that. But I want to make sure the first cut isn't likely
to break things left and right :)

Hence, let's at least get more testing.

> at Andy Shevchenko's Github, the Intel tracer downstream changes were
> just to remove physical EP1 and 2 from the DWC3 endpoint list.  If that

right, that's the reason why we introduced the endpoint feature
flags. The end goal was that the UDC would be able to have custom
feature flags paired with ->validate_endpoint() or whatever before
allowing it to be enabled. Then the UDC driver could tell UDC core to
skip that endpoint on that particular platform without interefering with
everything else.

Of course, we still need to figure out a way to abstract the different
dwc3 instantiations.

> was the change which ended up upstream for the Intel tracer then we
> could improve the logic to avoid re-sizing those particular EPs.

The problem then, just as I mentioned in the previous paragraph, will be
coming up with a solution that's elegant and works for all different
instantiations of dwc3 (or musb, cdns3, etc).

> However, I'm not sure how the changes would look like in the end, so I
> would like to wait later down the line to include that :).

Fair enough, I agree. Can we get some more testing of $subject, though?
Did you test $subject with upstream too? Which gadget drivers did you
use? How did you test

>>>>> I still think this may cause other places to break down. The promise the
>>>>> databook makes is that increasing the FIFO size over 2x wMaxPacketSize
>>>>> should bring little to no benefit, if we're not maintaining that, I
>>>>> wonder if the problem is with some of the BUSCFG registers instead,
>>>>> where we configure interconnect bursting and the like.
>>>>
>>>
>>> I've been referring mainly to the DWC3 programming guide for
>>> recommendations on how to improve USB performance in:
>>> Section 3.3.5 System Bus Features to Improve USB Performance
>> 
>> dwc3 or dwc3.1? Either way, since I left Intel I don't have access to
>> the databook anymore. I have to trust what you guys are telling me and,
>> based on the description so far, I don't think we're doing the right
>> thing (yet).
>> 
>
> Ah, I see.  DWC3.1 and DWC3 both have that USB performance section.  I
> can explain some of the points I made with a bit more detail.  I thought
> you still had access to it.

I wish :)

If Synopsys wants to give me access for the databook, I would not mind :-)

>> It would be nice if other users would test this patchset with different
>> gadget drivers and different platforms to have some confidence that
>> we're limiting possible regressions.
>> 
>> I would like for Thinh to comment from Synopsys side here.
>> 
>>> At least when I ran the initial profiling, adjusting the RX/TX
>>> thresholds brought little to no benefits.  Even in some of the examples,
>> 
>> right, the FIFO sizes shouldn't help much. At least that's what Paul
>> told me several years ago. Thinh, has the recommendation changed?
>> 
>
> So when I mention the RX/TX thresholds, this is different than the FIFO
> resize.  The RX/TX threshold is used by the controller to determine when
> to send or receive data based on the number of available FIFOs.  So for

oh right, I remember now :-

> the TX case, if we set the TX threshold, the controller will not start
> transmitting data over the link after X amount of packets are copied to
> the TXFIFO.  So for example, a TXFIFO size of 6 w/ a TX threshold of 3,
> means that the controller will wait for 3 FIFO slots to be filled before
> it sends the data.  So as you can see, with our configuration of TX FIFO

yeah, makes sense.

> size of 2 and TX threshold of 1, this would really be not beneficial to
> us, because we can only change the TX threshold to 2 at max, and at
> least in my observations, once we have to go out to system memory to
> fetch the next data packet, that latency takes enough time for the
> controller to end the current burst.

What I noticed with g_mass_storage is that we can amortize the cost of
fetching data from memory, with a deeper request queue. Whenever I
test(ed) g_mass_storage, I was doing so with 250 requests. And that was
enough to give me very good performance. Never had to poke at TX FIFO
resizing. Did you try something like this too?

I feel that allocating more requests is a far simpler and more generic
method that changing FIFO sizes :)

>>> they have diagrams showing a TXFIFO size of 6 max packets (Figure 3-5).
>>>  I think its difficult to say that the TX fifo resizing won't help in
>>> systems with limited, or shared resources where the bus latencies would
>>> be somewhat larger.  By adjusting the TX FIFO size, the controller would
>>> be able to fetch more data from system memory into the memory within the
>>> controller, leading to less frequent end of bursts, etc... as data is
>>> readily available.
>>>
>>> In terms of adjusting the AXI/AHB bursting, I would think the bandwidth
>>> increase would eventually be constrained based on your system's design.
>>>  We don't touch the GSBUSCFG registers, and leave them as is based off
>>> the recommendations from the HW designers.
>> 
>> Right, I want to touch those as little as possible too :-) However, to
>> illustrate, the only reason I implemented FIFO resizing was because
>> OMAP5 ES1 had TX FIFOs that were smaller than a full USB3 packet. HW
>> Designer's recommendation can be bogus too ;-)
>> 
>
> Haha...true, we question their designs only when there's something
> clearly wrong, but the AXI/AHB settings look good.  :)

:)

>>>> Good points.
>>>>
>>>> Wesley, what kind of testing have you done on this on different devices?
>>>>
>>>
>>> As mentioned above, these changes are currently present on end user
>>> devices for the past few years, so its been through a lot of testing :).
>> 
>> all with the same gadget driver. Also, who uses USB on android devices
>> these days? Most of the data transfer goes via WiFi or Bluetooth, anyway
>> :-)
>> 
>> I guess only developers are using USB during development to flash dev
>> images heh.
>> 
>
> I used to be a customer facing engineer, so honestly I did see some
> really interesting and crazy designs.  Again, we do have non-Android
> products that use the same code, and it has been working in there for a
> few years as well.  The TXFIFO sizing really has helped with multimedia
> use cases, which use isoc endpoints, since esp. in those lower end CPU
> chips where latencies across the system are much larger, and a missed
> ISOC interval leads to a pop in your ear.

This is good background information. Thanks for bringing this
up. Admitedly, we still have ISOC issues with dwc3. I'm interested in
knowing if a deeper request queue would also help here.

Remember dwc3 can accomodate 255 requests + link for each endpoint. If
our gadget driver uses a low number of requests, we're never really
using the TRB ring in our benefit.

-- 
balbi

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 511 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v9 0/5] Re-introduce TX FIFO resize for larger EP bursting
  2021-06-11  6:29             ` Felipe Balbi
@ 2021-06-11  8:43               ` Wesley Cheng
  2021-06-11 13:00                 ` Felipe Balbi
  0 siblings, 1 reply; 31+ messages in thread
From: Wesley Cheng @ 2021-06-11  8:43 UTC (permalink / raw)
  To: Felipe Balbi, Greg KH
  Cc: agross, bjorn.andersson, robh+dt, linux-usb, linux-kernel,
	linux-arm-msm, devicetree, jackp, Thinh.Nguyen

Hi Felipe,

On 6/10/2021 11:29 PM, Felipe Balbi wrote:
> 
> Hi,
> 
> Wesley Cheng <wcheng@codeaurora.org> writes:
>>>>>> Greg KH <gregkh@linuxfoundation.org> writes:
>>>>>>> On Wed, May 19, 2021 at 12:49:16AM -0700, Wesley Cheng wrote:
>>>>>>>> Changes in V9:
>>>>>>>>  - Fixed incorrect patch in series.  Removed changes in DTSI, as dwc3-qcom will
>>>>>>>>    add the property by default from the kernel.
>>>>>>>
>>>>>>> This patch series has one build failure and one warning added:
>>>>>>>
>>>>>>> drivers/usb/dwc3/gadget.c: In function ‘dwc3_gadget_calc_tx_fifo_size’:
>>>>>>> drivers/usb/dwc3/gadget.c:653:45: warning: passing argument 1 of ‘dwc3_mdwidth’ makes pointer from integer without a cast [-Wint-conversion]
>>>>>>>   653 |         mdwidth = dwc3_mdwidth(dwc->hwparams.hwparams0);
>>>>>>>       |                                ~~~~~~~~~~~~~^~~~~~~~~~
>>>>>>>       |                                             |
>>>>>>>       |                                             u32 {aka unsigned int}
>>>>>>> In file included from drivers/usb/dwc3/debug.h:14,
>>>>>>>                  from drivers/usb/dwc3/gadget.c:25:
>>>>>>> drivers/usb/dwc3/core.h:1493:45: note: expected ‘struct dwc3 *’ but argument is of type ‘u32’ {aka ‘unsigned int’}
>>>>>>>  1493 | static inline u32 dwc3_mdwidth(struct dwc3 *dwc)
>>>>>>>       |                                ~~~~~~~~~~~~~^~~
>>>>>>>
>>>>>>>
>>>>>>> drivers/usb/dwc3/dwc3-qcom.c: In function ‘dwc3_qcom_of_register_core’:
>>>>>>> drivers/usb/dwc3/dwc3-qcom.c:660:23: error: implicit declaration of function ‘of_add_property’; did you mean ‘of_get_property’? [-Werror=implicit-function-declaration]
>>>>>>>   660 |                 ret = of_add_property(dwc3_np, prop);
>>>>>>>       |                       ^~~~~~~~~~~~~~~
>>>>>>>       |                       of_get_property
>>>>>>>
>>>>>>>
>>>>>>> How did you test these?
>>>>
>>>> I ran these changes on our internal branches, which were probably
>>>> missing some of the recent changes done to the DWC3 drivers.  Will fix
>>>> the above compile errors and re-submit.
>>>>
>>>> In regards to how much these changes have been tested, we've been
>>>> maintaining the TX FIFO resize logic downstream for a few years already,
>>>> so its being used in end products.  We also verify this with our
>>>> internal testing, which has certain benchmarks we need to meet.
>>>
>>> the problem with that is that you *know* which gadget is running
>>> there. You know everyone of those is going to run the android
>>> gadget. In a sense, all those multiple products are testing the same
>>> exact use case :-)
>>>
>>
>> Mmmm, the USB gadget has changed from since we've implemented it, such
>> as going from Android gadget to Configfs.  Don't forget, we do have
>> other business segments that use this feature in other configurations as
>> well :).
> 
> :)
> 
>>>>>> to be honest, I don't think these should go in (apart from the build
>>>>>> failure) because it's likely to break instantiations of the core with
>>>>>> differing FIFO sizes. Some instantiations even have some endpoints with
>>>>>> dedicated functionality that requires the default FIFO size configured
>>>>>> during coreConsultant instantiation. I know of at OMAP5 and some Intel
>>>>>> implementations which have dedicated endpoints for processor tracing.
>>>>>>
>>>>>> With OMAP5, these endpoints are configured at the top of the available
>>>>>> endpoints, which means that if a gadget driver gets loaded and takes
>>>>>> over most of the FIFO space because of this resizing, processor tracing
>>>>>> will have a hard time running. That being said, processor tracing isn't
>>>>>> supported in upstream at this moment.
>>>>>>
>>>>
>>>> I agree that the application of this logic may differ between vendors,
>>>> hence why I wanted to keep this controllable by the DT property, so that
>>>> for those which do not support this use case can leave it disabled.  The
>>>> logic is there to ensure that for a given USB configuration, for each EP
>>>> it would have at least 1 TX FIFO.  For USB configurations which don't
>>>> utilize all available IN EPs, it would allow re-allocation of internal
>>>> memory to EPs which will actually be in use.
>>>
>>> The feature ends up being all-or-nothing, then :-) It sounds like we can
>>> be a little nicer in this regard.
>>>
>>
>> Don't get me wrong, I think once those features become available
>> upstream, we can improve the logic.  From what I remember when looking
> 
> sure, I support that. But I want to make sure the first cut isn't likely
> to break things left and right :)
> 
> Hence, let's at least get more testing.
> 

Sure, I'd hope that the other users of DWC3 will also see some pretty
big improvements on the TX path with this.

>> at Andy Shevchenko's Github, the Intel tracer downstream changes were
>> just to remove physical EP1 and 2 from the DWC3 endpoint list.  If that
> 
> right, that's the reason why we introduced the endpoint feature
> flags. The end goal was that the UDC would be able to have custom
> feature flags paired with ->validate_endpoint() or whatever before
> allowing it to be enabled. Then the UDC driver could tell UDC core to
> skip that endpoint on that particular platform without interefering with
> everything else.
> 
> Of course, we still need to figure out a way to abstract the different
> dwc3 instantiations.
> 
>> was the change which ended up upstream for the Intel tracer then we
>> could improve the logic to avoid re-sizing those particular EPs.
> 
> The problem then, just as I mentioned in the previous paragraph, will be
> coming up with a solution that's elegant and works for all different
> instantiations of dwc3 (or musb, cdns3, etc).
> 

Well, at least for the TX FIFO resizing logic, we'd only be needing to
focus on the DWC3 implementation.

You bring up another good topic that I'll eventually needing to be
taking a look at, which is a nice way we can handle vendor specific
endpoints and how they can co-exist with other "normal" endpoints.  We
have a few special HW eps as well, which we try to maintain separately
in our DWC3 vendor driver, but it isn't the most convenient, or most
pretty method :).

>> However, I'm not sure how the changes would look like in the end, so I
>> would like to wait later down the line to include that :).
> 
> Fair enough, I agree. Can we get some more testing of $subject, though?
> Did you test $subject with upstream too? Which gadget drivers did you
> use? How did you test
> 

The results that I included in the cover page was tested with the pure
upstream kernel on our device.  Below was using the ConfigFS gadget w/ a
mass storage only composition.

Test Parameters:
 - Platform: Qualcomm SM8150
 - bMaxBurst = 6
 - USB req size = 256kB
 - Num of USB reqs = 16
 - USB Speed = Super-Speed
 - Function Driver: Mass Storage (w/ ramdisk)
 - Test Application: CrystalDiskMark

Results:

TXFIFO Depth = 3 max packets

Test Case | Data Size | AVG tput (in MB/s)
-------------------------------------------
Sequential|1 GB x     |
Read      |9 loops    | 193.60
	  |           | 195.86
          |           | 184.77
          |           | 193.60
-------------------------------------------

TXFIFO Depth = 6 max packets

Test Case | Data Size | AVG tput (in MB/s)
-------------------------------------------
Sequential|1 GB x     |
Read      |9 loops    | 287.35
	  |           | 304.94
          |           | 289.64
          |           | 293.61
-------------------------------------------

We also have internal numbers which have shown similar improvements as
well.  Those are over networking/tethering interfaces, so testing IPERF
loopback over TCP/UDP.

>>>>>> I still think this may cause other places to break down. The promise the
>>>>>> databook makes is that increasing the FIFO size over 2x wMaxPacketSize
>>>>>> should bring little to no benefit, if we're not maintaining that, I
>>>>>> wonder if the problem is with some of the BUSCFG registers instead,
>>>>>> where we configure interconnect bursting and the like.
>>>>>
>>>>
>>>> I've been referring mainly to the DWC3 programming guide for
>>>> recommendations on how to improve USB performance in:
>>>> Section 3.3.5 System Bus Features to Improve USB Performance
>>>
>>> dwc3 or dwc3.1? Either way, since I left Intel I don't have access to
>>> the databook anymore. I have to trust what you guys are telling me and,
>>> based on the description so far, I don't think we're doing the right
>>> thing (yet).
>>>
>>
>> Ah, I see.  DWC3.1 and DWC3 both have that USB performance section.  I
>> can explain some of the points I made with a bit more detail.  I thought
>> you still had access to it.
> 
> I wish :)
> 
> If Synopsys wants to give me access for the databook, I would not mind :-)
> 
>>> It would be nice if other users would test this patchset with different
>>> gadget drivers and different platforms to have some confidence that
>>> we're limiting possible regressions.
>>>
>>> I would like for Thinh to comment from Synopsys side here.
>>>
>>>> At least when I ran the initial profiling, adjusting the RX/TX
>>>> thresholds brought little to no benefits.  Even in some of the examples,
>>>
>>> right, the FIFO sizes shouldn't help much. At least that's what Paul
>>> told me several years ago. Thinh, has the recommendation changed?
>>>
>>
>> So when I mention the RX/TX thresholds, this is different than the FIFO
>> resize.  The RX/TX threshold is used by the controller to determine when
>> to send or receive data based on the number of available FIFOs.  So for
> 
> oh right, I remember now :-
> 
>> the TX case, if we set the TX threshold, the controller will not start
>> transmitting data over the link after X amount of packets are copied to
>> the TXFIFO.  So for example, a TXFIFO size of 6 w/ a TX threshold of 3,
>> means that the controller will wait for 3 FIFO slots to be filled before
>> it sends the data.  So as you can see, with our configuration of TX FIFO
> 
> yeah, makes sense.
> 
>> size of 2 and TX threshold of 1, this would really be not beneficial to
>> us, because we can only change the TX threshold to 2 at max, and at
>> least in my observations, once we have to go out to system memory to
>> fetch the next data packet, that latency takes enough time for the
>> controller to end the current burst.
> 
> What I noticed with g_mass_storage is that we can amortize the cost of
> fetching data from memory, with a deeper request queue. Whenever I
> test(ed) g_mass_storage, I was doing so with 250 requests. And that was
> enough to give me very good performance. Never had to poke at TX FIFO
> resizing. Did you try something like this too?
> 
> I feel that allocating more requests is a far simpler and more generic
> method that changing FIFO sizes :)
> 

I wish I had a USB bus trace handy to show you, which would make it very
clear how the USB bus is currently utilized with TXFIFO size 2 vs 6.  So
by increasing the number of USB requests, that will help if there was a
bottleneck at the SW level where the application/function driver
utilizing the DWC3 was submitting data much faster than the HW was
processing them.

So yes, this method of increasing the # of USB reqs will definitely help
with situations such as HSUSB or in SSUSB when EP bursting isn't used.
The TXFIFO resize comes into play for SSUSB, which utilizes endpoint
bursting.

Now with endpoint bursting, if the function notifies the host that
bursting is supported, when the host sends the ACK for the Data Packet,
it should have a NumP value equal to the bMaxBurst reported in the EP
desc.  If we have a TXFIFO size of 2, then normally what I have seen is
that after 2 data packets, the device issues a NRDY.  So then we'd need
to send an ERDY once data is available within the FIFO, and the same
sequence happens until the USB request is complete.  With this constant
NRDY/ERDY handshake going on, you actually see that the bus is under
utilized.  When we increase an EP's FIFO size, then you'll see constant
bursts for a request, until the request is done, or if the host runs out
of RXFIFO. (ie no interruption [on the USB protocol level] during USB
request data transfer)

>>>> they have diagrams showing a TXFIFO size of 6 max packets (Figure 3-5).
>>>>  I think its difficult to say that the TX fifo resizing won't help in
>>>> systems with limited, or shared resources where the bus latencies would
>>>> be somewhat larger.  By adjusting the TX FIFO size, the controller would
>>>> be able to fetch more data from system memory into the memory within the
>>>> controller, leading to less frequent end of bursts, etc... as data is
>>>> readily available.
>>>>
>>>> In terms of adjusting the AXI/AHB bursting, I would think the bandwidth
>>>> increase would eventually be constrained based on your system's design.
>>>>  We don't touch the GSBUSCFG registers, and leave them as is based off
>>>> the recommendations from the HW designers.
>>>
>>> Right, I want to touch those as little as possible too :-) However, to
>>> illustrate, the only reason I implemented FIFO resizing was because
>>> OMAP5 ES1 had TX FIFOs that were smaller than a full USB3 packet. HW
>>> Designer's recommendation can be bogus too ;-)
>>>
>>
>> Haha...true, we question their designs only when there's something
>> clearly wrong, but the AXI/AHB settings look good.  :)
> 
> :)
> 
>>>>> Good points.
>>>>>
>>>>> Wesley, what kind of testing have you done on this on different devices?
>>>>>
>>>>
>>>> As mentioned above, these changes are currently present on end user
>>>> devices for the past few years, so its been through a lot of testing :).
>>>
>>> all with the same gadget driver. Also, who uses USB on android devices
>>> these days? Most of the data transfer goes via WiFi or Bluetooth, anyway
>>> :-)
>>>
>>> I guess only developers are using USB during development to flash dev
>>> images heh.
>>>
>>
>> I used to be a customer facing engineer, so honestly I did see some
>> really interesting and crazy designs.  Again, we do have non-Android
>> products that use the same code, and it has been working in there for a
>> few years as well.  The TXFIFO sizing really has helped with multimedia
>> use cases, which use isoc endpoints, since esp. in those lower end CPU
>> chips where latencies across the system are much larger, and a missed
>> ISOC interval leads to a pop in your ear.
> 
> This is good background information. Thanks for bringing this
> up. Admitedly, we still have ISOC issues with dwc3. I'm interested in
> knowing if a deeper request queue would also help here.
> 
> Remember dwc3 can accomodate 255 requests + link for each endpoint. If
> our gadget driver uses a low number of requests, we're never really
> using the TRB ring in our benefit.
> 

We're actually using both a deeper USB request queue + TX fifo resizing. :).

Thanks
Wesley Cheng

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v9 0/5] Re-introduce TX FIFO resize for larger EP bursting
  2021-06-11  8:43               ` Wesley Cheng
@ 2021-06-11 13:00                 ` Felipe Balbi
  2021-06-11 13:14                   ` Heikki Krogerus
  2021-07-01  1:08                   ` Wesley Cheng
  0 siblings, 2 replies; 31+ messages in thread
From: Felipe Balbi @ 2021-06-11 13:00 UTC (permalink / raw)
  To: Wesley Cheng, Greg KH, Heikki Krogerus
  Cc: agross, bjorn.andersson, robh+dt, linux-usb, linux-kernel,
	linux-arm-msm, devicetree, jackp, Thinh.Nguyen, John Youn

[-- Attachment #1: Type: text/plain, Size: 12194 bytes --]


Hi,

Wesley Cheng <wcheng@codeaurora.org> writes:
>>>>>>> to be honest, I don't think these should go in (apart from the build
>>>>>>> failure) because it's likely to break instantiations of the core with
>>>>>>> differing FIFO sizes. Some instantiations even have some endpoints with
>>>>>>> dedicated functionality that requires the default FIFO size configured
>>>>>>> during coreConsultant instantiation. I know of at OMAP5 and some Intel
>>>>>>> implementations which have dedicated endpoints for processor tracing.
>>>>>>>
>>>>>>> With OMAP5, these endpoints are configured at the top of the available
>>>>>>> endpoints, which means that if a gadget driver gets loaded and takes
>>>>>>> over most of the FIFO space because of this resizing, processor tracing
>>>>>>> will have a hard time running. That being said, processor tracing isn't
>>>>>>> supported in upstream at this moment.
>>>>>>>
>>>>>
>>>>> I agree that the application of this logic may differ between vendors,
>>>>> hence why I wanted to keep this controllable by the DT property, so that
>>>>> for those which do not support this use case can leave it disabled.  The
>>>>> logic is there to ensure that for a given USB configuration, for each EP
>>>>> it would have at least 1 TX FIFO.  For USB configurations which don't
>>>>> utilize all available IN EPs, it would allow re-allocation of internal
>>>>> memory to EPs which will actually be in use.
>>>>
>>>> The feature ends up being all-or-nothing, then :-) It sounds like we can
>>>> be a little nicer in this regard.
>>>>
>>>
>>> Don't get me wrong, I think once those features become available
>>> upstream, we can improve the logic.  From what I remember when looking
>> 
>> sure, I support that. But I want to make sure the first cut isn't likely
>> to break things left and right :)
>> 
>> Hence, let's at least get more testing.
>> 
>
> Sure, I'd hope that the other users of DWC3 will also see some pretty
> big improvements on the TX path with this.

fingers crossed

>>> at Andy Shevchenko's Github, the Intel tracer downstream changes were
>>> just to remove physical EP1 and 2 from the DWC3 endpoint list.  If that
>> 
>> right, that's the reason why we introduced the endpoint feature
>> flags. The end goal was that the UDC would be able to have custom
>> feature flags paired with ->validate_endpoint() or whatever before
>> allowing it to be enabled. Then the UDC driver could tell UDC core to
>> skip that endpoint on that particular platform without interefering with
>> everything else.
>> 
>> Of course, we still need to figure out a way to abstract the different
>> dwc3 instantiations.
>> 
>>> was the change which ended up upstream for the Intel tracer then we
>>> could improve the logic to avoid re-sizing those particular EPs.
>> 
>> The problem then, just as I mentioned in the previous paragraph, will be
>> coming up with a solution that's elegant and works for all different
>> instantiations of dwc3 (or musb, cdns3, etc).
>> 
>
> Well, at least for the TX FIFO resizing logic, we'd only be needing to
> focus on the DWC3 implementation.
>
> You bring up another good topic that I'll eventually needing to be
> taking a look at, which is a nice way we can handle vendor specific
> endpoints and how they can co-exist with other "normal" endpoints.  We
> have a few special HW eps as well, which we try to maintain separately
> in our DWC3 vendor driver, but it isn't the most convenient, or most
> pretty method :).

Awesome, as mentioned, the endpoint feature flags were added exactly to
allow for these vendor-specific features :-)

I'm more than happy to help testing now that I finally got our SM8150
Surface Duo device tree accepted by Bjorn ;-)

>>> However, I'm not sure how the changes would look like in the end, so I
>>> would like to wait later down the line to include that :).
>> 
>> Fair enough, I agree. Can we get some more testing of $subject, though?
>> Did you test $subject with upstream too? Which gadget drivers did you
>> use? How did you test
>> 
>
> The results that I included in the cover page was tested with the pure
> upstream kernel on our device.  Below was using the ConfigFS gadget w/ a
> mass storage only composition.
>
> Test Parameters:
>  - Platform: Qualcomm SM8150
>  - bMaxBurst = 6
>  - USB req size = 256kB
>  - Num of USB reqs = 16

do you mind testing with the regular request size (16KiB) and 250
requests? I think we can even do 15 bursts in that case.

>  - USB Speed = Super-Speed
>  - Function Driver: Mass Storage (w/ ramdisk)
>  - Test Application: CrystalDiskMark
>
> Results:
>
> TXFIFO Depth = 3 max packets
>
> Test Case | Data Size | AVG tput (in MB/s)
> -------------------------------------------
> Sequential|1 GB x     |
> Read      |9 loops    | 193.60
>           |           | 195.86
>           |           | 184.77
>           |           | 193.60
> -------------------------------------------
>
> TXFIFO Depth = 6 max packets
>
> Test Case | Data Size | AVG tput (in MB/s)
> -------------------------------------------
> Sequential|1 GB x     |
> Read      |9 loops    | 287.35
> 	    |           | 304.94
>           |           | 289.64
>           |           | 293.61

I remember getting close to 400MiB/sec with Intel platforms without
resizing FIFOs and I'm sure the FIFO size was set to 2x1024, though my
memory could be failing.

Then again, I never ran with CrystalDiskMark, I was using my own tool
(it's somewhere in github. If you care, I can look up the URL).

> We also have internal numbers which have shown similar improvements as
> well.  Those are over networking/tethering interfaces, so testing IPERF
> loopback over TCP/UDP.

loopback iperf? That would skip the wire, no?

>>> size of 2 and TX threshold of 1, this would really be not beneficial to
>>> us, because we can only change the TX threshold to 2 at max, and at
>>> least in my observations, once we have to go out to system memory to
>>> fetch the next data packet, that latency takes enough time for the
>>> controller to end the current burst.
>> 
>> What I noticed with g_mass_storage is that we can amortize the cost of
>> fetching data from memory, with a deeper request queue. Whenever I
>> test(ed) g_mass_storage, I was doing so with 250 requests. And that was
>> enough to give me very good performance. Never had to poke at TX FIFO
>> resizing. Did you try something like this too?
>> 
>> I feel that allocating more requests is a far simpler and more generic
>> method that changing FIFO sizes :)
>> 
>
> I wish I had a USB bus trace handy to show you, which would make it very
> clear how the USB bus is currently utilized with TXFIFO size 2 vs 6.  So
> by increasing the number of USB requests, that will help if there was a
> bottleneck at the SW level where the application/function driver
> utilizing the DWC3 was submitting data much faster than the HW was
> processing them.
>
> So yes, this method of increasing the # of USB reqs will definitely help
> with situations such as HSUSB or in SSUSB when EP bursting isn't used.
> The TXFIFO resize comes into play for SSUSB, which utilizes endpoint
> bursting.

Hmm, that's not what I remember. Perhaps the TRB cache size plays a role
here too. I have clear memories of testing this very scenario of
bursting (using g_mass_storage at the time) because I was curious about
it. Back then, my tests showed no difference in behavior.

It could be nice if Heikki could test Intel parts with and without your
changes on g_mass_storage with 250 requests.

> Now with endpoint bursting, if the function notifies the host that
> bursting is supported, when the host sends the ACK for the Data Packet,
> it should have a NumP value equal to the bMaxBurst reported in the EP

Yes and no. Looking back at the history, we used to configure NUMP based
on bMaxBurst, but it was changed later in commit
4e99472bc10bda9906526d725ff6d5f27b4ddca1 by yours truly because of a
problem reported by John Youn.

And now we've come full circle. Because even if I believe more requests
are enough for bursting, NUMP is limited by the RxFIFO size. This ends
up supporting your claim that we need RxFIFO resizing if we want to
squeeze more throughput out of the controller.

However, note that this is about RxFIFO size, not TxFIFO size. In fact,
looking at Table 8-13 of USB 3.1 r1.0, we read the following about NumP
(emphasis is mine):

	"Number of Packets (NumP). This field is used to indicate the
	number of Data Packet buffers that the **receiver** can
	accept. The value in this field shall be less than or equal to
	the maximum burst size supported by the endpoint as determined
	by the value in the bMaxBurst field in the Endpoint Companion
	Descriptor (refer to Section 9.6.7)."

So, NumP is for the receiver, not the transmitter. Could you clarify
what you mean here?

/me keeps reading

Hmm, table 8-15 tries to clarify:

	"Number of Packets (NumP).

	For an OUT endpoint, refer to Table 8-13 for the description of
	this field.

	For an IN endpoint this field is set by the endpoint to the
	number of packets it can transmit when the host resumes
	transactions to it. This field shall not have a value greater
	than the maximum burst size supported by the endpoint as
	indicated by the value in the bMaxBurst field in the Endpoint
	Companion Descriptor. Note that the value reported in this field
	may be treated by the host as informative only."

However, if I remember correctly (please verify dwc3 databook), NUMP in
DCFG was only for receive buffers. Thin, John, how does dwc3 compute
NumP for TX/IN endpoints? Is that computed as a function of DCFG.NUMP or
TxFIFO size?

> desc.  If we have a TXFIFO size of 2, then normally what I have seen is
> that after 2 data packets, the device issues a NRDY.  So then we'd need
> to send an ERDY once data is available within the FIFO, and the same
> sequence happens until the USB request is complete.  With this constant
> NRDY/ERDY handshake going on, you actually see that the bus is under
> utilized.  When we increase an EP's FIFO size, then you'll see constant
> bursts for a request, until the request is done, or if the host runs out
> of RXFIFO. (ie no interruption [on the USB protocol level] during USB
> request data transfer)

Unfortunately I don't have access to a USB sniffer anymore :-(

>>>>>> Good points.
>>>>>>
>>>>>> Wesley, what kind of testing have you done on this on different devices?
>>>>>>
>>>>>
>>>>> As mentioned above, these changes are currently present on end user
>>>>> devices for the past few years, so its been through a lot of testing :).
>>>>
>>>> all with the same gadget driver. Also, who uses USB on android devices
>>>> these days? Most of the data transfer goes via WiFi or Bluetooth, anyway
>>>> :-)
>>>>
>>>> I guess only developers are using USB during development to flash dev
>>>> images heh.
>>>>
>>>
>>> I used to be a customer facing engineer, so honestly I did see some
>>> really interesting and crazy designs.  Again, we do have non-Android
>>> products that use the same code, and it has been working in there for a
>>> few years as well.  The TXFIFO sizing really has helped with multimedia
>>> use cases, which use isoc endpoints, since esp. in those lower end CPU
>>> chips where latencies across the system are much larger, and a missed
>>> ISOC interval leads to a pop in your ear.
>> 
>> This is good background information. Thanks for bringing this
>> up. Admitedly, we still have ISOC issues with dwc3. I'm interested in
>> knowing if a deeper request queue would also help here.
>> 
>> Remember dwc3 can accomodate 255 requests + link for each endpoint. If
>> our gadget driver uses a low number of requests, we're never really
>> using the TRB ring in our benefit.
>> 
>
> We're actually using both a deeper USB request queue + TX fifo resizing. :).

okay, great. Let's see what John and/or Thinh respond WRT dwc3 TX Burst
behavior.

-- 
balbi

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 511 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v9 0/5] Re-introduce TX FIFO resize for larger EP bursting
  2021-06-11 13:00                 ` Felipe Balbi
@ 2021-06-11 13:14                   ` Heikki Krogerus
  2021-06-11 13:21                     ` Andy Shevchenko
  2021-07-01  1:08                   ` Wesley Cheng
  1 sibling, 1 reply; 31+ messages in thread
From: Heikki Krogerus @ 2021-06-11 13:14 UTC (permalink / raw)
  To: Felipe Balbi, Andy Shevchenko
  Cc: Wesley Cheng, Greg KH, agross, bjorn.andersson, robh+dt,
	linux-usb, linux-kernel, linux-arm-msm, devicetree, jackp,
	Thinh.Nguyen, John Youn

On Fri, Jun 11, 2021 at 04:00:38PM +0300, Felipe Balbi wrote:
> 
> Hi,
> 
> Wesley Cheng <wcheng@codeaurora.org> writes:
> >>>>>>> to be honest, I don't think these should go in (apart from the build
> >>>>>>> failure) because it's likely to break instantiations of the core with
> >>>>>>> differing FIFO sizes. Some instantiations even have some endpoints with
> >>>>>>> dedicated functionality that requires the default FIFO size configured
> >>>>>>> during coreConsultant instantiation. I know of at OMAP5 and some Intel
> >>>>>>> implementations which have dedicated endpoints for processor tracing.
> >>>>>>>
> >>>>>>> With OMAP5, these endpoints are configured at the top of the available
> >>>>>>> endpoints, which means that if a gadget driver gets loaded and takes
> >>>>>>> over most of the FIFO space because of this resizing, processor tracing
> >>>>>>> will have a hard time running. That being said, processor tracing isn't
> >>>>>>> supported in upstream at this moment.
> >>>>>>>
> >>>>>
> >>>>> I agree that the application of this logic may differ between vendors,
> >>>>> hence why I wanted to keep this controllable by the DT property, so that
> >>>>> for those which do not support this use case can leave it disabled.  The
> >>>>> logic is there to ensure that for a given USB configuration, for each EP
> >>>>> it would have at least 1 TX FIFO.  For USB configurations which don't
> >>>>> utilize all available IN EPs, it would allow re-allocation of internal
> >>>>> memory to EPs which will actually be in use.
> >>>>
> >>>> The feature ends up being all-or-nothing, then :-) It sounds like we can
> >>>> be a little nicer in this regard.
> >>>>
> >>>
> >>> Don't get me wrong, I think once those features become available
> >>> upstream, we can improve the logic.  From what I remember when looking
> >> 
> >> sure, I support that. But I want to make sure the first cut isn't likely
> >> to break things left and right :)
> >> 
> >> Hence, let's at least get more testing.
> >> 
> >
> > Sure, I'd hope that the other users of DWC3 will also see some pretty
> > big improvements on the TX path with this.
> 
> fingers crossed
> 
> >>> at Andy Shevchenko's Github, the Intel tracer downstream changes were
> >>> just to remove physical EP1 and 2 from the DWC3 endpoint list.  If that
> >> 
> >> right, that's the reason why we introduced the endpoint feature
> >> flags. The end goal was that the UDC would be able to have custom
> >> feature flags paired with ->validate_endpoint() or whatever before
> >> allowing it to be enabled. Then the UDC driver could tell UDC core to
> >> skip that endpoint on that particular platform without interefering with
> >> everything else.
> >> 
> >> Of course, we still need to figure out a way to abstract the different
> >> dwc3 instantiations.
> >> 
> >>> was the change which ended up upstream for the Intel tracer then we
> >>> could improve the logic to avoid re-sizing those particular EPs.
> >> 
> >> The problem then, just as I mentioned in the previous paragraph, will be
> >> coming up with a solution that's elegant and works for all different
> >> instantiations of dwc3 (or musb, cdns3, etc).
> >> 
> >
> > Well, at least for the TX FIFO resizing logic, we'd only be needing to
> > focus on the DWC3 implementation.
> >
> > You bring up another good topic that I'll eventually needing to be
> > taking a look at, which is a nice way we can handle vendor specific
> > endpoints and how they can co-exist with other "normal" endpoints.  We
> > have a few special HW eps as well, which we try to maintain separately
> > in our DWC3 vendor driver, but it isn't the most convenient, or most
> > pretty method :).
> 
> Awesome, as mentioned, the endpoint feature flags were added exactly to
> allow for these vendor-specific features :-)
> 
> I'm more than happy to help testing now that I finally got our SM8150
> Surface Duo device tree accepted by Bjorn ;-)
> 
> >>> However, I'm not sure how the changes would look like in the end, so I
> >>> would like to wait later down the line to include that :).
> >> 
> >> Fair enough, I agree. Can we get some more testing of $subject, though?
> >> Did you test $subject with upstream too? Which gadget drivers did you
> >> use? How did you test
> >> 
> >
> > The results that I included in the cover page was tested with the pure
> > upstream kernel on our device.  Below was using the ConfigFS gadget w/ a
> > mass storage only composition.
> >
> > Test Parameters:
> >  - Platform: Qualcomm SM8150
> >  - bMaxBurst = 6
> >  - USB req size = 256kB
> >  - Num of USB reqs = 16
> 
> do you mind testing with the regular request size (16KiB) and 250
> requests? I think we can even do 15 bursts in that case.
> 
> >  - USB Speed = Super-Speed
> >  - Function Driver: Mass Storage (w/ ramdisk)
> >  - Test Application: CrystalDiskMark
> >
> > Results:
> >
> > TXFIFO Depth = 3 max packets
> >
> > Test Case | Data Size | AVG tput (in MB/s)
> > -------------------------------------------
> > Sequential|1 GB x     |
> > Read      |9 loops    | 193.60
> >           |           | 195.86
> >           |           | 184.77
> >           |           | 193.60
> > -------------------------------------------
> >
> > TXFIFO Depth = 6 max packets
> >
> > Test Case | Data Size | AVG tput (in MB/s)
> > -------------------------------------------
> > Sequential|1 GB x     |
> > Read      |9 loops    | 287.35
> > 	    |           | 304.94
> >           |           | 289.64
> >           |           | 293.61
> 
> I remember getting close to 400MiB/sec with Intel platforms without
> resizing FIFOs and I'm sure the FIFO size was set to 2x1024, though my
> memory could be failing.
> 
> Then again, I never ran with CrystalDiskMark, I was using my own tool
> (it's somewhere in github. If you care, I can look up the URL).
> 
> > We also have internal numbers which have shown similar improvements as
> > well.  Those are over networking/tethering interfaces, so testing IPERF
> > loopback over TCP/UDP.
> 
> loopback iperf? That would skip the wire, no?
> 
> >>> size of 2 and TX threshold of 1, this would really be not beneficial to
> >>> us, because we can only change the TX threshold to 2 at max, and at
> >>> least in my observations, once we have to go out to system memory to
> >>> fetch the next data packet, that latency takes enough time for the
> >>> controller to end the current burst.
> >> 
> >> What I noticed with g_mass_storage is that we can amortize the cost of
> >> fetching data from memory, with a deeper request queue. Whenever I
> >> test(ed) g_mass_storage, I was doing so with 250 requests. And that was
> >> enough to give me very good performance. Never had to poke at TX FIFO
> >> resizing. Did you try something like this too?
> >> 
> >> I feel that allocating more requests is a far simpler and more generic
> >> method that changing FIFO sizes :)
> >> 
> >
> > I wish I had a USB bus trace handy to show you, which would make it very
> > clear how the USB bus is currently utilized with TXFIFO size 2 vs 6.  So
> > by increasing the number of USB requests, that will help if there was a
> > bottleneck at the SW level where the application/function driver
> > utilizing the DWC3 was submitting data much faster than the HW was
> > processing them.
> >
> > So yes, this method of increasing the # of USB reqs will definitely help
> > with situations such as HSUSB or in SSUSB when EP bursting isn't used.
> > The TXFIFO resize comes into play for SSUSB, which utilizes endpoint
> > bursting.
> 
> Hmm, that's not what I remember. Perhaps the TRB cache size plays a role
> here too. I have clear memories of testing this very scenario of
> bursting (using g_mass_storage at the time) because I was curious about
> it. Back then, my tests showed no difference in behavior.
> 
> It could be nice if Heikki could test Intel parts with and without your
> changes on g_mass_storage with 250 requests.

Andy, you have a system at hand that has the DWC3 block enabled,
right? Can you help out here?

thanks,


> > Now with endpoint bursting, if the function notifies the host that
> > bursting is supported, when the host sends the ACK for the Data Packet,
> > it should have a NumP value equal to the bMaxBurst reported in the EP
> 
> Yes and no. Looking back at the history, we used to configure NUMP based
> on bMaxBurst, but it was changed later in commit
> 4e99472bc10bda9906526d725ff6d5f27b4ddca1 by yours truly because of a
> problem reported by John Youn.
> 
> And now we've come full circle. Because even if I believe more requests
> are enough for bursting, NUMP is limited by the RxFIFO size. This ends
> up supporting your claim that we need RxFIFO resizing if we want to
> squeeze more throughput out of the controller.
> 
> However, note that this is about RxFIFO size, not TxFIFO size. In fact,
> looking at Table 8-13 of USB 3.1 r1.0, we read the following about NumP
> (emphasis is mine):
> 
> 	"Number of Packets (NumP). This field is used to indicate the
> 	number of Data Packet buffers that the **receiver** can
> 	accept. The value in this field shall be less than or equal to
> 	the maximum burst size supported by the endpoint as determined
> 	by the value in the bMaxBurst field in the Endpoint Companion
> 	Descriptor (refer to Section 9.6.7)."
> 
> So, NumP is for the receiver, not the transmitter. Could you clarify
> what you mean here?
> 
> /me keeps reading
> 
> Hmm, table 8-15 tries to clarify:
> 
> 	"Number of Packets (NumP).
> 
> 	For an OUT endpoint, refer to Table 8-13 for the description of
> 	this field.
> 
> 	For an IN endpoint this field is set by the endpoint to the
> 	number of packets it can transmit when the host resumes
> 	transactions to it. This field shall not have a value greater
> 	than the maximum burst size supported by the endpoint as
> 	indicated by the value in the bMaxBurst field in the Endpoint
> 	Companion Descriptor. Note that the value reported in this field
> 	may be treated by the host as informative only."
> 
> However, if I remember correctly (please verify dwc3 databook), NUMP in
> DCFG was only for receive buffers. Thin, John, how does dwc3 compute
> NumP for TX/IN endpoints? Is that computed as a function of DCFG.NUMP or
> TxFIFO size?
> 
> > desc.  If we have a TXFIFO size of 2, then normally what I have seen is
> > that after 2 data packets, the device issues a NRDY.  So then we'd need
> > to send an ERDY once data is available within the FIFO, and the same
> > sequence happens until the USB request is complete.  With this constant
> > NRDY/ERDY handshake going on, you actually see that the bus is under
> > utilized.  When we increase an EP's FIFO size, then you'll see constant
> > bursts for a request, until the request is done, or if the host runs out
> > of RXFIFO. (ie no interruption [on the USB protocol level] during USB
> > request data transfer)
> 
> Unfortunately I don't have access to a USB sniffer anymore :-(
> 
> >>>>>> Good points.
> >>>>>>
> >>>>>> Wesley, what kind of testing have you done on this on different devices?
> >>>>>>
> >>>>>
> >>>>> As mentioned above, these changes are currently present on end user
> >>>>> devices for the past few years, so its been through a lot of testing :).
> >>>>
> >>>> all with the same gadget driver. Also, who uses USB on android devices
> >>>> these days? Most of the data transfer goes via WiFi or Bluetooth, anyway
> >>>> :-)
> >>>>
> >>>> I guess only developers are using USB during development to flash dev
> >>>> images heh.
> >>>>
> >>>
> >>> I used to be a customer facing engineer, so honestly I did see some
> >>> really interesting and crazy designs.  Again, we do have non-Android
> >>> products that use the same code, and it has been working in there for a
> >>> few years as well.  The TXFIFO sizing really has helped with multimedia
> >>> use cases, which use isoc endpoints, since esp. in those lower end CPU
> >>> chips where latencies across the system are much larger, and a missed
> >>> ISOC interval leads to a pop in your ear.
> >> 
> >> This is good background information. Thanks for bringing this
> >> up. Admitedly, we still have ISOC issues with dwc3. I'm interested in
> >> knowing if a deeper request queue would also help here.
> >> 
> >> Remember dwc3 can accomodate 255 requests + link for each endpoint. If
> >> our gadget driver uses a low number of requests, we're never really
> >> using the TRB ring in our benefit.
> >> 
> >
> > We're actually using both a deeper USB request queue + TX fifo resizing. :).
> 
> okay, great. Let's see what John and/or Thinh respond WRT dwc3 TX Burst
> behavior.

-- 
heikki

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v9 0/5] Re-introduce TX FIFO resize for larger EP bursting
  2021-06-11 13:14                   ` Heikki Krogerus
@ 2021-06-11 13:21                     ` Andy Shevchenko
  2021-06-12 21:27                       ` Ferry Toth
  0 siblings, 1 reply; 31+ messages in thread
From: Andy Shevchenko @ 2021-06-11 13:21 UTC (permalink / raw)
  To: Heikki Krogerus, Ferry Toth
  Cc: Felipe Balbi, Andy Shevchenko, Wesley Cheng, Greg KH, Andy Gross,
	Bjorn Andersson, Rob Herring, USB, Linux Kernel Mailing List,
	linux-arm-msm, devicetree, Jack Pham, Thinh Nguyen, John Youn

On Fri, Jun 11, 2021 at 4:14 PM Heikki Krogerus
<heikki.krogerus@linux.intel.com> wrote:
>
> On Fri, Jun 11, 2021 at 04:00:38PM +0300, Felipe Balbi wrote:
> >
> > Hi,
> >
> > Wesley Cheng <wcheng@codeaurora.org> writes:
> > >>>>>>> to be honest, I don't think these should go in (apart from the build
> > >>>>>>> failure) because it's likely to break instantiations of the core with
> > >>>>>>> differing FIFO sizes. Some instantiations even have some endpoints with
> > >>>>>>> dedicated functionality that requires the default FIFO size configured
> > >>>>>>> during coreConsultant instantiation. I know of at OMAP5 and some Intel
> > >>>>>>> implementations which have dedicated endpoints for processor tracing.
> > >>>>>>>
> > >>>>>>> With OMAP5, these endpoints are configured at the top of the available
> > >>>>>>> endpoints, which means that if a gadget driver gets loaded and takes
> > >>>>>>> over most of the FIFO space because of this resizing, processor tracing
> > >>>>>>> will have a hard time running. That being said, processor tracing isn't
> > >>>>>>> supported in upstream at this moment.
> > >>>>>>>
> > >>>>>
> > >>>>> I agree that the application of this logic may differ between vendors,
> > >>>>> hence why I wanted to keep this controllable by the DT property, so that
> > >>>>> for those which do not support this use case can leave it disabled.  The
> > >>>>> logic is there to ensure that for a given USB configuration, for each EP
> > >>>>> it would have at least 1 TX FIFO.  For USB configurations which don't
> > >>>>> utilize all available IN EPs, it would allow re-allocation of internal
> > >>>>> memory to EPs which will actually be in use.
> > >>>>
> > >>>> The feature ends up being all-or-nothing, then :-) It sounds like we can
> > >>>> be a little nicer in this regard.
> > >>>>
> > >>>
> > >>> Don't get me wrong, I think once those features become available
> > >>> upstream, we can improve the logic.  From what I remember when looking
> > >>
> > >> sure, I support that. But I want to make sure the first cut isn't likely
> > >> to break things left and right :)
> > >>
> > >> Hence, let's at least get more testing.
> > >>
> > >
> > > Sure, I'd hope that the other users of DWC3 will also see some pretty
> > > big improvements on the TX path with this.
> >
> > fingers crossed
> >
> > >>> at Andy Shevchenko's Github, the Intel tracer downstream changes were
> > >>> just to remove physical EP1 and 2 from the DWC3 endpoint list.  If that
> > >>
> > >> right, that's the reason why we introduced the endpoint feature
> > >> flags. The end goal was that the UDC would be able to have custom
> > >> feature flags paired with ->validate_endpoint() or whatever before
> > >> allowing it to be enabled. Then the UDC driver could tell UDC core to
> > >> skip that endpoint on that particular platform without interefering with
> > >> everything else.
> > >>
> > >> Of course, we still need to figure out a way to abstract the different
> > >> dwc3 instantiations.
> > >>
> > >>> was the change which ended up upstream for the Intel tracer then we
> > >>> could improve the logic to avoid re-sizing those particular EPs.
> > >>
> > >> The problem then, just as I mentioned in the previous paragraph, will be
> > >> coming up with a solution that's elegant and works for all different
> > >> instantiations of dwc3 (or musb, cdns3, etc).
> > >>
> > >
> > > Well, at least for the TX FIFO resizing logic, we'd only be needing to
> > > focus on the DWC3 implementation.
> > >
> > > You bring up another good topic that I'll eventually needing to be
> > > taking a look at, which is a nice way we can handle vendor specific
> > > endpoints and how they can co-exist with other "normal" endpoints.  We
> > > have a few special HW eps as well, which we try to maintain separately
> > > in our DWC3 vendor driver, but it isn't the most convenient, or most
> > > pretty method :).
> >
> > Awesome, as mentioned, the endpoint feature flags were added exactly to
> > allow for these vendor-specific features :-)
> >
> > I'm more than happy to help testing now that I finally got our SM8150
> > Surface Duo device tree accepted by Bjorn ;-)
> >
> > >>> However, I'm not sure how the changes would look like in the end, so I
> > >>> would like to wait later down the line to include that :).
> > >>
> > >> Fair enough, I agree. Can we get some more testing of $subject, though?
> > >> Did you test $subject with upstream too? Which gadget drivers did you
> > >> use? How did you test
> > >>
> > >
> > > The results that I included in the cover page was tested with the pure
> > > upstream kernel on our device.  Below was using the ConfigFS gadget w/ a
> > > mass storage only composition.
> > >
> > > Test Parameters:
> > >  - Platform: Qualcomm SM8150
> > >  - bMaxBurst = 6
> > >  - USB req size = 256kB
> > >  - Num of USB reqs = 16
> >
> > do you mind testing with the regular request size (16KiB) and 250
> > requests? I think we can even do 15 bursts in that case.
> >
> > >  - USB Speed = Super-Speed
> > >  - Function Driver: Mass Storage (w/ ramdisk)
> > >  - Test Application: CrystalDiskMark
> > >
> > > Results:
> > >
> > > TXFIFO Depth = 3 max packets
> > >
> > > Test Case | Data Size | AVG tput (in MB/s)
> > > -------------------------------------------
> > > Sequential|1 GB x     |
> > > Read      |9 loops    | 193.60
> > >           |           | 195.86
> > >           |           | 184.77
> > >           |           | 193.60
> > > -------------------------------------------
> > >
> > > TXFIFO Depth = 6 max packets
> > >
> > > Test Case | Data Size | AVG tput (in MB/s)
> > > -------------------------------------------
> > > Sequential|1 GB x     |
> > > Read      |9 loops    | 287.35
> > >         |           | 304.94
> > >           |           | 289.64
> > >           |           | 293.61
> >
> > I remember getting close to 400MiB/sec with Intel platforms without
> > resizing FIFOs and I'm sure the FIFO size was set to 2x1024, though my
> > memory could be failing.
> >
> > Then again, I never ran with CrystalDiskMark, I was using my own tool
> > (it's somewhere in github. If you care, I can look up the URL).
> >
> > > We also have internal numbers which have shown similar improvements as
> > > well.  Those are over networking/tethering interfaces, so testing IPERF
> > > loopback over TCP/UDP.
> >
> > loopback iperf? That would skip the wire, no?
> >
> > >>> size of 2 and TX threshold of 1, this would really be not beneficial to
> > >>> us, because we can only change the TX threshold to 2 at max, and at
> > >>> least in my observations, once we have to go out to system memory to
> > >>> fetch the next data packet, that latency takes enough time for the
> > >>> controller to end the current burst.
> > >>
> > >> What I noticed with g_mass_storage is that we can amortize the cost of
> > >> fetching data from memory, with a deeper request queue. Whenever I
> > >> test(ed) g_mass_storage, I was doing so with 250 requests. And that was
> > >> enough to give me very good performance. Never had to poke at TX FIFO
> > >> resizing. Did you try something like this too?
> > >>
> > >> I feel that allocating more requests is a far simpler and more generic
> > >> method that changing FIFO sizes :)
> > >>
> > >
> > > I wish I had a USB bus trace handy to show you, which would make it very
> > > clear how the USB bus is currently utilized with TXFIFO size 2 vs 6.  So
> > > by increasing the number of USB requests, that will help if there was a
> > > bottleneck at the SW level where the application/function driver
> > > utilizing the DWC3 was submitting data much faster than the HW was
> > > processing them.
> > >
> > > So yes, this method of increasing the # of USB reqs will definitely help
> > > with situations such as HSUSB or in SSUSB when EP bursting isn't used.
> > > The TXFIFO resize comes into play for SSUSB, which utilizes endpoint
> > > bursting.
> >
> > Hmm, that's not what I remember. Perhaps the TRB cache size plays a role
> > here too. I have clear memories of testing this very scenario of
> > bursting (using g_mass_storage at the time) because I was curious about
> > it. Back then, my tests showed no difference in behavior.
> >
> > It could be nice if Heikki could test Intel parts with and without your
> > changes on g_mass_storage with 250 requests.
>
> Andy, you have a system at hand that has the DWC3 block enabled,
> right? Can you help out here?

I'm not sure if i will have time soon, I Cc'ed to Ferry who has a few
more test cases (I have only one or two) and maybe can help. But I'll
keep this in mind.

> > > Now with endpoint bursting, if the function notifies the host that
> > > bursting is supported, when the host sends the ACK for the Data Packet,
> > > it should have a NumP value equal to the bMaxBurst reported in the EP
> >
> > Yes and no. Looking back at the history, we used to configure NUMP based
> > on bMaxBurst, but it was changed later in commit
> > 4e99472bc10bda9906526d725ff6d5f27b4ddca1 by yours truly because of a
> > problem reported by John Youn.
> >
> > And now we've come full circle. Because even if I believe more requests
> > are enough for bursting, NUMP is limited by the RxFIFO size. This ends
> > up supporting your claim that we need RxFIFO resizing if we want to
> > squeeze more throughput out of the controller.
> >
> > However, note that this is about RxFIFO size, not TxFIFO size. In fact,
> > looking at Table 8-13 of USB 3.1 r1.0, we read the following about NumP
> > (emphasis is mine):
> >
> >       "Number of Packets (NumP). This field is used to indicate the
> >       number of Data Packet buffers that the **receiver** can
> >       accept. The value in this field shall be less than or equal to
> >       the maximum burst size supported by the endpoint as determined
> >       by the value in the bMaxBurst field in the Endpoint Companion
> >       Descriptor (refer to Section 9.6.7)."
> >
> > So, NumP is for the receiver, not the transmitter. Could you clarify
> > what you mean here?
> >
> > /me keeps reading
> >
> > Hmm, table 8-15 tries to clarify:
> >
> >       "Number of Packets (NumP).
> >
> >       For an OUT endpoint, refer to Table 8-13 for the description of
> >       this field.
> >
> >       For an IN endpoint this field is set by the endpoint to the
> >       number of packets it can transmit when the host resumes
> >       transactions to it. This field shall not have a value greater
> >       than the maximum burst size supported by the endpoint as
> >       indicated by the value in the bMaxBurst field in the Endpoint
> >       Companion Descriptor. Note that the value reported in this field
> >       may be treated by the host as informative only."
> >
> > However, if I remember correctly (please verify dwc3 databook), NUMP in
> > DCFG was only for receive buffers. Thin, John, how does dwc3 compute
> > NumP for TX/IN endpoints? Is that computed as a function of DCFG.NUMP or
> > TxFIFO size?
> >
> > > desc.  If we have a TXFIFO size of 2, then normally what I have seen is
> > > that after 2 data packets, the device issues a NRDY.  So then we'd need
> > > to send an ERDY once data is available within the FIFO, and the same
> > > sequence happens until the USB request is complete.  With this constant
> > > NRDY/ERDY handshake going on, you actually see that the bus is under
> > > utilized.  When we increase an EP's FIFO size, then you'll see constant
> > > bursts for a request, until the request is done, or if the host runs out
> > > of RXFIFO. (ie no interruption [on the USB protocol level] during USB
> > > request data transfer)
> >
> > Unfortunately I don't have access to a USB sniffer anymore :-(
> >
> > >>>>>> Good points.
> > >>>>>>
> > >>>>>> Wesley, what kind of testing have you done on this on different devices?
> > >>>>>>
> > >>>>>
> > >>>>> As mentioned above, these changes are currently present on end user
> > >>>>> devices for the past few years, so its been through a lot of testing :).
> > >>>>
> > >>>> all with the same gadget driver. Also, who uses USB on android devices
> > >>>> these days? Most of the data transfer goes via WiFi or Bluetooth, anyway
> > >>>> :-)
> > >>>>
> > >>>> I guess only developers are using USB during development to flash dev
> > >>>> images heh.
> > >>>>
> > >>>
> > >>> I used to be a customer facing engineer, so honestly I did see some
> > >>> really interesting and crazy designs.  Again, we do have non-Android
> > >>> products that use the same code, and it has been working in there for a
> > >>> few years as well.  The TXFIFO sizing really has helped with multimedia
> > >>> use cases, which use isoc endpoints, since esp. in those lower end CPU
> > >>> chips where latencies across the system are much larger, and a missed
> > >>> ISOC interval leads to a pop in your ear.
> > >>
> > >> This is good background information. Thanks for bringing this
> > >> up. Admitedly, we still have ISOC issues with dwc3. I'm interested in
> > >> knowing if a deeper request queue would also help here.
> > >>
> > >> Remember dwc3 can accomodate 255 requests + link for each endpoint. If
> > >> our gadget driver uses a low number of requests, we're never really
> > >> using the TRB ring in our benefit.
> > >>
> > >
> > > We're actually using both a deeper USB request queue + TX fifo resizing. :).
> >
> > okay, great. Let's see what John and/or Thinh respond WRT dwc3 TX Burst
> > behavior.
>
> --
> heikki



-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v9 0/5] Re-introduce TX FIFO resize for larger EP bursting
  2021-06-11 13:21                     ` Andy Shevchenko
@ 2021-06-12 21:27                       ` Ferry Toth
  2021-06-12 21:37                         ` Andy Shevchenko
  2021-06-14 18:58                         ` Wesley Cheng
  0 siblings, 2 replies; 31+ messages in thread
From: Ferry Toth @ 2021-06-12 21:27 UTC (permalink / raw)
  To: Andy Shevchenko, Heikki Krogerus
  Cc: Felipe Balbi, Andy Shevchenko, Wesley Cheng, Greg KH, Andy Gross,
	Bjorn Andersson, Rob Herring, USB, Linux Kernel Mailing List,
	linux-arm-msm, devicetree, Jack Pham, Thinh Nguyen, John Youn

Hi

Op 11-06-2021 om 15:21 schreef Andy Shevchenko:
> On Fri, Jun 11, 2021 at 4:14 PM Heikki Krogerus
> <heikki.krogerus@linux.intel.com> wrote:
>> On Fri, Jun 11, 2021 at 04:00:38PM +0300, Felipe Balbi wrote:
>>> Hi,
>>>
>>> Wesley Cheng <wcheng@codeaurora.org> writes:
>>>>>>>>>> to be honest, I don't think these should go in (apart from the build
>>>>>>>>>> failure) because it's likely to break instantiations of the core with
>>>>>>>>>> differing FIFO sizes. Some instantiations even have some endpoints with
>>>>>>>>>> dedicated functionality that requires the default FIFO size configured
>>>>>>>>>> during coreConsultant instantiation. I know of at OMAP5 and some Intel
>>>>>>>>>> implementations which have dedicated endpoints for processor tracing.
>>>>>>>>>>
>>>>>>>>>> With OMAP5, these endpoints are configured at the top of the available
>>>>>>>>>> endpoints, which means that if a gadget driver gets loaded and takes
>>>>>>>>>> over most of the FIFO space because of this resizing, processor tracing
>>>>>>>>>> will have a hard time running. That being said, processor tracing isn't
>>>>>>>>>> supported in upstream at this moment.
>>>>>>>>>>
>>>>>>>> I agree that the application of this logic may differ between vendors,
>>>>>>>> hence why I wanted to keep this controllable by the DT property, so that
>>>>>>>> for those which do not support this use case can leave it disabled.  The
>>>>>>>> logic is there to ensure that for a given USB configuration, for each EP
>>>>>>>> it would have at least 1 TX FIFO.  For USB configurations which don't
>>>>>>>> utilize all available IN EPs, it would allow re-allocation of internal
>>>>>>>> memory to EPs which will actually be in use.
>>>>>>> The feature ends up being all-or-nothing, then :-) It sounds like we can
>>>>>>> be a little nicer in this regard.
>>>>>>>
>>>>>> Don't get me wrong, I think once those features become available
>>>>>> upstream, we can improve the logic.  From what I remember when looking
>>>>> sure, I support that. But I want to make sure the first cut isn't likely
>>>>> to break things left and right :)
>>>>>
>>>>> Hence, let's at least get more testing.
>>>>>
>>>> Sure, I'd hope that the other users of DWC3 will also see some pretty
>>>> big improvements on the TX path with this.
>>> fingers crossed
>>>
>>>>>> at Andy Shevchenko's Github, the Intel tracer downstream changes were
>>>>>> just to remove physical EP1 and 2 from the DWC3 endpoint list.  If that
>>>>> right, that's the reason why we introduced the endpoint feature
>>>>> flags. The end goal was that the UDC would be able to have custom
>>>>> feature flags paired with ->validate_endpoint() or whatever before
>>>>> allowing it to be enabled. Then the UDC driver could tell UDC core to
>>>>> skip that endpoint on that particular platform without interefering with
>>>>> everything else.
>>>>>
>>>>> Of course, we still need to figure out a way to abstract the different
>>>>> dwc3 instantiations.
>>>>>
>>>>>> was the change which ended up upstream for the Intel tracer then we
>>>>>> could improve the logic to avoid re-sizing those particular EPs.
>>>>> The problem then, just as I mentioned in the previous paragraph, will be
>>>>> coming up with a solution that's elegant and works for all different
>>>>> instantiations of dwc3 (or musb, cdns3, etc).
>>>>>
>>>> Well, at least for the TX FIFO resizing logic, we'd only be needing to
>>>> focus on the DWC3 implementation.
>>>>
>>>> You bring up another good topic that I'll eventually needing to be
>>>> taking a look at, which is a nice way we can handle vendor specific
>>>> endpoints and how they can co-exist with other "normal" endpoints.  We
>>>> have a few special HW eps as well, which we try to maintain separately
>>>> in our DWC3 vendor driver, but it isn't the most convenient, or most
>>>> pretty method :).
>>> Awesome, as mentioned, the endpoint feature flags were added exactly to
>>> allow for these vendor-specific features :-)
>>>
>>> I'm more than happy to help testing now that I finally got our SM8150
>>> Surface Duo device tree accepted by Bjorn ;-)
>>>
>>>>>> However, I'm not sure how the changes would look like in the end, so I
>>>>>> would like to wait later down the line to include that :).
>>>>> Fair enough, I agree. Can we get some more testing of $subject, though?
>>>>> Did you test $subject with upstream too? Which gadget drivers did you
>>>>> use? How did you test
>>>>>
>>>> The results that I included in the cover page was tested with the pure
>>>> upstream kernel on our device.  Below was using the ConfigFS gadget w/ a
>>>> mass storage only composition.
>>>>
>>>> Test Parameters:
>>>>   - Platform: Qualcomm SM8150
>>>>   - bMaxBurst = 6
>>>>   - USB req size = 256kB
>>>>   - Num of USB reqs = 16
>>> do you mind testing with the regular request size (16KiB) and 250
>>> requests? I think we can even do 15 bursts in that case.
>>>
>>>>   - USB Speed = Super-Speed
>>>>   - Function Driver: Mass Storage (w/ ramdisk)
>>>>   - Test Application: CrystalDiskMark
>>>>
>>>> Results:
>>>>
>>>> TXFIFO Depth = 3 max packets
>>>>
>>>> Test Case | Data Size | AVG tput (in MB/s)
>>>> -------------------------------------------
>>>> Sequential|1 GB x     |
>>>> Read      |9 loops    | 193.60
>>>>            |           | 195.86
>>>>            |           | 184.77
>>>>            |           | 193.60
>>>> -------------------------------------------
>>>>
>>>> TXFIFO Depth = 6 max packets
>>>>
>>>> Test Case | Data Size | AVG tput (in MB/s)
>>>> -------------------------------------------
>>>> Sequential|1 GB x     |
>>>> Read      |9 loops    | 287.35
>>>>          |           | 304.94
>>>>            |           | 289.64
>>>>            |           | 293.61
>>> I remember getting close to 400MiB/sec with Intel platforms without
>>> resizing FIFOs and I'm sure the FIFO size was set to 2x1024, though my
>>> memory could be failing.
>>>
>>> Then again, I never ran with CrystalDiskMark, I was using my own tool
>>> (it's somewhere in github. If you care, I can look up the URL).
>>>
>>>> We also have internal numbers which have shown similar improvements as
>>>> well.  Those are over networking/tethering interfaces, so testing IPERF
>>>> loopback over TCP/UDP.
>>> loopback iperf? That would skip the wire, no?
>>>
>>>>>> size of 2 and TX threshold of 1, this would really be not beneficial to
>>>>>> us, because we can only change the TX threshold to 2 at max, and at
>>>>>> least in my observations, once we have to go out to system memory to
>>>>>> fetch the next data packet, that latency takes enough time for the
>>>>>> controller to end the current burst.
>>>>> What I noticed with g_mass_storage is that we can amortize the cost of
>>>>> fetching data from memory, with a deeper request queue. Whenever I
>>>>> test(ed) g_mass_storage, I was doing so with 250 requests. And that was
>>>>> enough to give me very good performance. Never had to poke at TX FIFO
>>>>> resizing. Did you try something like this too?
>>>>>
>>>>> I feel that allocating more requests is a far simpler and more generic
>>>>> method that changing FIFO sizes :)
>>>>>
>>>> I wish I had a USB bus trace handy to show you, which would make it very
>>>> clear how the USB bus is currently utilized with TXFIFO size 2 vs 6.  So
>>>> by increasing the number of USB requests, that will help if there was a
>>>> bottleneck at the SW level where the application/function driver
>>>> utilizing the DWC3 was submitting data much faster than the HW was
>>>> processing them.
>>>>
>>>> So yes, this method of increasing the # of USB reqs will definitely help
>>>> with situations such as HSUSB or in SSUSB when EP bursting isn't used.
>>>> The TXFIFO resize comes into play for SSUSB, which utilizes endpoint
>>>> bursting.
>>> Hmm, that's not what I remember. Perhaps the TRB cache size plays a role
>>> here too. I have clear memories of testing this very scenario of
>>> bursting (using g_mass_storage at the time) because I was curious about
>>> it. Back then, my tests showed no difference in behavior.
>>>
>>> It could be nice if Heikki could test Intel parts with and without your
>>> changes on g_mass_storage with 250 requests.
>> Andy, you have a system at hand that has the DWC3 block enabled,
>> right? Can you help out here?
> I'm not sure if i will have time soon, I Cc'ed to Ferry who has a few
> more test cases (I have only one or two) and maybe can help. But I'll
> keep this in mind.

I just tested on 5.13.0-rc4 on Intel Edison (x86_64). All 5 patches 
apply. Switching between host/gadget works, no connections dropping, no 
errors in dmesg.

In host mode I connect a smsc9504 eth+4p hub. In gadget mode I have 
composite device created from configfs with gser / eem / mass_storage / 
uac2.

Tested with iperf3 performance in host (93.6Mbits/sec) and gadget 
(207Mbits/sec) mode. Compared to v5.10.41 without patches host 
(93.4Mbits/sec) and gadget (198Mbits/sec).

Gadget seems to be a little faster with the patches, but that might also 
be caused  by something else, on v5.10.41 I see the bitrate bouncing 
between 207 and 199.

I saw a mention to test iperf3 to self (loopback). 3.09 Gbits/sec. With 
v5.10.41 3.07Gbits/sec. Not bad for a 500MHz device.

With gnome-disks I did a read access benchmark 35.4MB/s, with v5.10.41 
34.7MB/s. This might be limited by Edison's internal eMMC speed (when 
booting U-Boot reads the kernel with 21.4 MiB/s).

>>>> Now with endpoint bursting, if the function notifies the host that
>>>> bursting is supported, when the host sends the ACK for the Data Packet,
>>>> it should have a NumP value equal to the bMaxBurst reported in the EP
>>> Yes and no. Looking back at the history, we used to configure NUMP based
>>> on bMaxBurst, but it was changed later in commit
>>> 4e99472bc10bda9906526d725ff6d5f27b4ddca1 by yours truly because of a
>>> problem reported by John Youn.
>>>
>>> And now we've come full circle. Because even if I believe more requests
>>> are enough for bursting, NUMP is limited by the RxFIFO size. This ends
>>> up supporting your claim that we need RxFIFO resizing if we want to
>>> squeeze more throughput out of the controller.
>>>
>>> However, note that this is about RxFIFO size, not TxFIFO size. In fact,
>>> looking at Table 8-13 of USB 3.1 r1.0, we read the following about NumP
>>> (emphasis is mine):
>>>
>>>        "Number of Packets (NumP). This field is used to indicate the
>>>        number of Data Packet buffers that the **receiver** can
>>>        accept. The value in this field shall be less than or equal to
>>>        the maximum burst size supported by the endpoint as determined
>>>        by the value in the bMaxBurst field in the Endpoint Companion
>>>        Descriptor (refer to Section 9.6.7)."
>>>
>>> So, NumP is for the receiver, not the transmitter. Could you clarify
>>> what you mean here?
>>>
>>> /me keeps reading
>>>
>>> Hmm, table 8-15 tries to clarify:
>>>
>>>        "Number of Packets (NumP).
>>>
>>>        For an OUT endpoint, refer to Table 8-13 for the description of
>>>        this field.
>>>
>>>        For an IN endpoint this field is set by the endpoint to the
>>>        number of packets it can transmit when the host resumes
>>>        transactions to it. This field shall not have a value greater
>>>        than the maximum burst size supported by the endpoint as
>>>        indicated by the value in the bMaxBurst field in the Endpoint
>>>        Companion Descriptor. Note that the value reported in this field
>>>        may be treated by the host as informative only."
>>>
>>> However, if I remember correctly (please verify dwc3 databook), NUMP in
>>> DCFG was only for receive buffers. Thin, John, how does dwc3 compute
>>> NumP for TX/IN endpoints? Is that computed as a function of DCFG.NUMP or
>>> TxFIFO size?
>>>
>>>> desc.  If we have a TXFIFO size of 2, then normally what I have seen is
>>>> that after 2 data packets, the device issues a NRDY.  So then we'd need
>>>> to send an ERDY once data is available within the FIFO, and the same
>>>> sequence happens until the USB request is complete.  With this constant
>>>> NRDY/ERDY handshake going on, you actually see that the bus is under
>>>> utilized.  When we increase an EP's FIFO size, then you'll see constant
>>>> bursts for a request, until the request is done, or if the host runs out
>>>> of RXFIFO. (ie no interruption [on the USB protocol level] during USB
>>>> request data transfer)
>>> Unfortunately I don't have access to a USB sniffer anymore :-(
>>>
>>>>>>>>> Good points.
>>>>>>>>>
>>>>>>>>> Wesley, what kind of testing have you done on this on different devices?
>>>>>>>>>
>>>>>>>> As mentioned above, these changes are currently present on end user
>>>>>>>> devices for the past few years, so its been through a lot of testing :).
>>>>>>> all with the same gadget driver. Also, who uses USB on android devices
>>>>>>> these days? Most of the data transfer goes via WiFi or Bluetooth, anyway
>>>>>>> :-)
>>>>>>>
>>>>>>> I guess only developers are using USB during development to flash dev
>>>>>>> images heh.
>>>>>>>
>>>>>> I used to be a customer facing engineer, so honestly I did see some
>>>>>> really interesting and crazy designs.  Again, we do have non-Android
>>>>>> products that use the same code, and it has been working in there for a
>>>>>> few years as well.  The TXFIFO sizing really has helped with multimedia
>>>>>> use cases, which use isoc endpoints, since esp. in those lower end CPU
>>>>>> chips where latencies across the system are much larger, and a missed
>>>>>> ISOC interval leads to a pop in your ear.
>>>>> This is good background information. Thanks for bringing this
>>>>> up. Admitedly, we still have ISOC issues with dwc3. I'm interested in
>>>>> knowing if a deeper request queue would also help here.
>>>>>
>>>>> Remember dwc3 can accomodate 255 requests + link for each endpoint. If
>>>>> our gadget driver uses a low number of requests, we're never really
>>>>> using the TRB ring in our benefit.
>>>>>
>>>> We're actually using both a deeper USB request queue + TX fifo resizing. :).
>>> okay, great. Let's see what John and/or Thinh respond WRT dwc3 TX Burst
>>> behavior.
>> --
>> heikki
>
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v9 0/5] Re-introduce TX FIFO resize for larger EP bursting
  2021-06-12 21:27                       ` Ferry Toth
@ 2021-06-12 21:37                         ` Andy Shevchenko
  2021-06-14 18:58                         ` Wesley Cheng
  1 sibling, 0 replies; 31+ messages in thread
From: Andy Shevchenko @ 2021-06-12 21:37 UTC (permalink / raw)
  To: Ferry Toth
  Cc: Heikki Krogerus, Felipe Balbi, Andy Shevchenko, Wesley Cheng,
	Greg KH, Andy Gross, Bjorn Andersson, Rob Herring, USB,
	Linux Kernel Mailing List, linux-arm-msm, devicetree, Jack Pham,
	Thinh Nguyen, John Youn

On Sun, Jun 13, 2021 at 12:27 AM Ferry Toth <fntoth@gmail.com> wrote:
> Op 11-06-2021 om 15:21 schreef Andy Shevchenko:
> > On Fri, Jun 11, 2021 at 4:14 PM Heikki Krogerus
> > <heikki.krogerus@linux.intel.com> wrote:
> >> On Fri, Jun 11, 2021 at 04:00:38PM +0300, Felipe Balbi wrote:
> >>> Wesley Cheng <wcheng@codeaurora.org> writes:

...

> >>>>   - USB Speed = Super-Speed
> >>>>   - Function Driver: Mass Storage (w/ ramdisk)
> >>>>   - Test Application: CrystalDiskMark
> >>>>
> >>>> Results:
> >>>>
> >>>> TXFIFO Depth = 3 max packets
> >>>>
> >>>> Test Case | Data Size | AVG tput (in MB/s)
> >>>> -------------------------------------------
> >>>> Sequential|1 GB x     |
> >>>> Read      |9 loops    | 193.60
> >>>>            |           | 195.86
> >>>>            |           | 184.77
> >>>>            |           | 193.60
> >>>> -------------------------------------------
> >>>>
> >>>> TXFIFO Depth = 6 max packets
> >>>>
> >>>> Test Case | Data Size | AVG tput (in MB/s)
> >>>> -------------------------------------------
> >>>> Sequential|1 GB x     |
> >>>> Read      |9 loops    | 287.35
> >>>>          |           | 304.94
> >>>>            |           | 289.64
> >>>>            |           | 293.61
> >>> I remember getting close to 400MiB/sec with Intel platforms without
> >>> resizing FIFOs and I'm sure the FIFO size was set to 2x1024, though my
> >>> memory could be failing.
> >>>
> >>> Then again, I never ran with CrystalDiskMark, I was using my own tool
> >>> (it's somewhere in github. If you care, I can look up the URL).
> >>>
> >>>> We also have internal numbers which have shown similar improvements as
> >>>> well.  Those are over networking/tethering interfaces, so testing IPERF
> >>>> loopback over TCP/UDP.
> >>> loopback iperf? That would skip the wire, no?
> >>>
> >>>>>> size of 2 and TX threshold of 1, this would really be not beneficial to
> >>>>>> us, because we can only change the TX threshold to 2 at max, and at
> >>>>>> least in my observations, once we have to go out to system memory to
> >>>>>> fetch the next data packet, that latency takes enough time for the
> >>>>>> controller to end the current burst.
> >>>>> What I noticed with g_mass_storage is that we can amortize the cost of
> >>>>> fetching data from memory, with a deeper request queue. Whenever I
> >>>>> test(ed) g_mass_storage, I was doing so with 250 requests. And that was
> >>>>> enough to give me very good performance. Never had to poke at TX FIFO
> >>>>> resizing. Did you try something like this too?
> >>>>>
> >>>>> I feel that allocating more requests is a far simpler and more generic
> >>>>> method that changing FIFO sizes :)
> >>>>>
> >>>> I wish I had a USB bus trace handy to show you, which would make it very
> >>>> clear how the USB bus is currently utilized with TXFIFO size 2 vs 6.  So
> >>>> by increasing the number of USB requests, that will help if there was a
> >>>> bottleneck at the SW level where the application/function driver
> >>>> utilizing the DWC3 was submitting data much faster than the HW was
> >>>> processing them.
> >>>>
> >>>> So yes, this method of increasing the # of USB reqs will definitely help
> >>>> with situations such as HSUSB or in SSUSB when EP bursting isn't used.
> >>>> The TXFIFO resize comes into play for SSUSB, which utilizes endpoint
> >>>> bursting.
> >>> Hmm, that's not what I remember. Perhaps the TRB cache size plays a role
> >>> here too. I have clear memories of testing this very scenario of
> >>> bursting (using g_mass_storage at the time) because I was curious about
> >>> it. Back then, my tests showed no difference in behavior.
> >>>
> >>> It could be nice if Heikki could test Intel parts with and without your
> >>> changes on g_mass_storage with 250 requests.
> >> Andy, you have a system at hand that has the DWC3 block enabled,
> >> right? Can you help out here?
> > I'm not sure if i will have time soon, I Cc'ed to Ferry who has a few
> > more test cases (I have only one or two) and maybe can help. But I'll
> > keep this in mind.
>
> I just tested on 5.13.0-rc4 on Intel Edison (x86_64). All 5 patches
> apply. Switching between host/gadget works, no connections dropping, no
> errors in dmesg.
>
> In host mode I connect a smsc9504 eth+4p hub. In gadget mode I have
> composite device created from configfs with gser / eem / mass_storage /
> uac2.
>
> Tested with iperf3 performance in host (93.6Mbits/sec) and gadget
> (207Mbits/sec) mode. Compared to v5.10.41 without patches host
> (93.4Mbits/sec) and gadget (198Mbits/sec).
>
> Gadget seems to be a little faster with the patches, but that might also
> be caused  by something else, on v5.10.41 I see the bitrate bouncing
> between 207 and 199.
>
> I saw a mention to test iperf3 to self (loopback). 3.09 Gbits/sec. With
> v5.10.41 3.07Gbits/sec. Not bad for a 500MHz device.
>
> With gnome-disks I did a read access benchmark 35.4MB/s, with v5.10.41
> 34.7MB/s. This might be limited by Edison's internal eMMC speed (when
> booting U-Boot reads the kernel with 21.4 MiB/s).

Ferry, thank you very much for this information and testing efforts!

> >>>> Now with endpoint bursting, if the function notifies the host that
> >>>> bursting is supported, when the host sends the ACK for the Data Packet,
> >>>> it should have a NumP value equal to the bMaxBurst reported in the EP
> >>> Yes and no. Looking back at the history, we used to configure NUMP based
> >>> on bMaxBurst, but it was changed later in commit
> >>> 4e99472bc10bda9906526d725ff6d5f27b4ddca1 by yours truly because of a
> >>> problem reported by John Youn.
> >>>
> >>> And now we've come full circle. Because even if I believe more requests
> >>> are enough for bursting, NUMP is limited by the RxFIFO size. This ends
> >>> up supporting your claim that we need RxFIFO resizing if we want to
> >>> squeeze more throughput out of the controller.
> >>>
> >>> However, note that this is about RxFIFO size, not TxFIFO size. In fact,
> >>> looking at Table 8-13 of USB 3.1 r1.0, we read the following about NumP
> >>> (emphasis is mine):
> >>>
> >>>        "Number of Packets (NumP). This field is used to indicate the
> >>>        number of Data Packet buffers that the **receiver** can
> >>>        accept. The value in this field shall be less than or equal to
> >>>        the maximum burst size supported by the endpoint as determined
> >>>        by the value in the bMaxBurst field in the Endpoint Companion
> >>>        Descriptor (refer to Section 9.6.7)."
> >>>
> >>> So, NumP is for the receiver, not the transmitter. Could you clarify
> >>> what you mean here?
> >>>
> >>> /me keeps reading
> >>>
> >>> Hmm, table 8-15 tries to clarify:
> >>>
> >>>        "Number of Packets (NumP).
> >>>
> >>>        For an OUT endpoint, refer to Table 8-13 for the description of
> >>>        this field.
> >>>
> >>>        For an IN endpoint this field is set by the endpoint to the
> >>>        number of packets it can transmit when the host resumes
> >>>        transactions to it. This field shall not have a value greater
> >>>        than the maximum burst size supported by the endpoint as
> >>>        indicated by the value in the bMaxBurst field in the Endpoint
> >>>        Companion Descriptor. Note that the value reported in this field
> >>>        may be treated by the host as informative only."
> >>>
> >>> However, if I remember correctly (please verify dwc3 databook), NUMP in
> >>> DCFG was only for receive buffers. Thin, John, how does dwc3 compute
> >>> NumP for TX/IN endpoints? Is that computed as a function of DCFG.NUMP or
> >>> TxFIFO size?
> >>>
> >>>> desc.  If we have a TXFIFO size of 2, then normally what I have seen is
> >>>> that after 2 data packets, the device issues a NRDY.  So then we'd need
> >>>> to send an ERDY once data is available within the FIFO, and the same
> >>>> sequence happens until the USB request is complete.  With this constant
> >>>> NRDY/ERDY handshake going on, you actually see that the bus is under
> >>>> utilized.  When we increase an EP's FIFO size, then you'll see constant
> >>>> bursts for a request, until the request is done, or if the host runs out
> >>>> of RXFIFO. (ie no interruption [on the USB protocol level] during USB
> >>>> request data transfer)
> >>> Unfortunately I don't have access to a USB sniffer anymore :-(
> >>>
> >>>>>>>>> Good points.
> >>>>>>>>>
> >>>>>>>>> Wesley, what kind of testing have you done on this on different devices?
> >>>>>>>>>
> >>>>>>>> As mentioned above, these changes are currently present on end user
> >>>>>>>> devices for the past few years, so its been through a lot of testing :).
> >>>>>>> all with the same gadget driver. Also, who uses USB on android devices
> >>>>>>> these days? Most of the data transfer goes via WiFi or Bluetooth, anyway
> >>>>>>> :-)
> >>>>>>>
> >>>>>>> I guess only developers are using USB during development to flash dev
> >>>>>>> images heh.
> >>>>>>>
> >>>>>> I used to be a customer facing engineer, so honestly I did see some
> >>>>>> really interesting and crazy designs.  Again, we do have non-Android
> >>>>>> products that use the same code, and it has been working in there for a
> >>>>>> few years as well.  The TXFIFO sizing really has helped with multimedia
> >>>>>> use cases, which use isoc endpoints, since esp. in those lower end CPU
> >>>>>> chips where latencies across the system are much larger, and a missed
> >>>>>> ISOC interval leads to a pop in your ear.
> >>>>> This is good background information. Thanks for bringing this
> >>>>> up. Admitedly, we still have ISOC issues with dwc3. I'm interested in
> >>>>> knowing if a deeper request queue would also help here.
> >>>>>
> >>>>> Remember dwc3 can accomodate 255 requests + link for each endpoint. If
> >>>>> our gadget driver uses a low number of requests, we're never really
> >>>>> using the TRB ring in our benefit.
> >>>>>
> >>>> We're actually using both a deeper USB request queue + TX fifo resizing. :).
> >>> okay, great. Let's see what John and/or Thinh respond WRT dwc3 TX Burst
> >>> behavior.



-- 
With Best Regards,
Andy Shevchenko

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v9 0/5] Re-introduce TX FIFO resize for larger EP bursting
  2021-06-12 21:27                       ` Ferry Toth
  2021-06-12 21:37                         ` Andy Shevchenko
@ 2021-06-14 18:58                         ` Wesley Cheng
  2021-06-14 19:30                           ` Ferry Toth
  1 sibling, 1 reply; 31+ messages in thread
From: Wesley Cheng @ 2021-06-14 18:58 UTC (permalink / raw)
  To: Ferry Toth, Andy Shevchenko, Heikki Krogerus
  Cc: Felipe Balbi, Andy Shevchenko, Greg KH, Andy Gross,
	Bjorn Andersson, Rob Herring, USB, Linux Kernel Mailing List,
	linux-arm-msm, devicetree, Jack Pham, Thinh Nguyen, John Youn



On 6/12/2021 2:27 PM, Ferry Toth wrote:
> Hi
> 
> Op 11-06-2021 om 15:21 schreef Andy Shevchenko:
>> On Fri, Jun 11, 2021 at 4:14 PM Heikki Krogerus
>> <heikki.krogerus@linux.intel.com> wrote:
>>> On Fri, Jun 11, 2021 at 04:00:38PM +0300, Felipe Balbi wrote:
>>>> Hi,
>>>>
>>>> Wesley Cheng <wcheng@codeaurora.org> writes:
>>>>>>>>>>> to be honest, I don't think these should go in (apart from
>>>>>>>>>>> the build
>>>>>>>>>>> failure) because it's likely to break instantiations of the
>>>>>>>>>>> core with
>>>>>>>>>>> differing FIFO sizes. Some instantiations even have some
>>>>>>>>>>> endpoints with
>>>>>>>>>>> dedicated functionality that requires the default FIFO size
>>>>>>>>>>> configured
>>>>>>>>>>> during coreConsultant instantiation. I know of at OMAP5 and
>>>>>>>>>>> some Intel
>>>>>>>>>>> implementations which have dedicated endpoints for processor
>>>>>>>>>>> tracing.
>>>>>>>>>>>
>>>>>>>>>>> With OMAP5, these endpoints are configured at the top of the
>>>>>>>>>>> available
>>>>>>>>>>> endpoints, which means that if a gadget driver gets loaded
>>>>>>>>>>> and takes
>>>>>>>>>>> over most of the FIFO space because of this resizing,
>>>>>>>>>>> processor tracing
>>>>>>>>>>> will have a hard time running. That being said, processor
>>>>>>>>>>> tracing isn't
>>>>>>>>>>> supported in upstream at this moment.
>>>>>>>>>>>
>>>>>>>>> I agree that the application of this logic may differ between
>>>>>>>>> vendors,
>>>>>>>>> hence why I wanted to keep this controllable by the DT
>>>>>>>>> property, so that
>>>>>>>>> for those which do not support this use case can leave it
>>>>>>>>> disabled.  The
>>>>>>>>> logic is there to ensure that for a given USB configuration,
>>>>>>>>> for each EP
>>>>>>>>> it would have at least 1 TX FIFO.  For USB configurations which
>>>>>>>>> don't
>>>>>>>>> utilize all available IN EPs, it would allow re-allocation of
>>>>>>>>> internal
>>>>>>>>> memory to EPs which will actually be in use.
>>>>>>>> The feature ends up being all-or-nothing, then :-) It sounds
>>>>>>>> like we can
>>>>>>>> be a little nicer in this regard.
>>>>>>>>
>>>>>>> Don't get me wrong, I think once those features become available
>>>>>>> upstream, we can improve the logic.  From what I remember when
>>>>>>> looking
>>>>>> sure, I support that. But I want to make sure the first cut isn't
>>>>>> likely
>>>>>> to break things left and right :)
>>>>>>
>>>>>> Hence, let's at least get more testing.
>>>>>>
>>>>> Sure, I'd hope that the other users of DWC3 will also see some pretty
>>>>> big improvements on the TX path with this.
>>>> fingers crossed
>>>>
>>>>>>> at Andy Shevchenko's Github, the Intel tracer downstream changes
>>>>>>> were
>>>>>>> just to remove physical EP1 and 2 from the DWC3 endpoint list. 
>>>>>>> If that
>>>>>> right, that's the reason why we introduced the endpoint feature
>>>>>> flags. The end goal was that the UDC would be able to have custom
>>>>>> feature flags paired with ->validate_endpoint() or whatever before
>>>>>> allowing it to be enabled. Then the UDC driver could tell UDC core to
>>>>>> skip that endpoint on that particular platform without
>>>>>> interefering with
>>>>>> everything else.
>>>>>>
>>>>>> Of course, we still need to figure out a way to abstract the
>>>>>> different
>>>>>> dwc3 instantiations.
>>>>>>
>>>>>>> was the change which ended up upstream for the Intel tracer then we
>>>>>>> could improve the logic to avoid re-sizing those particular EPs.
>>>>>> The problem then, just as I mentioned in the previous paragraph,
>>>>>> will be
>>>>>> coming up with a solution that's elegant and works for all different
>>>>>> instantiations of dwc3 (or musb, cdns3, etc).
>>>>>>
>>>>> Well, at least for the TX FIFO resizing logic, we'd only be needing to
>>>>> focus on the DWC3 implementation.
>>>>>
>>>>> You bring up another good topic that I'll eventually needing to be
>>>>> taking a look at, which is a nice way we can handle vendor specific
>>>>> endpoints and how they can co-exist with other "normal" endpoints.  We
>>>>> have a few special HW eps as well, which we try to maintain separately
>>>>> in our DWC3 vendor driver, but it isn't the most convenient, or most
>>>>> pretty method :).
>>>> Awesome, as mentioned, the endpoint feature flags were added exactly to
>>>> allow for these vendor-specific features :-)
>>>>
>>>> I'm more than happy to help testing now that I finally got our SM8150
>>>> Surface Duo device tree accepted by Bjorn ;-)
>>>>
>>>>>>> However, I'm not sure how the changes would look like in the end,
>>>>>>> so I
>>>>>>> would like to wait later down the line to include that :).
>>>>>> Fair enough, I agree. Can we get some more testing of $subject,
>>>>>> though?
>>>>>> Did you test $subject with upstream too? Which gadget drivers did you
>>>>>> use? How did you test
>>>>>>
>>>>> The results that I included in the cover page was tested with the pure
>>>>> upstream kernel on our device.  Below was using the ConfigFS gadget
>>>>> w/ a
>>>>> mass storage only composition.
>>>>>
>>>>> Test Parameters:
>>>>>   - Platform: Qualcomm SM8150
>>>>>   - bMaxBurst = 6
>>>>>   - USB req size = 256kB
>>>>>   - Num of USB reqs = 16
>>>> do you mind testing with the regular request size (16KiB) and 250
>>>> requests? I think we can even do 15 bursts in that case.
>>>>
>>>>>   - USB Speed = Super-Speed
>>>>>   - Function Driver: Mass Storage (w/ ramdisk)
>>>>>   - Test Application: CrystalDiskMark
>>>>>
>>>>> Results:
>>>>>
>>>>> TXFIFO Depth = 3 max packets
>>>>>
>>>>> Test Case | Data Size | AVG tput (in MB/s)
>>>>> -------------------------------------------
>>>>> Sequential|1 GB x     |
>>>>> Read      |9 loops    | 193.60
>>>>>            |           | 195.86
>>>>>            |           | 184.77
>>>>>            |           | 193.60
>>>>> -------------------------------------------
>>>>>
>>>>> TXFIFO Depth = 6 max packets
>>>>>
>>>>> Test Case | Data Size | AVG tput (in MB/s)
>>>>> -------------------------------------------
>>>>> Sequential|1 GB x     |
>>>>> Read      |9 loops    | 287.35
>>>>>          |           | 304.94
>>>>>            |           | 289.64
>>>>>            |           | 293.61
>>>> I remember getting close to 400MiB/sec with Intel platforms without
>>>> resizing FIFOs and I'm sure the FIFO size was set to 2x1024, though my
>>>> memory could be failing.
>>>>
>>>> Then again, I never ran with CrystalDiskMark, I was using my own tool
>>>> (it's somewhere in github. If you care, I can look up the URL).
>>>>
>>>>> We also have internal numbers which have shown similar improvements as
>>>>> well.  Those are over networking/tethering interfaces, so testing
>>>>> IPERF
>>>>> loopback over TCP/UDP.
>>>> loopback iperf? That would skip the wire, no?
>>>>
>>>>>>> size of 2 and TX threshold of 1, this would really be not
>>>>>>> beneficial to
>>>>>>> us, because we can only change the TX threshold to 2 at max, and at
>>>>>>> least in my observations, once we have to go out to system memory to
>>>>>>> fetch the next data packet, that latency takes enough time for the
>>>>>>> controller to end the current burst.
>>>>>> What I noticed with g_mass_storage is that we can amortize the
>>>>>> cost of
>>>>>> fetching data from memory, with a deeper request queue. Whenever I
>>>>>> test(ed) g_mass_storage, I was doing so with 250 requests. And
>>>>>> that was
>>>>>> enough to give me very good performance. Never had to poke at TX FIFO
>>>>>> resizing. Did you try something like this too?
>>>>>>
>>>>>> I feel that allocating more requests is a far simpler and more
>>>>>> generic
>>>>>> method that changing FIFO sizes :)
>>>>>>
>>>>> I wish I had a USB bus trace handy to show you, which would make it
>>>>> very
>>>>> clear how the USB bus is currently utilized with TXFIFO size 2 vs
>>>>> 6.  So
>>>>> by increasing the number of USB requests, that will help if there
>>>>> was a
>>>>> bottleneck at the SW level where the application/function driver
>>>>> utilizing the DWC3 was submitting data much faster than the HW was
>>>>> processing them.
>>>>>
>>>>> So yes, this method of increasing the # of USB reqs will definitely
>>>>> help
>>>>> with situations such as HSUSB or in SSUSB when EP bursting isn't used.
>>>>> The TXFIFO resize comes into play for SSUSB, which utilizes endpoint
>>>>> bursting.
>>>> Hmm, that's not what I remember. Perhaps the TRB cache size plays a
>>>> role
>>>> here too. I have clear memories of testing this very scenario of
>>>> bursting (using g_mass_storage at the time) because I was curious about
>>>> it. Back then, my tests showed no difference in behavior.
>>>>
>>>> It could be nice if Heikki could test Intel parts with and without your
>>>> changes on g_mass_storage with 250 requests.
>>> Andy, you have a system at hand that has the DWC3 block enabled,
>>> right? Can you help out here?
>> I'm not sure if i will have time soon, I Cc'ed to Ferry who has a few
>> more test cases (I have only one or two) and maybe can help. But I'll
>> keep this in mind.
> 
> I just tested on 5.13.0-rc4 on Intel Edison (x86_64). All 5 patches
> apply. Switching between host/gadget works, no connections dropping, no
> errors in dmesg.
> 
> In host mode I connect a smsc9504 eth+4p hub. In gadget mode I have
> composite device created from configfs with gser / eem / mass_storage /
> uac2.
> 
> Tested with iperf3 performance in host (93.6Mbits/sec) and gadget
> (207Mbits/sec) mode. Compared to v5.10.41 without patches host
> (93.4Mbits/sec) and gadget (198Mbits/sec).
> 
> Gadget seems to be a little faster with the patches, but that might also
> be caused  by something else, on v5.10.41 I see the bitrate bouncing
> between 207 and 199.
> 
> I saw a mention to test iperf3 to self (loopback). 3.09 Gbits/sec. With
> v5.10.41 3.07Gbits/sec. Not bad for a 500MHz device.
> 
> With gnome-disks I did a read access benchmark 35.4MB/s, with v5.10.41
> 34.7MB/s. This might be limited by Edison's internal eMMC speed (when
> booting U-Boot reads the kernel with 21.4 MiB/s).
> 
Hi Ferry,

Thanks for the testing.  Just to double check, did you also enable the
property, which enabled the TXFIFO resize feature on the platform?  For
example, for the QCOM SM8150 platform, we're adding the following to our
device tree node:

tx-fifo-resize

If not, then your results at least confirms that w/o the property
present, the changes won't break anything :).  Thanks again for the
initial testing!

Thanks
Wesley Cheng

>>>>> Now with endpoint bursting, if the function notifies the host that
>>>>> bursting is supported, when the host sends the ACK for the Data
>>>>> Packet,
>>>>> it should have a NumP value equal to the bMaxBurst reported in the EP
>>>> Yes and no. Looking back at the history, we used to configure NUMP
>>>> based
>>>> on bMaxBurst, but it was changed later in commit
>>>> 4e99472bc10bda9906526d725ff6d5f27b4ddca1 by yours truly because of a
>>>> problem reported by John Youn.
>>>>
>>>> And now we've come full circle. Because even if I believe more requests
>>>> are enough for bursting, NUMP is limited by the RxFIFO size. This ends
>>>> up supporting your claim that we need RxFIFO resizing if we want to
>>>> squeeze more throughput out of the controller.
>>>>
>>>> However, note that this is about RxFIFO size, not TxFIFO size. In fact,
>>>> looking at Table 8-13 of USB 3.1 r1.0, we read the following about NumP
>>>> (emphasis is mine):
>>>>
>>>>        "Number of Packets (NumP). This field is used to indicate the
>>>>        number of Data Packet buffers that the **receiver** can
>>>>        accept. The value in this field shall be less than or equal to
>>>>        the maximum burst size supported by the endpoint as determined
>>>>        by the value in the bMaxBurst field in the Endpoint Companion
>>>>        Descriptor (refer to Section 9.6.7)."
>>>>
>>>> So, NumP is for the receiver, not the transmitter. Could you clarify
>>>> what you mean here?
>>>>
>>>> /me keeps reading
>>>>
>>>> Hmm, table 8-15 tries to clarify:
>>>>
>>>>        "Number of Packets (NumP).
>>>>
>>>>        For an OUT endpoint, refer to Table 8-13 for the description of
>>>>        this field.
>>>>
>>>>        For an IN endpoint this field is set by the endpoint to the
>>>>        number of packets it can transmit when the host resumes
>>>>        transactions to it. This field shall not have a value greater
>>>>        than the maximum burst size supported by the endpoint as
>>>>        indicated by the value in the bMaxBurst field in the Endpoint
>>>>        Companion Descriptor. Note that the value reported in this field
>>>>        may be treated by the host as informative only."
>>>>
>>>> However, if I remember correctly (please verify dwc3 databook), NUMP in
>>>> DCFG was only for receive buffers. Thin, John, how does dwc3 compute
>>>> NumP for TX/IN endpoints? Is that computed as a function of
>>>> DCFG.NUMP or
>>>> TxFIFO size?
>>>>
>>>>> desc.  If we have a TXFIFO size of 2, then normally what I have
>>>>> seen is
>>>>> that after 2 data packets, the device issues a NRDY.  So then we'd
>>>>> need
>>>>> to send an ERDY once data is available within the FIFO, and the same
>>>>> sequence happens until the USB request is complete.  With this
>>>>> constant
>>>>> NRDY/ERDY handshake going on, you actually see that the bus is under
>>>>> utilized.  When we increase an EP's FIFO size, then you'll see
>>>>> constant
>>>>> bursts for a request, until the request is done, or if the host
>>>>> runs out
>>>>> of RXFIFO. (ie no interruption [on the USB protocol level] during USB
>>>>> request data transfer)
>>>> Unfortunately I don't have access to a USB sniffer anymore :-(
>>>>
>>>>>>>>>> Good points.
>>>>>>>>>>
>>>>>>>>>> Wesley, what kind of testing have you done on this on
>>>>>>>>>> different devices?
>>>>>>>>>>
>>>>>>>>> As mentioned above, these changes are currently present on end
>>>>>>>>> user
>>>>>>>>> devices for the past few years, so its been through a lot of
>>>>>>>>> testing :).
>>>>>>>> all with the same gadget driver. Also, who uses USB on android
>>>>>>>> devices
>>>>>>>> these days? Most of the data transfer goes via WiFi or
>>>>>>>> Bluetooth, anyway
>>>>>>>> :-)
>>>>>>>>
>>>>>>>> I guess only developers are using USB during development to
>>>>>>>> flash dev
>>>>>>>> images heh.
>>>>>>>>
>>>>>>> I used to be a customer facing engineer, so honestly I did see some
>>>>>>> really interesting and crazy designs.  Again, we do have non-Android
>>>>>>> products that use the same code, and it has been working in there
>>>>>>> for a
>>>>>>> few years as well.  The TXFIFO sizing really has helped with
>>>>>>> multimedia
>>>>>>> use cases, which use isoc endpoints, since esp. in those lower
>>>>>>> end CPU
>>>>>>> chips where latencies across the system are much larger, and a
>>>>>>> missed
>>>>>>> ISOC interval leads to a pop in your ear.
>>>>>> This is good background information. Thanks for bringing this
>>>>>> up. Admitedly, we still have ISOC issues with dwc3. I'm interested in
>>>>>> knowing if a deeper request queue would also help here.
>>>>>>
>>>>>> Remember dwc3 can accomodate 255 requests + link for each
>>>>>> endpoint. If
>>>>>> our gadget driver uses a low number of requests, we're never really
>>>>>> using the TRB ring in our benefit.
>>>>>>
>>>>> We're actually using both a deeper USB request queue + TX fifo
>>>>> resizing. :).
>>>> okay, great. Let's see what John and/or Thinh respond WRT dwc3 TX Burst
>>>> behavior.
>>> -- 
>>> heikki
>>
>>

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v9 0/5] Re-introduce TX FIFO resize for larger EP bursting
  2021-06-14 18:58                         ` Wesley Cheng
@ 2021-06-14 19:30                           ` Ferry Toth
  2021-06-15  4:22                             ` Wesley Cheng
  0 siblings, 1 reply; 31+ messages in thread
From: Ferry Toth @ 2021-06-14 19:30 UTC (permalink / raw)
  To: Wesley Cheng, Andy Shevchenko, Heikki Krogerus
  Cc: Felipe Balbi, Andy Shevchenko, Greg KH, Andy Gross,
	Bjorn Andersson, Rob Herring, USB, Linux Kernel Mailing List,
	linux-arm-msm, devicetree, Jack Pham, Thinh Nguyen, John Youn


Op 14-06-2021 om 20:58 schreef Wesley Cheng:
>
> On 6/12/2021 2:27 PM, Ferry Toth wrote:
>> Hi
>>
>> Op 11-06-2021 om 15:21 schreef Andy Shevchenko:
>>> On Fri, Jun 11, 2021 at 4:14 PM Heikki Krogerus
>>> <heikki.krogerus@linux.intel.com> wrote:
>>>> On Fri, Jun 11, 2021 at 04:00:38PM +0300, Felipe Balbi wrote:
>>>>> Hi,
>>>>>
>>>>> Wesley Cheng <wcheng@codeaurora.org> writes:
>>>>>>>>>>>> to be honest, I don't think these should go in (apart from
>>>>>>>>>>>> the build
>>>>>>>>>>>> failure) because it's likely to break instantiations of the
>>>>>>>>>>>> core with
>>>>>>>>>>>> differing FIFO sizes. Some instantiations even have some
>>>>>>>>>>>> endpoints with
>>>>>>>>>>>> dedicated functionality that requires the default FIFO size
>>>>>>>>>>>> configured
>>>>>>>>>>>> during coreConsultant instantiation. I know of at OMAP5 and
>>>>>>>>>>>> some Intel
>>>>>>>>>>>> implementations which have dedicated endpoints for processor
>>>>>>>>>>>> tracing.
>>>>>>>>>>>>
>>>>>>>>>>>> With OMAP5, these endpoints are configured at the top of the
>>>>>>>>>>>> available
>>>>>>>>>>>> endpoints, which means that if a gadget driver gets loaded
>>>>>>>>>>>> and takes
>>>>>>>>>>>> over most of the FIFO space because of this resizing,
>>>>>>>>>>>> processor tracing
>>>>>>>>>>>> will have a hard time running. That being said, processor
>>>>>>>>>>>> tracing isn't
>>>>>>>>>>>> supported in upstream at this moment.
>>>>>>>>>>>>
>>>>>>>>>> I agree that the application of this logic may differ between
>>>>>>>>>> vendors,
>>>>>>>>>> hence why I wanted to keep this controllable by the DT
>>>>>>>>>> property, so that
>>>>>>>>>> for those which do not support this use case can leave it
>>>>>>>>>> disabled.  The
>>>>>>>>>> logic is there to ensure that for a given USB configuration,
>>>>>>>>>> for each EP
>>>>>>>>>> it would have at least 1 TX FIFO.  For USB configurations which
>>>>>>>>>> don't
>>>>>>>>>> utilize all available IN EPs, it would allow re-allocation of
>>>>>>>>>> internal
>>>>>>>>>> memory to EPs which will actually be in use.
>>>>>>>>> The feature ends up being all-or-nothing, then :-) It sounds
>>>>>>>>> like we can
>>>>>>>>> be a little nicer in this regard.
>>>>>>>>>
>>>>>>>> Don't get me wrong, I think once those features become available
>>>>>>>> upstream, we can improve the logic.  From what I remember when
>>>>>>>> looking
>>>>>>> sure, I support that. But I want to make sure the first cut isn't
>>>>>>> likely
>>>>>>> to break things left and right :)
>>>>>>>
>>>>>>> Hence, let's at least get more testing.
>>>>>>>
>>>>>> Sure, I'd hope that the other users of DWC3 will also see some pretty
>>>>>> big improvements on the TX path with this.
>>>>> fingers crossed
>>>>>
>>>>>>>> at Andy Shevchenko's Github, the Intel tracer downstream changes
>>>>>>>> were
>>>>>>>> just to remove physical EP1 and 2 from the DWC3 endpoint list.
>>>>>>>> If that
>>>>>>> right, that's the reason why we introduced the endpoint feature
>>>>>>> flags. The end goal was that the UDC would be able to have custom
>>>>>>> feature flags paired with ->validate_endpoint() or whatever before
>>>>>>> allowing it to be enabled. Then the UDC driver could tell UDC core to
>>>>>>> skip that endpoint on that particular platform without
>>>>>>> interefering with
>>>>>>> everything else.
>>>>>>>
>>>>>>> Of course, we still need to figure out a way to abstract the
>>>>>>> different
>>>>>>> dwc3 instantiations.
>>>>>>>
>>>>>>>> was the change which ended up upstream for the Intel tracer then we
>>>>>>>> could improve the logic to avoid re-sizing those particular EPs.
>>>>>>> The problem then, just as I mentioned in the previous paragraph,
>>>>>>> will be
>>>>>>> coming up with a solution that's elegant and works for all different
>>>>>>> instantiations of dwc3 (or musb, cdns3, etc).
>>>>>>>
>>>>>> Well, at least for the TX FIFO resizing logic, we'd only be needing to
>>>>>> focus on the DWC3 implementation.
>>>>>>
>>>>>> You bring up another good topic that I'll eventually needing to be
>>>>>> taking a look at, which is a nice way we can handle vendor specific
>>>>>> endpoints and how they can co-exist with other "normal" endpoints.  We
>>>>>> have a few special HW eps as well, which we try to maintain separately
>>>>>> in our DWC3 vendor driver, but it isn't the most convenient, or most
>>>>>> pretty method :).
>>>>> Awesome, as mentioned, the endpoint feature flags were added exactly to
>>>>> allow for these vendor-specific features :-)
>>>>>
>>>>> I'm more than happy to help testing now that I finally got our SM8150
>>>>> Surface Duo device tree accepted by Bjorn ;-)
>>>>>
>>>>>>>> However, I'm not sure how the changes would look like in the end,
>>>>>>>> so I
>>>>>>>> would like to wait later down the line to include that :).
>>>>>>> Fair enough, I agree. Can we get some more testing of $subject,
>>>>>>> though?
>>>>>>> Did you test $subject with upstream too? Which gadget drivers did you
>>>>>>> use? How did you test
>>>>>>>
>>>>>> The results that I included in the cover page was tested with the pure
>>>>>> upstream kernel on our device.  Below was using the ConfigFS gadget
>>>>>> w/ a
>>>>>> mass storage only composition.
>>>>>>
>>>>>> Test Parameters:
>>>>>>    - Platform: Qualcomm SM8150
>>>>>>    - bMaxBurst = 6
>>>>>>    - USB req size = 256kB
>>>>>>    - Num of USB reqs = 16
>>>>> do you mind testing with the regular request size (16KiB) and 250
>>>>> requests? I think we can even do 15 bursts in that case.
>>>>>
>>>>>>    - USB Speed = Super-Speed
>>>>>>    - Function Driver: Mass Storage (w/ ramdisk)
>>>>>>    - Test Application: CrystalDiskMark
>>>>>>
>>>>>> Results:
>>>>>>
>>>>>> TXFIFO Depth = 3 max packets
>>>>>>
>>>>>> Test Case | Data Size | AVG tput (in MB/s)
>>>>>> -------------------------------------------
>>>>>> Sequential|1 GB x     |
>>>>>> Read      |9 loops    | 193.60
>>>>>>             |           | 195.86
>>>>>>             |           | 184.77
>>>>>>             |           | 193.60
>>>>>> -------------------------------------------
>>>>>>
>>>>>> TXFIFO Depth = 6 max packets
>>>>>>
>>>>>> Test Case | Data Size | AVG tput (in MB/s)
>>>>>> -------------------------------------------
>>>>>> Sequential|1 GB x     |
>>>>>> Read      |9 loops    | 287.35
>>>>>>           |           | 304.94
>>>>>>             |           | 289.64
>>>>>>             |           | 293.61
>>>>> I remember getting close to 400MiB/sec with Intel platforms without
>>>>> resizing FIFOs and I'm sure the FIFO size was set to 2x1024, though my
>>>>> memory could be failing.
>>>>>
>>>>> Then again, I never ran with CrystalDiskMark, I was using my own tool
>>>>> (it's somewhere in github. If you care, I can look up the URL).
>>>>>
>>>>>> We also have internal numbers which have shown similar improvements as
>>>>>> well.  Those are over networking/tethering interfaces, so testing
>>>>>> IPERF
>>>>>> loopback over TCP/UDP.
>>>>> loopback iperf? That would skip the wire, no?
>>>>>
>>>>>>>> size of 2 and TX threshold of 1, this would really be not
>>>>>>>> beneficial to
>>>>>>>> us, because we can only change the TX threshold to 2 at max, and at
>>>>>>>> least in my observations, once we have to go out to system memory to
>>>>>>>> fetch the next data packet, that latency takes enough time for the
>>>>>>>> controller to end the current burst.
>>>>>>> What I noticed with g_mass_storage is that we can amortize the
>>>>>>> cost of
>>>>>>> fetching data from memory, with a deeper request queue. Whenever I
>>>>>>> test(ed) g_mass_storage, I was doing so with 250 requests. And
>>>>>>> that was
>>>>>>> enough to give me very good performance. Never had to poke at TX FIFO
>>>>>>> resizing. Did you try something like this too?
>>>>>>>
>>>>>>> I feel that allocating more requests is a far simpler and more
>>>>>>> generic
>>>>>>> method that changing FIFO sizes :)
>>>>>>>
>>>>>> I wish I had a USB bus trace handy to show you, which would make it
>>>>>> very
>>>>>> clear how the USB bus is currently utilized with TXFIFO size 2 vs
>>>>>> 6.  So
>>>>>> by increasing the number of USB requests, that will help if there
>>>>>> was a
>>>>>> bottleneck at the SW level where the application/function driver
>>>>>> utilizing the DWC3 was submitting data much faster than the HW was
>>>>>> processing them.
>>>>>>
>>>>>> So yes, this method of increasing the # of USB reqs will definitely
>>>>>> help
>>>>>> with situations such as HSUSB or in SSUSB when EP bursting isn't used.
>>>>>> The TXFIFO resize comes into play for SSUSB, which utilizes endpoint
>>>>>> bursting.
>>>>> Hmm, that's not what I remember. Perhaps the TRB cache size plays a
>>>>> role
>>>>> here too. I have clear memories of testing this very scenario of
>>>>> bursting (using g_mass_storage at the time) because I was curious about
>>>>> it. Back then, my tests showed no difference in behavior.
>>>>>
>>>>> It could be nice if Heikki could test Intel parts with and without your
>>>>> changes on g_mass_storage with 250 requests.
>>>> Andy, you have a system at hand that has the DWC3 block enabled,
>>>> right? Can you help out here?
>>> I'm not sure if i will have time soon, I Cc'ed to Ferry who has a few
>>> more test cases (I have only one or two) and maybe can help. But I'll
>>> keep this in mind.
>> I just tested on 5.13.0-rc4 on Intel Edison (x86_64). All 5 patches
>> apply. Switching between host/gadget works, no connections dropping, no
>> errors in dmesg.
>>
>> In host mode I connect a smsc9504 eth+4p hub. In gadget mode I have
>> composite device created from configfs with gser / eem / mass_storage /
>> uac2.
>>
>> Tested with iperf3 performance in host (93.6Mbits/sec) and gadget
>> (207Mbits/sec) mode. Compared to v5.10.41 without patches host
>> (93.4Mbits/sec) and gadget (198Mbits/sec).
>>
>> Gadget seems to be a little faster with the patches, but that might also
>> be caused  by something else, on v5.10.41 I see the bitrate bouncing
>> between 207 and 199.
>>
>> I saw a mention to test iperf3 to self (loopback). 3.09 Gbits/sec. With
>> v5.10.41 3.07Gbits/sec. Not bad for a 500MHz device.
>>
>> With gnome-disks I did a read access benchmark 35.4MB/s, with v5.10.41
>> 34.7MB/s. This might be limited by Edison's internal eMMC speed (when
>> booting U-Boot reads the kernel with 21.4 MiB/s).
>>
> Hi Ferry,
>
> Thanks for the testing.  Just to double check, did you also enable the
> property, which enabled the TXFIFO resize feature on the platform?  For
> example, for the QCOM SM8150 platform, we're adding the following to our
> device tree node:
>
> tx-fifo-resize
>
> If not, then your results at least confirms that w/o the property
> present, the changes won't break anything :).  Thanks again for the
> initial testing!

No I didn't. Afaik we don't have a devicetree property to set.

But I'd be happy to test that as well. But where to set the property?

dwc3_pci_mrfld_properties[] in dwc3-pci?

> Thanks
> Wesley Cheng
>
>>>>>> Now with endpoint bursting, if the function notifies the host that
>>>>>> bursting is supported, when the host sends the ACK for the Data
>>>>>> Packet,
>>>>>> it should have a NumP value equal to the bMaxBurst reported in the EP
>>>>> Yes and no. Looking back at the history, we used to configure NUMP
>>>>> based
>>>>> on bMaxBurst, but it was changed later in commit
>>>>> 4e99472bc10bda9906526d725ff6d5f27b4ddca1 by yours truly because of a
>>>>> problem reported by John Youn.
>>>>>
>>>>> And now we've come full circle. Because even if I believe more requests
>>>>> are enough for bursting, NUMP is limited by the RxFIFO size. This ends
>>>>> up supporting your claim that we need RxFIFO resizing if we want to
>>>>> squeeze more throughput out of the controller.
>>>>>
>>>>> However, note that this is about RxFIFO size, not TxFIFO size. In fact,
>>>>> looking at Table 8-13 of USB 3.1 r1.0, we read the following about NumP
>>>>> (emphasis is mine):
>>>>>
>>>>>         "Number of Packets (NumP). This field is used to indicate the
>>>>>         number of Data Packet buffers that the **receiver** can
>>>>>         accept. The value in this field shall be less than or equal to
>>>>>         the maximum burst size supported by the endpoint as determined
>>>>>         by the value in the bMaxBurst field in the Endpoint Companion
>>>>>         Descriptor (refer to Section 9.6.7)."
>>>>>
>>>>> So, NumP is for the receiver, not the transmitter. Could you clarify
>>>>> what you mean here?
>>>>>
>>>>> /me keeps reading
>>>>>
>>>>> Hmm, table 8-15 tries to clarify:
>>>>>
>>>>>         "Number of Packets (NumP).
>>>>>
>>>>>         For an OUT endpoint, refer to Table 8-13 for the description of
>>>>>         this field.
>>>>>
>>>>>         For an IN endpoint this field is set by the endpoint to the
>>>>>         number of packets it can transmit when the host resumes
>>>>>         transactions to it. This field shall not have a value greater
>>>>>         than the maximum burst size supported by the endpoint as
>>>>>         indicated by the value in the bMaxBurst field in the Endpoint
>>>>>         Companion Descriptor. Note that the value reported in this field
>>>>>         may be treated by the host as informative only."
>>>>>
>>>>> However, if I remember correctly (please verify dwc3 databook), NUMP in
>>>>> DCFG was only for receive buffers. Thin, John, how does dwc3 compute
>>>>> NumP for TX/IN endpoints? Is that computed as a function of
>>>>> DCFG.NUMP or
>>>>> TxFIFO size?
>>>>>
>>>>>> desc.  If we have a TXFIFO size of 2, then normally what I have
>>>>>> seen is
>>>>>> that after 2 data packets, the device issues a NRDY.  So then we'd
>>>>>> need
>>>>>> to send an ERDY once data is available within the FIFO, and the same
>>>>>> sequence happens until the USB request is complete.  With this
>>>>>> constant
>>>>>> NRDY/ERDY handshake going on, you actually see that the bus is under
>>>>>> utilized.  When we increase an EP's FIFO size, then you'll see
>>>>>> constant
>>>>>> bursts for a request, until the request is done, or if the host
>>>>>> runs out
>>>>>> of RXFIFO. (ie no interruption [on the USB protocol level] during USB
>>>>>> request data transfer)
>>>>> Unfortunately I don't have access to a USB sniffer anymore :-(
>>>>>
>>>>>>>>>>> Good points.
>>>>>>>>>>>
>>>>>>>>>>> Wesley, what kind of testing have you done on this on
>>>>>>>>>>> different devices?
>>>>>>>>>>>
>>>>>>>>>> As mentioned above, these changes are currently present on end
>>>>>>>>>> user
>>>>>>>>>> devices for the past few years, so its been through a lot of
>>>>>>>>>> testing :).
>>>>>>>>> all with the same gadget driver. Also, who uses USB on android
>>>>>>>>> devices
>>>>>>>>> these days? Most of the data transfer goes via WiFi or
>>>>>>>>> Bluetooth, anyway
>>>>>>>>> :-)
>>>>>>>>>
>>>>>>>>> I guess only developers are using USB during development to
>>>>>>>>> flash dev
>>>>>>>>> images heh.
>>>>>>>>>
>>>>>>>> I used to be a customer facing engineer, so honestly I did see some
>>>>>>>> really interesting and crazy designs.  Again, we do have non-Android
>>>>>>>> products that use the same code, and it has been working in there
>>>>>>>> for a
>>>>>>>> few years as well.  The TXFIFO sizing really has helped with
>>>>>>>> multimedia
>>>>>>>> use cases, which use isoc endpoints, since esp. in those lower
>>>>>>>> end CPU
>>>>>>>> chips where latencies across the system are much larger, and a
>>>>>>>> missed
>>>>>>>> ISOC interval leads to a pop in your ear.
>>>>>>> This is good background information. Thanks for bringing this
>>>>>>> up. Admitedly, we still have ISOC issues with dwc3. I'm interested in
>>>>>>> knowing if a deeper request queue would also help here.
>>>>>>>
>>>>>>> Remember dwc3 can accomodate 255 requests + link for each
>>>>>>> endpoint. If
>>>>>>> our gadget driver uses a low number of requests, we're never really
>>>>>>> using the TRB ring in our benefit.
>>>>>>>
>>>>>> We're actually using both a deeper USB request queue + TX fifo
>>>>>> resizing. :).
>>>>> okay, great. Let's see what John and/or Thinh respond WRT dwc3 TX Burst
>>>>> behavior.
>>>> -- 
>>>> heikki
>>>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v9 0/5] Re-introduce TX FIFO resize for larger EP bursting
  2021-06-14 19:30                           ` Ferry Toth
@ 2021-06-15  4:22                             ` Wesley Cheng
  2021-06-15 19:53                               ` Ferry Toth
  0 siblings, 1 reply; 31+ messages in thread
From: Wesley Cheng @ 2021-06-15  4:22 UTC (permalink / raw)
  To: Ferry Toth, Andy Shevchenko, Heikki Krogerus
  Cc: Felipe Balbi, Andy Shevchenko, Greg KH, Andy Gross,
	Bjorn Andersson, Rob Herring, USB, Linux Kernel Mailing List,
	linux-arm-msm, devicetree, Jack Pham, Thinh Nguyen, John Youn



On 6/14/2021 12:30 PM, Ferry Toth wrote:
> 
> Op 14-06-2021 om 20:58 schreef Wesley Cheng:
>>
>> On 6/12/2021 2:27 PM, Ferry Toth wrote:
>>> Hi
>>>
>>> Op 11-06-2021 om 15:21 schreef Andy Shevchenko:
>>>> On Fri, Jun 11, 2021 at 4:14 PM Heikki Krogerus
>>>> <heikki.krogerus@linux.intel.com> wrote:
>>>>> On Fri, Jun 11, 2021 at 04:00:38PM +0300, Felipe Balbi wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Wesley Cheng <wcheng@codeaurora.org> writes:
>>>>>>>>>>>>> to be honest, I don't think these should go in (apart from
>>>>>>>>>>>>> the build
>>>>>>>>>>>>> failure) because it's likely to break instantiations of the
>>>>>>>>>>>>> core with
>>>>>>>>>>>>> differing FIFO sizes. Some instantiations even have some
>>>>>>>>>>>>> endpoints with
>>>>>>>>>>>>> dedicated functionality that requires the default FIFO size
>>>>>>>>>>>>> configured
>>>>>>>>>>>>> during coreConsultant instantiation. I know of at OMAP5 and
>>>>>>>>>>>>> some Intel
>>>>>>>>>>>>> implementations which have dedicated endpoints for processor
>>>>>>>>>>>>> tracing.
>>>>>>>>>>>>>
>>>>>>>>>>>>> With OMAP5, these endpoints are configured at the top of the
>>>>>>>>>>>>> available
>>>>>>>>>>>>> endpoints, which means that if a gadget driver gets loaded
>>>>>>>>>>>>> and takes
>>>>>>>>>>>>> over most of the FIFO space because of this resizing,
>>>>>>>>>>>>> processor tracing
>>>>>>>>>>>>> will have a hard time running. That being said, processor
>>>>>>>>>>>>> tracing isn't
>>>>>>>>>>>>> supported in upstream at this moment.
>>>>>>>>>>>>>
>>>>>>>>>>> I agree that the application of this logic may differ between
>>>>>>>>>>> vendors,
>>>>>>>>>>> hence why I wanted to keep this controllable by the DT
>>>>>>>>>>> property, so that
>>>>>>>>>>> for those which do not support this use case can leave it
>>>>>>>>>>> disabled.  The
>>>>>>>>>>> logic is there to ensure that for a given USB configuration,
>>>>>>>>>>> for each EP
>>>>>>>>>>> it would have at least 1 TX FIFO.  For USB configurations which
>>>>>>>>>>> don't
>>>>>>>>>>> utilize all available IN EPs, it would allow re-allocation of
>>>>>>>>>>> internal
>>>>>>>>>>> memory to EPs which will actually be in use.
>>>>>>>>>> The feature ends up being all-or-nothing, then :-) It sounds
>>>>>>>>>> like we can
>>>>>>>>>> be a little nicer in this regard.
>>>>>>>>>>
>>>>>>>>> Don't get me wrong, I think once those features become available
>>>>>>>>> upstream, we can improve the logic.  From what I remember when
>>>>>>>>> looking
>>>>>>>> sure, I support that. But I want to make sure the first cut isn't
>>>>>>>> likely
>>>>>>>> to break things left and right :)
>>>>>>>>
>>>>>>>> Hence, let's at least get more testing.
>>>>>>>>
>>>>>>> Sure, I'd hope that the other users of DWC3 will also see some
>>>>>>> pretty
>>>>>>> big improvements on the TX path with this.
>>>>>> fingers crossed
>>>>>>
>>>>>>>>> at Andy Shevchenko's Github, the Intel tracer downstream changes
>>>>>>>>> were
>>>>>>>>> just to remove physical EP1 and 2 from the DWC3 endpoint list.
>>>>>>>>> If that
>>>>>>>> right, that's the reason why we introduced the endpoint feature
>>>>>>>> flags. The end goal was that the UDC would be able to have custom
>>>>>>>> feature flags paired with ->validate_endpoint() or whatever before
>>>>>>>> allowing it to be enabled. Then the UDC driver could tell UDC
>>>>>>>> core to
>>>>>>>> skip that endpoint on that particular platform without
>>>>>>>> interefering with
>>>>>>>> everything else.
>>>>>>>>
>>>>>>>> Of course, we still need to figure out a way to abstract the
>>>>>>>> different
>>>>>>>> dwc3 instantiations.
>>>>>>>>
>>>>>>>>> was the change which ended up upstream for the Intel tracer
>>>>>>>>> then we
>>>>>>>>> could improve the logic to avoid re-sizing those particular EPs.
>>>>>>>> The problem then, just as I mentioned in the previous paragraph,
>>>>>>>> will be
>>>>>>>> coming up with a solution that's elegant and works for all
>>>>>>>> different
>>>>>>>> instantiations of dwc3 (or musb, cdns3, etc).
>>>>>>>>
>>>>>>> Well, at least for the TX FIFO resizing logic, we'd only be
>>>>>>> needing to
>>>>>>> focus on the DWC3 implementation.
>>>>>>>
>>>>>>> You bring up another good topic that I'll eventually needing to be
>>>>>>> taking a look at, which is a nice way we can handle vendor specific
>>>>>>> endpoints and how they can co-exist with other "normal"
>>>>>>> endpoints.  We
>>>>>>> have a few special HW eps as well, which we try to maintain
>>>>>>> separately
>>>>>>> in our DWC3 vendor driver, but it isn't the most convenient, or most
>>>>>>> pretty method :).
>>>>>> Awesome, as mentioned, the endpoint feature flags were added
>>>>>> exactly to
>>>>>> allow for these vendor-specific features :-)
>>>>>>
>>>>>> I'm more than happy to help testing now that I finally got our SM8150
>>>>>> Surface Duo device tree accepted by Bjorn ;-)
>>>>>>
>>>>>>>>> However, I'm not sure how the changes would look like in the end,
>>>>>>>>> so I
>>>>>>>>> would like to wait later down the line to include that :).
>>>>>>>> Fair enough, I agree. Can we get some more testing of $subject,
>>>>>>>> though?
>>>>>>>> Did you test $subject with upstream too? Which gadget drivers
>>>>>>>> did you
>>>>>>>> use? How did you test
>>>>>>>>
>>>>>>> The results that I included in the cover page was tested with the
>>>>>>> pure
>>>>>>> upstream kernel on our device.  Below was using the ConfigFS gadget
>>>>>>> w/ a
>>>>>>> mass storage only composition.
>>>>>>>
>>>>>>> Test Parameters:
>>>>>>>    - Platform: Qualcomm SM8150
>>>>>>>    - bMaxBurst = 6
>>>>>>>    - USB req size = 256kB
>>>>>>>    - Num of USB reqs = 16
>>>>>> do you mind testing with the regular request size (16KiB) and 250
>>>>>> requests? I think we can even do 15 bursts in that case.
>>>>>>
>>>>>>>    - USB Speed = Super-Speed
>>>>>>>    - Function Driver: Mass Storage (w/ ramdisk)
>>>>>>>    - Test Application: CrystalDiskMark
>>>>>>>
>>>>>>> Results:
>>>>>>>
>>>>>>> TXFIFO Depth = 3 max packets
>>>>>>>
>>>>>>> Test Case | Data Size | AVG tput (in MB/s)
>>>>>>> -------------------------------------------
>>>>>>> Sequential|1 GB x     |
>>>>>>> Read      |9 loops    | 193.60
>>>>>>>             |           | 195.86
>>>>>>>             |           | 184.77
>>>>>>>             |           | 193.60
>>>>>>> -------------------------------------------
>>>>>>>
>>>>>>> TXFIFO Depth = 6 max packets
>>>>>>>
>>>>>>> Test Case | Data Size | AVG tput (in MB/s)
>>>>>>> -------------------------------------------
>>>>>>> Sequential|1 GB x     |
>>>>>>> Read      |9 loops    | 287.35
>>>>>>>           |           | 304.94
>>>>>>>             |           | 289.64
>>>>>>>             |           | 293.61
>>>>>> I remember getting close to 400MiB/sec with Intel platforms without
>>>>>> resizing FIFOs and I'm sure the FIFO size was set to 2x1024,
>>>>>> though my
>>>>>> memory could be failing.
>>>>>>
>>>>>> Then again, I never ran with CrystalDiskMark, I was using my own tool
>>>>>> (it's somewhere in github. If you care, I can look up the URL).
>>>>>>
>>>>>>> We also have internal numbers which have shown similar
>>>>>>> improvements as
>>>>>>> well.  Those are over networking/tethering interfaces, so testing
>>>>>>> IPERF
>>>>>>> loopback over TCP/UDP.
>>>>>> loopback iperf? That would skip the wire, no?
>>>>>>
>>>>>>>>> size of 2 and TX threshold of 1, this would really be not
>>>>>>>>> beneficial to
>>>>>>>>> us, because we can only change the TX threshold to 2 at max,
>>>>>>>>> and at
>>>>>>>>> least in my observations, once we have to go out to system
>>>>>>>>> memory to
>>>>>>>>> fetch the next data packet, that latency takes enough time for the
>>>>>>>>> controller to end the current burst.
>>>>>>>> What I noticed with g_mass_storage is that we can amortize the
>>>>>>>> cost of
>>>>>>>> fetching data from memory, with a deeper request queue. Whenever I
>>>>>>>> test(ed) g_mass_storage, I was doing so with 250 requests. And
>>>>>>>> that was
>>>>>>>> enough to give me very good performance. Never had to poke at TX
>>>>>>>> FIFO
>>>>>>>> resizing. Did you try something like this too?
>>>>>>>>
>>>>>>>> I feel that allocating more requests is a far simpler and more
>>>>>>>> generic
>>>>>>>> method that changing FIFO sizes :)
>>>>>>>>
>>>>>>> I wish I had a USB bus trace handy to show you, which would make it
>>>>>>> very
>>>>>>> clear how the USB bus is currently utilized with TXFIFO size 2 vs
>>>>>>> 6.  So
>>>>>>> by increasing the number of USB requests, that will help if there
>>>>>>> was a
>>>>>>> bottleneck at the SW level where the application/function driver
>>>>>>> utilizing the DWC3 was submitting data much faster than the HW was
>>>>>>> processing them.
>>>>>>>
>>>>>>> So yes, this method of increasing the # of USB reqs will definitely
>>>>>>> help
>>>>>>> with situations such as HSUSB or in SSUSB when EP bursting isn't
>>>>>>> used.
>>>>>>> The TXFIFO resize comes into play for SSUSB, which utilizes endpoint
>>>>>>> bursting.
>>>>>> Hmm, that's not what I remember. Perhaps the TRB cache size plays a
>>>>>> role
>>>>>> here too. I have clear memories of testing this very scenario of
>>>>>> bursting (using g_mass_storage at the time) because I was curious
>>>>>> about
>>>>>> it. Back then, my tests showed no difference in behavior.
>>>>>>
>>>>>> It could be nice if Heikki could test Intel parts with and without
>>>>>> your
>>>>>> changes on g_mass_storage with 250 requests.
>>>>> Andy, you have a system at hand that has the DWC3 block enabled,
>>>>> right? Can you help out here?
>>>> I'm not sure if i will have time soon, I Cc'ed to Ferry who has a few
>>>> more test cases (I have only one or two) and maybe can help. But I'll
>>>> keep this in mind.
>>> I just tested on 5.13.0-rc4 on Intel Edison (x86_64). All 5 patches
>>> apply. Switching between host/gadget works, no connections dropping, no
>>> errors in dmesg.
>>>
>>> In host mode I connect a smsc9504 eth+4p hub. In gadget mode I have
>>> composite device created from configfs with gser / eem / mass_storage /
>>> uac2.
>>>
>>> Tested with iperf3 performance in host (93.6Mbits/sec) and gadget
>>> (207Mbits/sec) mode. Compared to v5.10.41 without patches host
>>> (93.4Mbits/sec) and gadget (198Mbits/sec).
>>>
>>> Gadget seems to be a little faster with the patches, but that might also
>>> be caused  by something else, on v5.10.41 I see the bitrate bouncing
>>> between 207 and 199.
>>>
>>> I saw a mention to test iperf3 to self (loopback). 3.09 Gbits/sec. With
>>> v5.10.41 3.07Gbits/sec. Not bad for a 500MHz device.
>>>
>>> With gnome-disks I did a read access benchmark 35.4MB/s, with v5.10.41
>>> 34.7MB/s. This might be limited by Edison's internal eMMC speed (when
>>> booting U-Boot reads the kernel with 21.4 MiB/s).
>>>
>> Hi Ferry,
>>
>> Thanks for the testing.  Just to double check, did you also enable the
>> property, which enabled the TXFIFO resize feature on the platform?  For
>> example, for the QCOM SM8150 platform, we're adding the following to our
>> device tree node:
>>
>> tx-fifo-resize
>>
>> If not, then your results at least confirms that w/o the property
>> present, the changes won't break anything :).  Thanks again for the
>> initial testing!
> 
> No I didn't. Afaik we don't have a devicetree property to set.
> 
> But I'd be happy to test that as well. But where to set the property?
> 
> dwc3_pci_mrfld_properties[] in dwc3-pci?
> 
Hi Ferry,

Not too sure which DWC3 driver is used for the Intel platform, but I
believe that should be the one. (if that's what is normally used)  We'd
just need to add an entry w/ the below:

PROPERTY_ENTRY_BOOL("tx-fifo-resize")

Thanks
Wesley Cheng

>> Thanks
>> Wesley Cheng
>>
>>>>>>> Now with endpoint bursting, if the function notifies the host that
>>>>>>> bursting is supported, when the host sends the ACK for the Data
>>>>>>> Packet,
>>>>>>> it should have a NumP value equal to the bMaxBurst reported in
>>>>>>> the EP
>>>>>> Yes and no. Looking back at the history, we used to configure NUMP
>>>>>> based
>>>>>> on bMaxBurst, but it was changed later in commit
>>>>>> 4e99472bc10bda9906526d725ff6d5f27b4ddca1 by yours truly because of a
>>>>>> problem reported by John Youn.
>>>>>>
>>>>>> And now we've come full circle. Because even if I believe more
>>>>>> requests
>>>>>> are enough for bursting, NUMP is limited by the RxFIFO size. This
>>>>>> ends
>>>>>> up supporting your claim that we need RxFIFO resizing if we want to
>>>>>> squeeze more throughput out of the controller.
>>>>>>
>>>>>> However, note that this is about RxFIFO size, not TxFIFO size. In
>>>>>> fact,
>>>>>> looking at Table 8-13 of USB 3.1 r1.0, we read the following about
>>>>>> NumP
>>>>>> (emphasis is mine):
>>>>>>
>>>>>>         "Number of Packets (NumP). This field is used to indicate the
>>>>>>         number of Data Packet buffers that the **receiver** can
>>>>>>         accept. The value in this field shall be less than or
>>>>>> equal to
>>>>>>         the maximum burst size supported by the endpoint as
>>>>>> determined
>>>>>>         by the value in the bMaxBurst field in the Endpoint Companion
>>>>>>         Descriptor (refer to Section 9.6.7)."
>>>>>>
>>>>>> So, NumP is for the receiver, not the transmitter. Could you clarify
>>>>>> what you mean here?
>>>>>>
>>>>>> /me keeps reading
>>>>>>
>>>>>> Hmm, table 8-15 tries to clarify:
>>>>>>
>>>>>>         "Number of Packets (NumP).
>>>>>>
>>>>>>         For an OUT endpoint, refer to Table 8-13 for the
>>>>>> description of
>>>>>>         this field.
>>>>>>
>>>>>>         For an IN endpoint this field is set by the endpoint to the
>>>>>>         number of packets it can transmit when the host resumes
>>>>>>         transactions to it. This field shall not have a value greater
>>>>>>         than the maximum burst size supported by the endpoint as
>>>>>>         indicated by the value in the bMaxBurst field in the Endpoint
>>>>>>         Companion Descriptor. Note that the value reported in this
>>>>>> field
>>>>>>         may be treated by the host as informative only."
>>>>>>
>>>>>> However, if I remember correctly (please verify dwc3 databook),
>>>>>> NUMP in
>>>>>> DCFG was only for receive buffers. Thin, John, how does dwc3 compute
>>>>>> NumP for TX/IN endpoints? Is that computed as a function of
>>>>>> DCFG.NUMP or
>>>>>> TxFIFO size?
>>>>>>
>>>>>>> desc.  If we have a TXFIFO size of 2, then normally what I have
>>>>>>> seen is
>>>>>>> that after 2 data packets, the device issues a NRDY.  So then we'd
>>>>>>> need
>>>>>>> to send an ERDY once data is available within the FIFO, and the same
>>>>>>> sequence happens until the USB request is complete.  With this
>>>>>>> constant
>>>>>>> NRDY/ERDY handshake going on, you actually see that the bus is under
>>>>>>> utilized.  When we increase an EP's FIFO size, then you'll see
>>>>>>> constant
>>>>>>> bursts for a request, until the request is done, or if the host
>>>>>>> runs out
>>>>>>> of RXFIFO. (ie no interruption [on the USB protocol level] during
>>>>>>> USB
>>>>>>> request data transfer)
>>>>>> Unfortunately I don't have access to a USB sniffer anymore :-(
>>>>>>
>>>>>>>>>>>> Good points.
>>>>>>>>>>>>
>>>>>>>>>>>> Wesley, what kind of testing have you done on this on
>>>>>>>>>>>> different devices?
>>>>>>>>>>>>
>>>>>>>>>>> As mentioned above, these changes are currently present on end
>>>>>>>>>>> user
>>>>>>>>>>> devices for the past few years, so its been through a lot of
>>>>>>>>>>> testing :).
>>>>>>>>>> all with the same gadget driver. Also, who uses USB on android
>>>>>>>>>> devices
>>>>>>>>>> these days? Most of the data transfer goes via WiFi or
>>>>>>>>>> Bluetooth, anyway
>>>>>>>>>> :-)
>>>>>>>>>>
>>>>>>>>>> I guess only developers are using USB during development to
>>>>>>>>>> flash dev
>>>>>>>>>> images heh.
>>>>>>>>>>
>>>>>>>>> I used to be a customer facing engineer, so honestly I did see
>>>>>>>>> some
>>>>>>>>> really interesting and crazy designs.  Again, we do have
>>>>>>>>> non-Android
>>>>>>>>> products that use the same code, and it has been working in there
>>>>>>>>> for a
>>>>>>>>> few years as well.  The TXFIFO sizing really has helped with
>>>>>>>>> multimedia
>>>>>>>>> use cases, which use isoc endpoints, since esp. in those lower
>>>>>>>>> end CPU
>>>>>>>>> chips where latencies across the system are much larger, and a
>>>>>>>>> missed
>>>>>>>>> ISOC interval leads to a pop in your ear.
>>>>>>>> This is good background information. Thanks for bringing this
>>>>>>>> up. Admitedly, we still have ISOC issues with dwc3. I'm
>>>>>>>> interested in
>>>>>>>> knowing if a deeper request queue would also help here.
>>>>>>>>
>>>>>>>> Remember dwc3 can accomodate 255 requests + link for each
>>>>>>>> endpoint. If
>>>>>>>> our gadget driver uses a low number of requests, we're never really
>>>>>>>> using the TRB ring in our benefit.
>>>>>>>>
>>>>>>> We're actually using both a deeper USB request queue + TX fifo
>>>>>>> resizing. :).
>>>>>> okay, great. Let's see what John and/or Thinh respond WRT dwc3 TX
>>>>>> Burst
>>>>>> behavior.
>>>>> -- 
>>>>> heikki
>>>>

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v9 0/5] Re-introduce TX FIFO resize for larger EP bursting
  2021-06-15  4:22                             ` Wesley Cheng
@ 2021-06-15 19:53                               ` Ferry Toth
  2021-06-17  4:25                                 ` Wesley Cheng
  0 siblings, 1 reply; 31+ messages in thread
From: Ferry Toth @ 2021-06-15 19:53 UTC (permalink / raw)
  To: Wesley Cheng, Andy Shevchenko, Heikki Krogerus
  Cc: Felipe Balbi, Andy Shevchenko, Greg KH, Andy Gross,
	Bjorn Andersson, Rob Herring, USB, Linux Kernel Mailing List,
	linux-arm-msm, devicetree, Jack Pham, Thinh Nguyen, John Youn

Hi

Op 15-06-2021 om 06:22 schreef Wesley Cheng:
>
> On 6/14/2021 12:30 PM, Ferry Toth wrote:
>> Op 14-06-2021 om 20:58 schreef Wesley Cheng:
>>> On 6/12/2021 2:27 PM, Ferry Toth wrote:
>>>> Hi
>>>>
>>>> Op 11-06-2021 om 15:21 schreef Andy Shevchenko:
>>>>> On Fri, Jun 11, 2021 at 4:14 PM Heikki Krogerus
>>>>> <heikki.krogerus@linux.intel.com> wrote:
>>>>>> On Fri, Jun 11, 2021 at 04:00:38PM +0300, Felipe Balbi wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Wesley Cheng <wcheng@codeaurora.org> writes:
>>>>>>>>>>>>>> to be honest, I don't think these should go in (apart from
>>>>>>>>>>>>>> the build
>>>>>>>>>>>>>> failure) because it's likely to break instantiations of the
>>>>>>>>>>>>>> core with
>>>>>>>>>>>>>> differing FIFO sizes. Some instantiations even have some
>>>>>>>>>>>>>> endpoints with
>>>>>>>>>>>>>> dedicated functionality that requires the default FIFO size
>>>>>>>>>>>>>> configured
>>>>>>>>>>>>>> during coreConsultant instantiation. I know of at OMAP5 and
>>>>>>>>>>>>>> some Intel
>>>>>>>>>>>>>> implementations which have dedicated endpoints for processor
>>>>>>>>>>>>>> tracing.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> With OMAP5, these endpoints are configured at the top of the
>>>>>>>>>>>>>> available
>>>>>>>>>>>>>> endpoints, which means that if a gadget driver gets loaded
>>>>>>>>>>>>>> and takes
>>>>>>>>>>>>>> over most of the FIFO space because of this resizing,
>>>>>>>>>>>>>> processor tracing
>>>>>>>>>>>>>> will have a hard time running. That being said, processor
>>>>>>>>>>>>>> tracing isn't
>>>>>>>>>>>>>> supported in upstream at this moment.
>>>>>>>>>>>>>>
>>>>>>>>>>>> I agree that the application of this logic may differ between
>>>>>>>>>>>> vendors,
>>>>>>>>>>>> hence why I wanted to keep this controllable by the DT
>>>>>>>>>>>> property, so that
>>>>>>>>>>>> for those which do not support this use case can leave it
>>>>>>>>>>>> disabled.  The
>>>>>>>>>>>> logic is there to ensure that for a given USB configuration,
>>>>>>>>>>>> for each EP
>>>>>>>>>>>> it would have at least 1 TX FIFO.  For USB configurations which
>>>>>>>>>>>> don't
>>>>>>>>>>>> utilize all available IN EPs, it would allow re-allocation of
>>>>>>>>>>>> internal
>>>>>>>>>>>> memory to EPs which will actually be in use.
>>>>>>>>>>> The feature ends up being all-or-nothing, then :-) It sounds
>>>>>>>>>>> like we can
>>>>>>>>>>> be a little nicer in this regard.
>>>>>>>>>>>
>>>>>>>>>> Don't get me wrong, I think once those features become available
>>>>>>>>>> upstream, we can improve the logic.  From what I remember when
>>>>>>>>>> looking
>>>>>>>>> sure, I support that. But I want to make sure the first cut isn't
>>>>>>>>> likely
>>>>>>>>> to break things left and right :)
>>>>>>>>>
>>>>>>>>> Hence, let's at least get more testing.
>>>>>>>>>
>>>>>>>> Sure, I'd hope that the other users of DWC3 will also see some
>>>>>>>> pretty
>>>>>>>> big improvements on the TX path with this.
>>>>>>> fingers crossed
>>>>>>>
>>>>>>>>>> at Andy Shevchenko's Github, the Intel tracer downstream changes
>>>>>>>>>> were
>>>>>>>>>> just to remove physical EP1 and 2 from the DWC3 endpoint list.
>>>>>>>>>> If that
>>>>>>>>> right, that's the reason why we introduced the endpoint feature
>>>>>>>>> flags. The end goal was that the UDC would be able to have custom
>>>>>>>>> feature flags paired with ->validate_endpoint() or whatever before
>>>>>>>>> allowing it to be enabled. Then the UDC driver could tell UDC
>>>>>>>>> core to
>>>>>>>>> skip that endpoint on that particular platform without
>>>>>>>>> interefering with
>>>>>>>>> everything else.
>>>>>>>>>
>>>>>>>>> Of course, we still need to figure out a way to abstract the
>>>>>>>>> different
>>>>>>>>> dwc3 instantiations.
>>>>>>>>>
>>>>>>>>>> was the change which ended up upstream for the Intel tracer
>>>>>>>>>> then we
>>>>>>>>>> could improve the logic to avoid re-sizing those particular EPs.
>>>>>>>>> The problem then, just as I mentioned in the previous paragraph,
>>>>>>>>> will be
>>>>>>>>> coming up with a solution that's elegant and works for all
>>>>>>>>> different
>>>>>>>>> instantiations of dwc3 (or musb, cdns3, etc).
>>>>>>>>>
>>>>>>>> Well, at least for the TX FIFO resizing logic, we'd only be
>>>>>>>> needing to
>>>>>>>> focus on the DWC3 implementation.
>>>>>>>>
>>>>>>>> You bring up another good topic that I'll eventually needing to be
>>>>>>>> taking a look at, which is a nice way we can handle vendor specific
>>>>>>>> endpoints and how they can co-exist with other "normal"
>>>>>>>> endpoints.  We
>>>>>>>> have a few special HW eps as well, which we try to maintain
>>>>>>>> separately
>>>>>>>> in our DWC3 vendor driver, but it isn't the most convenient, or most
>>>>>>>> pretty method :).
>>>>>>> Awesome, as mentioned, the endpoint feature flags were added
>>>>>>> exactly to
>>>>>>> allow for these vendor-specific features :-)
>>>>>>>
>>>>>>> I'm more than happy to help testing now that I finally got our SM8150
>>>>>>> Surface Duo device tree accepted by Bjorn ;-)
>>>>>>>
>>>>>>>>>> However, I'm not sure how the changes would look like in the end,
>>>>>>>>>> so I
>>>>>>>>>> would like to wait later down the line to include that :).
>>>>>>>>> Fair enough, I agree. Can we get some more testing of $subject,
>>>>>>>>> though?
>>>>>>>>> Did you test $subject with upstream too? Which gadget drivers
>>>>>>>>> did you
>>>>>>>>> use? How did you test
>>>>>>>>>
>>>>>>>> The results that I included in the cover page was tested with the
>>>>>>>> pure
>>>>>>>> upstream kernel on our device.  Below was using the ConfigFS gadget
>>>>>>>> w/ a
>>>>>>>> mass storage only composition.
>>>>>>>>
>>>>>>>> Test Parameters:
>>>>>>>>     - Platform: Qualcomm SM8150
>>>>>>>>     - bMaxBurst = 6
>>>>>>>>     - USB req size = 256kB
>>>>>>>>     - Num of USB reqs = 16
>>>>>>> do you mind testing with the regular request size (16KiB) and 250
>>>>>>> requests? I think we can even do 15 bursts in that case.
>>>>>>>
>>>>>>>>     - USB Speed = Super-Speed
>>>>>>>>     - Function Driver: Mass Storage (w/ ramdisk)
>>>>>>>>     - Test Application: CrystalDiskMark
>>>>>>>>
>>>>>>>> Results:
>>>>>>>>
>>>>>>>> TXFIFO Depth = 3 max packets
>>>>>>>>
>>>>>>>> Test Case | Data Size | AVG tput (in MB/s)
>>>>>>>> -------------------------------------------
>>>>>>>> Sequential|1 GB x     |
>>>>>>>> Read      |9 loops    | 193.60
>>>>>>>>              |           | 195.86
>>>>>>>>              |           | 184.77
>>>>>>>>              |           | 193.60
>>>>>>>> -------------------------------------------
>>>>>>>>
>>>>>>>> TXFIFO Depth = 6 max packets
>>>>>>>>
>>>>>>>> Test Case | Data Size | AVG tput (in MB/s)
>>>>>>>> -------------------------------------------
>>>>>>>> Sequential|1 GB x     |
>>>>>>>> Read      |9 loops    | 287.35
>>>>>>>>            |           | 304.94
>>>>>>>>              |           | 289.64
>>>>>>>>              |           | 293.61
>>>>>>> I remember getting close to 400MiB/sec with Intel platforms without
>>>>>>> resizing FIFOs and I'm sure the FIFO size was set to 2x1024,
>>>>>>> though my
>>>>>>> memory could be failing.
>>>>>>>
>>>>>>> Then again, I never ran with CrystalDiskMark, I was using my own tool
>>>>>>> (it's somewhere in github. If you care, I can look up the URL).
>>>>>>>
>>>>>>>> We also have internal numbers which have shown similar
>>>>>>>> improvements as
>>>>>>>> well.  Those are over networking/tethering interfaces, so testing
>>>>>>>> IPERF
>>>>>>>> loopback over TCP/UDP.
>>>>>>> loopback iperf? That would skip the wire, no?
>>>>>>>
>>>>>>>>>> size of 2 and TX threshold of 1, this would really be not
>>>>>>>>>> beneficial to
>>>>>>>>>> us, because we can only change the TX threshold to 2 at max,
>>>>>>>>>> and at
>>>>>>>>>> least in my observations, once we have to go out to system
>>>>>>>>>> memory to
>>>>>>>>>> fetch the next data packet, that latency takes enough time for the
>>>>>>>>>> controller to end the current burst.
>>>>>>>>> What I noticed with g_mass_storage is that we can amortize the
>>>>>>>>> cost of
>>>>>>>>> fetching data from memory, with a deeper request queue. Whenever I
>>>>>>>>> test(ed) g_mass_storage, I was doing so with 250 requests. And
>>>>>>>>> that was
>>>>>>>>> enough to give me very good performance. Never had to poke at TX
>>>>>>>>> FIFO
>>>>>>>>> resizing. Did you try something like this too?
>>>>>>>>>
>>>>>>>>> I feel that allocating more requests is a far simpler and more
>>>>>>>>> generic
>>>>>>>>> method that changing FIFO sizes :)
>>>>>>>>>
>>>>>>>> I wish I had a USB bus trace handy to show you, which would make it
>>>>>>>> very
>>>>>>>> clear how the USB bus is currently utilized with TXFIFO size 2 vs
>>>>>>>> 6.  So
>>>>>>>> by increasing the number of USB requests, that will help if there
>>>>>>>> was a
>>>>>>>> bottleneck at the SW level where the application/function driver
>>>>>>>> utilizing the DWC3 was submitting data much faster than the HW was
>>>>>>>> processing them.
>>>>>>>>
>>>>>>>> So yes, this method of increasing the # of USB reqs will definitely
>>>>>>>> help
>>>>>>>> with situations such as HSUSB or in SSUSB when EP bursting isn't
>>>>>>>> used.
>>>>>>>> The TXFIFO resize comes into play for SSUSB, which utilizes endpoint
>>>>>>>> bursting.
>>>>>>> Hmm, that's not what I remember. Perhaps the TRB cache size plays a
>>>>>>> role
>>>>>>> here too. I have clear memories of testing this very scenario of
>>>>>>> bursting (using g_mass_storage at the time) because I was curious
>>>>>>> about
>>>>>>> it. Back then, my tests showed no difference in behavior.
>>>>>>>
>>>>>>> It could be nice if Heikki could test Intel parts with and without
>>>>>>> your
>>>>>>> changes on g_mass_storage with 250 requests.
>>>>>> Andy, you have a system at hand that has the DWC3 block enabled,
>>>>>> right? Can you help out here?
>>>>> I'm not sure if i will have time soon, I Cc'ed to Ferry who has a few
>>>>> more test cases (I have only one or two) and maybe can help. But I'll
>>>>> keep this in mind.
>>>> I just tested on 5.13.0-rc4 on Intel Edison (x86_64). All 5 patches
>>>> apply. Switching between host/gadget works, no connections dropping, no
>>>> errors in dmesg.
>>>>
>>>> In host mode I connect a smsc9504 eth+4p hub. In gadget mode I have
>>>> composite device created from configfs with gser / eem / mass_storage /
>>>> uac2.
>>>>
>>>> Tested with iperf3 performance in host (93.6Mbits/sec) and gadget
>>>> (207Mbits/sec) mode. Compared to v5.10.41 without patches host
>>>> (93.4Mbits/sec) and gadget (198Mbits/sec).
>>>>
>>>> Gadget seems to be a little faster with the patches, but that might also
>>>> be caused  by something else, on v5.10.41 I see the bitrate bouncing
>>>> between 207 and 199.
>>>>
>>>> I saw a mention to test iperf3 to self (loopback). 3.09 Gbits/sec. With
>>>> v5.10.41 3.07Gbits/sec. Not bad for a 500MHz device.
>>>>
>>>> With gnome-disks I did a read access benchmark 35.4MB/s, with v5.10.41
>>>> 34.7MB/s. This might be limited by Edison's internal eMMC speed (when
>>>> booting U-Boot reads the kernel with 21.4 MiB/s).
>>>>
>>> Hi Ferry,
>>>
>>> Thanks for the testing.  Just to double check, did you also enable the
>>> property, which enabled the TXFIFO resize feature on the platform?  For
>>> example, for the QCOM SM8150 platform, we're adding the following to our
>>> device tree node:
>>>
>>> tx-fifo-resize
>>>
>>> If not, then your results at least confirms that w/o the property
>>> present, the changes won't break anything :).  Thanks again for the
>>> initial testing!

I applied the patch now to 5.13.0-rc5 + the following:

--- a/drivers/usb/dwc3/dwc3-pci.c
+++ b/drivers/usb/dwc3/dwc3-pci.c
@@ -124,6 +124,7 @@ static const struct property_entry 
dwc3_pci_mrfld_properties[] = {
      PROPERTY_ENTRY_BOOL("snps,dis_u3_susphy_quirk"),
      PROPERTY_ENTRY_BOOL("snps,dis_u2_susphy_quirk"),
      PROPERTY_ENTRY_BOOL("snps,usb2-gadget-lpm-disable"),
+    PROPERTY_ENTRY_BOOL("tx-fifo-resize"),
      PROPERTY_ENTRY_BOOL("linux,sysdev_is_parent"),
      {}
  };

  and when switching to gadget mode unfortunately received the following 
oops:

BUG: unable to handle page fault for address: 00000000202043f2
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] SMP PTI
CPU: 0 PID: 617 Comm: conf-gadget.sh Not tainted 
5.13.0-rc5-edison-acpi-standard #1
Hardware name: Intel Corporation Merrifield/BODEGA BAY, BIOS 542 
2015.01.21:18.19.48
RIP: 0010:dwc3_gadget_check_config+0x33/0x80
Code: 59 04 00 00 04 74 61 48 c1 ee 10 48 89 f7 f3 48 0f b8 c7 48 89 c7 
39 81 60 04 00 00 7d 4a 89 81 60 04 00 00 8b 81 08 04 00 00 <81> b8 e8 
03 00 00 32 33 00 00 0f b6 b0 09 04 00 00 75 0d 8b 80 20
RSP: 0018:ffffb5550038fda0 EFLAGS: 00010297
RAX: 000000002020400a RBX: ffffa04502627348 RCX: ffffa04507354028
RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000004
RBP: ffffa04508ac0550 R08: ffffa04503a75b2c R09: 0000000000000000
R10: 0000000000000216 R11: 000000000002eba0 R12: ffffa04508ac0550
R13: dead000000000100 R14: ffffa04508ac0600 R15: ffffa04508ac0520
FS:  00007f7471e2f740(0000) GS:ffffa0453e200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000202043f2 CR3: 0000000003f38000 CR4: 00000000001006f0
Call Trace:
  configfs_composite_bind+0x2f4/0x430 [libcomposite]
  udc_bind_to_driver+0x64/0x180
  usb_gadget_probe_driver+0x114/0x150
  gadget_dev_desc_UDC_store+0xbc/0x130 [libcomposite]
  configfs_write_file+0xcd/0x140
  vfs_write+0xbb/0x250
  ksys_write+0x5a/0xd0
  do_syscall_64+0x40/0x80
  entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f7471f1ff53
Code: 8b 15 21 cf 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 
00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 00 00 0f 05 <48> 3d 00 
f0 ff ff 77 55 c3 0f 1f 40 00 48 83 ec 28 48 89 54 24 18
RSP: 002b:00007fffa3dcd328 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007f7471f1ff53
RDX: 000000000000000c RSI: 00005614d615a770 RDI: 0000000000000001
RBP: 00005614d615a770 R08: 000000000000000a R09: 00007f7471fb20c0
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
R13: 00007f7471fee520 R14: 000000000000000c R15: 00007f7471fee720
Modules linked in: usb_f_uac2 u_audio usb_f_mass_storage usb_f_eem 
u_ether usb_f_serial u_serial libcomposite rfcomm iptable_nat bnep 
snd_sof_nocodec spi_pxa2xx_platform dw_dmac smsc snd_sof_pci_intel_tng 
snd_sof_pci snd_sof_acpi_intel_byt snd_sof_intel_ipc snd_sof_acpi 
smsc95xx snd_sof pwm_lpss_pci pwm_lpss snd_sof_xtensa_dsp 
snd_intel_dspcfg snd_soc_acpi_intel_match snd_soc_acpi dw_dmac_pci 
intel_mrfld_pwrbtn intel_mrfld_adc dw_dmac_core spi_pxa2xx_pci brcmfmac 
brcmutil leds_gpio hci_uart btbcm ti_ads7950 
industrialio_triggered_buffer kfifo_buf ledtrig_timer ledtrig_heartbeat 
mmc_block extcon_intel_mrfld sdhci_pci cqhci sdhci led_class 
intel_soc_pmic_mrfld mmc_core btrfs libcrc32c xor zstd_compress 
zlib_deflate raid6_pq
CR2: 00000000202043f2
---[ end trace 5c11fe50dca92ad4 ]---

>> No I didn't. Afaik we don't have a devicetree property to set.
>>
>> But I'd be happy to test that as well. But where to set the property?
>>
>> dwc3_pci_mrfld_properties[] in dwc3-pci?
>>
> Hi Ferry,
>
> Not too sure which DWC3 driver is used for the Intel platform, but I
> believe that should be the one. (if that's what is normally used)  We'd
> just need to add an entry w/ the below:
>
> PROPERTY_ENTRY_BOOL("tx-fifo-resize")
>
> Thanks
> Wesley Cheng
>
>>> Thanks
>>> Wesley Cheng
>>>
>>>>>>>> Now with endpoint bursting, if the function notifies the host that
>>>>>>>> bursting is supported, when the host sends the ACK for the Data
>>>>>>>> Packet,
>>>>>>>> it should have a NumP value equal to the bMaxBurst reported in
>>>>>>>> the EP
>>>>>>> Yes and no. Looking back at the history, we used to configure NUMP
>>>>>>> based
>>>>>>> on bMaxBurst, but it was changed later in commit
>>>>>>> 4e99472bc10bda9906526d725ff6d5f27b4ddca1 by yours truly because of a
>>>>>>> problem reported by John Youn.
>>>>>>>
>>>>>>> And now we've come full circle. Because even if I believe more
>>>>>>> requests
>>>>>>> are enough for bursting, NUMP is limited by the RxFIFO size. This
>>>>>>> ends
>>>>>>> up supporting your claim that we need RxFIFO resizing if we want to
>>>>>>> squeeze more throughput out of the controller.
>>>>>>>
>>>>>>> However, note that this is about RxFIFO size, not TxFIFO size. In
>>>>>>> fact,
>>>>>>> looking at Table 8-13 of USB 3.1 r1.0, we read the following about
>>>>>>> NumP
>>>>>>> (emphasis is mine):
>>>>>>>
>>>>>>>          "Number of Packets (NumP). This field is used to indicate the
>>>>>>>          number of Data Packet buffers that the **receiver** can
>>>>>>>          accept. The value in this field shall be less than or
>>>>>>> equal to
>>>>>>>          the maximum burst size supported by the endpoint as
>>>>>>> determined
>>>>>>>          by the value in the bMaxBurst field in the Endpoint Companion
>>>>>>>          Descriptor (refer to Section 9.6.7)."
>>>>>>>
>>>>>>> So, NumP is for the receiver, not the transmitter. Could you clarify
>>>>>>> what you mean here?
>>>>>>>
>>>>>>> /me keeps reading
>>>>>>>
>>>>>>> Hmm, table 8-15 tries to clarify:
>>>>>>>
>>>>>>>          "Number of Packets (NumP).
>>>>>>>
>>>>>>>          For an OUT endpoint, refer to Table 8-13 for the
>>>>>>> description of
>>>>>>>          this field.
>>>>>>>
>>>>>>>          For an IN endpoint this field is set by the endpoint to the
>>>>>>>          number of packets it can transmit when the host resumes
>>>>>>>          transactions to it. This field shall not have a value greater
>>>>>>>          than the maximum burst size supported by the endpoint as
>>>>>>>          indicated by the value in the bMaxBurst field in the Endpoint
>>>>>>>          Companion Descriptor. Note that the value reported in this
>>>>>>> field
>>>>>>>          may be treated by the host as informative only."
>>>>>>>
>>>>>>> However, if I remember correctly (please verify dwc3 databook),
>>>>>>> NUMP in
>>>>>>> DCFG was only for receive buffers. Thin, John, how does dwc3 compute
>>>>>>> NumP for TX/IN endpoints? Is that computed as a function of
>>>>>>> DCFG.NUMP or
>>>>>>> TxFIFO size?
>>>>>>>
>>>>>>>> desc.  If we have a TXFIFO size of 2, then normally what I have
>>>>>>>> seen is
>>>>>>>> that after 2 data packets, the device issues a NRDY.  So then we'd
>>>>>>>> need
>>>>>>>> to send an ERDY once data is available within the FIFO, and the same
>>>>>>>> sequence happens until the USB request is complete.  With this
>>>>>>>> constant
>>>>>>>> NRDY/ERDY handshake going on, you actually see that the bus is under
>>>>>>>> utilized.  When we increase an EP's FIFO size, then you'll see
>>>>>>>> constant
>>>>>>>> bursts for a request, until the request is done, or if the host
>>>>>>>> runs out
>>>>>>>> of RXFIFO. (ie no interruption [on the USB protocol level] during
>>>>>>>> USB
>>>>>>>> request data transfer)
>>>>>>> Unfortunately I don't have access to a USB sniffer anymore :-(
>>>>>>>
>>>>>>>>>>>>> Good points.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Wesley, what kind of testing have you done on this on
>>>>>>>>>>>>> different devices?
>>>>>>>>>>>>>
>>>>>>>>>>>> As mentioned above, these changes are currently present on end
>>>>>>>>>>>> user
>>>>>>>>>>>> devices for the past few years, so its been through a lot of
>>>>>>>>>>>> testing :).
>>>>>>>>>>> all with the same gadget driver. Also, who uses USB on android
>>>>>>>>>>> devices
>>>>>>>>>>> these days? Most of the data transfer goes via WiFi or
>>>>>>>>>>> Bluetooth, anyway
>>>>>>>>>>> :-)
>>>>>>>>>>>
>>>>>>>>>>> I guess only developers are using USB during development to
>>>>>>>>>>> flash dev
>>>>>>>>>>> images heh.
>>>>>>>>>>>
>>>>>>>>>> I used to be a customer facing engineer, so honestly I did see
>>>>>>>>>> some
>>>>>>>>>> really interesting and crazy designs.  Again, we do have
>>>>>>>>>> non-Android
>>>>>>>>>> products that use the same code, and it has been working in there
>>>>>>>>>> for a
>>>>>>>>>> few years as well.  The TXFIFO sizing really has helped with
>>>>>>>>>> multimedia
>>>>>>>>>> use cases, which use isoc endpoints, since esp. in those lower
>>>>>>>>>> end CPU
>>>>>>>>>> chips where latencies across the system are much larger, and a
>>>>>>>>>> missed
>>>>>>>>>> ISOC interval leads to a pop in your ear.
>>>>>>>>> This is good background information. Thanks for bringing this
>>>>>>>>> up. Admitedly, we still have ISOC issues with dwc3. I'm
>>>>>>>>> interested in
>>>>>>>>> knowing if a deeper request queue would also help here.
>>>>>>>>>
>>>>>>>>> Remember dwc3 can accomodate 255 requests + link for each
>>>>>>>>> endpoint. If
>>>>>>>>> our gadget driver uses a low number of requests, we're never really
>>>>>>>>> using the TRB ring in our benefit.
>>>>>>>>>
>>>>>>>> We're actually using both a deeper USB request queue + TX fifo
>>>>>>>> resizing. :).
>>>>>>> okay, great. Let's see what John and/or Thinh respond WRT dwc3 TX
>>>>>>> Burst
>>>>>>> behavior.
>>>>>> -- 
>>>>>> heikki

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v9 0/5] Re-introduce TX FIFO resize for larger EP bursting
  2021-06-15 19:53                               ` Ferry Toth
@ 2021-06-17  4:25                                 ` Wesley Cheng
       [not found]                                   ` <fe834dbf-786a-2996-5c4b-1eac92e3ed18@gmail.com>
  0 siblings, 1 reply; 31+ messages in thread
From: Wesley Cheng @ 2021-06-17  4:25 UTC (permalink / raw)
  To: Ferry Toth, Andy Shevchenko, Heikki Krogerus
  Cc: Felipe Balbi, Andy Shevchenko, Greg KH, Andy Gross,
	Bjorn Andersson, Rob Herring, USB, Linux Kernel Mailing List,
	linux-arm-msm, devicetree, Jack Pham, Thinh Nguyen, John Youn



On 6/15/2021 12:53 PM, Ferry Toth wrote:
> Hi
> 
> Op 15-06-2021 om 06:22 schreef Wesley Cheng:
>>
>> On 6/14/2021 12:30 PM, Ferry Toth wrote:
>>> Op 14-06-2021 om 20:58 schreef Wesley Cheng:
>>>> On 6/12/2021 2:27 PM, Ferry Toth wrote:
>>>>> Hi
>>>>>
>>>>> Op 11-06-2021 om 15:21 schreef Andy Shevchenko:
>>>>>> On Fri, Jun 11, 2021 at 4:14 PM Heikki Krogerus
>>>>>> <heikki.krogerus@linux.intel.com> wrote:
>>>>>>> On Fri, Jun 11, 2021 at 04:00:38PM +0300, Felipe Balbi wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Wesley Cheng <wcheng@codeaurora.org> writes:
>>>>>>>>>>>>>>> to be honest, I don't think these should go in (apart from
>>>>>>>>>>>>>>> the build
>>>>>>>>>>>>>>> failure) because it's likely to break instantiations of the
>>>>>>>>>>>>>>> core with
>>>>>>>>>>>>>>> differing FIFO sizes. Some instantiations even have some
>>>>>>>>>>>>>>> endpoints with
>>>>>>>>>>>>>>> dedicated functionality that requires the default FIFO size
>>>>>>>>>>>>>>> configured
>>>>>>>>>>>>>>> during coreConsultant instantiation. I know of at OMAP5 and
>>>>>>>>>>>>>>> some Intel
>>>>>>>>>>>>>>> implementations which have dedicated endpoints for processor
>>>>>>>>>>>>>>> tracing.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> With OMAP5, these endpoints are configured at the top of the
>>>>>>>>>>>>>>> available
>>>>>>>>>>>>>>> endpoints, which means that if a gadget driver gets loaded
>>>>>>>>>>>>>>> and takes
>>>>>>>>>>>>>>> over most of the FIFO space because of this resizing,
>>>>>>>>>>>>>>> processor tracing
>>>>>>>>>>>>>>> will have a hard time running. That being said, processor
>>>>>>>>>>>>>>> tracing isn't
>>>>>>>>>>>>>>> supported in upstream at this moment.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>> I agree that the application of this logic may differ between
>>>>>>>>>>>>> vendors,
>>>>>>>>>>>>> hence why I wanted to keep this controllable by the DT
>>>>>>>>>>>>> property, so that
>>>>>>>>>>>>> for those which do not support this use case can leave it
>>>>>>>>>>>>> disabled.  The
>>>>>>>>>>>>> logic is there to ensure that for a given USB configuration,
>>>>>>>>>>>>> for each EP
>>>>>>>>>>>>> it would have at least 1 TX FIFO.  For USB configurations
>>>>>>>>>>>>> which
>>>>>>>>>>>>> don't
>>>>>>>>>>>>> utilize all available IN EPs, it would allow re-allocation of
>>>>>>>>>>>>> internal
>>>>>>>>>>>>> memory to EPs which will actually be in use.
>>>>>>>>>>>> The feature ends up being all-or-nothing, then :-) It sounds
>>>>>>>>>>>> like we can
>>>>>>>>>>>> be a little nicer in this regard.
>>>>>>>>>>>>
>>>>>>>>>>> Don't get me wrong, I think once those features become available
>>>>>>>>>>> upstream, we can improve the logic.  From what I remember when
>>>>>>>>>>> looking
>>>>>>>>>> sure, I support that. But I want to make sure the first cut isn't
>>>>>>>>>> likely
>>>>>>>>>> to break things left and right :)
>>>>>>>>>>
>>>>>>>>>> Hence, let's at least get more testing.
>>>>>>>>>>
>>>>>>>>> Sure, I'd hope that the other users of DWC3 will also see some
>>>>>>>>> pretty
>>>>>>>>> big improvements on the TX path with this.
>>>>>>>> fingers crossed
>>>>>>>>
>>>>>>>>>>> at Andy Shevchenko's Github, the Intel tracer downstream changes
>>>>>>>>>>> were
>>>>>>>>>>> just to remove physical EP1 and 2 from the DWC3 endpoint list.
>>>>>>>>>>> If that
>>>>>>>>>> right, that's the reason why we introduced the endpoint feature
>>>>>>>>>> flags. The end goal was that the UDC would be able to have custom
>>>>>>>>>> feature flags paired with ->validate_endpoint() or whatever
>>>>>>>>>> before
>>>>>>>>>> allowing it to be enabled. Then the UDC driver could tell UDC
>>>>>>>>>> core to
>>>>>>>>>> skip that endpoint on that particular platform without
>>>>>>>>>> interefering with
>>>>>>>>>> everything else.
>>>>>>>>>>
>>>>>>>>>> Of course, we still need to figure out a way to abstract the
>>>>>>>>>> different
>>>>>>>>>> dwc3 instantiations.
>>>>>>>>>>
>>>>>>>>>>> was the change which ended up upstream for the Intel tracer
>>>>>>>>>>> then we
>>>>>>>>>>> could improve the logic to avoid re-sizing those particular EPs.
>>>>>>>>>> The problem then, just as I mentioned in the previous paragraph,
>>>>>>>>>> will be
>>>>>>>>>> coming up with a solution that's elegant and works for all
>>>>>>>>>> different
>>>>>>>>>> instantiations of dwc3 (or musb, cdns3, etc).
>>>>>>>>>>
>>>>>>>>> Well, at least for the TX FIFO resizing logic, we'd only be
>>>>>>>>> needing to
>>>>>>>>> focus on the DWC3 implementation.
>>>>>>>>>
>>>>>>>>> You bring up another good topic that I'll eventually needing to be
>>>>>>>>> taking a look at, which is a nice way we can handle vendor
>>>>>>>>> specific
>>>>>>>>> endpoints and how they can co-exist with other "normal"
>>>>>>>>> endpoints.  We
>>>>>>>>> have a few special HW eps as well, which we try to maintain
>>>>>>>>> separately
>>>>>>>>> in our DWC3 vendor driver, but it isn't the most convenient, or
>>>>>>>>> most
>>>>>>>>> pretty method :).
>>>>>>>> Awesome, as mentioned, the endpoint feature flags were added
>>>>>>>> exactly to
>>>>>>>> allow for these vendor-specific features :-)
>>>>>>>>
>>>>>>>> I'm more than happy to help testing now that I finally got our
>>>>>>>> SM8150
>>>>>>>> Surface Duo device tree accepted by Bjorn ;-)
>>>>>>>>
>>>>>>>>>>> However, I'm not sure how the changes would look like in the
>>>>>>>>>>> end,
>>>>>>>>>>> so I
>>>>>>>>>>> would like to wait later down the line to include that :).
>>>>>>>>>> Fair enough, I agree. Can we get some more testing of $subject,
>>>>>>>>>> though?
>>>>>>>>>> Did you test $subject with upstream too? Which gadget drivers
>>>>>>>>>> did you
>>>>>>>>>> use? How did you test
>>>>>>>>>>
>>>>>>>>> The results that I included in the cover page was tested with the
>>>>>>>>> pure
>>>>>>>>> upstream kernel on our device.  Below was using the ConfigFS
>>>>>>>>> gadget
>>>>>>>>> w/ a
>>>>>>>>> mass storage only composition.
>>>>>>>>>
>>>>>>>>> Test Parameters:
>>>>>>>>>     - Platform: Qualcomm SM8150
>>>>>>>>>     - bMaxBurst = 6
>>>>>>>>>     - USB req size = 256kB
>>>>>>>>>     - Num of USB reqs = 16
>>>>>>>> do you mind testing with the regular request size (16KiB) and 250
>>>>>>>> requests? I think we can even do 15 bursts in that case.
>>>>>>>>
>>>>>>>>>     - USB Speed = Super-Speed
>>>>>>>>>     - Function Driver: Mass Storage (w/ ramdisk)
>>>>>>>>>     - Test Application: CrystalDiskMark
>>>>>>>>>
>>>>>>>>> Results:
>>>>>>>>>
>>>>>>>>> TXFIFO Depth = 3 max packets
>>>>>>>>>
>>>>>>>>> Test Case | Data Size | AVG tput (in MB/s)
>>>>>>>>> -------------------------------------------
>>>>>>>>> Sequential|1 GB x     |
>>>>>>>>> Read      |9 loops    | 193.60
>>>>>>>>>              |           | 195.86
>>>>>>>>>              |           | 184.77
>>>>>>>>>              |           | 193.60
>>>>>>>>> -------------------------------------------
>>>>>>>>>
>>>>>>>>> TXFIFO Depth = 6 max packets
>>>>>>>>>
>>>>>>>>> Test Case | Data Size | AVG tput (in MB/s)
>>>>>>>>> -------------------------------------------
>>>>>>>>> Sequential|1 GB x     |
>>>>>>>>> Read      |9 loops    | 287.35
>>>>>>>>>            |           | 304.94
>>>>>>>>>              |           | 289.64
>>>>>>>>>              |           | 293.61
>>>>>>>> I remember getting close to 400MiB/sec with Intel platforms without
>>>>>>>> resizing FIFOs and I'm sure the FIFO size was set to 2x1024,
>>>>>>>> though my
>>>>>>>> memory could be failing.
>>>>>>>>
>>>>>>>> Then again, I never ran with CrystalDiskMark, I was using my own
>>>>>>>> tool
>>>>>>>> (it's somewhere in github. If you care, I can look up the URL).
>>>>>>>>
>>>>>>>>> We also have internal numbers which have shown similar
>>>>>>>>> improvements as
>>>>>>>>> well.  Those are over networking/tethering interfaces, so testing
>>>>>>>>> IPERF
>>>>>>>>> loopback over TCP/UDP.
>>>>>>>> loopback iperf? That would skip the wire, no?
>>>>>>>>
>>>>>>>>>>> size of 2 and TX threshold of 1, this would really be not
>>>>>>>>>>> beneficial to
>>>>>>>>>>> us, because we can only change the TX threshold to 2 at max,
>>>>>>>>>>> and at
>>>>>>>>>>> least in my observations, once we have to go out to system
>>>>>>>>>>> memory to
>>>>>>>>>>> fetch the next data packet, that latency takes enough time
>>>>>>>>>>> for the
>>>>>>>>>>> controller to end the current burst.
>>>>>>>>>> What I noticed with g_mass_storage is that we can amortize the
>>>>>>>>>> cost of
>>>>>>>>>> fetching data from memory, with a deeper request queue.
>>>>>>>>>> Whenever I
>>>>>>>>>> test(ed) g_mass_storage, I was doing so with 250 requests. And
>>>>>>>>>> that was
>>>>>>>>>> enough to give me very good performance. Never had to poke at TX
>>>>>>>>>> FIFO
>>>>>>>>>> resizing. Did you try something like this too?
>>>>>>>>>>
>>>>>>>>>> I feel that allocating more requests is a far simpler and more
>>>>>>>>>> generic
>>>>>>>>>> method that changing FIFO sizes :)
>>>>>>>>>>
>>>>>>>>> I wish I had a USB bus trace handy to show you, which would
>>>>>>>>> make it
>>>>>>>>> very
>>>>>>>>> clear how the USB bus is currently utilized with TXFIFO size 2 vs
>>>>>>>>> 6.  So
>>>>>>>>> by increasing the number of USB requests, that will help if there
>>>>>>>>> was a
>>>>>>>>> bottleneck at the SW level where the application/function driver
>>>>>>>>> utilizing the DWC3 was submitting data much faster than the HW was
>>>>>>>>> processing them.
>>>>>>>>>
>>>>>>>>> So yes, this method of increasing the # of USB reqs will
>>>>>>>>> definitely
>>>>>>>>> help
>>>>>>>>> with situations such as HSUSB or in SSUSB when EP bursting isn't
>>>>>>>>> used.
>>>>>>>>> The TXFIFO resize comes into play for SSUSB, which utilizes
>>>>>>>>> endpoint
>>>>>>>>> bursting.
>>>>>>>> Hmm, that's not what I remember. Perhaps the TRB cache size plays a
>>>>>>>> role
>>>>>>>> here too. I have clear memories of testing this very scenario of
>>>>>>>> bursting (using g_mass_storage at the time) because I was curious
>>>>>>>> about
>>>>>>>> it. Back then, my tests showed no difference in behavior.
>>>>>>>>
>>>>>>>> It could be nice if Heikki could test Intel parts with and without
>>>>>>>> your
>>>>>>>> changes on g_mass_storage with 250 requests.
>>>>>>> Andy, you have a system at hand that has the DWC3 block enabled,
>>>>>>> right? Can you help out here?
>>>>>> I'm not sure if i will have time soon, I Cc'ed to Ferry who has a few
>>>>>> more test cases (I have only one or two) and maybe can help. But I'll
>>>>>> keep this in mind.
>>>>> I just tested on 5.13.0-rc4 on Intel Edison (x86_64). All 5 patches
>>>>> apply. Switching between host/gadget works, no connections
>>>>> dropping, no
>>>>> errors in dmesg.
>>>>>
>>>>> In host mode I connect a smsc9504 eth+4p hub. In gadget mode I have
>>>>> composite device created from configfs with gser / eem /
>>>>> mass_storage /
>>>>> uac2.
>>>>>
>>>>> Tested with iperf3 performance in host (93.6Mbits/sec) and gadget
>>>>> (207Mbits/sec) mode. Compared to v5.10.41 without patches host
>>>>> (93.4Mbits/sec) and gadget (198Mbits/sec).
>>>>>
>>>>> Gadget seems to be a little faster with the patches, but that might
>>>>> also
>>>>> be caused  by something else, on v5.10.41 I see the bitrate bouncing
>>>>> between 207 and 199.
>>>>>
>>>>> I saw a mention to test iperf3 to self (loopback). 3.09 Gbits/sec.
>>>>> With
>>>>> v5.10.41 3.07Gbits/sec. Not bad for a 500MHz device.
>>>>>
>>>>> With gnome-disks I did a read access benchmark 35.4MB/s, with v5.10.41
>>>>> 34.7MB/s. This might be limited by Edison's internal eMMC speed (when
>>>>> booting U-Boot reads the kernel with 21.4 MiB/s).
>>>>>
>>>> Hi Ferry,
>>>>
>>>> Thanks for the testing.  Just to double check, did you also enable the
>>>> property, which enabled the TXFIFO resize feature on the platform?  For
>>>> example, for the QCOM SM8150 platform, we're adding the following to
>>>> our
>>>> device tree node:
>>>>
>>>> tx-fifo-resize
>>>>
>>>> If not, then your results at least confirms that w/o the property
>>>> present, the changes won't break anything :).  Thanks again for the
>>>> initial testing!
> 
> I applied the patch now to 5.13.0-rc5 + the following:
> 

Hi Ferry,

Quick question...there was a compile error with the V9 patch series, as
it was using the dwc3_mwidth() incorrectly.  I will update this with the
proper use of the mdwidth, but which patch version did you use?

Thanks
Wesley Cheng

> --- a/drivers/usb/dwc3/dwc3-pci.c
> +++ b/drivers/usb/dwc3/dwc3-pci.c
> @@ -124,6 +124,7 @@ static const struct property_entry
> dwc3_pci_mrfld_properties[] = {
>      PROPERTY_ENTRY_BOOL("snps,dis_u3_susphy_quirk"),
>      PROPERTY_ENTRY_BOOL("snps,dis_u2_susphy_quirk"),
>      PROPERTY_ENTRY_BOOL("snps,usb2-gadget-lpm-disable"),
> +    PROPERTY_ENTRY_BOOL("tx-fifo-resize"),
>      PROPERTY_ENTRY_BOOL("linux,sysdev_is_parent"),
>      {}
>  };
> 
>  and when switching to gadget mode unfortunately received the following
> oops:
> 
> BUG: unable to handle page fault for address: 00000000202043f2
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x0000) - not-present page
> PGD 0 P4D 0
> Oops: 0000 [#1] SMP PTI
> CPU: 0 PID: 617 Comm: conf-gadget.sh Not tainted
> 5.13.0-rc5-edison-acpi-standard #1
> Hardware name: Intel Corporation Merrifield/BODEGA BAY, BIOS 542
> 2015.01.21:18.19.48
> RIP: 0010:dwc3_gadget_check_config+0x33/0x80
> Code: 59 04 00 00 04 74 61 48 c1 ee 10 48 89 f7 f3 48 0f b8 c7 48 89 c7
> 39 81 60 04 00 00 7d 4a 89 81 60 04 00 00 8b 81 08 04 00 00 <81> b8 e8
> 03 00 00 32 33 00 00 0f b6 b0 09 04 00 00 75 0d 8b 80 20
> RSP: 0018:ffffb5550038fda0 EFLAGS: 00010297
> RAX: 000000002020400a RBX: ffffa04502627348 RCX: ffffa04507354028
> RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000004
> RBP: ffffa04508ac0550 R08: ffffa04503a75b2c R09: 0000000000000000
> R10: 0000000000000216 R11: 000000000002eba0 R12: ffffa04508ac0550
> R13: dead000000000100 R14: ffffa04508ac0600 R15: ffffa04508ac0520
> FS:  00007f7471e2f740(0000) GS:ffffa0453e200000(0000)
> knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00000000202043f2 CR3: 0000000003f38000 CR4: 00000000001006f0
> Call Trace:
>  configfs_composite_bind+0x2f4/0x430 [libcomposite]
>  udc_bind_to_driver+0x64/0x180
>  usb_gadget_probe_driver+0x114/0x150
>  gadget_dev_desc_UDC_store+0xbc/0x130 [libcomposite]
>  configfs_write_file+0xcd/0x140
>  vfs_write+0xbb/0x250
>  ksys_write+0x5a/0xd0
>  do_syscall_64+0x40/0x80
>  entry_SYSCALL_64_after_hwframe+0x44/0xae
> RIP: 0033:0x7f7471f1ff53
> Code: 8b 15 21 cf 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f
> 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 00 00 0f 05 <48> 3d 00
> f0 ff ff 77 55 c3 0f 1f 40 00 48 83 ec 28 48 89 54 24 18
> RSP: 002b:00007fffa3dcd328 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007f7471f1ff53
> RDX: 000000000000000c RSI: 00005614d615a770 RDI: 0000000000000001
> RBP: 00005614d615a770 R08: 000000000000000a R09: 00007f7471fb20c0
> R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
> R13: 00007f7471fee520 R14: 000000000000000c R15: 00007f7471fee720
> Modules linked in: usb_f_uac2 u_audio usb_f_mass_storage usb_f_eem
> u_ether usb_f_serial u_serial libcomposite rfcomm iptable_nat bnep
> snd_sof_nocodec spi_pxa2xx_platform dw_dmac smsc snd_sof_pci_intel_tng
> snd_sof_pci snd_sof_acpi_intel_byt snd_sof_intel_ipc snd_sof_acpi
> smsc95xx snd_sof pwm_lpss_pci pwm_lpss snd_sof_xtensa_dsp
> snd_intel_dspcfg snd_soc_acpi_intel_match snd_soc_acpi dw_dmac_pci
> intel_mrfld_pwrbtn intel_mrfld_adc dw_dmac_core spi_pxa2xx_pci brcmfmac
> brcmutil leds_gpio hci_uart btbcm ti_ads7950
> industrialio_triggered_buffer kfifo_buf ledtrig_timer ledtrig_heartbeat
> mmc_block extcon_intel_mrfld sdhci_pci cqhci sdhci led_class
> intel_soc_pmic_mrfld mmc_core btrfs libcrc32c xor zstd_compress
> zlib_deflate raid6_pq
> CR2: 00000000202043f2
> ---[ end trace 5c11fe50dca92ad4 ]---
> 
>>> No I didn't. Afaik we don't have a devicetree property to set.
>>>
>>> But I'd be happy to test that as well. But where to set the property?
>>>
>>> dwc3_pci_mrfld_properties[] in dwc3-pci?
>>>
>> Hi Ferry,
>>
>> Not too sure which DWC3 driver is used for the Intel platform, but I
>> believe that should be the one. (if that's what is normally used)  We'd
>> just need to add an entry w/ the below:
>>
>> PROPERTY_ENTRY_BOOL("tx-fifo-resize")
>>
>> Thanks
>> Wesley Cheng
>>
>>>> Thanks
>>>> Wesley Cheng
>>>>
>>>>>>>>> Now with endpoint bursting, if the function notifies the host that
>>>>>>>>> bursting is supported, when the host sends the ACK for the Data
>>>>>>>>> Packet,
>>>>>>>>> it should have a NumP value equal to the bMaxBurst reported in
>>>>>>>>> the EP
>>>>>>>> Yes and no. Looking back at the history, we used to configure NUMP
>>>>>>>> based
>>>>>>>> on bMaxBurst, but it was changed later in commit
>>>>>>>> 4e99472bc10bda9906526d725ff6d5f27b4ddca1 by yours truly because
>>>>>>>> of a
>>>>>>>> problem reported by John Youn.
>>>>>>>>
>>>>>>>> And now we've come full circle. Because even if I believe more
>>>>>>>> requests
>>>>>>>> are enough for bursting, NUMP is limited by the RxFIFO size. This
>>>>>>>> ends
>>>>>>>> up supporting your claim that we need RxFIFO resizing if we want to
>>>>>>>> squeeze more throughput out of the controller.
>>>>>>>>
>>>>>>>> However, note that this is about RxFIFO size, not TxFIFO size. In
>>>>>>>> fact,
>>>>>>>> looking at Table 8-13 of USB 3.1 r1.0, we read the following about
>>>>>>>> NumP
>>>>>>>> (emphasis is mine):
>>>>>>>>
>>>>>>>>          "Number of Packets (NumP). This field is used to
>>>>>>>> indicate the
>>>>>>>>          number of Data Packet buffers that the **receiver** can
>>>>>>>>          accept. The value in this field shall be less than or
>>>>>>>> equal to
>>>>>>>>          the maximum burst size supported by the endpoint as
>>>>>>>> determined
>>>>>>>>          by the value in the bMaxBurst field in the Endpoint
>>>>>>>> Companion
>>>>>>>>          Descriptor (refer to Section 9.6.7)."
>>>>>>>>
>>>>>>>> So, NumP is for the receiver, not the transmitter. Could you
>>>>>>>> clarify
>>>>>>>> what you mean here?
>>>>>>>>
>>>>>>>> /me keeps reading
>>>>>>>>
>>>>>>>> Hmm, table 8-15 tries to clarify:
>>>>>>>>
>>>>>>>>          "Number of Packets (NumP).
>>>>>>>>
>>>>>>>>          For an OUT endpoint, refer to Table 8-13 for the
>>>>>>>> description of
>>>>>>>>          this field.
>>>>>>>>
>>>>>>>>          For an IN endpoint this field is set by the endpoint to
>>>>>>>> the
>>>>>>>>          number of packets it can transmit when the host resumes
>>>>>>>>          transactions to it. This field shall not have a value
>>>>>>>> greater
>>>>>>>>          than the maximum burst size supported by the endpoint as
>>>>>>>>          indicated by the value in the bMaxBurst field in the
>>>>>>>> Endpoint
>>>>>>>>          Companion Descriptor. Note that the value reported in this
>>>>>>>> field
>>>>>>>>          may be treated by the host as informative only."
>>>>>>>>
>>>>>>>> However, if I remember correctly (please verify dwc3 databook),
>>>>>>>> NUMP in
>>>>>>>> DCFG was only for receive buffers. Thin, John, how does dwc3
>>>>>>>> compute
>>>>>>>> NumP for TX/IN endpoints? Is that computed as a function of
>>>>>>>> DCFG.NUMP or
>>>>>>>> TxFIFO size?
>>>>>>>>
>>>>>>>>> desc.  If we have a TXFIFO size of 2, then normally what I have
>>>>>>>>> seen is
>>>>>>>>> that after 2 data packets, the device issues a NRDY.  So then we'd
>>>>>>>>> need
>>>>>>>>> to send an ERDY once data is available within the FIFO, and the
>>>>>>>>> same
>>>>>>>>> sequence happens until the USB request is complete.  With this
>>>>>>>>> constant
>>>>>>>>> NRDY/ERDY handshake going on, you actually see that the bus is
>>>>>>>>> under
>>>>>>>>> utilized.  When we increase an EP's FIFO size, then you'll see
>>>>>>>>> constant
>>>>>>>>> bursts for a request, until the request is done, or if the host
>>>>>>>>> runs out
>>>>>>>>> of RXFIFO. (ie no interruption [on the USB protocol level] during
>>>>>>>>> USB
>>>>>>>>> request data transfer)
>>>>>>>> Unfortunately I don't have access to a USB sniffer anymore :-(
>>>>>>>>
>>>>>>>>>>>>>> Good points.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Wesley, what kind of testing have you done on this on
>>>>>>>>>>>>>> different devices?
>>>>>>>>>>>>>>
>>>>>>>>>>>>> As mentioned above, these changes are currently present on end
>>>>>>>>>>>>> user
>>>>>>>>>>>>> devices for the past few years, so its been through a lot of
>>>>>>>>>>>>> testing :).
>>>>>>>>>>>> all with the same gadget driver. Also, who uses USB on android
>>>>>>>>>>>> devices
>>>>>>>>>>>> these days? Most of the data transfer goes via WiFi or
>>>>>>>>>>>> Bluetooth, anyway
>>>>>>>>>>>> :-)
>>>>>>>>>>>>
>>>>>>>>>>>> I guess only developers are using USB during development to
>>>>>>>>>>>> flash dev
>>>>>>>>>>>> images heh.
>>>>>>>>>>>>
>>>>>>>>>>> I used to be a customer facing engineer, so honestly I did see
>>>>>>>>>>> some
>>>>>>>>>>> really interesting and crazy designs.  Again, we do have
>>>>>>>>>>> non-Android
>>>>>>>>>>> products that use the same code, and it has been working in
>>>>>>>>>>> there
>>>>>>>>>>> for a
>>>>>>>>>>> few years as well.  The TXFIFO sizing really has helped with
>>>>>>>>>>> multimedia
>>>>>>>>>>> use cases, which use isoc endpoints, since esp. in those lower
>>>>>>>>>>> end CPU
>>>>>>>>>>> chips where latencies across the system are much larger, and a
>>>>>>>>>>> missed
>>>>>>>>>>> ISOC interval leads to a pop in your ear.
>>>>>>>>>> This is good background information. Thanks for bringing this
>>>>>>>>>> up. Admitedly, we still have ISOC issues with dwc3. I'm
>>>>>>>>>> interested in
>>>>>>>>>> knowing if a deeper request queue would also help here.
>>>>>>>>>>
>>>>>>>>>> Remember dwc3 can accomodate 255 requests + link for each
>>>>>>>>>> endpoint. If
>>>>>>>>>> our gadget driver uses a low number of requests, we're never
>>>>>>>>>> really
>>>>>>>>>> using the TRB ring in our benefit.
>>>>>>>>>>
>>>>>>>>> We're actually using both a deeper USB request queue + TX fifo
>>>>>>>>> resizing. :).
>>>>>>>> okay, great. Let's see what John and/or Thinh respond WRT dwc3 TX
>>>>>>>> Burst
>>>>>>>> behavior.
>>>>>>> -- 
>>>>>>> heikki

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v9 0/5] Re-introduce TX FIFO resize for larger EP bursting
       [not found]                                   ` <fe834dbf-786a-2996-5c4b-1eac92e3ed18@gmail.com>
@ 2021-06-17  8:30                                     ` Wesley Cheng
  2021-06-17 19:54                                       ` Ferry Toth
  0 siblings, 1 reply; 31+ messages in thread
From: Wesley Cheng @ 2021-06-17  8:30 UTC (permalink / raw)
  To: Ferry Toth, Andy Shevchenko, Heikki Krogerus
  Cc: Felipe Balbi, Andy Shevchenko, Greg KH, Andy Gross,
	Bjorn Andersson, Rob Herring, USB, Linux Kernel Mailing List,
	linux-arm-msm, devicetree, Jack Pham, Thinh Nguyen, John Youn



On 6/17/2021 12:47 AM, Ferry Toth wrote:
> Hi
> 
> Op 17-06-2021 om 06:25 schreef Wesley Cheng:
>> On 6/15/2021 12:53 PM, Ferry Toth wrote:
>>> Hi
>>>
>>> Op 15-06-2021 om 06:22 schreef Wesley Cheng:
>>>> On 6/14/2021 12:30 PM, Ferry Toth wrote:
>>>>> Op 14-06-2021 om 20:58 schreef Wesley Cheng:
>>>>>> On 6/12/2021 2:27 PM, Ferry Toth wrote:
>>>>>>> Hi
>>>>>>>
>>>>>>> Op 11-06-2021 om 15:21 schreef Andy Shevchenko:
>>>>>>>> On Fri, Jun 11, 2021 at 4:14 PM Heikki Krogerus
>>>>>>>> <heikki.krogerus@linux.intel.com> wrote:
>>>>>>>>> On Fri, Jun 11, 2021 at 04:00:38PM +0300, Felipe Balbi wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> Wesley Cheng <wcheng@codeaurora.org> writes:
>>>>>>>>>>>>>>>>> to be honest, I don't think these should go in (apart from
>>>>>>>>>>>>>>>>> the build
>>>>>>>>>>>>>>>>> failure) because it's likely to break instantiations of the
>>>>>>>>>>>>>>>>> core with
>>>>>>>>>>>>>>>>> differing FIFO sizes. Some instantiations even have some
>>>>>>>>>>>>>>>>> endpoints with
>>>>>>>>>>>>>>>>> dedicated functionality that requires the default FIFO size
>>>>>>>>>>>>>>>>> configured
>>>>>>>>>>>>>>>>> during coreConsultant instantiation. I know of at OMAP5 and
>>>>>>>>>>>>>>>>> some Intel
>>>>>>>>>>>>>>>>> implementations which have dedicated endpoints for processor
>>>>>>>>>>>>>>>>> tracing.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> With OMAP5, these endpoints are configured at the top of the
>>>>>>>>>>>>>>>>> available
>>>>>>>>>>>>>>>>> endpoints, which means that if a gadget driver gets loaded
>>>>>>>>>>>>>>>>> and takes
>>>>>>>>>>>>>>>>> over most of the FIFO space because of this resizing,
>>>>>>>>>>>>>>>>> processor tracing
>>>>>>>>>>>>>>>>> will have a hard time running. That being said, processor
>>>>>>>>>>>>>>>>> tracing isn't
>>>>>>>>>>>>>>>>> supported in upstream at this moment.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I agree that the application of this logic may differ between
>>>>>>>>>>>>>>> vendors,
>>>>>>>>>>>>>>> hence why I wanted to keep this controllable by the DT
>>>>>>>>>>>>>>> property, so that
>>>>>>>>>>>>>>> for those which do not support this use case can leave it
>>>>>>>>>>>>>>> disabled.  The
>>>>>>>>>>>>>>> logic is there to ensure that for a given USB configuration,
>>>>>>>>>>>>>>> for each EP
>>>>>>>>>>>>>>> it would have at least 1 TX FIFO.  For USB configurations
>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>> don't
>>>>>>>>>>>>>>> utilize all available IN EPs, it would allow re-allocation of
>>>>>>>>>>>>>>> internal
>>>>>>>>>>>>>>> memory to EPs which will actually be in use.
>>>>>>>>>>>>>> The feature ends up being all-or-nothing, then :-) It sounds
>>>>>>>>>>>>>> like we can
>>>>>>>>>>>>>> be a little nicer in this regard.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> Don't get me wrong, I think once those features become available
>>>>>>>>>>>>> upstream, we can improve the logic.  From what I remember when
>>>>>>>>>>>>> looking
>>>>>>>>>>>> sure, I support that. But I want to make sure the first cut isn't
>>>>>>>>>>>> likely
>>>>>>>>>>>> to break things left and right :)
>>>>>>>>>>>>
>>>>>>>>>>>> Hence, let's at least get more testing.
>>>>>>>>>>>>
>>>>>>>>>>> Sure, I'd hope that the other users of DWC3 will also see some
>>>>>>>>>>> pretty
>>>>>>>>>>> big improvements on the TX path with this.
>>>>>>>>>> fingers crossed
>>>>>>>>>>
>>>>>>>>>>>>> at Andy Shevchenko's Github, the Intel tracer downstream changes
>>>>>>>>>>>>> were
>>>>>>>>>>>>> just to remove physical EP1 and 2 from the DWC3 endpoint list.
>>>>>>>>>>>>> If that
>>>>>>>>>>>> right, that's the reason why we introduced the endpoint feature
>>>>>>>>>>>> flags. The end goal was that the UDC would be able to have custom
>>>>>>>>>>>> feature flags paired with ->validate_endpoint() or whatever
>>>>>>>>>>>> before
>>>>>>>>>>>> allowing it to be enabled. Then the UDC driver could tell UDC
>>>>>>>>>>>> core to
>>>>>>>>>>>> skip that endpoint on that particular platform without
>>>>>>>>>>>> interefering with
>>>>>>>>>>>> everything else.
>>>>>>>>>>>>
>>>>>>>>>>>> Of course, we still need to figure out a way to abstract the
>>>>>>>>>>>> different
>>>>>>>>>>>> dwc3 instantiations.
>>>>>>>>>>>>
>>>>>>>>>>>>> was the change which ended up upstream for the Intel tracer
>>>>>>>>>>>>> then we
>>>>>>>>>>>>> could improve the logic to avoid re-sizing those particular EPs.
>>>>>>>>>>>> The problem then, just as I mentioned in the previous paragraph,
>>>>>>>>>>>> will be
>>>>>>>>>>>> coming up with a solution that's elegant and works for all
>>>>>>>>>>>> different
>>>>>>>>>>>> instantiations of dwc3 (or musb, cdns3, etc).
>>>>>>>>>>>>
>>>>>>>>>>> Well, at least for the TX FIFO resizing logic, we'd only be
>>>>>>>>>>> needing to
>>>>>>>>>>> focus on the DWC3 implementation.
>>>>>>>>>>>
>>>>>>>>>>> You bring up another good topic that I'll eventually needing to be
>>>>>>>>>>> taking a look at, which is a nice way we can handle vendor
>>>>>>>>>>> specific
>>>>>>>>>>> endpoints and how they can co-exist with other "normal"
>>>>>>>>>>> endpoints.  We
>>>>>>>>>>> have a few special HW eps as well, which we try to maintain
>>>>>>>>>>> separately
>>>>>>>>>>> in our DWC3 vendor driver, but it isn't the most convenient, or
>>>>>>>>>>> most
>>>>>>>>>>> pretty method :).
>>>>>>>>>> Awesome, as mentioned, the endpoint feature flags were added
>>>>>>>>>> exactly to
>>>>>>>>>> allow for these vendor-specific features :-)
>>>>>>>>>>
>>>>>>>>>> I'm more than happy to help testing now that I finally got our
>>>>>>>>>> SM8150
>>>>>>>>>> Surface Duo device tree accepted by Bjorn ;-)
>>>>>>>>>>
>>>>>>>>>>>>> However, I'm not sure how the changes would look like in the
>>>>>>>>>>>>> end,
>>>>>>>>>>>>> so I
>>>>>>>>>>>>> would like to wait later down the line to include that :).
>>>>>>>>>>>> Fair enough, I agree. Can we get some more testing of $subject,
>>>>>>>>>>>> though?
>>>>>>>>>>>> Did you test $subject with upstream too? Which gadget drivers
>>>>>>>>>>>> did you
>>>>>>>>>>>> use? How did you test
>>>>>>>>>>>>
>>>>>>>>>>> The results that I included in the cover page was tested with the
>>>>>>>>>>> pure
>>>>>>>>>>> upstream kernel on our device.  Below was using the ConfigFS
>>>>>>>>>>> gadget
>>>>>>>>>>> w/ a
>>>>>>>>>>> mass storage only composition.
>>>>>>>>>>>
>>>>>>>>>>> Test Parameters:
>>>>>>>>>>>     - Platform: Qualcomm SM8150
>>>>>>>>>>>     - bMaxBurst = 6
>>>>>>>>>>>     - USB req size = 256kB
>>>>>>>>>>>     - Num of USB reqs = 16
>>>>>>>>>> do you mind testing with the regular request size (16KiB) and 250
>>>>>>>>>> requests? I think we can even do 15 bursts in that case.
>>>>>>>>>>
>>>>>>>>>>>     - USB Speed = Super-Speed
>>>>>>>>>>>     - Function Driver: Mass Storage (w/ ramdisk)
>>>>>>>>>>>     - Test Application: CrystalDiskMark
>>>>>>>>>>>
>>>>>>>>>>> Results:
>>>>>>>>>>>
>>>>>>>>>>> TXFIFO Depth = 3 max packets
>>>>>>>>>>>
>>>>>>>>>>> Test Case | Data Size | AVG tput (in MB/s)
>>>>>>>>>>> -------------------------------------------
>>>>>>>>>>> Sequential|1 GB x     |
>>>>>>>>>>> Read      |9 loops    | 193.60
>>>>>>>>>>>              |           | 195.86
>>>>>>>>>>>              |           | 184.77
>>>>>>>>>>>              |           | 193.60
>>>>>>>>>>> -------------------------------------------
>>>>>>>>>>>
>>>>>>>>>>> TXFIFO Depth = 6 max packets
>>>>>>>>>>>
>>>>>>>>>>> Test Case | Data Size | AVG tput (in MB/s)
>>>>>>>>>>> -------------------------------------------
>>>>>>>>>>> Sequential|1 GB x     |
>>>>>>>>>>> Read      |9 loops    | 287.35
>>>>>>>>>>>            |           | 304.94
>>>>>>>>>>>              |           | 289.64
>>>>>>>>>>>              |           | 293.61
>>>>>>>>>> I remember getting close to 400MiB/sec with Intel platforms without
>>>>>>>>>> resizing FIFOs and I'm sure the FIFO size was set to 2x1024,
>>>>>>>>>> though my
>>>>>>>>>> memory could be failing.
>>>>>>>>>>
>>>>>>>>>> Then again, I never ran with CrystalDiskMark, I was using my own
>>>>>>>>>> tool
>>>>>>>>>> (it's somewhere in github. If you care, I can look up the URL).
>>>>>>>>>>
>>>>>>>>>>> We also have internal numbers which have shown similar
>>>>>>>>>>> improvements as
>>>>>>>>>>> well.  Those are over networking/tethering interfaces, so testing
>>>>>>>>>>> IPERF
>>>>>>>>>>> loopback over TCP/UDP.
>>>>>>>>>> loopback iperf? That would skip the wire, no?
>>>>>>>>>>
>>>>>>>>>>>>> size of 2 and TX threshold of 1, this would really be not
>>>>>>>>>>>>> beneficial to
>>>>>>>>>>>>> us, because we can only change the TX threshold to 2 at max,
>>>>>>>>>>>>> and at
>>>>>>>>>>>>> least in my observations, once we have to go out to system
>>>>>>>>>>>>> memory to
>>>>>>>>>>>>> fetch the next data packet, that latency takes enough time
>>>>>>>>>>>>> for the
>>>>>>>>>>>>> controller to end the current burst.
>>>>>>>>>>>> What I noticed with g_mass_storage is that we can amortize the
>>>>>>>>>>>> cost of
>>>>>>>>>>>> fetching data from memory, with a deeper request queue.
>>>>>>>>>>>> Whenever I
>>>>>>>>>>>> test(ed) g_mass_storage, I was doing so with 250 requests. And
>>>>>>>>>>>> that was
>>>>>>>>>>>> enough to give me very good performance. Never had to poke at TX
>>>>>>>>>>>> FIFO
>>>>>>>>>>>> resizing. Did you try something like this too?
>>>>>>>>>>>>
>>>>>>>>>>>> I feel that allocating more requests is a far simpler and more
>>>>>>>>>>>> generic
>>>>>>>>>>>> method that changing FIFO sizes :)
>>>>>>>>>>>>
>>>>>>>>>>> I wish I had a USB bus trace handy to show you, which would
>>>>>>>>>>> make it
>>>>>>>>>>> very
>>>>>>>>>>> clear how the USB bus is currently utilized with TXFIFO size 2 vs
>>>>>>>>>>> 6.  So
>>>>>>>>>>> by increasing the number of USB requests, that will help if there
>>>>>>>>>>> was a
>>>>>>>>>>> bottleneck at the SW level where the application/function driver
>>>>>>>>>>> utilizing the DWC3 was submitting data much faster than the HW was
>>>>>>>>>>> processing them.
>>>>>>>>>>>
>>>>>>>>>>> So yes, this method of increasing the # of USB reqs will
>>>>>>>>>>> definitely
>>>>>>>>>>> help
>>>>>>>>>>> with situations such as HSUSB or in SSUSB when EP bursting isn't
>>>>>>>>>>> used.
>>>>>>>>>>> The TXFIFO resize comes into play for SSUSB, which utilizes
>>>>>>>>>>> endpoint
>>>>>>>>>>> bursting.
>>>>>>>>>> Hmm, that's not what I remember. Perhaps the TRB cache size plays a
>>>>>>>>>> role
>>>>>>>>>> here too. I have clear memories of testing this very scenario of
>>>>>>>>>> bursting (using g_mass_storage at the time) because I was curious
>>>>>>>>>> about
>>>>>>>>>> it. Back then, my tests showed no difference in behavior.
>>>>>>>>>>
>>>>>>>>>> It could be nice if Heikki could test Intel parts with and without
>>>>>>>>>> your
>>>>>>>>>> changes on g_mass_storage with 250 requests.
>>>>>>>>> Andy, you have a system at hand that has the DWC3 block enabled,
>>>>>>>>> right? Can you help out here?
>>>>>>>> I'm not sure if i will have time soon, I Cc'ed to Ferry who has a few
>>>>>>>> more test cases (I have only one or two) and maybe can help. But I'll
>>>>>>>> keep this in mind.
>>>>>>> I just tested on 5.13.0-rc4 on Intel Edison (x86_64). All 5 patches
>>>>>>> apply. Switching between host/gadget works, no connections
>>>>>>> dropping, no
>>>>>>> errors in dmesg.
>>>>>>>
>>>>>>> In host mode I connect a smsc9504 eth+4p hub. In gadget mode I have
>>>>>>> composite device created from configfs with gser / eem /
>>>>>>> mass_storage /
>>>>>>> uac2.
>>>>>>>
>>>>>>> Tested with iperf3 performance in host (93.6Mbits/sec) and gadget
>>>>>>> (207Mbits/sec) mode. Compared to v5.10.41 without patches host
>>>>>>> (93.4Mbits/sec) and gadget (198Mbits/sec).
>>>>>>>
>>>>>>> Gadget seems to be a little faster with the patches, but that might
>>>>>>> also
>>>>>>> be caused  by something else, on v5.10.41 I see the bitrate bouncing
>>>>>>> between 207 and 199.
>>>>>>>
>>>>>>> I saw a mention to test iperf3 to self (loopback). 3.09 Gbits/sec.
>>>>>>> With
>>>>>>> v5.10.41 3.07Gbits/sec. Not bad for a 500MHz device.
>>>>>>>
>>>>>>> With gnome-disks I did a read access benchmark 35.4MB/s, with v5.10.41
>>>>>>> 34.7MB/s. This might be limited by Edison's internal eMMC speed (when
>>>>>>> booting U-Boot reads the kernel with 21.4 MiB/s).
>>>>>>>
>>>>>> Hi Ferry,
>>>>>>
>>>>>> Thanks for the testing.  Just to double check, did you also enable the
>>>>>> property, which enabled the TXFIFO resize feature on the platform?  For
>>>>>> example, for the QCOM SM8150 platform, we're adding the following to
>>>>>> our
>>>>>> device tree node:
>>>>>>
>>>>>> tx-fifo-resize
>>>>>>
>>>>>> If not, then your results at least confirms that w/o the property
>>>>>> present, the changes won't break anything :).  Thanks again for the
>>>>>> initial testing!
>>> I applied the patch now to 5.13.0-rc5 + the following:
>>>
>> Hi Ferry,
>>
>> Quick question...there was a compile error with the V9 patch series, as
>> it was using the dwc3_mwidth() incorrectly.  I will update this with the
>> proper use of the mdwidth, but which patch version did you use?

Hi Ferry,

> The V9 set gets applied to 5.13.0-rc5 by Yocto, if it doesn't apply it
> stops the build. I didn't notice any compile errors, they stop the whole
> build process too. But warnings are ignored. I'll check the logs to be sure.

Ah, ok.  I think the incorrect usage of the API will result as a warning
as seen in Greg's compile log:

drivers/usb/dwc3/gadget.c: In function ‘dwc3_gadget_calc_tx_fifo_size’:
drivers/usb/dwc3/gadget.c:653:45: warning: passing argument 1 of
‘dwc3_mdwidth’ makes pointer from integer without a cast [-Wint-conversion]
  653 |         mdwidth = dwc3_mdwidth(dwc->hwparams.hwparams0);

This is probably why the page fault occurs, as we're not passing in the
DWC3 struct.  I will send out a V10 shortly after testing it on my device.

Thanks
Wesley Cheng

>> Thanks
>> Wesley Cheng
>>
>>> --- a/drivers/usb/dwc3/dwc3-pci.c
>>> +++ b/drivers/usb/dwc3/dwc3-pci.c
>>> @@ -124,6 +124,7 @@ static const struct property_entry
>>> dwc3_pci_mrfld_properties[] = {
>>>      PROPERTY_ENTRY_BOOL("snps,dis_u3_susphy_quirk"),
>>>      PROPERTY_ENTRY_BOOL("snps,dis_u2_susphy_quirk"),
>>>      PROPERTY_ENTRY_BOOL("snps,usb2-gadget-lpm-disable"),
>>> +    PROPERTY_ENTRY_BOOL("tx-fifo-resize"),
>>>      PROPERTY_ENTRY_BOOL("linux,sysdev_is_parent"),
>>>      {}
>>>  };
>>>
>>>  and when switching to gadget mode unfortunately received the following
>>> oops:
>>>
>>> BUG: unable to handle page fault for address: 00000000202043f2
>>> #PF: supervisor read access in kernel mode
>>> #PF: error_code(0x0000) - not-present page
>>> PGD 0 P4D 0
>>> Oops: 0000 [#1] SMP PTI
>>> CPU: 0 PID: 617 Comm: conf-gadget.sh Not tainted
>>> 5.13.0-rc5-edison-acpi-standard #1
>>> Hardware name: Intel Corporation Merrifield/BODEGA BAY, BIOS 542
>>> 2015.01.21:18.19.48
>>> RIP: 0010:dwc3_gadget_check_config+0x33/0x80
>>> Code: 59 04 00 00 04 74 61 48 c1 ee 10 48 89 f7 f3 48 0f b8 c7 48 89 c7
>>> 39 81 60 04 00 00 7d 4a 89 81 60 04 00 00 8b 81 08 04 00 00 <81> b8 e8
>>> 03 00 00 32 33 00 00 0f b6 b0 09 04 00 00 75 0d 8b 80 20
>>> RSP: 0018:ffffb5550038fda0 EFLAGS: 00010297
>>> RAX: 000000002020400a RBX: ffffa04502627348 RCX: ffffa04507354028
>>> RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000004
>>> RBP: ffffa04508ac0550 R08: ffffa04503a75b2c R09: 0000000000000000
>>> R10: 0000000000000216 R11: 000000000002eba0 R12: ffffa04508ac0550
>>> R13: dead000000000100 R14: ffffa04508ac0600 R15: ffffa04508ac0520
>>> FS:  00007f7471e2f740(0000) GS:ffffa0453e200000(0000)
>>> knlGS:0000000000000000
>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> CR2: 00000000202043f2 CR3: 0000000003f38000 CR4: 00000000001006f0
>>> Call Trace:
>>>  configfs_composite_bind+0x2f4/0x430 [libcomposite]
>>>  udc_bind_to_driver+0x64/0x180
>>>  usb_gadget_probe_driver+0x114/0x150
>>>  gadget_dev_desc_UDC_store+0xbc/0x130 [libcomposite]
>>>  configfs_write_file+0xcd/0x140
>>>  vfs_write+0xbb/0x250
>>>  ksys_write+0x5a/0xd0
>>>  do_syscall_64+0x40/0x80
>>>  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>> RIP: 0033:0x7f7471f1ff53
>>> Code: 8b 15 21 cf 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f
>>> 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 00 00 0f 05 <48> 3d 00
>>> f0 ff ff 77 55 c3 0f 1f 40 00 48 83 ec 28 48 89 54 24 18
>>> RSP: 002b:00007fffa3dcd328 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
>>> RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007f7471f1ff53
>>> RDX: 000000000000000c RSI: 00005614d615a770 RDI: 0000000000000001
>>> RBP: 00005614d615a770 R08: 000000000000000a R09: 00007f7471fb20c0
>>> R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
>>> R13: 00007f7471fee520 R14: 000000000000000c R15: 00007f7471fee720
>>> Modules linked in: usb_f_uac2 u_audio usb_f_mass_storage usb_f_eem
>>> u_ether usb_f_serial u_serial libcomposite rfcomm iptable_nat bnep
>>> snd_sof_nocodec spi_pxa2xx_platform dw_dmac smsc snd_sof_pci_intel_tng
>>> snd_sof_pci snd_sof_acpi_intel_byt snd_sof_intel_ipc snd_sof_acpi
>>> smsc95xx snd_sof pwm_lpss_pci pwm_lpss snd_sof_xtensa_dsp
>>> snd_intel_dspcfg snd_soc_acpi_intel_match snd_soc_acpi dw_dmac_pci
>>> intel_mrfld_pwrbtn intel_mrfld_adc dw_dmac_core spi_pxa2xx_pci brcmfmac
>>> brcmutil leds_gpio hci_uart btbcm ti_ads7950
>>> industrialio_triggered_buffer kfifo_buf ledtrig_timer ledtrig_heartbeat
>>> mmc_block extcon_intel_mrfld sdhci_pci cqhci sdhci led_class
>>> intel_soc_pmic_mrfld mmc_core btrfs libcrc32c xor zstd_compress
>>> zlib_deflate raid6_pq
>>> CR2: 00000000202043f2
>>> ---[ end trace 5c11fe50dca92ad4 ]---
>>>
>>>>> No I didn't. Afaik we don't have a devicetree property to set.
>>>>>
>>>>> But I'd be happy to test that as well. But where to set the property?
>>>>>
>>>>> dwc3_pci_mrfld_properties[] in dwc3-pci?
>>>>>
>>>> Hi Ferry,
>>>>
>>>> Not too sure which DWC3 driver is used for the Intel platform, but I
>>>> believe that should be the one. (if that's what is normally used)  We'd
>>>> just need to add an entry w/ the below:
>>>>
>>>> PROPERTY_ENTRY_BOOL("tx-fifo-resize")
>>>>
>>>> Thanks
>>>> Wesley Cheng
>>>>
>>>>>> Thanks
>>>>>> Wesley Cheng
>>>>>>
>>>>>>>>>>> Now with endpoint bursting, if the function notifies the host that
>>>>>>>>>>> bursting is supported, when the host sends the ACK for the Data
>>>>>>>>>>> Packet,
>>>>>>>>>>> it should have a NumP value equal to the bMaxBurst reported in
>>>>>>>>>>> the EP
>>>>>>>>>> Yes and no. Looking back at the history, we used to configure NUMP
>>>>>>>>>> based
>>>>>>>>>> on bMaxBurst, but it was changed later in commit
>>>>>>>>>> 4e99472bc10bda9906526d725ff6d5f27b4ddca1 by yours truly because
>>>>>>>>>> of a
>>>>>>>>>> problem reported by John Youn.
>>>>>>>>>>
>>>>>>>>>> And now we've come full circle. Because even if I believe more
>>>>>>>>>> requests
>>>>>>>>>> are enough for bursting, NUMP is limited by the RxFIFO size. This
>>>>>>>>>> ends
>>>>>>>>>> up supporting your claim that we need RxFIFO resizing if we want to
>>>>>>>>>> squeeze more throughput out of the controller.
>>>>>>>>>>
>>>>>>>>>> However, note that this is about RxFIFO size, not TxFIFO size. In
>>>>>>>>>> fact,
>>>>>>>>>> looking at Table 8-13 of USB 3.1 r1.0, we read the following about
>>>>>>>>>> NumP
>>>>>>>>>> (emphasis is mine):
>>>>>>>>>>
>>>>>>>>>>          "Number of Packets (NumP). This field is used to
>>>>>>>>>> indicate the
>>>>>>>>>>          number of Data Packet buffers that the **receiver** can
>>>>>>>>>>          accept. The value in this field shall be less than or
>>>>>>>>>> equal to
>>>>>>>>>>          the maximum burst size supported by the endpoint as
>>>>>>>>>> determined
>>>>>>>>>>          by the value in the bMaxBurst field in the Endpoint
>>>>>>>>>> Companion
>>>>>>>>>>          Descriptor (refer to Section 9.6.7)."
>>>>>>>>>>
>>>>>>>>>> So, NumP is for the receiver, not the transmitter. Could you
>>>>>>>>>> clarify
>>>>>>>>>> what you mean here?
>>>>>>>>>>
>>>>>>>>>> /me keeps reading
>>>>>>>>>>
>>>>>>>>>> Hmm, table 8-15 tries to clarify:
>>>>>>>>>>
>>>>>>>>>>          "Number of Packets (NumP).
>>>>>>>>>>
>>>>>>>>>>          For an OUT endpoint, refer to Table 8-13 for the
>>>>>>>>>> description of
>>>>>>>>>>          this field.
>>>>>>>>>>
>>>>>>>>>>          For an IN endpoint this field is set by the endpoint to
>>>>>>>>>> the
>>>>>>>>>>          number of packets it can transmit when the host resumes
>>>>>>>>>>          transactions to it. This field shall not have a value
>>>>>>>>>> greater
>>>>>>>>>>          than the maximum burst size supported by the endpoint as
>>>>>>>>>>          indicated by the value in the bMaxBurst field in the
>>>>>>>>>> Endpoint
>>>>>>>>>>          Companion Descriptor. Note that the value reported in this
>>>>>>>>>> field
>>>>>>>>>>          may be treated by the host as informative only."
>>>>>>>>>>
>>>>>>>>>> However, if I remember correctly (please verify dwc3 databook),
>>>>>>>>>> NUMP in
>>>>>>>>>> DCFG was only for receive buffers. Thin, John, how does dwc3
>>>>>>>>>> compute
>>>>>>>>>> NumP for TX/IN endpoints? Is that computed as a function of
>>>>>>>>>> DCFG.NUMP or
>>>>>>>>>> TxFIFO size?
>>>>>>>>>>
>>>>>>>>>>> desc.  If we have a TXFIFO size of 2, then normally what I have
>>>>>>>>>>> seen is
>>>>>>>>>>> that after 2 data packets, the device issues a NRDY.  So then we'd
>>>>>>>>>>> need
>>>>>>>>>>> to send an ERDY once data is available within the FIFO, and the
>>>>>>>>>>> same
>>>>>>>>>>> sequence happens until the USB request is complete.  With this
>>>>>>>>>>> constant
>>>>>>>>>>> NRDY/ERDY handshake going on, you actually see that the bus is
>>>>>>>>>>> under
>>>>>>>>>>> utilized.  When we increase an EP's FIFO size, then you'll see
>>>>>>>>>>> constant
>>>>>>>>>>> bursts for a request, until the request is done, or if the host
>>>>>>>>>>> runs out
>>>>>>>>>>> of RXFIFO. (ie no interruption [on the USB protocol level] during
>>>>>>>>>>> USB
>>>>>>>>>>> request data transfer)
>>>>>>>>>> Unfortunately I don't have access to a USB sniffer anymore :-(
>>>>>>>>>>
>>>>>>>>>>>>>>>> Good points.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Wesley, what kind of testing have you done on this on
>>>>>>>>>>>>>>>> different devices?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> As mentioned above, these changes are currently present on end
>>>>>>>>>>>>>>> user
>>>>>>>>>>>>>>> devices for the past few years, so its been through a lot of
>>>>>>>>>>>>>>> testing :).
>>>>>>>>>>>>>> all with the same gadget driver. Also, who uses USB on android
>>>>>>>>>>>>>> devices
>>>>>>>>>>>>>> these days? Most of the data transfer goes via WiFi or
>>>>>>>>>>>>>> Bluetooth, anyway
>>>>>>>>>>>>>> :-)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I guess only developers are using USB during development to
>>>>>>>>>>>>>> flash dev
>>>>>>>>>>>>>> images heh.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> I used to be a customer facing engineer, so honestly I did see
>>>>>>>>>>>>> some
>>>>>>>>>>>>> really interesting and crazy designs.  Again, we do have
>>>>>>>>>>>>> non-Android
>>>>>>>>>>>>> products that use the same code, and it has been working in
>>>>>>>>>>>>> there
>>>>>>>>>>>>> for a
>>>>>>>>>>>>> few years as well.  The TXFIFO sizing really has helped with
>>>>>>>>>>>>> multimedia
>>>>>>>>>>>>> use cases, which use isoc endpoints, since esp. in those lower
>>>>>>>>>>>>> end CPU
>>>>>>>>>>>>> chips where latencies across the system are much larger, and a
>>>>>>>>>>>>> missed
>>>>>>>>>>>>> ISOC interval leads to a pop in your ear.
>>>>>>>>>>>> This is good background information. Thanks for bringing this
>>>>>>>>>>>> up. Admitedly, we still have ISOC issues with dwc3. I'm
>>>>>>>>>>>> interested in
>>>>>>>>>>>> knowing if a deeper request queue would also help here.
>>>>>>>>>>>>
>>>>>>>>>>>> Remember dwc3 can accomodate 255 requests + link for each
>>>>>>>>>>>> endpoint. If
>>>>>>>>>>>> our gadget driver uses a low number of requests, we're never
>>>>>>>>>>>> really
>>>>>>>>>>>> using the TRB ring in our benefit.
>>>>>>>>>>>>
>>>>>>>>>>> We're actually using both a deeper USB request queue + TX fifo
>>>>>>>>>>> resizing. :).
>>>>>>>>>> okay, great. Let's see what John and/or Thinh respond WRT dwc3 TX
>>>>>>>>>> Burst
>>>>>>>>>> behavior.
>>>>>>>>> -- 
>>>>>>>>> heikki

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v9 0/5] Re-introduce TX FIFO resize for larger EP bursting
  2021-06-17  8:30                                     ` Wesley Cheng
@ 2021-06-17 19:54                                       ` Ferry Toth
  0 siblings, 0 replies; 31+ messages in thread
From: Ferry Toth @ 2021-06-17 19:54 UTC (permalink / raw)
  To: Wesley Cheng, Andy Shevchenko, Heikki Krogerus
  Cc: Felipe Balbi, Andy Shevchenko, Greg KH, Andy Gross,
	Bjorn Andersson, Rob Herring, USB, Linux Kernel Mailing List,
	linux-arm-msm, devicetree, Jack Pham, Thinh Nguyen, John Youn

Hi

Op 17-06-2021 om 10:30 schreef Wesley Cheng:
>
> On 6/17/2021 12:47 AM, Ferry Toth wrote:
>> Hi
>>
>> Op 17-06-2021 om 06:25 schreef Wesley Cheng:
>>> On 6/15/2021 12:53 PM, Ferry Toth wrote:
>>>> Hi
>>>>
>>>> Op 15-06-2021 om 06:22 schreef Wesley Cheng:
>>>>> On 6/14/2021 12:30 PM, Ferry Toth wrote:
>>>>>> Op 14-06-2021 om 20:58 schreef Wesley Cheng:
>>>>>>> On 6/12/2021 2:27 PM, Ferry Toth wrote:
>>>>>>>> Hi
>>>>>>>>
>>>>>>>> Op 11-06-2021 om 15:21 schreef Andy Shevchenko:
>>>>>>>>> On Fri, Jun 11, 2021 at 4:14 PM Heikki Krogerus
>>>>>>>>> <heikki.krogerus@linux.intel.com> wrote:
>>>>>>>>>> On Fri, Jun 11, 2021 at 04:00:38PM +0300, Felipe Balbi wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> Wesley Cheng <wcheng@codeaurora.org> writes:
>>>>>>>>>>>>>>>>>> to be honest, I don't think these should go in (apart from
>>>>>>>>>>>>>>>>>> the build
>>>>>>>>>>>>>>>>>> failure) because it's likely to break instantiations of the
>>>>>>>>>>>>>>>>>> core with
>>>>>>>>>>>>>>>>>> differing FIFO sizes. Some instantiations even have some
>>>>>>>>>>>>>>>>>> endpoints with
>>>>>>>>>>>>>>>>>> dedicated functionality that requires the default FIFO size
>>>>>>>>>>>>>>>>>> configured
>>>>>>>>>>>>>>>>>> during coreConsultant instantiation. I know of at OMAP5 and
>>>>>>>>>>>>>>>>>> some Intel
>>>>>>>>>>>>>>>>>> implementations which have dedicated endpoints for processor
>>>>>>>>>>>>>>>>>> tracing.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> With OMAP5, these endpoints are configured at the top of the
>>>>>>>>>>>>>>>>>> available
>>>>>>>>>>>>>>>>>> endpoints, which means that if a gadget driver gets loaded
>>>>>>>>>>>>>>>>>> and takes
>>>>>>>>>>>>>>>>>> over most of the FIFO space because of this resizing,
>>>>>>>>>>>>>>>>>> processor tracing
>>>>>>>>>>>>>>>>>> will have a hard time running. That being said, processor
>>>>>>>>>>>>>>>>>> tracing isn't
>>>>>>>>>>>>>>>>>> supported in upstream at this moment.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I agree that the application of this logic may differ between
>>>>>>>>>>>>>>>> vendors,
>>>>>>>>>>>>>>>> hence why I wanted to keep this controllable by the DT
>>>>>>>>>>>>>>>> property, so that
>>>>>>>>>>>>>>>> for those which do not support this use case can leave it
>>>>>>>>>>>>>>>> disabled.  The
>>>>>>>>>>>>>>>> logic is there to ensure that for a given USB configuration,
>>>>>>>>>>>>>>>> for each EP
>>>>>>>>>>>>>>>> it would have at least 1 TX FIFO.  For USB configurations
>>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>> don't
>>>>>>>>>>>>>>>> utilize all available IN EPs, it would allow re-allocation of
>>>>>>>>>>>>>>>> internal
>>>>>>>>>>>>>>>> memory to EPs which will actually be in use.
>>>>>>>>>>>>>>> The feature ends up being all-or-nothing, then :-) It sounds
>>>>>>>>>>>>>>> like we can
>>>>>>>>>>>>>>> be a little nicer in this regard.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Don't get me wrong, I think once those features become available
>>>>>>>>>>>>>> upstream, we can improve the logic.  From what I remember when
>>>>>>>>>>>>>> looking
>>>>>>>>>>>>> sure, I support that. But I want to make sure the first cut isn't
>>>>>>>>>>>>> likely
>>>>>>>>>>>>> to break things left and right :)
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hence, let's at least get more testing.
>>>>>>>>>>>>>
>>>>>>>>>>>> Sure, I'd hope that the other users of DWC3 will also see some
>>>>>>>>>>>> pretty
>>>>>>>>>>>> big improvements on the TX path with this.
>>>>>>>>>>> fingers crossed
>>>>>>>>>>>
>>>>>>>>>>>>>> at Andy Shevchenko's Github, the Intel tracer downstream changes
>>>>>>>>>>>>>> were
>>>>>>>>>>>>>> just to remove physical EP1 and 2 from the DWC3 endpoint list.
>>>>>>>>>>>>>> If that
>>>>>>>>>>>>> right, that's the reason why we introduced the endpoint feature
>>>>>>>>>>>>> flags. The end goal was that the UDC would be able to have custom
>>>>>>>>>>>>> feature flags paired with ->validate_endpoint() or whatever
>>>>>>>>>>>>> before
>>>>>>>>>>>>> allowing it to be enabled. Then the UDC driver could tell UDC
>>>>>>>>>>>>> core to
>>>>>>>>>>>>> skip that endpoint on that particular platform without
>>>>>>>>>>>>> interefering with
>>>>>>>>>>>>> everything else.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Of course, we still need to figure out a way to abstract the
>>>>>>>>>>>>> different
>>>>>>>>>>>>> dwc3 instantiations.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> was the change which ended up upstream for the Intel tracer
>>>>>>>>>>>>>> then we
>>>>>>>>>>>>>> could improve the logic to avoid re-sizing those particular EPs.
>>>>>>>>>>>>> The problem then, just as I mentioned in the previous paragraph,
>>>>>>>>>>>>> will be
>>>>>>>>>>>>> coming up with a solution that's elegant and works for all
>>>>>>>>>>>>> different
>>>>>>>>>>>>> instantiations of dwc3 (or musb, cdns3, etc).
>>>>>>>>>>>>>
>>>>>>>>>>>> Well, at least for the TX FIFO resizing logic, we'd only be
>>>>>>>>>>>> needing to
>>>>>>>>>>>> focus on the DWC3 implementation.
>>>>>>>>>>>>
>>>>>>>>>>>> You bring up another good topic that I'll eventually needing to be
>>>>>>>>>>>> taking a look at, which is a nice way we can handle vendor
>>>>>>>>>>>> specific
>>>>>>>>>>>> endpoints and how they can co-exist with other "normal"
>>>>>>>>>>>> endpoints.  We
>>>>>>>>>>>> have a few special HW eps as well, which we try to maintain
>>>>>>>>>>>> separately
>>>>>>>>>>>> in our DWC3 vendor driver, but it isn't the most convenient, or
>>>>>>>>>>>> most
>>>>>>>>>>>> pretty method :).
>>>>>>>>>>> Awesome, as mentioned, the endpoint feature flags were added
>>>>>>>>>>> exactly to
>>>>>>>>>>> allow for these vendor-specific features :-)
>>>>>>>>>>>
>>>>>>>>>>> I'm more than happy to help testing now that I finally got our
>>>>>>>>>>> SM8150
>>>>>>>>>>> Surface Duo device tree accepted by Bjorn ;-)
>>>>>>>>>>>
>>>>>>>>>>>>>> However, I'm not sure how the changes would look like in the
>>>>>>>>>>>>>> end,
>>>>>>>>>>>>>> so I
>>>>>>>>>>>>>> would like to wait later down the line to include that :).
>>>>>>>>>>>>> Fair enough, I agree. Can we get some more testing of $subject,
>>>>>>>>>>>>> though?
>>>>>>>>>>>>> Did you test $subject with upstream too? Which gadget drivers
>>>>>>>>>>>>> did you
>>>>>>>>>>>>> use? How did you test
>>>>>>>>>>>>>
>>>>>>>>>>>> The results that I included in the cover page was tested with the
>>>>>>>>>>>> pure
>>>>>>>>>>>> upstream kernel on our device.  Below was using the ConfigFS
>>>>>>>>>>>> gadget
>>>>>>>>>>>> w/ a
>>>>>>>>>>>> mass storage only composition.
>>>>>>>>>>>>
>>>>>>>>>>>> Test Parameters:
>>>>>>>>>>>>      - Platform: Qualcomm SM8150
>>>>>>>>>>>>      - bMaxBurst = 6
>>>>>>>>>>>>      - USB req size = 256kB
>>>>>>>>>>>>      - Num of USB reqs = 16
>>>>>>>>>>> do you mind testing with the regular request size (16KiB) and 250
>>>>>>>>>>> requests? I think we can even do 15 bursts in that case.
>>>>>>>>>>>
>>>>>>>>>>>>      - USB Speed = Super-Speed
>>>>>>>>>>>>      - Function Driver: Mass Storage (w/ ramdisk)
>>>>>>>>>>>>      - Test Application: CrystalDiskMark
>>>>>>>>>>>>
>>>>>>>>>>>> Results:
>>>>>>>>>>>>
>>>>>>>>>>>> TXFIFO Depth = 3 max packets
>>>>>>>>>>>>
>>>>>>>>>>>> Test Case | Data Size | AVG tput (in MB/s)
>>>>>>>>>>>> -------------------------------------------
>>>>>>>>>>>> Sequential|1 GB x     |
>>>>>>>>>>>> Read      |9 loops    | 193.60
>>>>>>>>>>>>               |           | 195.86
>>>>>>>>>>>>               |           | 184.77
>>>>>>>>>>>>               |           | 193.60
>>>>>>>>>>>> -------------------------------------------
>>>>>>>>>>>>
>>>>>>>>>>>> TXFIFO Depth = 6 max packets
>>>>>>>>>>>>
>>>>>>>>>>>> Test Case | Data Size | AVG tput (in MB/s)
>>>>>>>>>>>> -------------------------------------------
>>>>>>>>>>>> Sequential|1 GB x     |
>>>>>>>>>>>> Read      |9 loops    | 287.35
>>>>>>>>>>>>             |           | 304.94
>>>>>>>>>>>>               |           | 289.64
>>>>>>>>>>>>               |           | 293.61
>>>>>>>>>>> I remember getting close to 400MiB/sec with Intel platforms without
>>>>>>>>>>> resizing FIFOs and I'm sure the FIFO size was set to 2x1024,
>>>>>>>>>>> though my
>>>>>>>>>>> memory could be failing.
>>>>>>>>>>>
>>>>>>>>>>> Then again, I never ran with CrystalDiskMark, I was using my own
>>>>>>>>>>> tool
>>>>>>>>>>> (it's somewhere in github. If you care, I can look up the URL).
>>>>>>>>>>>
>>>>>>>>>>>> We also have internal numbers which have shown similar
>>>>>>>>>>>> improvements as
>>>>>>>>>>>> well.  Those are over networking/tethering interfaces, so testing
>>>>>>>>>>>> IPERF
>>>>>>>>>>>> loopback over TCP/UDP.
>>>>>>>>>>> loopback iperf? That would skip the wire, no?
>>>>>>>>>>>
>>>>>>>>>>>>>> size of 2 and TX threshold of 1, this would really be not
>>>>>>>>>>>>>> beneficial to
>>>>>>>>>>>>>> us, because we can only change the TX threshold to 2 at max,
>>>>>>>>>>>>>> and at
>>>>>>>>>>>>>> least in my observations, once we have to go out to system
>>>>>>>>>>>>>> memory to
>>>>>>>>>>>>>> fetch the next data packet, that latency takes enough time
>>>>>>>>>>>>>> for the
>>>>>>>>>>>>>> controller to end the current burst.
>>>>>>>>>>>>> What I noticed with g_mass_storage is that we can amortize the
>>>>>>>>>>>>> cost of
>>>>>>>>>>>>> fetching data from memory, with a deeper request queue.
>>>>>>>>>>>>> Whenever I
>>>>>>>>>>>>> test(ed) g_mass_storage, I was doing so with 250 requests. And
>>>>>>>>>>>>> that was
>>>>>>>>>>>>> enough to give me very good performance. Never had to poke at TX
>>>>>>>>>>>>> FIFO
>>>>>>>>>>>>> resizing. Did you try something like this too?
>>>>>>>>>>>>>
>>>>>>>>>>>>> I feel that allocating more requests is a far simpler and more
>>>>>>>>>>>>> generic
>>>>>>>>>>>>> method that changing FIFO sizes :)
>>>>>>>>>>>>>
>>>>>>>>>>>> I wish I had a USB bus trace handy to show you, which would
>>>>>>>>>>>> make it
>>>>>>>>>>>> very
>>>>>>>>>>>> clear how the USB bus is currently utilized with TXFIFO size 2 vs
>>>>>>>>>>>> 6.  So
>>>>>>>>>>>> by increasing the number of USB requests, that will help if there
>>>>>>>>>>>> was a
>>>>>>>>>>>> bottleneck at the SW level where the application/function driver
>>>>>>>>>>>> utilizing the DWC3 was submitting data much faster than the HW was
>>>>>>>>>>>> processing them.
>>>>>>>>>>>>
>>>>>>>>>>>> So yes, this method of increasing the # of USB reqs will
>>>>>>>>>>>> definitely
>>>>>>>>>>>> help
>>>>>>>>>>>> with situations such as HSUSB or in SSUSB when EP bursting isn't
>>>>>>>>>>>> used.
>>>>>>>>>>>> The TXFIFO resize comes into play for SSUSB, which utilizes
>>>>>>>>>>>> endpoint
>>>>>>>>>>>> bursting.
>>>>>>>>>>> Hmm, that's not what I remember. Perhaps the TRB cache size plays a
>>>>>>>>>>> role
>>>>>>>>>>> here too. I have clear memories of testing this very scenario of
>>>>>>>>>>> bursting (using g_mass_storage at the time) because I was curious
>>>>>>>>>>> about
>>>>>>>>>>> it. Back then, my tests showed no difference in behavior.
>>>>>>>>>>>
>>>>>>>>>>> It could be nice if Heikki could test Intel parts with and without
>>>>>>>>>>> your
>>>>>>>>>>> changes on g_mass_storage with 250 requests.
>>>>>>>>>> Andy, you have a system at hand that has the DWC3 block enabled,
>>>>>>>>>> right? Can you help out here?
>>>>>>>>> I'm not sure if i will have time soon, I Cc'ed to Ferry who has a few
>>>>>>>>> more test cases (I have only one or two) and maybe can help. But I'll
>>>>>>>>> keep this in mind.
>>>>>>>> I just tested on 5.13.0-rc4 on Intel Edison (x86_64). All 5 patches
>>>>>>>> apply. Switching between host/gadget works, no connections
>>>>>>>> dropping, no
>>>>>>>> errors in dmesg.
>>>>>>>>
>>>>>>>> In host mode I connect a smsc9504 eth+4p hub. In gadget mode I have
>>>>>>>> composite device created from configfs with gser / eem /
>>>>>>>> mass_storage /
>>>>>>>> uac2.
>>>>>>>>
>>>>>>>> Tested with iperf3 performance in host (93.6Mbits/sec) and gadget
>>>>>>>> (207Mbits/sec) mode. Compared to v5.10.41 without patches host
>>>>>>>> (93.4Mbits/sec) and gadget (198Mbits/sec).
>>>>>>>>
>>>>>>>> Gadget seems to be a little faster with the patches, but that might
>>>>>>>> also
>>>>>>>> be caused  by something else, on v5.10.41 I see the bitrate bouncing
>>>>>>>> between 207 and 199.
>>>>>>>>
>>>>>>>> I saw a mention to test iperf3 to self (loopback). 3.09 Gbits/sec.
>>>>>>>> With
>>>>>>>> v5.10.41 3.07Gbits/sec. Not bad for a 500MHz device.
>>>>>>>>
>>>>>>>> With gnome-disks I did a read access benchmark 35.4MB/s, with v5.10.41
>>>>>>>> 34.7MB/s. This might be limited by Edison's internal eMMC speed (when
>>>>>>>> booting U-Boot reads the kernel with 21.4 MiB/s).
>>>>>>>>
>>>>>>> Hi Ferry,
>>>>>>>
>>>>>>> Thanks for the testing.  Just to double check, did you also enable the
>>>>>>> property, which enabled the TXFIFO resize feature on the platform?  For
>>>>>>> example, for the QCOM SM8150 platform, we're adding the following to
>>>>>>> our
>>>>>>> device tree node:
>>>>>>>
>>>>>>> tx-fifo-resize
>>>>>>>
>>>>>>> If not, then your results at least confirms that w/o the property
>>>>>>> present, the changes won't break anything :).  Thanks again for the
>>>>>>> initial testing!
>>>> I applied the patch now to 5.13.0-rc5 + the following:
>>>>
>>> Hi Ferry,
>>>
>>> Quick question...there was a compile error with the V9 patch series, as
>>> it was using the dwc3_mwidth() incorrectly.  I will update this with the
>>> proper use of the mdwidth, but which patch version did you use?
> Hi Ferry,
>
>> The V9 set gets applied to 5.13.0-rc5 by Yocto, if it doesn't apply it
>> stops the build. I didn't notice any compile errors, they stop the whole
>> build process too. But warnings are ignored. I'll check the logs to be sure.
> Ah, ok.  I think the incorrect usage of the API will result as a warning
> as seen in Greg's compile log:
>
> drivers/usb/dwc3/gadget.c: In function ‘dwc3_gadget_calc_tx_fifo_size’:
> drivers/usb/dwc3/gadget.c:653:45: warning: passing argument 1 of
> ‘dwc3_mdwidth’ makes pointer from integer without a cast [-Wint-conversion]
>    653 |         mdwidth = dwc3_mdwidth(dwc->hwparams.hwparams0);
I found exactly this warning in my logs. Sorry for this noise, I was 
watching for an error.
> This is probably why the page fault occurs, as we're not passing in the
> DWC3 struct.  I will send out a V10 shortly after testing it on my device.
>
> Thanks
> Wesley Cheng
>
>>> Thanks
>>> Wesley Cheng
>>>
>>>> --- a/drivers/usb/dwc3/dwc3-pci.c
>>>> +++ b/drivers/usb/dwc3/dwc3-pci.c
>>>> @@ -124,6 +124,7 @@ static const struct property_entry
>>>> dwc3_pci_mrfld_properties[] = {
>>>>       PROPERTY_ENTRY_BOOL("snps,dis_u3_susphy_quirk"),
>>>>       PROPERTY_ENTRY_BOOL("snps,dis_u2_susphy_quirk"),
>>>>       PROPERTY_ENTRY_BOOL("snps,usb2-gadget-lpm-disable"),
>>>> +    PROPERTY_ENTRY_BOOL("tx-fifo-resize"),
>>>>       PROPERTY_ENTRY_BOOL("linux,sysdev_is_parent"),
>>>>       {}
>>>>   };
>>>>
>>>>   and when switching to gadget mode unfortunately received the following
>>>> oops:
>>>>
>>>> BUG: unable to handle page fault for address: 00000000202043f2
>>>> #PF: supervisor read access in kernel mode
>>>> #PF: error_code(0x0000) - not-present page
>>>> PGD 0 P4D 0
>>>> Oops: 0000 [#1] SMP PTI
>>>> CPU: 0 PID: 617 Comm: conf-gadget.sh Not tainted
>>>> 5.13.0-rc5-edison-acpi-standard #1
>>>> Hardware name: Intel Corporation Merrifield/BODEGA BAY, BIOS 542
>>>> 2015.01.21:18.19.48
>>>> RIP: 0010:dwc3_gadget_check_config+0x33/0x80
>>>> Code: 59 04 00 00 04 74 61 48 c1 ee 10 48 89 f7 f3 48 0f b8 c7 48 89 c7
>>>> 39 81 60 04 00 00 7d 4a 89 81 60 04 00 00 8b 81 08 04 00 00 <81> b8 e8
>>>> 03 00 00 32 33 00 00 0f b6 b0 09 04 00 00 75 0d 8b 80 20
>>>> RSP: 0018:ffffb5550038fda0 EFLAGS: 00010297
>>>> RAX: 000000002020400a RBX: ffffa04502627348 RCX: ffffa04507354028
>>>> RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000004
>>>> RBP: ffffa04508ac0550 R08: ffffa04503a75b2c R09: 0000000000000000
>>>> R10: 0000000000000216 R11: 000000000002eba0 R12: ffffa04508ac0550
>>>> R13: dead000000000100 R14: ffffa04508ac0600 R15: ffffa04508ac0520
>>>> FS:  00007f7471e2f740(0000) GS:ffffa0453e200000(0000)
>>>> knlGS:0000000000000000
>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> CR2: 00000000202043f2 CR3: 0000000003f38000 CR4: 00000000001006f0
>>>> Call Trace:
>>>>   configfs_composite_bind+0x2f4/0x430 [libcomposite]
>>>>   udc_bind_to_driver+0x64/0x180
>>>>   usb_gadget_probe_driver+0x114/0x150
>>>>   gadget_dev_desc_UDC_store+0xbc/0x130 [libcomposite]
>>>>   configfs_write_file+0xcd/0x140
>>>>   vfs_write+0xbb/0x250
>>>>   ksys_write+0x5a/0xd0
>>>>   do_syscall_64+0x40/0x80
>>>>   entry_SYSCALL_64_after_hwframe+0x44/0xae
>>>> RIP: 0033:0x7f7471f1ff53
>>>> Code: 8b 15 21 cf 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f
>>>> 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 00 00 0f 05 <48> 3d 00
>>>> f0 ff ff 77 55 c3 0f 1f 40 00 48 83 ec 28 48 89 54 24 18
>>>> RSP: 002b:00007fffa3dcd328 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
>>>> RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007f7471f1ff53
>>>> RDX: 000000000000000c RSI: 00005614d615a770 RDI: 0000000000000001
>>>> RBP: 00005614d615a770 R08: 000000000000000a R09: 00007f7471fb20c0
>>>> R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
>>>> R13: 00007f7471fee520 R14: 000000000000000c R15: 00007f7471fee720
>>>> Modules linked in: usb_f_uac2 u_audio usb_f_mass_storage usb_f_eem
>>>> u_ether usb_f_serial u_serial libcomposite rfcomm iptable_nat bnep
>>>> snd_sof_nocodec spi_pxa2xx_platform dw_dmac smsc snd_sof_pci_intel_tng
>>>> snd_sof_pci snd_sof_acpi_intel_byt snd_sof_intel_ipc snd_sof_acpi
>>>> smsc95xx snd_sof pwm_lpss_pci pwm_lpss snd_sof_xtensa_dsp
>>>> snd_intel_dspcfg snd_soc_acpi_intel_match snd_soc_acpi dw_dmac_pci
>>>> intel_mrfld_pwrbtn intel_mrfld_adc dw_dmac_core spi_pxa2xx_pci brcmfmac
>>>> brcmutil leds_gpio hci_uart btbcm ti_ads7950
>>>> industrialio_triggered_buffer kfifo_buf ledtrig_timer ledtrig_heartbeat
>>>> mmc_block extcon_intel_mrfld sdhci_pci cqhci sdhci led_class
>>>> intel_soc_pmic_mrfld mmc_core btrfs libcrc32c xor zstd_compress
>>>> zlib_deflate raid6_pq
>>>> CR2: 00000000202043f2
>>>> ---[ end trace 5c11fe50dca92ad4 ]---
>>>>
>>>>>> No I didn't. Afaik we don't have a devicetree property to set.
>>>>>>
>>>>>> But I'd be happy to test that as well. But where to set the property?
>>>>>>
>>>>>> dwc3_pci_mrfld_properties[] in dwc3-pci?
>>>>>>
>>>>> Hi Ferry,
>>>>>
>>>>> Not too sure which DWC3 driver is used for the Intel platform, but I
>>>>> believe that should be the one. (if that's what is normally used)  We'd
>>>>> just need to add an entry w/ the below:
>>>>>
>>>>> PROPERTY_ENTRY_BOOL("tx-fifo-resize")
>>>>>
>>>>> Thanks
>>>>> Wesley Cheng
>>>>>
>>>>>>> Thanks
>>>>>>> Wesley Cheng
>>>>>>>
>>>>>>>>>>>> Now with endpoint bursting, if the function notifies the host that
>>>>>>>>>>>> bursting is supported, when the host sends the ACK for the Data
>>>>>>>>>>>> Packet,
>>>>>>>>>>>> it should have a NumP value equal to the bMaxBurst reported in
>>>>>>>>>>>> the EP
>>>>>>>>>>> Yes and no. Looking back at the history, we used to configure NUMP
>>>>>>>>>>> based
>>>>>>>>>>> on bMaxBurst, but it was changed later in commit
>>>>>>>>>>> 4e99472bc10bda9906526d725ff6d5f27b4ddca1 by yours truly because
>>>>>>>>>>> of a
>>>>>>>>>>> problem reported by John Youn.
>>>>>>>>>>>
>>>>>>>>>>> And now we've come full circle. Because even if I believe more
>>>>>>>>>>> requests
>>>>>>>>>>> are enough for bursting, NUMP is limited by the RxFIFO size. This
>>>>>>>>>>> ends
>>>>>>>>>>> up supporting your claim that we need RxFIFO resizing if we want to
>>>>>>>>>>> squeeze more throughput out of the controller.
>>>>>>>>>>>
>>>>>>>>>>> However, note that this is about RxFIFO size, not TxFIFO size. In
>>>>>>>>>>> fact,
>>>>>>>>>>> looking at Table 8-13 of USB 3.1 r1.0, we read the following about
>>>>>>>>>>> NumP
>>>>>>>>>>> (emphasis is mine):
>>>>>>>>>>>
>>>>>>>>>>>           "Number of Packets (NumP). This field is used to
>>>>>>>>>>> indicate the
>>>>>>>>>>>           number of Data Packet buffers that the **receiver** can
>>>>>>>>>>>           accept. The value in this field shall be less than or
>>>>>>>>>>> equal to
>>>>>>>>>>>           the maximum burst size supported by the endpoint as
>>>>>>>>>>> determined
>>>>>>>>>>>           by the value in the bMaxBurst field in the Endpoint
>>>>>>>>>>> Companion
>>>>>>>>>>>           Descriptor (refer to Section 9.6.7)."
>>>>>>>>>>>
>>>>>>>>>>> So, NumP is for the receiver, not the transmitter. Could you
>>>>>>>>>>> clarify
>>>>>>>>>>> what you mean here?
>>>>>>>>>>>
>>>>>>>>>>> /me keeps reading
>>>>>>>>>>>
>>>>>>>>>>> Hmm, table 8-15 tries to clarify:
>>>>>>>>>>>
>>>>>>>>>>>           "Number of Packets (NumP).
>>>>>>>>>>>
>>>>>>>>>>>           For an OUT endpoint, refer to Table 8-13 for the
>>>>>>>>>>> description of
>>>>>>>>>>>           this field.
>>>>>>>>>>>
>>>>>>>>>>>           For an IN endpoint this field is set by the endpoint to
>>>>>>>>>>> the
>>>>>>>>>>>           number of packets it can transmit when the host resumes
>>>>>>>>>>>           transactions to it. This field shall not have a value
>>>>>>>>>>> greater
>>>>>>>>>>>           than the maximum burst size supported by the endpoint as
>>>>>>>>>>>           indicated by the value in the bMaxBurst field in the
>>>>>>>>>>> Endpoint
>>>>>>>>>>>           Companion Descriptor. Note that the value reported in this
>>>>>>>>>>> field
>>>>>>>>>>>           may be treated by the host as informative only."
>>>>>>>>>>>
>>>>>>>>>>> However, if I remember correctly (please verify dwc3 databook),
>>>>>>>>>>> NUMP in
>>>>>>>>>>> DCFG was only for receive buffers. Thin, John, how does dwc3
>>>>>>>>>>> compute
>>>>>>>>>>> NumP for TX/IN endpoints? Is that computed as a function of
>>>>>>>>>>> DCFG.NUMP or
>>>>>>>>>>> TxFIFO size?
>>>>>>>>>>>
>>>>>>>>>>>> desc.  If we have a TXFIFO size of 2, then normally what I have
>>>>>>>>>>>> seen is
>>>>>>>>>>>> that after 2 data packets, the device issues a NRDY.  So then we'd
>>>>>>>>>>>> need
>>>>>>>>>>>> to send an ERDY once data is available within the FIFO, and the
>>>>>>>>>>>> same
>>>>>>>>>>>> sequence happens until the USB request is complete.  With this
>>>>>>>>>>>> constant
>>>>>>>>>>>> NRDY/ERDY handshake going on, you actually see that the bus is
>>>>>>>>>>>> under
>>>>>>>>>>>> utilized.  When we increase an EP's FIFO size, then you'll see
>>>>>>>>>>>> constant
>>>>>>>>>>>> bursts for a request, until the request is done, or if the host
>>>>>>>>>>>> runs out
>>>>>>>>>>>> of RXFIFO. (ie no interruption [on the USB protocol level] during
>>>>>>>>>>>> USB
>>>>>>>>>>>> request data transfer)
>>>>>>>>>>> Unfortunately I don't have access to a USB sniffer anymore :-(
>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Good points.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Wesley, what kind of testing have you done on this on
>>>>>>>>>>>>>>>>> different devices?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> As mentioned above, these changes are currently present on end
>>>>>>>>>>>>>>>> user
>>>>>>>>>>>>>>>> devices for the past few years, so its been through a lot of
>>>>>>>>>>>>>>>> testing :).
>>>>>>>>>>>>>>> all with the same gadget driver. Also, who uses USB on android
>>>>>>>>>>>>>>> devices
>>>>>>>>>>>>>>> these days? Most of the data transfer goes via WiFi or
>>>>>>>>>>>>>>> Bluetooth, anyway
>>>>>>>>>>>>>>> :-)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I guess only developers are using USB during development to
>>>>>>>>>>>>>>> flash dev
>>>>>>>>>>>>>>> images heh.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I used to be a customer facing engineer, so honestly I did see
>>>>>>>>>>>>>> some
>>>>>>>>>>>>>> really interesting and crazy designs.  Again, we do have
>>>>>>>>>>>>>> non-Android
>>>>>>>>>>>>>> products that use the same code, and it has been working in
>>>>>>>>>>>>>> there
>>>>>>>>>>>>>> for a
>>>>>>>>>>>>>> few years as well.  The TXFIFO sizing really has helped with
>>>>>>>>>>>>>> multimedia
>>>>>>>>>>>>>> use cases, which use isoc endpoints, since esp. in those lower
>>>>>>>>>>>>>> end CPU
>>>>>>>>>>>>>> chips where latencies across the system are much larger, and a
>>>>>>>>>>>>>> missed
>>>>>>>>>>>>>> ISOC interval leads to a pop in your ear.
>>>>>>>>>>>>> This is good background information. Thanks for bringing this
>>>>>>>>>>>>> up. Admitedly, we still have ISOC issues with dwc3. I'm
>>>>>>>>>>>>> interested in
>>>>>>>>>>>>> knowing if a deeper request queue would also help here.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Remember dwc3 can accomodate 255 requests + link for each
>>>>>>>>>>>>> endpoint. If
>>>>>>>>>>>>> our gadget driver uses a low number of requests, we're never
>>>>>>>>>>>>> really
>>>>>>>>>>>>> using the TRB ring in our benefit.
>>>>>>>>>>>>>
>>>>>>>>>>>> We're actually using both a deeper USB request queue + TX fifo
>>>>>>>>>>>> resizing. :).
>>>>>>>>>>> okay, great. Let's see what John and/or Thinh respond WRT dwc3 TX
>>>>>>>>>>> Burst
>>>>>>>>>>> behavior.
>>>>>>>>>> -- 
>>>>>>>>>> heikki

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v9 0/5] Re-introduce TX FIFO resize for larger EP bursting
  2021-06-11 13:00                 ` Felipe Balbi
  2021-06-11 13:14                   ` Heikki Krogerus
@ 2021-07-01  1:08                   ` Wesley Cheng
  1 sibling, 0 replies; 31+ messages in thread
From: Wesley Cheng @ 2021-07-01  1:08 UTC (permalink / raw)
  To: Felipe Balbi, Greg KH, Heikki Krogerus
  Cc: agross, bjorn.andersson, robh+dt, linux-usb, linux-kernel,
	linux-arm-msm, devicetree, jackp, Thinh.Nguyen, John Youn



On 6/11/2021 6:00 AM, Felipe Balbi wrote:
> 
> Hi,
> 
> Wesley Cheng <wcheng@codeaurora.org> writes:
>>>>>>>> to be honest, I don't think these should go in (apart from the build
>>>>>>>> failure) because it's likely to break instantiations of the core with
>>>>>>>> differing FIFO sizes. Some instantiations even have some endpoints with
>>>>>>>> dedicated functionality that requires the default FIFO size configured
>>>>>>>> during coreConsultant instantiation. I know of at OMAP5 and some Intel
>>>>>>>> implementations which have dedicated endpoints for processor tracing.
>>>>>>>>
>>>>>>>> With OMAP5, these endpoints are configured at the top of the available
>>>>>>>> endpoints, which means that if a gadget driver gets loaded and takes
>>>>>>>> over most of the FIFO space because of this resizing, processor tracing
>>>>>>>> will have a hard time running. That being said, processor tracing isn't
>>>>>>>> supported in upstream at this moment.
>>>>>>>>
>>>>>>
>>>>>> I agree that the application of this logic may differ between vendors,
>>>>>> hence why I wanted to keep this controllable by the DT property, so that
>>>>>> for those which do not support this use case can leave it disabled.  The
>>>>>> logic is there to ensure that for a given USB configuration, for each EP
>>>>>> it would have at least 1 TX FIFO.  For USB configurations which don't
>>>>>> utilize all available IN EPs, it would allow re-allocation of internal
>>>>>> memory to EPs which will actually be in use.
>>>>>
>>>>> The feature ends up being all-or-nothing, then :-) It sounds like we can
>>>>> be a little nicer in this regard.
>>>>>
>>>>
>>>> Don't get me wrong, I think once those features become available
>>>> upstream, we can improve the logic.  From what I remember when looking
>>>
>>> sure, I support that. But I want to make sure the first cut isn't likely
>>> to break things left and right :)
>>>
>>> Hence, let's at least get more testing.
>>>
>>
>> Sure, I'd hope that the other users of DWC3 will also see some pretty
>> big improvements on the TX path with this.

Hi Felipe,

Sorry for the delayed response.  I went into the office to capture a USB
trace to better show you the difference with and without the TXFIFO
resize changes.  Let me address your comments below first before showing
the traces.

> 
> fingers crossed
> 

Unfortunately, based on Ferry's testing, it looks like the Intel HW
platform itself doesn't have a SS capable port.  Although, we did get
some good information from it, as we found that math is different
between controller revisions.

>>>> at Andy Shevchenko's Github, the Intel tracer downstream changes were
>>>> just to remove physical EP1 and 2 from the DWC3 endpoint list.  If that
>>>
>>> right, that's the reason why we introduced the endpoint feature
>>> flags. The end goal was that the UDC would be able to have custom
>>> feature flags paired with ->validate_endpoint() or whatever before
>>> allowing it to be enabled. Then the UDC driver could tell UDC core to
>>> skip that endpoint on that particular platform without interefering with
>>> everything else.
>>>
>>> Of course, we still need to figure out a way to abstract the different
>>> dwc3 instantiations.
>>>
>>>> was the change which ended up upstream for the Intel tracer then we
>>>> could improve the logic to avoid re-sizing those particular EPs.
>>>
>>> The problem then, just as I mentioned in the previous paragraph, will be
>>> coming up with a solution that's elegant and works for all different
>>> instantiations of dwc3 (or musb, cdns3, etc).
>>>
>>
>> Well, at least for the TX FIFO resizing logic, we'd only be needing to
>> focus on the DWC3 implementation.
>>
>> You bring up another good topic that I'll eventually needing to be
>> taking a look at, which is a nice way we can handle vendor specific
>> endpoints and how they can co-exist with other "normal" endpoints.  We
>> have a few special HW eps as well, which we try to maintain separately
>> in our DWC3 vendor driver, but it isn't the most convenient, or most
>> pretty method :).
> 
> Awesome, as mentioned, the endpoint feature flags were added exactly to
> allow for these vendor-specific features :-)
> 
> I'm more than happy to help testing now that I finally got our SM8150
> Surface Duo device tree accepted by Bjorn ;-)
> 
>>>> However, I'm not sure how the changes would look like in the end, so I
>>>> would like to wait later down the line to include that :).
>>>
>>> Fair enough, I agree. Can we get some more testing of $subject, though?
>>> Did you test $subject with upstream too? Which gadget drivers did you
>>> use? How did you test
>>>
>>
>> The results that I included in the cover page was tested with the pure
>> upstream kernel on our device.  Below was using the ConfigFS gadget w/ a
>> mass storage only composition.
>>
>> Test Parameters:
>>  - Platform: Qualcomm SM8150
>>  - bMaxBurst = 6
>>  - USB req size = 256kB
>>  - Num of USB reqs = 16
> 
> do you mind testing with the regular request size (16KiB) and 250
> requests? I think we can even do 15 bursts in that case.
> 

Let's go over the trace.  If you are still convinced this would help
with the particular scenario we're looking at, then I can run this test
with the above.

>>  - USB Speed = Super-Speed
>>  - Function Driver: Mass Storage (w/ ramdisk)
>>  - Test Application: CrystalDiskMark
>>
>> Results:
>>
>> TXFIFO Depth = 3 max packets
>>
>> Test Case | Data Size | AVG tput (in MB/s)
>> -------------------------------------------
>> Sequential|1 GB x     |
>> Read      |9 loops    | 193.60
>>           |           | 195.86
>>           |           | 184.77
>>           |           | 193.60
>> -------------------------------------------
>>
>> TXFIFO Depth = 6 max packets
>>
>> Test Case | Data Size | AVG tput (in MB/s)
>> -------------------------------------------
>> Sequential|1 GB x     |
>> Read      |9 loops    | 287.35
>> 	    |           | 304.94
>>           |           | 289.64
>>           |           | 293.61
> 
> I remember getting close to 400MiB/sec with Intel platforms without
> resizing FIFOs and I'm sure the FIFO size was set to 2x1024, though my
> memory could be failing.
> 
> Then again, I never ran with CrystalDiskMark, I was using my own tool
> (it's somewhere in github. If you care, I can look up the URL).
> 
>> We also have internal numbers which have shown similar improvements as
>> well.  Those are over networking/tethering interfaces, so testing IPERF
>> loopback over TCP/UDP.
> 
> loopback iperf? That would skip the wire, no?
> 

The iperf server (receiver) would be running on the PC, and the iperf
client would be running on our device (transmitter).

>>>> size of 2 and TX threshold of 1, this would really be not beneficial to
>>>> us, because we can only change the TX threshold to 2 at max, and at
>>>> least in my observations, once we have to go out to system memory to
>>>> fetch the next data packet, that latency takes enough time for the
>>>> controller to end the current burst.
>>>
>>> What I noticed with g_mass_storage is that we can amortize the cost of
>>> fetching data from memory, with a deeper request queue. Whenever I
>>> test(ed) g_mass_storage, I was doing so with 250 requests. And that was
>>> enough to give me very good performance. Never had to poke at TX FIFO
>>> resizing. Did you try something like this too?
>>>
>>> I feel that allocating more requests is a far simpler and more generic
>>> method that changing FIFO sizes :)
>>>
>>
>> I wish I had a USB bus trace handy to show you, which would make it very
>> clear how the USB bus is currently utilized with TXFIFO size 2 vs 6.  So
>> by increasing the number of USB requests, that will help if there was a
>> bottleneck at the SW level where the application/function driver
>> utilizing the DWC3 was submitting data much faster than the HW was
>> processing them.
>>
>> So yes, this method of increasing the # of USB reqs will definitely help
>> with situations such as HSUSB or in SSUSB when EP bursting isn't used.
>> The TXFIFO resize comes into play for SSUSB, which utilizes endpoint
>> bursting.
> 
> Hmm, that's not what I remember. Perhaps the TRB cache size plays a role
> here too. I have clear memories of testing this very scenario of
> bursting (using g_mass_storage at the time) because I was curious about
> it. Back then, my tests showed no difference in behavior.
> 
> It could be nice if Heikki could test Intel parts with and without your
> changes on g_mass_storage with 250 requests.
> 
>> Now with endpoint bursting, if the function notifies the host that
>> bursting is supported, when the host sends the ACK for the Data Packet,
>> it should have a NumP value equal to the bMaxBurst reported in the EP
> 
> Yes and no. Looking back at the history, we used to configure NUMP based
> on bMaxBurst, but it was changed later in commit
> 4e99472bc10bda9906526d725ff6d5f27b4ddca1 by yours truly because of a
> problem reported by John Youn.
> 
> And now we've come full circle. Because even if I believe more requests
> are enough for bursting, NUMP is limited by the RxFIFO size. This ends
> up supporting your claim that we need RxFIFO resizing if we want to
> squeeze more throughput out of the controller.
> 
> However, note that this is about RxFIFO size, not TxFIFO size. In fact,
> looking at Table 8-13 of USB 3.1 r1.0, we read the following about NumP
> (emphasis is mine):
> 
> 	"Number of Packets (NumP). This field is used to indicate the
> 	number of Data Packet buffers that the **receiver** can
> 	accept. The value in this field shall be less than or equal to
> 	the maximum burst size supported by the endpoint as determined
> 	by the value in the bMaxBurst field in the Endpoint Companion
> 	Descriptor (refer to Section 9.6.7)."
> 
> So, NumP is for the receiver, not the transmitter. Could you clarify
> what you mean here?
> 
> /me keeps reading
> 
> Hmm, table 8-15 tries to clarify:
> 
> 	"Number of Packets (NumP).
> 
> 	For an OUT endpoint, refer to Table 8-13 for the description of
> 	this field.
> 
> 	For an IN endpoint this field is set by the endpoint to the
> 	number of packets it can transmit when the host resumes
> 	transactions to it. This field shall not have a value greater
> 	than the maximum burst size supported by the endpoint as
> 	indicated by the value in the bMaxBurst field in the Endpoint
> 	Companion Descriptor. Note that the value reported in this field
> 	may be treated by the host as informative only."
> 
> However, if I remember correctly (please verify dwc3 databook), NUMP in
> DCFG was only for receive buffers. Thin, John, how does dwc3 compute
> NumP for TX/IN endpoints? Is that computed as a function of DCFG.NUMP or
> TxFIFO size?
> 

Sorry for confusing you.  So you are right about NumP being applicable
to the receiver path, and the PC USB host controller also will have its
own RXFIFO, which for the IN direction correlates to the NumP value
being sent by the host within the ACK.  The point I was trying to make
is that, the bus utilization is not hampered by the host running out of
RXFIFO (getting ACKs w/ NumP=0), but with the device always pre-maturely
ending the burst.

>> desc.  If we have a TXFIFO size of 2, then normally what I have seen is
>> that after 2 data packets, the device issues a NRDY.  So then we'd need
>> to send an ERDY once data is available within the FIFO, and the same
>> sequence happens until the USB request is complete.  With this constant
>> NRDY/ERDY handshake going on, you actually see that the bus is under
>> utilized.  When we increase an EP's FIFO size, then you'll see constant
>> bursts for a request, until the request is done, or if the host runs out
>> of RXFIFO. (ie no interruption [on the USB protocol level] during USB
>> request data transfer)
> 
> Unfortunately I don't have access to a USB sniffer anymore :-(
> 

So I went ahead and captured USB Lecroy log with the following conditions:
- bMaxBurst value = 6
- TXFIFOSZ = 6 max packets (with resize) / TXFIFOSZ = 1 max packet (w/o)
- Test case = USB tethering w/ IPERF loopback (between PC and device)
- USB request size = 32kB
- Speed = USB3.1 gen 1

Trace Hierarchy:
Transfer
   --> Transaction
      --> Packet

1 transfer has multiple transactions, and 1 transaction has multiple
packets.  This is just how Lecroy does the packet groupings to help make
the traces more readable.  For now, we can focus on the "Packet" entries.

With TXFIFO resize:

Transfer(1062) Left("Left") G1(x1) Bulk(IN) ADDR(15) ENDP(4)
_______| Bytes Transferred(31584) Time Stamp(0 . 400 271 196)
_______|_______________________________________________________________________
Transaction(1768) Left("Left") G1(x1) IN ADDR(15) ENDP(4) Data(31584 bytes)
_______| Time Stamp(0 . 400 271 196)
_______|_______________________________________________________________________L

Packet(268690) Left("Left") Dir G1(x1) TP ACK(1) ADDR(15) ENDP(4)
_______| Dir(In) SeqN(31) NumP(3) Stream ID(0x0000) PP(Pnd)   LCW  (Hseq:2)
_______| Duration(40.100 ns) Time(358.000 ns) Time Stamp(0 . 400 271 196)
_______|_______________________________________________________________________R

Packet(268692) Right("Right") Dir G1(x1) DP Data Len(1024) ADDR(15)
_______| ENDP(4) Dir(In) SeqN(31) EoB(N) Stream ID(0x0000) PP(Not Pnd)
_______|   LCW  (Hseq:3) Data(1024 bytes) Duration(  2.117 us) Time(
2.140 us)
_______| Time Stamp(0 . 400 271 554)
_______|_______________________________________________________________________R

Packet(268696) Right("Right") Dir G1(x1) DP Data Len(1024) ADDR(15)
_______| ENDP(4) Dir(In) SeqN(0) EoB(N) Stream ID(0x0000) PP(Not Pnd)
_______|   LCW  (Hseq:4) Data(1024 bytes) Duration(  2.117 us)
Time(262.000 ns)
_______| Time Stamp(0 . 400 273 694)
_______|_______________________________________________________________________L

Packet(268699) Left("Left") Dir G1(x1) TP ACK(1) ADDR(15) ENDP(4)
_______| Dir(In) SeqN(0) NumP(3) Stream ID(0x0000) PP(Pnd)   LCW  (Hseq:3)
_______| Duration(40.100 ns) Time(  1.890 us) Time Stamp(0 . 400 273 956)
_______|_______________________________________________________________________R

Packet(268702) Right("Right") Dir G1(x1) DP Data Len(1024) ADDR(15)
_______| ENDP(4) Dir(In) SeqN(1) EoB(N) Stream ID(0x0000) PP(Not Pnd)
_______|   LCW  (Hseq:5) Data(1024 bytes) Duration(  2.117 us)
Time(258.000 ns)
_______| Time Stamp(0 . 400 275 846)
_______|_______________________________________________________________________L

Packet(268705) Left("Left") Dir G1(x1) TP ACK(1) ADDR(15) ENDP(4)
_______| Dir(In) SeqN(1) NumP(3) Stream ID(0x0000) PP(Pnd)   LCW  (Hseq:4)
_______| Duration(40.100 ns) Time(  1.918 us) Time Stamp(0 . 400 276 104)
_______|_______________________________________________________________________R

Packet(268708) Right("Right") Dir G1(x1) DP Data Len(1024) ADDR(15)
_______| ENDP(4) Dir(In) SeqN(2) EoB(N) Stream ID(0x0000) PP(Not Pnd)
_______|   LCW  (Hseq:6) Data(1024 bytes) Duration(  2.117 us)
Time(246.000 ns)
_______| Time Stamp(0 . 400 278 022)
_______|_______________________________________________________________________L

Packet(268711) Left("Left") Dir G1(x1) TP ACK(1) ADDR(15) ENDP(4)
_______| Dir(In) SeqN(2) NumP(2) Stream ID(0x0000) PP(Pnd)   LCW  (Hseq:5)
_______| Duration(40.100 ns) Time(  1.962 us) Time Stamp(0 . 400 278 268)
_______|_______________________________________________________________________R


Observations:
- Within a transfer, there are no data packets w/ the EoB set to yes.
- Host never runs out of RXFIFO (NumP never reaches 0)
- Packets within a transaction is never interrupted with a NRDY.
(followed by an ERDY from the device to continue the transaction)


==========================================================================

Without TXFIFO resize:

Transfer(1542) Left("Left") G1(x1) Bulk(IN) ADDR(19) ENDP(4)
_______| Bytes Transferred(31584) Time Stamp(0 . 619 833 722)
_______|_______________________________________________________________________
Transaction(7677) Left("Left") G1(x1) IN ADDR(19) ENDP(4) Condition(Flow
Ctrl)
_______| Data(1024 bytes) Time Stamp(0 . 619 833 722)
_______|_______________________________________________________________________L

Packet(415331) Left("Left") Dir G1(x1) TP ACK(1) ADDR(19) ENDP(4)
_______| Dir(In) SeqN(13) NumP(4) Stream ID(0x0000) PP(Pnd)   LCW  (Hseq:3)
_______| Duration(40.100 ns) Time(396.000 ns) Time Stamp(0 . 619 833 722)
_______|_______________________________________________________________________R

Packet(415338) Right("Right") Dir G1(x1) DP Data Len(1024) ADDR(19)
_______| ENDP(4) Dir(In) SeqN(13) EoB(Y) Stream ID(0x0000) PP(Not Pnd)
_______|   LCW  (Hseq:5) Data(1024 bytes) Duration(  2.117 us) Time(
4.012 us)
_______| Time Stamp(0 . 619 834 118)
_______|_______________________________________________________________________L

Packet(415349) Left("Left") Dir G1(x1) TP ACK(1) ADDR(19) ENDP(4)
_______| Dir(In) SeqN(14) NumP(0) Stream ID(0x0000) PP(Pnd)   LCW  (Hseq:6)
_______| Duration(40.100 ns) Time(460.000 ns) Time Stamp(0 . 619 838 130)
_______|_______________________________________________________________________
Transaction(7681) Right("Right") G1(x1) EP Ready ADDR(19) ENDP(4) Dir(IN)
_______| Time Stamp(0 . 619 838 590)
_______|_______________________________________________________________________R

Packet(415354) Right("Right") Dir G1(x1) TP ERDY(3) ADDR(19) ENDP(4)
_______| Dir(In) NumP(7) Stream ID(0x0000)   LCW  (Hseq:7)
Duration(40.100 ns)
_______| Time(  2.148 us) Time Stamp(0 . 619 838 590)
_______|_______________________________________________________________________
Transaction(7682) Left("Left") G1(x1) IN ADDR(19) ENDP(4) Condition(Flow
Ctrl)
_______| Data(1024 bytes) Time Stamp(0 . 619 840 738)
_______|_______________________________________________________________________L

Packet(415358) Left("Left") Dir G1(x1) TP ACK(1) ADDR(19) ENDP(4)
_______| Dir(In) SeqN(14) NumP(3) Stream ID(0x0000) PP(Pnd)   LCW  (Hseq:0)
_______| Duration(40.100 ns) Time(428.000 ns) Time Stamp(0 . 619 840 738)
_______|_______________________________________________________________________R

Packet(415365) Right("Right") Dir G1(x1) DP Data Len(1024) ADDR(19)
_______| ENDP(4) Dir(In) SeqN(14) EoB(Y) Stream ID(0x0000) PP(Not Pnd)
_______|   LCW  (Hseq:1) Data(1024 bytes) Duration(  2.117 us) Time(
3.980 us)
_______| Time Stamp(0 . 619 841 166)
_______|_______________________________________________________________________L

Packet(415376) Left("Left") Dir G1(x1) TP ACK(1) ADDR(19) ENDP(4)
_______| Dir(In) SeqN(15) NumP(0) Stream ID(0x0000) PP(Pnd)   LCW  (Hseq:3)
_______| Duration(40.100 ns) Time(444.000 ns) Time Stamp(0 . 619 845 146)
_______|_______________________________________________________________________
Transaction(7683) Right("Right") G1(x1) EP Ready ADDR(19) ENDP(4) Dir(IN)
_______| Time Stamp(0 . 619 845 590)
_______|_______________________________________________________________________R

Packet(415383) Right("Right") Dir G1(x1) TP ERDY(3) ADDR(19) ENDP(4)
_______| Dir(In) NumP(7) Stream ID(0x0000)   LCW  (Hseq:4)
Duration(40.100 ns)
_______| Time(  3.964 us) Time Stamp(0 . 619 845 590)
_______|_______________________________________________________________________
Transaction(7684) Left("Left") G1(x1) IN ADDR(19) ENDP(4) Condition(Flow
Ctrl)
_______| Data(1024 bytes) Time Stamp(0 . 619 849 554)
_______|_______________________________________________________________________L

Packet(415394) Left("Left") Dir G1(x1) TP ACK(1) ADDR(19) ENDP(4)
_______| Dir(In) SeqN(15) NumP(1) Stream ID(0x0000) PP(Pnd)   LCW  (Hseq:6)
_______| Duration(40.100 ns) Time(396.000 ns) Time Stamp(0 . 619 849 554)

Observations:
- Since the current setting has TXFIFO size to only have 1 max packet,
the device sets the EoB to yes during the data packet.
- Within a transfer, there are multiple "transactions" as Lecroy groups
transactions whenever there is a NRDY --> ERDY transition.
- Frequent NRDY --> ERDY transitions coming from the device


I hope this clears up what the TXFIFO resize is actually helping with.
It keeps the burst continuously going w/o having to do a NRDY-->ERDY
handshake. (which is unnecessary overhead)  Even though we allocate more
USB requests from the SW, the SW has no control over how the request is
sent over the link.

Thanks
Wesley Cheng

>>>>>>> Good points.
>>>>>>>
>>>>>>> Wesley, what kind of testing have you done on this on different devices?
>>>>>>>
>>>>>>
>>>>>> As mentioned above, these changes are currently present on end user
>>>>>> devices for the past few years, so its been through a lot of testing :).
>>>>>
>>>>> all with the same gadget driver. Also, who uses USB on android devices
>>>>> these days? Most of the data transfer goes via WiFi or Bluetooth, anyway
>>>>> :-)
>>>>>
>>>>> I guess only developers are using USB during development to flash dev
>>>>> images heh.
>>>>>
>>>>
>>>> I used to be a customer facing engineer, so honestly I did see some
>>>> really interesting and crazy designs.  Again, we do have non-Android
>>>> products that use the same code, and it has been working in there for a
>>>> few years as well.  The TXFIFO sizing really has helped with multimedia
>>>> use cases, which use isoc endpoints, since esp. in those lower end CPU
>>>> chips where latencies across the system are much larger, and a missed
>>>> ISOC interval leads to a pop in your ear.
>>>
>>> This is good background information. Thanks for bringing this
>>> up. Admitedly, we still have ISOC issues with dwc3. I'm interested in
>>> knowing if a deeper request queue would also help here.
>>>
>>> Remember dwc3 can accomodate 255 requests + link for each endpoint. If
>>> our gadget driver uses a low number of requests, we're never really
>>> using the TRB ring in our benefit.
>>>
>>
>> We're actually using both a deeper USB request queue + TX fifo resizing. :).
> 
> okay, great. Let's see what John and/or Thinh respond WRT dwc3 TX Burst
> behavior.
> 

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2021-07-01  1:09 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-19  7:49 [PATCH v9 0/5] Re-introduce TX FIFO resize for larger EP bursting Wesley Cheng
2021-05-19  7:49 ` [PATCH v9 1/5] usb: gadget: udc: core: Introduce check_config to verify USB configuration Wesley Cheng
2021-05-19  7:49 ` [PATCH v9 2/5] usb: gadget: configfs: Check USB configuration before adding Wesley Cheng
2021-05-19  7:49 ` [PATCH v9 3/5] usb: dwc3: Resize TX FIFOs to meet EP bursting requirements Wesley Cheng
2021-05-19  7:49 ` [PATCH v9 4/5] usb: dwc3: dwc3-qcom: Enable tx-fifo-resize property by default Wesley Cheng
2021-05-19  7:49 ` [PATCH v9 5/5] dt-bindings: usb: dwc3: Update dwc3 TX fifo properties Wesley Cheng
2021-06-04 11:54 ` [PATCH v9 0/5] Re-introduce TX FIFO resize for larger EP bursting Greg KH
2021-06-04 14:18   ` Felipe Balbi
2021-06-04 14:36     ` Greg KH
2021-06-08  5:44       ` Wesley Cheng
2021-06-10  9:20         ` Felipe Balbi
2021-06-10 10:03           ` Greg KH
2021-06-10 10:16             ` Felipe Balbi
2021-06-10 18:15           ` Wesley Cheng
2021-06-11  6:29             ` Felipe Balbi
2021-06-11  8:43               ` Wesley Cheng
2021-06-11 13:00                 ` Felipe Balbi
2021-06-11 13:14                   ` Heikki Krogerus
2021-06-11 13:21                     ` Andy Shevchenko
2021-06-12 21:27                       ` Ferry Toth
2021-06-12 21:37                         ` Andy Shevchenko
2021-06-14 18:58                         ` Wesley Cheng
2021-06-14 19:30                           ` Ferry Toth
2021-06-15  4:22                             ` Wesley Cheng
2021-06-15 19:53                               ` Ferry Toth
2021-06-17  4:25                                 ` Wesley Cheng
     [not found]                                   ` <fe834dbf-786a-2996-5c4b-1eac92e3ed18@gmail.com>
2021-06-17  8:30                                     ` Wesley Cheng
2021-06-17 19:54                                       ` Ferry Toth
2021-07-01  1:08                   ` Wesley Cheng
2021-06-07 16:04   ` Jack Pham
2021-06-08  5:07     ` Wesley Cheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).