All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 1/3] bus: mhi: core: Add helper API to return number of free TREs
@ 2020-12-10 11:15 Loic Poulain
  2020-12-10 11:15 ` [PATCH v2 2/3] net: mhi: Get RX queue size from MHI core Loic Poulain
  2020-12-10 11:15 ` [PATCH v2 3/3] net: mhi: Add dedicated alloc thread Loic Poulain
  0 siblings, 2 replies; 12+ messages in thread
From: Loic Poulain @ 2020-12-10 11:15 UTC (permalink / raw)
  To: kuba, davem
  Cc: manivannan.sadhasivam, linux-arm-msm, netdev, jhugo, Hemant Kumar

From: Hemant Kumar <hemantk@codeaurora.org>

Introduce mhi_get_free_desc_count() API to return number
of TREs available to queue buffer. MHI clients can use this
API to know before hand if ring is full without calling queue
API.

Signed-off-by: Hemant Kumar <hemantk@codeaurora.org>
Reviewed-by: Jeffrey Hugo <jhugo@codeaurora.org>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
---
 v2: no change

 drivers/bus/mhi/core/main.c | 12 ++++++++++++
 include/linux/mhi.h         |  9 +++++++++
 2 files changed, 21 insertions(+)

diff --git a/drivers/bus/mhi/core/main.c b/drivers/bus/mhi/core/main.c
index 54d9c80..a24ba4f 100644
--- a/drivers/bus/mhi/core/main.c
+++ b/drivers/bus/mhi/core/main.c
@@ -303,6 +303,18 @@ int mhi_destroy_device(struct device *dev, void *data)
 	return 0;
 }
 
+int mhi_get_free_desc_count(struct mhi_device *mhi_dev,
+				enum dma_data_direction dir)
+{
+	struct mhi_controller *mhi_cntrl = mhi_dev->mhi_cntrl;
+	struct mhi_chan *mhi_chan = (dir == DMA_TO_DEVICE) ?
+		mhi_dev->ul_chan : mhi_dev->dl_chan;
+	struct mhi_ring *tre_ring = &mhi_chan->tre_ring;
+
+	return get_nr_avail_ring_elements(mhi_cntrl, tre_ring);
+}
+EXPORT_SYMBOL_GPL(mhi_get_free_desc_count);
+
 void mhi_notify(struct mhi_device *mhi_dev, enum mhi_callback cb_reason)
 {
 	struct mhi_driver *mhi_drv;
diff --git a/include/linux/mhi.h b/include/linux/mhi.h
index 09f786e..25c69a0 100644
--- a/include/linux/mhi.h
+++ b/include/linux/mhi.h
@@ -616,6 +616,15 @@ void mhi_set_mhi_state(struct mhi_controller *mhi_cntrl,
 void mhi_notify(struct mhi_device *mhi_dev, enum mhi_callback cb_reason);
 
 /**
+ * mhi_get_free_desc_count - Get transfer ring length
+ * Get # of TD available to queue buffers
+ * @mhi_dev: Device associated with the channels
+ * @dir: Direction of the channel
+ */
+int mhi_get_free_desc_count(struct mhi_device *mhi_dev,
+				enum dma_data_direction dir);
+
+/**
  * mhi_prepare_for_power_up - Do pre-initialization before power up.
  *                            This is optional, call this before power up if
  *                            the controller does not want bus framework to
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 2/3] net: mhi: Get RX queue size from MHI core
  2020-12-10 11:15 [PATCH v2 1/3] bus: mhi: core: Add helper API to return number of free TREs Loic Poulain
@ 2020-12-10 11:15 ` Loic Poulain
  2020-12-11  5:38   ` Manivannan Sadhasivam
  2020-12-10 11:15 ` [PATCH v2 3/3] net: mhi: Add dedicated alloc thread Loic Poulain
  1 sibling, 1 reply; 12+ messages in thread
From: Loic Poulain @ 2020-12-10 11:15 UTC (permalink / raw)
  To: kuba, davem
  Cc: manivannan.sadhasivam, linux-arm-msm, netdev, jhugo, Loic Poulain

The RX queue size can be determined at runtime by retrieving the
number of available transfer descriptors.

Signed-off-by: Loic Poulain <loic.poulain@linaro.org>
---
 v2: Fixed commit message typo

 drivers/net/mhi_net.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/mhi_net.c b/drivers/net/mhi_net.c
index 8e72d94..0333e07 100644
--- a/drivers/net/mhi_net.c
+++ b/drivers/net/mhi_net.c
@@ -256,9 +256,6 @@ static int mhi_net_probe(struct mhi_device *mhi_dev,
 	mhi_netdev->mdev = mhi_dev;
 	SET_NETDEV_DEV(ndev, &mhi_dev->dev);
 
-	/* All MHI net channels have 128 ring elements (at least for now) */
-	mhi_netdev->rx_queue_sz = 128;
-
 	INIT_DELAYED_WORK(&mhi_netdev->rx_refill, mhi_net_rx_refill_work);
 	u64_stats_init(&mhi_netdev->stats.rx_syncp);
 	u64_stats_init(&mhi_netdev->stats.tx_syncp);
@@ -268,6 +265,9 @@ static int mhi_net_probe(struct mhi_device *mhi_dev,
 	if (err)
 		goto out_err;
 
+	/* Number of transfer descriptors determines size of the queue */
+	mhi_netdev->rx_queue_sz = mhi_get_free_desc_count(mhi_dev, DMA_FROM_DEVICE);
+
 	err = register_netdev(ndev);
 	if (err)
 		goto out_err;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 3/3] net: mhi: Add dedicated alloc thread
  2020-12-10 11:15 [PATCH v2 1/3] bus: mhi: core: Add helper API to return number of free TREs Loic Poulain
  2020-12-10 11:15 ` [PATCH v2 2/3] net: mhi: Get RX queue size from MHI core Loic Poulain
@ 2020-12-10 11:15 ` Loic Poulain
  2020-12-12 20:55   ` Jakub Kicinski
  1 sibling, 1 reply; 12+ messages in thread
From: Loic Poulain @ 2020-12-10 11:15 UTC (permalink / raw)
  To: kuba, davem
  Cc: manivannan.sadhasivam, linux-arm-msm, netdev, jhugo, Loic Poulain

The buffer allocation for RX path is currently done by a work executed
in the system workqueue. The work to do is quite simple and consists
mostly in allocating and queueing as much as possible buffers to the MHI
RX channel.

It appears that using a dedicated kthread would be more appropriate to
prevent
1. RX allocation latency introduced by the system queue
2. Unbounded work execution, the work only returning when queue is
full, it can possibly monopolise the workqueue thread on slower systems.

This patch replaces the system work with a simple kthread that loops on
buffer allocation and sleeps when queue is full. Moreover it gets rid
of the local rx_queued variable (to track buffer count), and instead,
relies on the new mhi_get_free_desc_count helper.

After pratical testing on a x86_64 machine, this change improves
- Peek throughput (slightly, by few mbps)
- Throughput stability when concurrent loads are running (stress)
- CPU usage, less CPU cycles dedicated to the task

Below is the powertop output for RX allocation task before and after
this change, when performing UDP download at 6Gbps. Mostly to highlight
the improvement in term of CPU usage.

older (system workqueue):
Usage       Events/s    Category       Description
63,2 ms/s     134,0        kWork          mhi_net_rx_refill_work
62,8 ms/s     134,3        kWork          mhi_net_rx_refill_work
60,8 ms/s     141,4        kWork          mhi_net_rx_refill_work

newer (dedicated kthread):
Usage       Events/s    Category       Description
20,7 ms/s     155,6        Process        [PID 3360] [mhi-net-rx]
22,2 ms/s     169,6        Process        [PID 3360] [mhi-net-rx]
22,3 ms/s     150,2        Process        [PID 3360] [mhi-net-rx]

Signed-off-by: Loic Poulain <loic.poulain@linaro.org>
---
 v2: add module parameter for changing RX refill level

 drivers/net/mhi_net.c | 110 ++++++++++++++++++++++++++++----------------------
 1 file changed, 62 insertions(+), 48 deletions(-)

diff --git a/drivers/net/mhi_net.c b/drivers/net/mhi_net.c
index 0333e07..bd66d51 100644
--- a/drivers/net/mhi_net.c
+++ b/drivers/net/mhi_net.c
@@ -5,6 +5,7 @@
  */
 
 #include <linux/if_arp.h>
+#include <linux/kthread.h>
 #include <linux/mhi.h>
 #include <linux/mod_devicetable.h>
 #include <linux/module.h>
@@ -16,6 +17,11 @@
 #define MHI_NET_MAX_MTU		0xffff
 #define MHI_NET_DEFAULT_MTU	0x4000
 
+static unsigned int rx_refill_level = 70;
+module_param(rx_refill_level, uint, 0600);
+MODULE_PARM_DESC(rx_refill_level,
+		 "The minimal RX queue level percentage (0 to 100) under which the RX queue must be refilled");
+
 struct mhi_net_stats {
 	u64_stats_t rx_packets;
 	u64_stats_t rx_bytes;
@@ -25,7 +31,6 @@ struct mhi_net_stats {
 	u64_stats_t tx_bytes;
 	u64_stats_t tx_errors;
 	u64_stats_t tx_dropped;
-	atomic_t rx_queued;
 	struct u64_stats_sync tx_syncp;
 	struct u64_stats_sync rx_syncp;
 };
@@ -33,17 +38,66 @@ struct mhi_net_stats {
 struct mhi_net_dev {
 	struct mhi_device *mdev;
 	struct net_device *ndev;
-	struct delayed_work rx_refill;
+	struct task_struct *refill_task;
+	wait_queue_head_t refill_wq;
 	struct mhi_net_stats stats;
 	u32 rx_queue_sz;
+	u32 rx_refill_level;
 };
 
+static int mhi_net_refill_thread(void *data)
+{
+	struct mhi_net_dev *mhi_netdev = data;
+	struct net_device *ndev = mhi_netdev->ndev;
+	struct mhi_device *mdev = mhi_netdev->mdev;
+	int size = READ_ONCE(ndev->mtu);
+	struct sk_buff *skb;
+	int err;
+
+	while (1) {
+		err = wait_event_interruptible(mhi_netdev->refill_wq,
+					       !mhi_queue_is_full(mdev, DMA_FROM_DEVICE)
+					       || kthread_should_stop());
+		if (err || kthread_should_stop())
+			break;
+
+		skb = netdev_alloc_skb(ndev, size);
+		if (unlikely(!skb)) {
+			/* No memory, retry later */
+			schedule_timeout_interruptible(msecs_to_jiffies(250));
+			continue;
+		}
+
+		err = mhi_queue_skb(mdev, DMA_FROM_DEVICE, skb, size, MHI_EOT);
+		if (unlikely(err)) {
+			net_err_ratelimited("%s: Failed to queue RX buf (%d)\n",
+					    ndev->name, err);
+			kfree_skb(skb);
+			break;
+		}
+
+		/* Do not hog the CPU */
+		cond_resched();
+	}
+
+	return 0;
+}
+
 static int mhi_ndo_open(struct net_device *ndev)
 {
 	struct mhi_net_dev *mhi_netdev = netdev_priv(ndev);
+	unsigned int qsz = mhi_netdev->rx_queue_sz;
 
-	/* Feed the rx buffer pool */
-	schedule_delayed_work(&mhi_netdev->rx_refill, 0);
+	if (rx_refill_level >= 100)
+		mhi_netdev->rx_refill_level = 1;
+	else
+		mhi_netdev->rx_refill_level = qsz - qsz * rx_refill_level / 100;
+
+	mhi_netdev->refill_task = kthread_run(mhi_net_refill_thread, mhi_netdev,
+					      "mhi-net-rx");
+	if (IS_ERR(mhi_netdev->refill_task)) {
+		return PTR_ERR(mhi_netdev->refill_task);
+	}
 
 	/* Carrier is established via out-of-band channel (e.g. qmi) */
 	netif_carrier_on(ndev);
@@ -57,9 +111,9 @@ static int mhi_ndo_stop(struct net_device *ndev)
 {
 	struct mhi_net_dev *mhi_netdev = netdev_priv(ndev);
 
+	kthread_stop(mhi_netdev->refill_task);
 	netif_stop_queue(ndev);
 	netif_carrier_off(ndev);
-	cancel_delayed_work_sync(&mhi_netdev->rx_refill);
 
 	return 0;
 }
@@ -138,9 +192,6 @@ static void mhi_net_dl_callback(struct mhi_device *mhi_dev,
 {
 	struct mhi_net_dev *mhi_netdev = dev_get_drvdata(&mhi_dev->dev);
 	struct sk_buff *skb = mhi_res->buf_addr;
-	int remaining;
-
-	remaining = atomic_dec_return(&mhi_netdev->stats.rx_queued);
 
 	if (unlikely(mhi_res->transaction_status)) {
 		dev_kfree_skb_any(skb);
@@ -163,9 +214,8 @@ static void mhi_net_dl_callback(struct mhi_device *mhi_dev,
 		netif_rx(skb);
 	}
 
-	/* Refill if RX buffers queue becomes low */
-	if (remaining <= mhi_netdev->rx_queue_sz / 2)
-		schedule_delayed_work(&mhi_netdev->rx_refill, 0);
+	if (mhi_get_free_desc_count(mhi_dev, DMA_FROM_DEVICE) >= mhi_netdev->rx_refill_level)
+		wake_up_interruptible(&mhi_netdev->refill_wq);
 }
 
 static void mhi_net_ul_callback(struct mhi_device *mhi_dev,
@@ -200,42 +250,6 @@ static void mhi_net_ul_callback(struct mhi_device *mhi_dev,
 		netif_wake_queue(ndev);
 }
 
-static void mhi_net_rx_refill_work(struct work_struct *work)
-{
-	struct mhi_net_dev *mhi_netdev = container_of(work, struct mhi_net_dev,
-						      rx_refill.work);
-	struct net_device *ndev = mhi_netdev->ndev;
-	struct mhi_device *mdev = mhi_netdev->mdev;
-	int size = READ_ONCE(ndev->mtu);
-	struct sk_buff *skb;
-	int err;
-
-	while (atomic_read(&mhi_netdev->stats.rx_queued) < mhi_netdev->rx_queue_sz) {
-		skb = netdev_alloc_skb(ndev, size);
-		if (unlikely(!skb))
-			break;
-
-		err = mhi_queue_skb(mdev, DMA_FROM_DEVICE, skb, size, MHI_EOT);
-		if (unlikely(err)) {
-			net_err_ratelimited("%s: Failed to queue RX buf (%d)\n",
-					    ndev->name, err);
-			kfree_skb(skb);
-			break;
-		}
-
-		atomic_inc(&mhi_netdev->stats.rx_queued);
-
-		/* Do not hog the CPU if rx buffers are consumed faster than
-		 * queued (unlikely).
-		 */
-		cond_resched();
-	}
-
-	/* If we're still starved of rx buffers, reschedule later */
-	if (unlikely(!atomic_read(&mhi_netdev->stats.rx_queued)))
-		schedule_delayed_work(&mhi_netdev->rx_refill, HZ / 2);
-}
-
 static int mhi_net_probe(struct mhi_device *mhi_dev,
 			 const struct mhi_device_id *id)
 {
@@ -256,7 +270,7 @@ static int mhi_net_probe(struct mhi_device *mhi_dev,
 	mhi_netdev->mdev = mhi_dev;
 	SET_NETDEV_DEV(ndev, &mhi_dev->dev);
 
-	INIT_DELAYED_WORK(&mhi_netdev->rx_refill, mhi_net_rx_refill_work);
+	init_waitqueue_head(&mhi_netdev->refill_wq);
 	u64_stats_init(&mhi_netdev->stats.rx_syncp);
 	u64_stats_init(&mhi_netdev->stats.tx_syncp);
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 2/3] net: mhi: Get RX queue size from MHI core
  2020-12-10 11:15 ` [PATCH v2 2/3] net: mhi: Get RX queue size from MHI core Loic Poulain
@ 2020-12-11  5:38   ` Manivannan Sadhasivam
  2020-12-11  9:40     ` Loic Poulain
  0 siblings, 1 reply; 12+ messages in thread
From: Manivannan Sadhasivam @ 2020-12-11  5:38 UTC (permalink / raw)
  To: Loic Poulain; +Cc: kuba, davem, linux-arm-msm, netdev, jhugo

On Thu, Dec 10, 2020 at 12:15:50PM +0100, Loic Poulain wrote:
> The RX queue size can be determined at runtime by retrieving the
> number of available transfer descriptors.
> 
> Signed-off-by: Loic Poulain <loic.poulain@linaro.org>
> ---
>  v2: Fixed commit message typo
> 
>  drivers/net/mhi_net.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/mhi_net.c b/drivers/net/mhi_net.c
> index 8e72d94..0333e07 100644
> --- a/drivers/net/mhi_net.c
> +++ b/drivers/net/mhi_net.c
> @@ -256,9 +256,6 @@ static int mhi_net_probe(struct mhi_device *mhi_dev,
>  	mhi_netdev->mdev = mhi_dev;
>  	SET_NETDEV_DEV(ndev, &mhi_dev->dev);
>  
> -	/* All MHI net channels have 128 ring elements (at least for now) */
> -	mhi_netdev->rx_queue_sz = 128;
> -
>  	INIT_DELAYED_WORK(&mhi_netdev->rx_refill, mhi_net_rx_refill_work);
>  	u64_stats_init(&mhi_netdev->stats.rx_syncp);
>  	u64_stats_init(&mhi_netdev->stats.tx_syncp);
> @@ -268,6 +265,9 @@ static int mhi_net_probe(struct mhi_device *mhi_dev,
>  	if (err)
>  		goto out_err;
>  
> +	/* Number of transfer descriptors determines size of the queue */
> +	mhi_netdev->rx_queue_sz = mhi_get_free_desc_count(mhi_dev, DMA_FROM_DEVICE);
> +

This value is not static right? You might need to fetch the count in
mhi_net_rx_refill_work().

Thanks,
Mani

>  	err = register_netdev(ndev);
>  	if (err)
>  		goto out_err;
> -- 
> 2.7.4
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 2/3] net: mhi: Get RX queue size from MHI core
  2020-12-11  5:38   ` Manivannan Sadhasivam
@ 2020-12-11  9:40     ` Loic Poulain
  2020-12-11 10:15       ` Manivannan Sadhasivam
  0 siblings, 1 reply; 12+ messages in thread
From: Loic Poulain @ 2020-12-11  9:40 UTC (permalink / raw)
  To: Manivannan Sadhasivam
  Cc: Jakub Kicinski, David Miller, linux-arm-msm, Network Development,
	Jeffrey Hugo

Hi Mani,

On Fri, 11 Dec 2020 at 06:38, Manivannan Sadhasivam
<manivannan.sadhasivam@linaro.org> wrote:
>
> On Thu, Dec 10, 2020 at 12:15:50PM +0100, Loic Poulain wrote:
> > The RX queue size can be determined at runtime by retrieving the
> > number of available transfer descriptors.
> >
> > Signed-off-by: Loic Poulain <loic.poulain@linaro.org>
> > ---
> >  v2: Fixed commit message typo
> >
> >  drivers/net/mhi_net.c | 6 +++---
> >  1 file changed, 3 insertions(+), 3 deletions(-)
[...]
> > -
> >       INIT_DELAYED_WORK(&mhi_netdev->rx_refill, mhi_net_rx_refill_work);
> >       u64_stats_init(&mhi_netdev->stats.rx_syncp);
> >       u64_stats_init(&mhi_netdev->stats.tx_syncp);
> > @@ -268,6 +265,9 @@ static int mhi_net_probe(struct mhi_device *mhi_dev,
> >       if (err)
> >               goto out_err;
> >
> > +     /* Number of transfer descriptors determines size of the queue */
> > +     mhi_netdev->rx_queue_sz = mhi_get_free_desc_count(mhi_dev, DMA_FROM_DEVICE);
> > +
>
> This value is not static right? You might need to fetch the count in
> mhi_net_rx_refill_work().

It is, actually here driver is just looking for the total queue size,
which is the number of descriptors at init time. This total queue size
is used later to determine the level of MHI queue occupancy rate.

Regards,
Loic

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 2/3] net: mhi: Get RX queue size from MHI core
  2020-12-11  9:40     ` Loic Poulain
@ 2020-12-11 10:15       ` Manivannan Sadhasivam
  2020-12-11 10:40         ` Loic Poulain
  0 siblings, 1 reply; 12+ messages in thread
From: Manivannan Sadhasivam @ 2020-12-11 10:15 UTC (permalink / raw)
  To: Loic Poulain
  Cc: Jakub Kicinski, David Miller, linux-arm-msm, Network Development,
	Jeffrey Hugo

On Fri, Dec 11, 2020 at 10:40:13AM +0100, Loic Poulain wrote:
> Hi Mani,
> 
> On Fri, 11 Dec 2020 at 06:38, Manivannan Sadhasivam
> <manivannan.sadhasivam@linaro.org> wrote:
> >
> > On Thu, Dec 10, 2020 at 12:15:50PM +0100, Loic Poulain wrote:
> > > The RX queue size can be determined at runtime by retrieving the
> > > number of available transfer descriptors.
> > >
> > > Signed-off-by: Loic Poulain <loic.poulain@linaro.org>
> > > ---
> > >  v2: Fixed commit message typo
> > >
> > >  drivers/net/mhi_net.c | 6 +++---
> > >  1 file changed, 3 insertions(+), 3 deletions(-)
> [...]
> > > -
> > >       INIT_DELAYED_WORK(&mhi_netdev->rx_refill, mhi_net_rx_refill_work);
> > >       u64_stats_init(&mhi_netdev->stats.rx_syncp);
> > >       u64_stats_init(&mhi_netdev->stats.tx_syncp);
> > > @@ -268,6 +265,9 @@ static int mhi_net_probe(struct mhi_device *mhi_dev,
> > >       if (err)
> > >               goto out_err;
> > >
> > > +     /* Number of transfer descriptors determines size of the queue */
> > > +     mhi_netdev->rx_queue_sz = mhi_get_free_desc_count(mhi_dev, DMA_FROM_DEVICE);
> > > +
> >
> > This value is not static right? You might need to fetch the count in
> > mhi_net_rx_refill_work().
> 
> It is, actually here driver is just looking for the total queue size,
> which is the number of descriptors at init time. This total queue size
> is used later to determine the level of MHI queue occupancy rate.
> 

Right but what if the size got increased in runtime (recycled etc...), we won't
be fully utilizing the ring.

Thanks,
Mani

> Regards,
> Loic

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 2/3] net: mhi: Get RX queue size from MHI core
  2020-12-11 10:15       ` Manivannan Sadhasivam
@ 2020-12-11 10:40         ` Loic Poulain
  0 siblings, 0 replies; 12+ messages in thread
From: Loic Poulain @ 2020-12-11 10:40 UTC (permalink / raw)
  To: Manivannan Sadhasivam
  Cc: Jakub Kicinski, David Miller, linux-arm-msm, Network Development,
	Jeffrey Hugo

On Fri, 11 Dec 2020 at 11:15, Manivannan Sadhasivam
<manivannan.sadhasivam@linaro.org> wrote:
>
> On Fri, Dec 11, 2020 at 10:40:13AM +0100, Loic Poulain wrote:
> > Hi Mani,
> >
> > On Fri, 11 Dec 2020 at 06:38, Manivannan Sadhasivam
> > <manivannan.sadhasivam@linaro.org> wrote:
> > >
> > > On Thu, Dec 10, 2020 at 12:15:50PM +0100, Loic Poulain wrote:
> > > > The RX queue size can be determined at runtime by retrieving the
> > > > number of available transfer descriptors.
> > > >
> > > > Signed-off-by: Loic Poulain <loic.poulain@linaro.org>
> > > > ---
> > > >  v2: Fixed commit message typo
> > > >
> > > >  drivers/net/mhi_net.c | 6 +++---
> > > >  1 file changed, 3 insertions(+), 3 deletions(-)
> > [...]
> > > > -
> > > >       INIT_DELAYED_WORK(&mhi_netdev->rx_refill, mhi_net_rx_refill_work);
> > > >       u64_stats_init(&mhi_netdev->stats.rx_syncp);
> > > >       u64_stats_init(&mhi_netdev->stats.tx_syncp);
> > > > @@ -268,6 +265,9 @@ static int mhi_net_probe(struct mhi_device *mhi_dev,
> > > >       if (err)
> > > >               goto out_err;
> > > >
> > > > +     /* Number of transfer descriptors determines size of the queue */
> > > > +     mhi_netdev->rx_queue_sz = mhi_get_free_desc_count(mhi_dev, DMA_FROM_DEVICE);
> > > > +
> > >
> > > This value is not static right? You might need to fetch the count in
> > > mhi_net_rx_refill_work().
> >
> > It is, actually here driver is just looking for the total queue size,
> > which is the number of descriptors at init time. This total queue size
> > is used later to determine the level of MHI queue occupancy rate.
> >
>
> Right but what if the size got increased in runtime (recycled etc...), we won't
> be fully utilizing the ring.

The queue size can not be more than the initial size, which is the
number of elements in the rings (e.g. 128). At runtime, the driver
calls again mhi_get_free_desc_count() in DL callback to determine the
current number available slots (e.g. 32). If this value is higher than
a certain limit (e.g 128 /2), then we start refilling the MHI RX queue
with fresh buffers to prevent buffer starvation.

Regards,
Loic

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 3/3] net: mhi: Add dedicated alloc thread
  2020-12-10 11:15 ` [PATCH v2 3/3] net: mhi: Add dedicated alloc thread Loic Poulain
@ 2020-12-12 20:55   ` Jakub Kicinski
  2020-12-14  9:19     ` Loic Poulain
  0 siblings, 1 reply; 12+ messages in thread
From: Jakub Kicinski @ 2020-12-12 20:55 UTC (permalink / raw)
  To: Loic Poulain; +Cc: davem, manivannan.sadhasivam, linux-arm-msm, netdev, jhugo

On Thu, 10 Dec 2020 12:15:51 +0100 Loic Poulain wrote:
> The buffer allocation for RX path is currently done by a work executed
> in the system workqueue. The work to do is quite simple and consists
> mostly in allocating and queueing as much as possible buffers to the MHI
> RX channel.
> 
> It appears that using a dedicated kthread would be more appropriate to
> prevent
> 1. RX allocation latency introduced by the system queue

System work queue should not add much latency, you can also create your
own workqueue. Did you intend to modify the priority of the thread you
create?

> 2. Unbounded work execution, the work only returning when queue is
> full, it can possibly monopolise the workqueue thread on slower systems.

Is this something you observed in practice?

> This patch replaces the system work with a simple kthread that loops on
> buffer allocation and sleeps when queue is full. Moreover it gets rid
> of the local rx_queued variable (to track buffer count), and instead,
> relies on the new mhi_get_free_desc_count helper.

Seems unrelated, should probably be a separate patch.

> After pratical testing on a x86_64 machine, this change improves
> - Peek throughput (slightly, by few mbps)
> - Throughput stability when concurrent loads are running (stress)
> - CPU usage, less CPU cycles dedicated to the task

Do you have an explanation why the CPU cycles are lower?

> Below is the powertop output for RX allocation task before and after
> this change, when performing UDP download at 6Gbps. Mostly to highlight
> the improvement in term of CPU usage.
> 
> older (system workqueue):
> Usage       Events/s    Category       Description
> 63,2 ms/s     134,0        kWork          mhi_net_rx_refill_work
> 62,8 ms/s     134,3        kWork          mhi_net_rx_refill_work
> 60,8 ms/s     141,4        kWork          mhi_net_rx_refill_work
> 
> newer (dedicated kthread):
> Usage       Events/s    Category       Description
> 20,7 ms/s     155,6        Process        [PID 3360] [mhi-net-rx]
> 22,2 ms/s     169,6        Process        [PID 3360] [mhi-net-rx]
> 22,3 ms/s     150,2        Process        [PID 3360] [mhi-net-rx]
> 
> Signed-off-by: Loic Poulain <loic.poulain@linaro.org>
> ---
>  v2: add module parameter for changing RX refill level

> @@ -16,6 +17,11 @@
>  #define MHI_NET_MAX_MTU		0xffff
>  #define MHI_NET_DEFAULT_MTU	0x4000
>  
> +static unsigned int rx_refill_level = 70;
> +module_param(rx_refill_level, uint, 0600);
> +MODULE_PARM_DESC(rx_refill_level,
> +		 "The minimal RX queue level percentage (0 to 100) under which the RX queue must be refilled");

Sorry you got bad advice in v1 and I didn't catch it. Please avoid
adding module parameters. Many drivers do bulk refill, and don't need
and extra parametrization, I don't see why this one would be special -
if it is please explain.

>  struct mhi_net_stats {
>  	u64_stats_t rx_packets;
>  	u64_stats_t rx_bytes;
> @@ -25,7 +31,6 @@ struct mhi_net_stats {
>  	u64_stats_t tx_bytes;
>  	u64_stats_t tx_errors;
>  	u64_stats_t tx_dropped;
> -	atomic_t rx_queued;
>  	struct u64_stats_sync tx_syncp;
>  	struct u64_stats_sync rx_syncp;
>  };
> @@ -33,17 +38,66 @@ struct mhi_net_stats {
>  struct mhi_net_dev {
>  	struct mhi_device *mdev;
>  	struct net_device *ndev;
> -	struct delayed_work rx_refill;
> +	struct task_struct *refill_task;
> +	wait_queue_head_t refill_wq;
>  	struct mhi_net_stats stats;
>  	u32 rx_queue_sz;
> +	u32 rx_refill_level;
>  };
>  
> +static int mhi_net_refill_thread(void *data)
> +{
> +	struct mhi_net_dev *mhi_netdev = data;
> +	struct net_device *ndev = mhi_netdev->ndev;
> +	struct mhi_device *mdev = mhi_netdev->mdev;
> +	int size = READ_ONCE(ndev->mtu);
> +	struct sk_buff *skb;
> +	int err;
> +
> +	while (1) {
> +		err = wait_event_interruptible(mhi_netdev->refill_wq,
> +					       !mhi_queue_is_full(mdev, DMA_FROM_DEVICE)
> +					       || kthread_should_stop());
> +		if (err || kthread_should_stop())
> +			break;
> +
> +		skb = netdev_alloc_skb(ndev, size);
> +		if (unlikely(!skb)) {
> +			/* No memory, retry later */
> +			schedule_timeout_interruptible(msecs_to_jiffies(250));

You should have a counter for this, at least for your testing. If this
condition is hit it'll probably have a large impact on the performance.

> +			continue;
> +		}
> +
> +		err = mhi_queue_skb(mdev, DMA_FROM_DEVICE, skb, size, MHI_EOT);
> +		if (unlikely(err)) {
> +			net_err_ratelimited("%s: Failed to queue RX buf (%d)\n",
> +					    ndev->name, err);
> +			kfree_skb(skb);
> +			break;
> +		}
> +
> +		/* Do not hog the CPU */
> +		cond_resched();
> +	}
> +
> +	return 0;
> +}
> +
>  static int mhi_ndo_open(struct net_device *ndev)
>  {
>  	struct mhi_net_dev *mhi_netdev = netdev_priv(ndev);
> +	unsigned int qsz = mhi_netdev->rx_queue_sz;
>  
> -	/* Feed the rx buffer pool */
> -	schedule_delayed_work(&mhi_netdev->rx_refill, 0);
> +	if (rx_refill_level >= 100)
> +		mhi_netdev->rx_refill_level = 1;
> +	else
> +		mhi_netdev->rx_refill_level = qsz - qsz * rx_refill_level / 100;

So you're switching from 50% fill level to 70%. Are you sure that's not
the reason the performance gets better? Did you experiments with higher
fill levels?

> +	mhi_netdev->refill_task = kthread_run(mhi_net_refill_thread, mhi_netdev,
> +					      "mhi-net-rx");
> +	if (IS_ERR(mhi_netdev->refill_task)) {
> +		return PTR_ERR(mhi_netdev->refill_task);
> +	}
>  
>  	/* Carrier is established via out-of-band channel (e.g. qmi) */
>  	netif_carrier_on(ndev);
> @@ -57,9 +111,9 @@ static int mhi_ndo_stop(struct net_device *ndev)
>  {
>  	struct mhi_net_dev *mhi_netdev = netdev_priv(ndev);
>  
> +	kthread_stop(mhi_netdev->refill_task);
>  	netif_stop_queue(ndev);
>  	netif_carrier_off(ndev);
> -	cancel_delayed_work_sync(&mhi_netdev->rx_refill);
>  
>  	return 0;
>  }
> @@ -138,9 +192,6 @@ static void mhi_net_dl_callback(struct mhi_device *mhi_dev,
>  {
>  	struct mhi_net_dev *mhi_netdev = dev_get_drvdata(&mhi_dev->dev);
>  	struct sk_buff *skb = mhi_res->buf_addr;
> -	int remaining;
> -
> -	remaining = atomic_dec_return(&mhi_netdev->stats.rx_queued);
>  
>  	if (unlikely(mhi_res->transaction_status)) {
>  		dev_kfree_skb_any(skb);
> @@ -163,9 +214,8 @@ static void mhi_net_dl_callback(struct mhi_device *mhi_dev,
>  		netif_rx(skb);
>  	}
>  
> -	/* Refill if RX buffers queue becomes low */
> -	if (remaining <= mhi_netdev->rx_queue_sz / 2)
> -		schedule_delayed_work(&mhi_netdev->rx_refill, 0);
> +	if (mhi_get_free_desc_count(mhi_dev, DMA_FROM_DEVICE) >= mhi_netdev->rx_refill_level)
> +		wake_up_interruptible(&mhi_netdev->refill_wq);
>  }
>  
>  static void mhi_net_ul_callback(struct mhi_device *mhi_dev,


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 3/3] net: mhi: Add dedicated alloc thread
  2020-12-12 20:55   ` Jakub Kicinski
@ 2020-12-14  9:19     ` Loic Poulain
  2020-12-14 19:47       ` Jakub Kicinski
  0 siblings, 1 reply; 12+ messages in thread
From: Loic Poulain @ 2020-12-14  9:19 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David Miller, Manivannan Sadhasivam, linux-arm-msm,
	Network Development, Jeffrey Hugo

Hi Jakub,

On Sat, 12 Dec 2020 at 21:55, Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 10 Dec 2020 12:15:51 +0100 Loic Poulain wrote:
> > The buffer allocation for RX path is currently done by a work executed
> > in the system workqueue. The work to do is quite simple and consists
> > mostly in allocating and queueing as much as possible buffers to the MHI
> > RX channel.
> >
> > It appears that using a dedicated kthread would be more appropriate to
> > prevent
> > 1. RX allocation latency introduced by the system queue
>
> System work queue should not add much latency, you can also create your
> own workqueue. Did you intend to modify the priority of the thread you
> create?

No, and I don't, since I assume there is no reason to prioritize
network over other loads. I've considered the dedicated workqueue, but
since there is only one task to run as a while loop, I thought using a
kthread was more appropriate (and slightly lighter), but I can move to
that solution if you recommend it.

>
> > 2. Unbounded work execution, the work only returning when queue is
> > full, it can possibly monopolise the workqueue thread on slower systems.
>
> Is this something you observed in practice?

No, I've just observed that work duration is inconstant , queuing from
few buffers to several hundreeds. This unbounded behavior makes me
feel that doing that in the shared sytem workqueue is probably not the
right place. I've not tested on a slower machine though.

>
> > This patch replaces the system work with a simple kthread that loops on
> > buffer allocation and sleeps when queue is full. Moreover it gets rid
> > of the local rx_queued variable (to track buffer count), and instead,
> > relies on the new mhi_get_free_desc_count helper.
>
> Seems unrelated, should probably be a separate patch.

I can do that.

>
> > After pratical testing on a x86_64 machine, this change improves
> > - Peek throughput (slightly, by few mbps)
> > - Throughput stability when concurrent loads are running (stress)
> > - CPU usage, less CPU cycles dedicated to the task
>
> Do you have an explanation why the CPU cycles are lower?

For CPU cycles, TBH, not really, this is just observational. Regarding
throughput stability, it's certainly because the work can consume all
its dedicated kthread time.

>
> > Below is the powertop output for RX allocation task before and after
> > this change, when performing UDP download at 6Gbps. Mostly to highlight
> > the improvement in term of CPU usage.
> >
> > older (system workqueue):
> > Usage       Events/s    Category       Description
> > 63,2 ms/s     134,0        kWork          mhi_net_rx_refill_work
> > 62,8 ms/s     134,3        kWork          mhi_net_rx_refill_work
> > 60,8 ms/s     141,4        kWork          mhi_net_rx_refill_work
> >
> > newer (dedicated kthread):
> > Usage       Events/s    Category       Description
> > 20,7 ms/s     155,6        Process        [PID 3360] [mhi-net-rx]
> > 22,2 ms/s     169,6        Process        [PID 3360] [mhi-net-rx]
> > 22,3 ms/s     150,2        Process        [PID 3360] [mhi-net-rx]
> >
> > Signed-off-by: Loic Poulain <loic.poulain@linaro.org>
> > ---
> >  v2: add module parameter for changing RX refill level
>
> > @@ -16,6 +17,11 @@
> >  #define MHI_NET_MAX_MTU              0xffff
> >  #define MHI_NET_DEFAULT_MTU  0x4000
> >
> > +static unsigned int rx_refill_level = 70;
> > +module_param(rx_refill_level, uint, 0600);
> > +MODULE_PARM_DESC(rx_refill_level,
> > +              "The minimal RX queue level percentage (0 to 100) under which the RX queue must be refilled");
>
> Sorry you got bad advice in v1 and I didn't catch it. Please avoid
> adding module parameters. Many drivers do bulk refill, and don't need
> and extra parametrization, I don't see why this one would be special -
> if it is please explain.

Ok, going to revert that.

>
> >  struct mhi_net_stats {
> >       u64_stats_t rx_packets;
> >       u64_stats_t rx_bytes;
> > @@ -25,7 +31,6 @@ struct mhi_net_stats {
> >       u64_stats_t tx_bytes;
> >       u64_stats_t tx_errors;
> >       u64_stats_t tx_dropped;
> > -     atomic_t rx_queued;
> >       struct u64_stats_sync tx_syncp;
> >       struct u64_stats_sync rx_syncp;
> >  };
> > @@ -33,17 +38,66 @@ struct mhi_net_stats {
> >  struct mhi_net_dev {
> >       struct mhi_device *mdev;
> >       struct net_device *ndev;
> > -     struct delayed_work rx_refill;
> > +     struct task_struct *refill_task;
> > +     wait_queue_head_t refill_wq;
> >       struct mhi_net_stats stats;
> >       u32 rx_queue_sz;
> > +     u32 rx_refill_level;
> >  };
> >
> > +static int mhi_net_refill_thread(void *data)
> > +{
> > +     struct mhi_net_dev *mhi_netdev = data;
> > +     struct net_device *ndev = mhi_netdev->ndev;
> > +     struct mhi_device *mdev = mhi_netdev->mdev;
> > +     int size = READ_ONCE(ndev->mtu);
> > +     struct sk_buff *skb;
> > +     int err;
> > +
> > +     while (1) {
> > +             err = wait_event_interruptible(mhi_netdev->refill_wq,
> > +                                            !mhi_queue_is_full(mdev, DMA_FROM_DEVICE)
> > +                                            || kthread_should_stop());
> > +             if (err || kthread_should_stop())
> > +                     break;
> > +
> > +             skb = netdev_alloc_skb(ndev, size);
> > +             if (unlikely(!skb)) {
> > +                     /* No memory, retry later */
> > +                     schedule_timeout_interruptible(msecs_to_jiffies(250));
>
> You should have a counter for this, at least for your testing. If this
> condition is hit it'll probably have a large impact on the performance.

Indeed, going to do that, what about a ratelimited error? I assume if
it's happen, system is really in bad shape.

>
> > +                     continue;
> > +             }
> > +
> > +             err = mhi_queue_skb(mdev, DMA_FROM_DEVICE, skb, size, MHI_EOT);
> > +             if (unlikely(err)) {
> > +                     net_err_ratelimited("%s: Failed to queue RX buf (%d)\n",
> > +                                         ndev->name, err);
> > +                     kfree_skb(skb);
> > +                     break;
> > +             }
> > +
> > +             /* Do not hog the CPU */
> > +             cond_resched();
> > +     }
> > +
> > +     return 0;
> > +}
> > +
> >  static int mhi_ndo_open(struct net_device *ndev)
> >  {
> >       struct mhi_net_dev *mhi_netdev = netdev_priv(ndev);
> > +     unsigned int qsz = mhi_netdev->rx_queue_sz;
> >
> > -     /* Feed the rx buffer pool */
> > -     schedule_delayed_work(&mhi_netdev->rx_refill, 0);
> > +     if (rx_refill_level >= 100)
> > +             mhi_netdev->rx_refill_level = 1;
> > +     else
> > +             mhi_netdev->rx_refill_level = qsz - qsz * rx_refill_level / 100;
>
> So you're switching from 50% fill level to 70%. Are you sure that's not
> the reason the performance gets better? Did you experiments with higher
> fill levels?

No, I've tested both levels with the two solutions, It's just that
after experiment, high throughput is a bit more stable with 70%. So I
can revert back to 50% to avoid confusion and keep that for a
subsequent change.

Thanks,
Loic

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 3/3] net: mhi: Add dedicated alloc thread
  2020-12-14  9:19     ` Loic Poulain
@ 2020-12-14 19:47       ` Jakub Kicinski
  2020-12-15 12:09         ` Loic Poulain
  0 siblings, 1 reply; 12+ messages in thread
From: Jakub Kicinski @ 2020-12-14 19:47 UTC (permalink / raw)
  To: Loic Poulain
  Cc: David Miller, Manivannan Sadhasivam, linux-arm-msm,
	Network Development, Jeffrey Hugo

On Mon, 14 Dec 2020 10:19:07 +0100 Loic Poulain wrote:
> On Sat, 12 Dec 2020 at 21:55, Jakub Kicinski <kuba@kernel.org> wrote:
> > On Thu, 10 Dec 2020 12:15:51 +0100 Loic Poulain wrote:  
> > > The buffer allocation for RX path is currently done by a work executed
> > > in the system workqueue. The work to do is quite simple and consists
> > > mostly in allocating and queueing as much as possible buffers to the MHI
> > > RX channel.
> > >
> > > It appears that using a dedicated kthread would be more appropriate to
> > > prevent
> > > 1. RX allocation latency introduced by the system queue  
> >
> > System work queue should not add much latency, you can also create your
> > own workqueue. Did you intend to modify the priority of the thread you
> > create?  
> 
> No, and I don't, since I assume there is no reason to prioritize
> network over other loads. I've considered the dedicated workqueue, but
> since there is only one task to run as a while loop, I thought using a
> kthread was more appropriate (and slightly lighter), but I can move to
> that solution if you recommend it.

Not sure what to recommend TBH, if thread works better for you that's
fine. I don't understand why the thread would work better, tho. I was
just checking if there is any extra tuning that happens.

> > > 2. Unbounded work execution, the work only returning when queue is
> > > full, it can possibly monopolise the workqueue thread on slower systems.  
> >
> > Is this something you observed in practice?  
> 
> No, I've just observed that work duration is inconstant , queuing from
> few buffers to several hundreeds. This unbounded behavior makes me
> feel that doing that in the shared sytem workqueue is probably not the
> right place. I've not tested on a slower machine though.

I think long running work should not be an issue for the cmwq
implementation we have in the kernel.

Several hundred buffers means it's running concurrently with RX, right?
Since the NIC queue is 128 buffers.

> > > This patch replaces the system work with a simple kthread that loops on
> > > buffer allocation and sleeps when queue is full. Moreover it gets rid
> > > of the local rx_queued variable (to track buffer count), and instead,
> > > relies on the new mhi_get_free_desc_count helper.  
> >
> > Seems unrelated, should probably be a separate patch.  
> 
> I can do that.
> 
> >  
> > > After pratical testing on a x86_64 machine, this change improves
> > > - Peek throughput (slightly, by few mbps)
> > > - Throughput stability when concurrent loads are running (stress)
> > > - CPU usage, less CPU cycles dedicated to the task  
> >
> > Do you have an explanation why the CPU cycles are lower?  
> 
> For CPU cycles, TBH, not really, this is just observational.

Is the IRQ pinned? I wonder how often work runs on the same CPU as IRQ
processing and how often does the thread do.

> Regarding throughput stability, it's certainly because the work can
> consume all its dedicated kthread time.

Meaning workqueue implementation doesn't get enough CPU? Strange.

> > > Below is the powertop output for RX allocation task before and
> > > after this change, when performing UDP download at 6Gbps. Mostly
> > > to highlight the improvement in term of CPU usage.
> > >
> > > older (system workqueue):
> > > Usage       Events/s    Category       Description
> > > 63,2 ms/s     134,0        kWork          mhi_net_rx_refill_work
> > > 62,8 ms/s     134,3        kWork          mhi_net_rx_refill_work
> > > 60,8 ms/s     141,4        kWork          mhi_net_rx_refill_work
> > >
> > > newer (dedicated kthread):
> > > Usage       Events/s    Category       Description
> > > 20,7 ms/s     155,6        Process        [PID 3360] [mhi-net-rx]
> > > 22,2 ms/s     169,6        Process        [PID 3360] [mhi-net-rx]
> > > 22,3 ms/s     150,2        Process        [PID 3360] [mhi-net-rx]
> > >
> > > Signed-off-by: Loic Poulain <loic.poulain@linaro.org>

> > > +             skb = netdev_alloc_skb(ndev, size);
> > > +             if (unlikely(!skb)) {
> > > +                     /* No memory, retry later */
> > > +
> > > schedule_timeout_interruptible(msecs_to_jiffies(250));  
> >
> > You should have a counter for this, at least for your testing. If
> > this condition is hit it'll probably have a large impact on the
> > performance.  
> 
> Indeed, going to do that, what about a ratelimited error? I assume if
> it's happen, system is really in bad shape.

It's not that uncommon to run out of memory for a 2k allocation in an
atomic context (note that netdev_alloc_skb() uses GFP_ATOMIC).
You can add a rate-limited print if you want, tho.

> > > +                     continue;
> > > +             }
> > > +
> > > +             err = mhi_queue_skb(mdev, DMA_FROM_DEVICE, skb,
> > > size, MHI_EOT);
> > > +             if (unlikely(err)) {
> > > +                     net_err_ratelimited("%s: Failed to queue RX
> > > buf (%d)\n",
> > > +                                         ndev->name, err);
> > > +                     kfree_skb(skb);
> > > +                     break;
> > > +             }
> > > +
> > > +             /* Do not hog the CPU */
> > > +             cond_resched();
> > > +     }
> > > +
> > > +     return 0;
> > > +}
> > > +
> > >  static int mhi_ndo_open(struct net_device *ndev)
> > >  {
> > >       struct mhi_net_dev *mhi_netdev = netdev_priv(ndev);
> > > +     unsigned int qsz = mhi_netdev->rx_queue_sz;
> > >
> > > -     /* Feed the rx buffer pool */
> > > -     schedule_delayed_work(&mhi_netdev->rx_refill, 0);
> > > +     if (rx_refill_level >= 100)
> > > +             mhi_netdev->rx_refill_level = 1;
> > > +     else
> > > +             mhi_netdev->rx_refill_level = qsz - qsz *
> > > rx_refill_level / 100;  
> >
> > So you're switching from 50% fill level to 70%. Are you sure that's
> > not the reason the performance gets better? Did you experiments
> > with higher fill levels?  
> 
> No, I've tested both levels with the two solutions, It's just that
> after experiment, high throughput is a bit more stable with 70%. So I
> can revert back to 50% to avoid confusion and keep that for a
> subsequent change.

I'm not fussed about that - it would be good tho to have the numbers in
comparisons for the same fill levels. Otherwise comparing workq at 50%
vs thread at 70% is changing two variables at the same time.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 3/3] net: mhi: Add dedicated alloc thread
  2020-12-14 19:47       ` Jakub Kicinski
@ 2020-12-15 12:09         ` Loic Poulain
  0 siblings, 0 replies; 12+ messages in thread
From: Loic Poulain @ 2020-12-15 12:09 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David Miller, Manivannan Sadhasivam, linux-arm-msm,
	Network Development, Jeffrey Hugo

Hi Jakub,

On Mon, 14 Dec 2020 at 20:47, Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon, 14 Dec 2020 10:19:07 +0100 Loic Poulain wrote:
> > On Sat, 12 Dec 2020 at 21:55, Jakub Kicinski <kuba@kernel.org> wrote:
> > > On Thu, 10 Dec 2020 12:15:51 +0100 Loic Poulain wrote:
> > > > The buffer allocation for RX path is currently done by a work executed
> > > > in the system workqueue. The work to do is quite simple and consists
> > > > mostly in allocating and queueing as much as possible buffers to the MHI
> > > > RX channel.
> > > >
> > > > It appears that using a dedicated kthread would be more appropriate to
> > > > prevent
> > > > 1. RX allocation latency introduced by the system queue
> > >
> > > System work queue should not add much latency, you can also create your
> > > own workqueue. Did you intend to modify the priority of the thread you
> > > create?
> >
> > No, and I don't, since I assume there is no reason to prioritize
> > network over other loads. I've considered the dedicated workqueue, but
> > since there is only one task to run as a while loop, I thought using a
> > kthread was more appropriate (and slightly lighter), but I can move to
> > that solution if you recommend it.
>
> Not sure what to recommend TBH, if thread works better for you that's
> fine. I don't understand why the thread would work better, tho. I was
> just checking if there is any extra tuning that happens.
>
> > > > 2. Unbounded work execution, the work only returning when queue is
> > > > full, it can possibly monopolise the workqueue thread on slower systems.
> > >
> > > Is this something you observed in practice?
> >
> > No, I've just observed that work duration is inconstant , queuing from
> > few buffers to several hundreeds. This unbounded behavior makes me
> > feel that doing that in the shared sytem workqueue is probably not the
> > right place. I've not tested on a slower machine though.
>
> I think long running work should not be an issue for the cmwq
> implementation we have in the kernel.
>
> Several hundred buffers means it's running concurrently with RX, right?
> Since the NIC queue is 128 buffers.

Exactly, buffers can be completed by the hardware before we even
finished to completely fill the MHI ring buffer, that why the loop can
queue more than 128 buffers.

> > > > This patch replaces the system work with a simple kthread that loops on
> > > > buffer allocation and sleeps when queue is full. Moreover it gets rid
> > > > of the local rx_queued variable (to track buffer count), and instead,
> > > > relies on the new mhi_get_free_desc_count helper.
> > >
> > > Seems unrelated, should probably be a separate patch.
> >
> > I can do that.
> >
> > >
> > > > After pratical testing on a x86_64 machine, this change improves
> > > > - Peek throughput (slightly, by few mbps)
> > > > - Throughput stability when concurrent loads are running (stress)
> > > > - CPU usage, less CPU cycles dedicated to the task
> > >
> > > Do you have an explanation why the CPU cycles are lower?
> >
> > For CPU cycles, TBH, not really, this is just observational.
>
> Is the IRQ pinned? I wonder how often work runs on the same CPU as IRQ
> processing and how often does the thread do.
>
> > Regarding throughput stability, it's certainly because the work can
> > consume all its dedicated kthread time.
>
> Meaning workqueue implementation doesn't get enough CPU? Strange.
>
> > > > Below is the powertop output for RX allocation task before and
> > > > after this change, when performing UDP download at 6Gbps. Mostly
> > > > to highlight the improvement in term of CPU usage.
> > > >
> > > > older (system workqueue):
> > > > Usage       Events/s    Category       Description
> > > > 63,2 ms/s     134,0        kWork          mhi_net_rx_refill_work
> > > > 62,8 ms/s     134,3        kWork          mhi_net_rx_refill_work
> > > > 60,8 ms/s     141,4        kWork          mhi_net_rx_refill_work
> > > >
> > > > newer (dedicated kthread):
> > > > Usage       Events/s    Category       Description
> > > > 20,7 ms/s     155,6        Process        [PID 3360] [mhi-net-rx]
> > > > 22,2 ms/s     169,6        Process        [PID 3360] [mhi-net-rx]
> > > > 22,3 ms/s     150,2        Process        [PID 3360] [mhi-net-rx]
> > > >
> > > > Signed-off-by: Loic Poulain <loic.poulain@linaro.org>
>
> > > > +             skb = netdev_alloc_skb(ndev, size);
> > > > +             if (unlikely(!skb)) {
> > > > +                     /* No memory, retry later */
> > > > +
> > > > schedule_timeout_interruptible(msecs_to_jiffies(250));
> > >
> > > You should have a counter for this, at least for your testing. If
> > > this condition is hit it'll probably have a large impact on the
> > > performance.
> >
> > Indeed, going to do that, what about a ratelimited error? I assume if
> > it's happen, system is really in bad shape.
>
> It's not that uncommon to run out of memory for a 2k allocation in an
> atomic context (note that netdev_alloc_skb() uses GFP_ATOMIC).
> You can add a rate-limited print if you want, tho.
>
> > > > +                     continue;
> > > > +             }
> > > > +
> > > > +             err = mhi_queue_skb(mdev, DMA_FROM_DEVICE, skb,
> > > > size, MHI_EOT);
> > > > +             if (unlikely(err)) {
> > > > +                     net_err_ratelimited("%s: Failed to queue RX
> > > > buf (%d)\n",
> > > > +                                         ndev->name, err);
> > > > +                     kfree_skb(skb);
> > > > +                     break;
> > > > +             }
> > > > +
> > > > +             /* Do not hog the CPU */
> > > > +             cond_resched();
> > > > +     }
> > > > +
> > > > +     return 0;
> > > > +}
> > > > +
> > > >  static int mhi_ndo_open(struct net_device *ndev)
> > > >  {
> > > >       struct mhi_net_dev *mhi_netdev = netdev_priv(ndev);
> > > > +     unsigned int qsz = mhi_netdev->rx_queue_sz;
> > > >
> > > > -     /* Feed the rx buffer pool */
> > > > -     schedule_delayed_work(&mhi_netdev->rx_refill, 0);
> > > > +     if (rx_refill_level >= 100)
> > > > +             mhi_netdev->rx_refill_level = 1;
> > > > +     else
> > > > +             mhi_netdev->rx_refill_level = qsz - qsz *
> > > > rx_refill_level / 100;
> > >
> > > So you're switching from 50% fill level to 70%. Are you sure that's
> > > not the reason the performance gets better? Did you experiments
> > > with higher fill levels?
> >
> > No, I've tested both levels with the two solutions, It's just that
> > after experiment, high throughput is a bit more stable with 70%. So I
> > can revert back to 50% to avoid confusion and keep that for a
> > subsequent change.
>
> I'm not fussed about that - it would be good tho to have the numbers in
> comparisons for the same fill levels. Otherwise comparing workq at 50%
> vs thread at 70% is changing two variables at the same time.

Yes, anyway, I'm going to skip the new kthread from the series, and
I'll resubmit once I get consolidated numbers with proper comparison.

Thanks,
Loic

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2 1/3] bus: mhi: core: Add helper API to return number of free TREs
  2020-05-22  2:51 [PATCH v2 0/3] user space client interface driver Hemant Kumar
@ 2020-05-22  2:51 ` Hemant Kumar
  0 siblings, 0 replies; 12+ messages in thread
From: Hemant Kumar @ 2020-05-22  2:51 UTC (permalink / raw)
  To: manivannan.sadhasivam
  Cc: linux-arm-msm, linux-kernel, jhugo, bbhatt, Hemant Kumar

Introduce mhi_get_no_free_descriptors() API to return number
of TREs available to queue buffer. MHI clients can use this
API to know before hand if ring is full without calling queue
API.

Signed-off-by: Hemant Kumar <hemantk@codeaurora.org>
---
 drivers/bus/mhi/core/main.c | 12 ++++++++++++
 include/linux/mhi.h         |  9 +++++++++
 2 files changed, 21 insertions(+)

diff --git a/drivers/bus/mhi/core/main.c b/drivers/bus/mhi/core/main.c
index d25f321..1bd3b1e 100644
--- a/drivers/bus/mhi/core/main.c
+++ b/drivers/bus/mhi/core/main.c
@@ -258,6 +258,18 @@ int mhi_destroy_device(struct device *dev, void *data)
 	return 0;
 }
 
+int mhi_get_no_free_descriptors(struct mhi_device *mhi_dev,
+				enum dma_data_direction dir)
+{
+	struct mhi_controller *mhi_cntrl = mhi_dev->mhi_cntrl;
+	struct mhi_chan *mhi_chan = (dir == DMA_TO_DEVICE) ?
+		mhi_dev->ul_chan : mhi_dev->dl_chan;
+	struct mhi_ring *tre_ring = &mhi_chan->tre_ring;
+
+	return get_nr_avail_ring_elements(mhi_cntrl, tre_ring);
+}
+EXPORT_SYMBOL_GPL(mhi_get_no_free_descriptors);
+
 void mhi_notify(struct mhi_device *mhi_dev, enum mhi_callback cb_reason)
 {
 	struct mhi_driver *mhi_drv;
diff --git a/include/linux/mhi.h b/include/linux/mhi.h
index 6af6bd6..a39b77d 100644
--- a/include/linux/mhi.h
+++ b/include/linux/mhi.h
@@ -602,6 +602,15 @@ void mhi_set_mhi_state(struct mhi_controller *mhi_cntrl,
 void mhi_notify(struct mhi_device *mhi_dev, enum mhi_callback cb_reason);
 
 /**
+ * mhi_get_no_free_descriptors - Get transfer ring length
+ * Get # of TD available to queue buffers
+ * @mhi_dev: Device associated with the channels
+ * @dir: Direction of the channel
+ */
+int mhi_get_no_free_descriptors(struct mhi_device *mhi_dev,
+				enum dma_data_direction dir);
+
+/**
  * mhi_prepare_for_power_up - Do pre-initialization before power up.
  *                            This is optional, call this before power up if
  *                            the controller does not want bus framework to
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project


^ permalink raw reply related	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-12-15 12:04 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-10 11:15 [PATCH v2 1/3] bus: mhi: core: Add helper API to return number of free TREs Loic Poulain
2020-12-10 11:15 ` [PATCH v2 2/3] net: mhi: Get RX queue size from MHI core Loic Poulain
2020-12-11  5:38   ` Manivannan Sadhasivam
2020-12-11  9:40     ` Loic Poulain
2020-12-11 10:15       ` Manivannan Sadhasivam
2020-12-11 10:40         ` Loic Poulain
2020-12-10 11:15 ` [PATCH v2 3/3] net: mhi: Add dedicated alloc thread Loic Poulain
2020-12-12 20:55   ` Jakub Kicinski
2020-12-14  9:19     ` Loic Poulain
2020-12-14 19:47       ` Jakub Kicinski
2020-12-15 12:09         ` Loic Poulain
  -- strict thread matches above, loose matches on Subject: below --
2020-05-22  2:51 [PATCH v2 0/3] user space client interface driver Hemant Kumar
2020-05-22  2:51 ` [PATCH v2 1/3] bus: mhi: core: Add helper API to return number of free TREs Hemant Kumar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.