All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jakub Kicinski <jakub.kicinski@netronome.com>
To: Zhu Yanjun <yanjun.zhu@oracle.com>
Cc: netdev@vger.kernel.org, davem@davemloft.net
Subject: Re: [PATCH 1/2] forcedeth: add recv cache make nic work steadily
Date: Mon, 8 Jul 2019 13:52:57 -0700	[thread overview]
Message-ID: <20190708135257.18200316@cakuba.netronome.com> (raw)
In-Reply-To: <1562307568-21549-2-git-send-email-yanjun.zhu@oracle.com>

On Fri,  5 Jul 2019 02:19:27 -0400, Zhu Yanjun wrote:
> A recv cache is added. The size of recv cache is 1000Mb / skb_length.
> When the system memory is not enough, this recv cache can make nic work
> steadily.
> When nic is up, this recv cache and work queue are created. When nic
> is down, this recv cache will be destroyed and delayed workqueue is
> canceled.
> When nic is polled or rx interrupt is triggerred, rx handler will
> get a skb from recv cache. Then the state of recv cache is checked.
> If recv cache is not in filling up state, a work is queued to fill
> up recv cache.
> When skb size is changed, the old recv cache is destroyed and new recv
> cache is created.
> When the system memory is not enough, the allocation of skb failed.
> recv cache will continue allocate skb until the recv cache is filled up.
> When the system memory is not enough, this can make nic work steadily.
> Becase of recv cache, the performance of nic is enhanced.
> 
> CC: Joe Jin <joe.jin@oracle.com>
> CC: Junxiao Bi <junxiao.bi@oracle.com>
> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>

Could you tell us a little more about the use case and the system
condition?

> diff --git a/drivers/net/ethernet/nvidia/forcedeth.c b/drivers/net/ethernet/nvidia/forcedeth.c
> index b327b29..a673005 100644
> --- a/drivers/net/ethernet/nvidia/forcedeth.c
> +++ b/drivers/net/ethernet/nvidia/forcedeth.c
> @@ -674,6 +674,11 @@ struct nv_ethtool_stats {
>  	u64 tx_broadcast;
>  };
>  
> +/* 1000Mb is 125M bytes, 125 * 1024 * 1024 bytes
> + * The length of recv cache is 125M / skb_length
> + */
> +#define RECV_CACHE_LIST_LENGTH		(125 * 1024 * 1024 / np->rx_buf_sz)
> +
>  #define NV_DEV_STATISTICS_V3_COUNT (sizeof(struct nv_ethtool_stats)/sizeof(u64))
>  #define NV_DEV_STATISTICS_V2_COUNT (NV_DEV_STATISTICS_V3_COUNT - 3)
>  #define NV_DEV_STATISTICS_V1_COUNT (NV_DEV_STATISTICS_V2_COUNT - 6)
> @@ -844,8 +849,18 @@ struct fe_priv {
>  	char name_rx[IFNAMSIZ + 3];       /* -rx    */
>  	char name_tx[IFNAMSIZ + 3];       /* -tx    */
>  	char name_other[IFNAMSIZ + 6];    /* -other */
> +
> +	/* This is to schedule work */
> +	struct delayed_work     recv_cache_work;
> +	/* This list is to store skb queue for recv */
> +	struct sk_buff_head recv_list;
> +	unsigned long nv_recv_list_state;
>  };
>  
> +/* This is recv list state to fill up recv cache */
> +enum recv_list_state {
> +	RECV_LIST_ALLOCATE
> +};
>  /*
>   * Maximum number of loops until we assume that a bit in the irq mask
>   * is stuck. Overridable with module param.
> @@ -1804,7 +1819,11 @@ static int nv_alloc_rx(struct net_device *dev)
>  		less_rx = np->last_rx.orig;
>  
>  	while (np->put_rx.orig != less_rx) {
> -		struct sk_buff *skb = netdev_alloc_skb(dev, np->rx_buf_sz + NV_RX_ALLOC_PAD);
> +		struct sk_buff *skb = skb_dequeue(&np->recv_list);
> +
> +		if (!test_bit(RECV_LIST_ALLOCATE, &np->nv_recv_list_state))
> +			schedule_delayed_work(&np->recv_cache_work, 0);

Interesting, this seems to be coming up in multiple places recently..

Could you explain why you have your own RECV_LIST_ALLOCATE bit here?
Workqueue implementation itself uses an atomic bit to avoid scheduling
work mutliple times:

bool queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
			   struct delayed_work *dwork, unsigned long delay)
{
	struct work_struct *work = &dwork->work;
	bool ret = false;
	unsigned long flags;

	/* read the comment in __queue_work() */
	local_irq_save(flags);

	if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
		__queue_delayed_work(cpu, wq, dwork, delay);
		ret = true;
	}

	local_irq_restore(flags);
	return ret;
}
EXPORT_SYMBOL(queue_delayed_work_on);

>  		if (likely(skb)) {
>  			np->put_rx_ctx->skb = skb;
>  			np->put_rx_ctx->dma = dma_map_single(&np->pci_dev->dev,
> @@ -1845,7 +1864,11 @@ static int nv_alloc_rx_optimized(struct net_device *dev)
>  		less_rx = np->last_rx.ex;
>  
>  	while (np->put_rx.ex != less_rx) {
> -		struct sk_buff *skb = netdev_alloc_skb(dev, np->rx_buf_sz + NV_RX_ALLOC_PAD);
> +		struct sk_buff *skb = skb_dequeue(&np->recv_list);
> +
> +		if (!test_bit(RECV_LIST_ALLOCATE, &np->nv_recv_list_state))
> +			schedule_delayed_work(&np->recv_cache_work, 0);

It seems a little heavy to schedule this work on every packet, would it
make sense to add this in nv_napi_poll() instead?

>  		if (likely(skb)) {
>  			np->put_rx_ctx->skb = skb;
>  			np->put_rx_ctx->dma = dma_map_single(&np->pci_dev->dev,
> @@ -1957,6 +1980,40 @@ static void nv_init_tx(struct net_device *dev)
>  	}
>  }
>  
> +static void nv_init_recv_cache(struct net_device *dev)
> +{
> +	struct fe_priv *np = netdev_priv(dev);
> +
> +	skb_queue_head_init(&np->recv_list);
> +	while (skb_queue_len(&np->recv_list) < RECV_CACHE_LIST_LENGTH) {
> +		struct sk_buff *skb = netdev_alloc_skb(dev,
> +				 np->rx_buf_sz + NV_RX_ALLOC_PAD);
> +		/* skb is null. This indicates that memory is not
> +		 * enough.
> +		 */
> +		if (unlikely(!skb)) {
> +			ndelay(3);
> +			continue;

Does this path ever hit?  Seems like doing an ndelay() and retrying
allocation is not the best idea for non-preempt kernel.  

Also perhaps you should consider using __netdev_alloc_skb() and passing
GFP_KERNEL, that way the system has a chance to go into memory reclaim
(I presume this function can sleep).

> +		}
> +
> +		skb_queue_tail(&np->recv_list, skb);
> +	}
> +}
> +
> +static void nv_destroy_recv_cache(struct net_device *dev)
> +{
> +	struct sk_buff *skb;
> +	struct fe_priv *np = netdev_priv(dev);
> +
> +	cancel_delayed_work_sync(&np->recv_cache_work);
> +	WARN_ON(delayed_work_pending(&np->recv_cache_work));
> +
> +	while ((skb = skb_dequeue(&np->recv_list)))
> +		kfree_skb(skb);

skb_queue_purge()

> +	WARN_ON(skb_queue_len(&np->recv_list));
> +}
> +
>  static int nv_init_ring(struct net_device *dev)
>  {
>  	struct fe_priv *np = netdev_priv(dev);
> @@ -3047,6 +3104,8 @@ static int nv_change_mtu(struct net_device *dev, int new_mtu)
>  		nv_drain_rxtx(dev);
>  		/* reinit driver view of the rx queue */
>  		set_bufsize(dev);
> +		nv_destroy_recv_cache(dev);
> +		nv_init_recv_cache(dev);
>  		if (nv_init_ring(dev)) {
>  			if (!np->in_shutdown)
>  				mod_timer(&np->oom_kick, jiffies + OOM_REFILL);
> @@ -4074,6 +4133,32 @@ static void nv_free_irq(struct net_device *dev)
>  	}
>  }
>  
> +static void nv_recv_cache_worker(struct work_struct *work)
> +{
> +	struct fe_priv *np = container_of(work, struct fe_priv,
> +					  recv_cache_work.work);
> +
> +	set_bit(RECV_LIST_ALLOCATE, &np->nv_recv_list_state);
> +	while (skb_queue_len(&np->recv_list) < RECV_CACHE_LIST_LENGTH) {
> +		struct sk_buff *skb = netdev_alloc_skb(np->dev,
> +				np->rx_buf_sz + NV_RX_ALLOC_PAD);
> +		/* skb is null. This indicates that memory is not
> +		 * enough.
> +		 * When the system memory is not enough, the kernel
> +		 * will compact memory or drop caches. At that time,
> +		 * if memory allocation fails, it had better wait some
> +		 * time for memory.
> +		 */
> +		if (unlikely(!skb)) {
> +			ndelay(3);
> +			continue;

Same comments as for the init function.

> +		}
> +
> +		skb_queue_tail(&np->recv_list, skb);
> +	}
> +	clear_bit(RECV_LIST_ALLOCATE, &np->nv_recv_list_state);
> +}
> +
>  static void nv_do_nic_poll(struct timer_list *t)
>  {
>  	struct fe_priv *np = from_timer(np, t, nic_poll);
> @@ -4129,6 +4214,8 @@ static void nv_do_nic_poll(struct timer_list *t)
>  			nv_drain_rxtx(dev);
>  			/* reinit driver view of the rx queue */
>  			set_bufsize(dev);
> +			nv_destroy_recv_cache(dev);
> +			nv_init_recv_cache(dev);
>  			if (nv_init_ring(dev)) {
>  				if (!np->in_shutdown)
>  					mod_timer(&np->oom_kick, jiffies + OOM_REFILL);


  parent reply	other threads:[~2019-07-08 20:53 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-07-05  6:19 [PATCH 0/2] forcedeth: recv cache support Zhu Yanjun
2019-07-05  6:19 ` [PATCH 1/2] forcedeth: add recv cache make nic work steadily Zhu Yanjun
2019-07-06  4:14   ` Zhu Yanjun
2019-07-08 20:52   ` Jakub Kicinski [this message]
2019-07-09  7:38     ` Yanjun Zhu
2019-07-05  6:19 ` [PATCH 2/2] forcedeth: disable recv cache by default Zhu Yanjun
2019-07-08 22:23 ` [PATCH 0/2] forcedeth: recv cache support David Miller
2019-07-09  8:28   ` Yanjun Zhu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190708135257.18200316@cakuba.netronome.com \
    --to=jakub.kicinski@netronome.com \
    --cc=davem@davemloft.net \
    --cc=netdev@vger.kernel.org \
    --cc=yanjun.zhu@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.