netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Yunsheng Lin <linyunsheng@huawei.com>
To: Michal Kubecek <mkubecek@suse.cz>
Cc: Juergen Gross <jgross@suse.com>, Jiri Kosina <jikos@kernel.org>,
	<davem@davemloft.net>, <kuba@kernel.org>, <olteanv@gmail.com>,
	<ast@kernel.org>, <daniel@iogearbox.net>, <andriin@fb.com>,
	<edumazet@google.com>, <weiwan@google.com>,
	<cong.wang@bytedance.com>, <ap420073@gmail.com>,
	<netdev@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
	<linuxarm@openeuler.org>, <mkl@pengutronix.de>,
	<linux-can@vger.kernel.org>, <jhs@mojatatu.com>,
	<xiyou.wangcong@gmail.com>, <jiri@resnulli.us>,
	<andrii@kernel.org>, <kafai@fb.com>, <songliubraving@fb.com>,
	<yhs@fb.com>, <john.fastabend@gmail.com>, <kpsingh@kernel.org>,
	<bpf@vger.kernel.org>, <jonas.bonn@netrounds.com>,
	<pabeni@redhat.com>, <mzhivich@akamai.com>, <johunt@akamai.com>,
	<albcamus@gmail.com>, <kehuan.feng@gmail.com>,
	<a.fatoum@pengutronix.de>, <atenart@kernel.org>,
	<alexander.duyck@gmail.com>
Subject: Re: [PATCH net v3] net: sched: fix packet stuck problem for lockless qdisc
Date: Tue, 20 Apr 2021 09:45:23 +0800	[thread overview]
Message-ID: <181cdef1-3596-1b74-0fc5-1ced1203a9a5@huawei.com> (raw)
In-Reply-To: <20210419145724.wx6wriaxobo6uruu@lion.mk-sys.cz>

On 2021/4/19 22:57, Michal Kubecek wrote:
> On Mon, Apr 19, 2021 at 10:04:27AM +0800, Yunsheng Lin wrote:
>>>
>>> I tried this patch o top of 5.12-rc7 with real devices. I used two
>>> machines with 10Gb/s Intel ixgbe NICs, sender has 16 CPUs (2 8-core CPUs
>>> with HT disabled) and 16 Rx/Tx queues, receiver has 48 CPUs (2 12-core
>>> CPUs with HT enabled) and 48 Rx/Tx queues. With multiple TCP streams on
>>> a saturated ethernet, the CPU consumption grows quite a lot:
>>>
>>>     threads     unpatched 5.12-rc7    5.12-rc7 + v3   
>>>       1               25.6%               30.6%
>>>       8               73.1%              241.4%
>>>     128              132.2%             1012.0%
>>>
>>> (The values are normalized to one core, i.e. 100% corresponds to one
>>> fully used logical CPU.) I didn't perform a full statistical evaluation
>>> but the growth is way beyond any statistical fluctuation with one
>>> exception: 8-thread test of patched kernel showed values from 155.5% to
>>> 311.4%. Closer look shows that most of the CPU time was spent in softirq
>>> and running top in parallel with the test confirms that there are
>>> multiple ksofirqd threads running at 100% CPU. I had similar problem
>>> with earlier versions of my patch (work in progress, I still need to
>>> check some corner cases and write commit message explaining the logic)
>>
>> Great, if there is a better idea, maybe share the core idea first so
>> that we both can work on the that?
> 
> I'm not sure if it's really better but to minimize the false positives
> and unnecessary calls to __netif_schedule(), I replaced q->seqlock with
> an atomic combination of a "running" flag (which corresponds to current
> seqlock being locked) and a "drainers" count (number of other threads
> going to clean up the qdisc queue). This way we could keep track of them
> and get reliable information if another thread is going to run a cleanup
> after we leave the qdisc_run() critical section (so that there is no
> need to schedule).

It seems you are trying to match the skb enqueuing with the calling of
__qdisc_run() here, which is not reliable when considering the dequeue
batching, see try_bulk_dequeue_skb() or try_bulk_dequeue_skb_slow() in
dequeue_skb().

> 
>>> The biggest problem IMHO is that the loop in __qdisc_run() may finish
>>> without rescheduling not only when the qdisc queue is empty but also
>>> when the corresponding device Tx queue is stopped which devices tend to
>>> do whenever they cannot send any more packets out. Thus whenever
>>> __QDISC_STATE_NEED_RESCHEDULE is set with device queue stopped or
>>> frozen, we keep rescheduling the queue cleanup without any chance to
>>> progress or clear __QDISC_STATE_NEED_RESCHEDULE. For this to happen, all
>>> we need is another thready to fail the first spin_trylock() while device
>>> queue is stopped and qdisc queue not empty.
>>
>> Yes, We could just return false before doing the second spin_trylock() if
>> the netdev queue corresponding qdisc is stopped, and when the netdev queue
>> is restarted, __netif_schedule() is called again, see netif_tx_wake_queue().
>>
>> Maybe add a sch_qdisc_stopped() function and do the testting in qdisc_run_begin:
>>
>> if (dont_retry || sch_qdisc_stopped())
>> 	return false;
>>
>> bool sch_qdisc_stopped(struct Qdisc *q)
>> {
>> 	const struct netdev_queue *txq = q->dev_queue;
>>
>> 	if (!netif_xmit_frozen_or_stopped(txq))
>> 		return true;
>>
>> 	reture false;
>> }
>>
>> At least for qdisc with TCQ_F_ONETXQUEUE flags set is doable?
> 
> Either this or you can do the check in qdisc_run_end() - when the device
> queue is stopped or frozen, there is no need to schedule as we know it's
> going to be done when the flag is cleared again (and we cannot do
> anything until then anyway).
> 
>>> Another part of the problem may be that to avoid the race, the logic is
>>> too pessimistic: consider e.g. (dotted lines show "barriers" where
>>> ordering is important):
>>>
>>>     CPU A                            CPU B
>>>     spin_trylock() succeeds
>>>                                      pfifo_fast_enqueue()
>>>     ..................................................................
>>>     skb_array empty, exit loop
>>>                                      first spin_trylock() fails
>>>                                      set __QDISC_STATE_NEED_RESCHEDULE
>>>                                      second spin_trylock() fails
>>>     ..................................................................
>>>     spin_unlock()
>>>     call __netif_schedule()
>>>
>>> When we switch the order of spin_lock() on CPU A and second
>>> spin_trylock() on CPU B (but keep setting __QDISC_STATE_NEED_RESCHEDULE
>>> before CPU A tests it), we end up scheduling a queue cleanup even if
>>> there is already one running. And either of these is quite realistic.
>>
>> But I am not sure it is a good thing or bad thing when the above happen,
>> because it may be able to enable the tx batching?
> 
> In either of the two scenarios, we call __netif_schedule() to schedule
> a cleanup which does not do anything useful. In first, the qdics queue
> is empty so that either the scheduled cleanup finds it empty or there
> will be some new packets which would have their own cleanup. In second,
> scheduling a cleanup will not prevent the other thread from doing its
> own cleanup it already started.

The main idea is that a thread(holding q->seqlock)to do the dequeuing
as much as possible while other threads are enqueuing skb, which seems
possible to avoid the above case?

> 
>>>> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
>>>> index 44991ea..4953430 100644
>>>> --- a/net/sched/sch_generic.c
>>>> +++ b/net/sched/sch_generic.c
>>>> @@ -640,8 +640,10 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
>>>>  {
>>>>  	struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
>>>>  	struct sk_buff *skb = NULL;
>>>> +	bool need_retry = true;
>>>>  	int band;
>>>>  
>>>> +retry:
>>>>  	for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
>>>>  		struct skb_array *q = band2list(priv, band);
>>>>  
>>>> @@ -652,6 +654,16 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
>>>>  	}
>>>>  	if (likely(skb)) {
>>>>  		qdisc_update_stats_at_dequeue(qdisc, skb);
>>>> +	} else if (need_retry &&
>>>> +		   test_and_clear_bit(__QDISC_STATE_NEED_RESCHEDULE,
>>>> +				      &qdisc->state)) {
>>>> +		/* do another enqueuing after clearing the flag to
>>>> +		 * avoid calling __netif_schedule().
>>>> +		 */
>>>> +		smp_mb__after_atomic();
>>>> +		need_retry = false;
>>>> +
>>>> +		goto retry;
>>>>  	} else {
>>>>  		WRITE_ONCE(qdisc->empty, true);
>>>>  	}i
>>>
>>> Does the retry really provide significant improvement? IIUC it only
>>> helps if all of enqueue, first spin_trylock() failure and setting the
>>> __QDISC_STATE_NEED_RESCHEDULE flag happen between us finding the
>>> skb_array empty and checking the flag. If enqueue happens before we
>>> check the array (and flag is set before the check), the retry is
>>> useless. If the flag is set after we check it, we don't catch it (no
>>> matter if the enqueue happened before or after we found skb_array
>>> empty).
>>
>> In odrder to avoid doing the "set_bit(__QDISC_STATE_MISSED, &qdisc->state)"
>> as much as possible, the __QDISC_STATE_NEED_RESCHEDULE need to be set as
>> as much as possible, so only clear __QDISC_STATE_NEED_RESCHEDULE when the
>> queue is empty.
> 
> This is what I'm worried about. We are trying to address a race
> condition which is extremely rare so we should avoid adding too much
> overhead to the normal use.
> 
>> And it has about 5% performance improvement.
> 
> OK then. It thought that it would do an unnecessary dequeue retry much
> more often than prevent an unnecessary __netif_schedule() but I did not
> run any special benchmark to check.

When netdev tx queue is not stopped:
1. if __QDISC_STATE_NEED_RESCHEDULE is set and there is a lot of skb to be
   dequeued, it is likely that __netif_schedule() is already called in
   __qdisc_run(), so the __netif_schedule() called in qdisc_run_end() is
   no-op.

2. if __QDISC_STATE_NEED_RESCHEDULE is set during the data race this patch is
   trying to fix, then we do need to call __netif_schedule().

3. otherwise the __QDISC_STATE_NEED_RESCHEDULE is cleared in test_and_clear()
   added in pfifo_fast_dequeue().

The main point here is to delay clearing __QDISC_STATE_NEED_RESCHEDULE bit
as much as possible so that the set_bit() and second spin_trylock() is
avoided.

> 
> Michal
> 
> .
> 


      reply	other threads:[~2021-04-20  1:45 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-25  3:13 [PATCH net v3] net: sched: fix packet stuck problem for lockless qdisc Yunsheng Lin
2021-04-09  5:31 ` Juergen Gross
2021-04-12  1:04   ` Yunsheng Lin
2021-04-12  8:21     ` Juergen Gross
     [not found] ` <20210409090909.1767-1-hdanton@sina.com>
2021-04-12  1:24   ` Yunsheng Lin
     [not found]   ` <20210412032111.1887-1-hdanton@sina.com>
2021-04-12  3:37     ` Yunsheng Lin
     [not found]     ` <20210412072856.2046-1-hdanton@sina.com>
2021-04-12 12:00       ` Yunsheng Lin
     [not found]       ` <20210413022129.2203-1-hdanton@sina.com>
2021-04-13  2:56         ` Yunsheng Lin
     [not found]         ` <20210413032620.2259-1-hdanton@sina.com>
2021-04-13  3:34           ` Yunsheng Lin
     [not found]           ` <20210413071241.2325-1-hdanton@sina.com>
2021-04-13  7:57             ` Yunsheng Lin
     [not found]             ` <20210413083352.2424-1-hdanton@sina.com>
2021-04-13  9:03               ` Yunsheng Lin
2021-04-13  9:15                 ` Juergen Gross
2021-04-16 11:46                   ` Jiri Kosina
2021-04-18 22:59 ` Michal Kubecek
2021-04-19  2:04   ` Yunsheng Lin
2021-04-19 12:21     ` [Linuxarm] " Yunsheng Lin
2021-04-19 15:25       ` Michal Kubecek
2021-04-19 14:57     ` Michal Kubecek
2021-04-20  1:45       ` Yunsheng Lin [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=181cdef1-3596-1b74-0fc5-1ced1203a9a5@huawei.com \
    --to=linyunsheng@huawei.com \
    --cc=a.fatoum@pengutronix.de \
    --cc=albcamus@gmail.com \
    --cc=alexander.duyck@gmail.com \
    --cc=andrii@kernel.org \
    --cc=andriin@fb.com \
    --cc=ap420073@gmail.com \
    --cc=ast@kernel.org \
    --cc=atenart@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=cong.wang@bytedance.com \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=jgross@suse.com \
    --cc=jhs@mojatatu.com \
    --cc=jikos@kernel.org \
    --cc=jiri@resnulli.us \
    --cc=john.fastabend@gmail.com \
    --cc=johunt@akamai.com \
    --cc=jonas.bonn@netrounds.com \
    --cc=kafai@fb.com \
    --cc=kehuan.feng@gmail.com \
    --cc=kpsingh@kernel.org \
    --cc=kuba@kernel.org \
    --cc=linux-can@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linuxarm@openeuler.org \
    --cc=mkl@pengutronix.de \
    --cc=mkubecek@suse.cz \
    --cc=mzhivich@akamai.com \
    --cc=netdev@vger.kernel.org \
    --cc=olteanv@gmail.com \
    --cc=pabeni@redhat.com \
    --cc=songliubraving@fb.com \
    --cc=weiwan@google.com \
    --cc=xiyou.wangcong@gmail.com \
    --cc=yhs@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).