Re: [PATCH net v3] net: sched: fix packet stuck problem for lockless qdisc

From: Yunsheng Lin <linyunsheng@huawei.com>
To: Hillf Danton <hdanton@sina.com>
Cc: Juergen Gross <jgross@suse.com>, <netdev@vger.kernel.org>,
	<linux-kernel@vger.kernel.org>, Jiri Kosina <JKosina@suse.com>
Subject: Re: [PATCH net v3] net: sched: fix packet stuck problem for lockless qdisc
Date: Mon, 12 Apr 2021 20:00:43 +0800	[thread overview]
Message-ID: <d7b8a391-0b2f-f0a9-82ed-0609addcadb2@huawei.com> (raw)
In-Reply-To: <20210412072856.2046-1-hdanton@sina.com>

On 2021/4/12 15:28, Hillf Danton wrote:
> On Mon, 12 Apr 2021 11:37:24 Yunsheng Lin wrote:
>> On 2021/4/12 11:21, Hillf Danton wrote:
>>> On Mon, 12 Apr 2021 09:24:30  Yunsheng Lin wrote:
>>>> On 2021/4/9 17:09, Hillf Danton wrote:
>>>>> On Fri, 9 Apr 2021 07:31:03  Juergen Gross wrote:
>>>>>> On 25.03.21 04:13, Yunsheng Lin wrote:
>>>>>> I have a setup which is able to reproduce the issue quite reliably:
>>>>>>
>>>>>> In a Xen guest I'm mounting 8 NFS shares and run sysbench fileio on
>>>>>> each of them. The average latency reported by sysbench is well below
>>>>>> 1 msec, but at least once per hour I get latencies in the minute
>>>>>> range.
>>>>>>
>>>>>> With this patch I don't see these high latencies any longer (test
>>>>>> is running for more than 20 hours now).
>>>>>>
>>>>>> So you can add my:
>>>>>>
>>>>>> Tested-by: Juergen Gross <jgross@suse.com>
>>>>>>
>>>>>
>>>>> If retry is allowed in the dequeue method then a simple seqcount can do the
>>>>> work of serializing enqueuer and dequeuer. IIUC it was not attempted last year.
>>>>
>>>> At the first glance, I do not think the below patch fix the data race
>>>
>>> Thanks for taking a look.
>>>
>>>> described in the commit log, as it does not handle the time window
>>>> between dequeuing and q->seqlock releasing, as below:
>>>>
>>> Yes the time window does exist.
>>>
>>>> The cpu1 may not see the qdisc->pad changed after pfifo_fast_dequeue(),
>>>> and cpu2 is not able to take the q->seqlock yet because cpu1 do not
>>>> release the q->seqlock.
>>>>
>>> It's now covered by extending the seqcount aperture a bit.
>>>
>>> --- x/net/sched/sch_generic.c
>>> +++ y/net/sched/sch_generic.c
>>> @@ -380,14 +380,23 @@ void __qdisc_run(struct Qdisc *q)
>>>  {
>>>  	int quota = dev_tx_weight;
>>>  	int packets;
>>> +	int seq;
>>> +
>>> +again:
>>> +	seq = READ_ONCE(q->pad);
>>> +	smp_rmb();
>>>  
>>>  	while (qdisc_restart(q, &packets)) {
>>>  		quota -= packets;
>>>  		if (quota <= 0) {
>>>  			__netif_schedule(q);
>>> -			break;
>>> +			return;
>>>  		}
>>>  	}
>>> +
>>> +	smp_rmb();
>>> +	if (seq != READ_ONCE(q->pad))
>>> +		goto again;
>>
>> As my understanding, there is still time window between q->pad checking
>> above and q->seqlock releasing in qdisc_run_end().
>>
> Then extend the cover across q->seqlock on top of the flag you added.

Yes, the below patch seems to fix the data race described in
the commit log.
Then what is the difference between my patch and your patch below:)

> 
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -36,6 +36,7 @@ struct qdisc_rate_table {
>  enum qdisc_state_t {
>  	__QDISC_STATE_SCHED,
>  	__QDISC_STATE_DEACTIVATED,
> +	__QDISC_STATE_NEED_RESCHEDULE,
>  };
>  
>  struct qdisc_size_table {
> @@ -176,8 +177,13 @@ static inline bool qdisc_run_begin(struc
>  static inline void qdisc_run_end(struct Qdisc *qdisc)
>  {
>  	write_seqcount_end(&qdisc->running);
> -	if (qdisc->flags & TCQ_F_NOLOCK)
> +	if (qdisc->flags & TCQ_F_NOLOCK) {
>  		spin_unlock(&qdisc->seqlock);
> +
> +		if (test_and_clear_bit(__QDISC_STATE_NEED_RESCHEDULE,
> +							&qdisc->state))
> +			__netif_schedule(qdisc);
> +	}
>  }
>  
>  static inline bool qdisc_may_bulk(const struct Qdisc *qdisc)
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -381,13 +381,21 @@ void __qdisc_run(struct Qdisc *q)
>  	int quota = dev_tx_weight;
>  	int packets;
>  
> +	if (q->flags & TCQ_F_NOLOCK)
> +		clear_bit(__QDISC_STATE_NEED_RESCHEDULE, &q->state);
> +again:
>  	while (qdisc_restart(q, &packets)) {
>  		quota -= packets;
>  		if (quota <= 0) {
>  			__netif_schedule(q);
> -			break;
> +			return;
>  		}
>  	}
> +
> +	if (q->flags & TCQ_F_NOLOCK)
> +		if (test_and_clear_bit(__QDISC_STATE_NEED_RESCHEDULE,
> +					&q->state))
> +			goto again;
>  }
>  
>  unsigned long dev_trans_start(struct net_device *dev)
> @@ -632,6 +640,9 @@ static int pfifo_fast_enqueue(struct sk_
>  			return qdisc_drop(skb, qdisc, to_free);
>  	}
>  
> +	if (qdisc->flags & TCQ_F_NOLOCK)
> +		set_bit(__QDISC_STATE_NEED_RESCHEDULE, &qdisc->state);

Doing set_bit() in pfifo_fast_enqueue() unconditionally does not
seems to be performance friendly, because it requires exclusive access
to the cache line of qdisc->state.
Perhaps do some performance test?

> +
>  	qdisc_update_stats_at_enqueue(qdisc, pkt_len);
>  	return NET_XMIT_SUCCESS;
>  }
> 
> .
>