From mboxrd@z Thu Jan  1 00:00:00 1970
From: John Fastabend <john.fastabend@gmail.com>
Subject: Re: [net-next PATCH] net: ptr_ring: otherwise safe empty checks can
 overrun array bounds
Date: Tue, 2 Jan 2018 13:27:23 -0800
Message-ID: <149cba1c-e51a-e3fa-8105-34b95ad2e582@gmail.com>
References: <20171228035024.14699.69453.stgit@john-Precision-Tower-5810>
 <20180102.115219.1101472320429215260.davem@davemloft.net>
 <20180102190107-mutt-send-email-mst@kernel.org>
 <20180102191329-mutt-send-email-mst@kernel.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
Cc: jakub.kicinski@netronome.com, xiyou.wangcong@gmail.com,
        jiri@resnulli.us, netdev@vger.kernel.org
To: "Michael S. Tsirkin" <mst@redhat.com>,
        David Miller <davem@davemloft.net>
Return-path: <netdev-owner@vger.kernel.org>
Received: from mail-pf0-f196.google.com ([209.85.192.196]:45063 "EHLO
        mail-pf0-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1751078AbeABV1r (ORCPT
        <rfc822;netdev@vger.kernel.org>); Tue, 2 Jan 2018 16:27:47 -0500
Received: by mail-pf0-f196.google.com with SMTP id u19so26112802pfa.12
        for <netdev@vger.kernel.org>; Tue, 02 Jan 2018 13:27:47 -0800 (PST)
In-Reply-To: <20180102191329-mutt-send-email-mst@kernel.org>
Content-Language: en-US
Sender: netdev-owner@vger.kernel.org
List-ID: <netdev.vger.kernel.org>

On 01/02/2018 09:17 AM, Michael S. Tsirkin wrote:
> On Tue, Jan 02, 2018 at 07:01:33PM +0200, Michael S. Tsirkin wrote:
>> On Tue, Jan 02, 2018 at 11:52:19AM -0500, David Miller wrote:
>>> From: John Fastabend <john.fastabend@gmail.com>
>>> Date: Wed, 27 Dec 2017 19:50:25 -0800
>>>
>>>> When running consumer and/or producer operations and empty checks in
>>>> parallel its possible to have the empty check run past the end of the
>>>> array. The scenario occurs when an empty check is run while
>>>> __ptr_ring_discard_one() is in progress. Specifically after the
>>>> consumer_head is incremented but before (consumer_head >= ring_size)
>>>> check is made and the consumer head is zeroe'd.
>>>>
>>>> To resolve this, without having to rework how consumer/producer ops
>>>> work on the array, simply add an extra dummy slot to the end of the
>>>> array. Even if we did a rework to avoid the extra slot it looks
>>>> like the normal case checks would suffer some so best to just
>>>> allocate an extra pointer.
>>>>
>>>> Reported-by: Jakub Kicinski <jakub.kicinski@netronome.com>
>>>> Fixes: c5ad119fb6c09 ("net: sched: pfifo_fast use skb_array")
>>>> Signed-off-by: John Fastabend <john.fastabend@gmail.com>
>>>
>>> Applied, thanks John.
>>
>> I think that patch is wrong. I'd rather it got reverted.
> 
> And replaced with something like the following - stil untested, but
> apparently there's some urgency to fix it so posting for review ASAP.
> 

So the above ptr_ring patch is meant for the dequeue() case not
the peek case. Dequeue case shown here,

static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
{
        struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
        struct sk_buff *skb = NULL;
        int band;

        for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
                struct skb_array *q = band2list(priv, band);

                if (__skb_array_empty(q))
                        continue;

                skb = skb_array_consume_bh(q);
        }
        if (likely(skb)) {
                qdisc_qstats_cpu_backlog_dec(qdisc, skb);
                qdisc_bstats_cpu_update(qdisc, skb);
                qdisc_qstats_cpu_qlen_dec(qdisc);
        }

        return skb;
}

In the dequeue case we use it purely as a hint and then do a proper
consume (with locks) if needed. A false negative in this case means
a consume is happening on another cpu due to a reset op or a
dequeue op. (an aside but I'm evaluating if we should only allow a
single dequeue'ing cpu at a time more below?) If its a reset op
that caused the false negative it is OK because we are flushing the
array anyways. If it is a dequeue op it is also OK because this core
will abort but the running core will continue to dequeue avoiding a
stall. So I believe false negatives are OK in the above function.

> John, others, could you pls confirm it's not too bad performance-wise?
> I'll then split it up properly and re-post.
> 

I haven't benchmarked it but in the dequeue case taking a lock for
every priority even when its empty seems unneeded.

> -->
> 
> net: don't misuse ptr_ring_peek
> 
> ptr_ring_peek only claims to be safe if the result is never
> dereferenced, which isn't the case for its use in sch_generic.
> Add locked API variants and use the bh one here.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> 
> ---
> 

[...]

> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -659,7 +659,7 @@ static struct sk_buff *pfifo_fast_peek(struct Qdisc *qdisc)
>  	for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
>  		struct skb_array *q = band2list(priv, band);
>  
> -		skb = __skb_array_peek(q);
> +		skb = skb_array_peek_bh(q);

Ah I should have added a comment here. For now peek() is only used from
locking qdiscs. So peek and consume/produce operations will never happen
in parallel. In this case we should never hit the false negative case with
my patch or the out of bounds reference without my patch.

Doing a peek() op without qdisc lock is a bit problematic anyways. With
current code another cpu could consume the skb and free it. Either we
can ensure a single consumer runs at a time on an array (not the same as
qdisc maybe) or just avoid peek operations in this case. My current plan
was to just avoid peek() ops altogether, they seem unnecessary with the
types of qdiscs I want to be build.

Thanks,
John

>  	}
>  
>  	return skb;
>