From mboxrd@z Thu Jan  1 00:00:00 1970
From: John Fastabend <john.fastabend@gmail.com>
Subject: Re: [net-next PATCH] net: ptr_ring: otherwise safe empty checks can
 overrun array bounds
Date: Wed, 3 Jan 2018 09:46:15 -0800
Message-ID: <127b09e0-2acb-f220-d525-e61ad34da794@gmail.com>
References: <20171228035024.14699.69453.stgit@john-Precision-Tower-5810>
 <20180102.115219.1101472320429215260.davem@davemloft.net>
 <20180102190107-mutt-send-email-mst@kernel.org>
 <20180102191329-mutt-send-email-mst@kernel.org>
 <149cba1c-e51a-e3fa-8105-34b95ad2e582@gmail.com>
 <20180103004619-mutt-send-email-mst@kernel.org>
 <457d384f-5cd1-d52d-f6e9-4c9ebb1f8d09@gmail.com>
 <20180103165249-mutt-send-email-mst@kernel.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
Cc: David Miller <davem@davemloft.net>, jakub.kicinski@netronome.com,
        xiyou.wangcong@gmail.com, jiri@resnulli.us, netdev@vger.kernel.org
To: "Michael S. Tsirkin" <mst@redhat.com>
Return-path: <netdev-owner@vger.kernel.org>
Received: from mail-pg0-f54.google.com ([74.125.83.54]:35645 "EHLO
        mail-pg0-f54.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1750753AbeACRqa (ORCPT
        <rfc822;netdev@vger.kernel.org>); Wed, 3 Jan 2018 12:46:30 -0500
Received: by mail-pg0-f54.google.com with SMTP id b70so952954pga.2
        for <netdev@vger.kernel.org>; Wed, 03 Jan 2018 09:46:30 -0800 (PST)
In-Reply-To: <20180103165249-mutt-send-email-mst@kernel.org>
Content-Language: en-US
Sender: netdev-owner@vger.kernel.org
List-ID: <netdev.vger.kernel.org>

On 01/03/2018 07:50 AM, Michael S. Tsirkin wrote:
> On Tue, Jan 02, 2018 at 04:25:03PM -0800, John Fastabend wrote:
>>>
>>> More generally, what makes this usage safe?
>>> Is there a way to formalize it at the API level?
>>>
>>
>> Right I think these are good questions. I think the ptr_ring API should
>> allow a peek operation to be used without a lock. The user has to ensure
>> they only use it as a hint and if its dereferenced the user needs to
>> ensure the object is not free'd from some other codepath while it is
>> being dereferenced. The existing API seems to match this.
>>
>> This is how I used it in pfifo_fast expecting the above to be true. The
>> API allows for false negatives which _should_ be OK if the user is expecting
>> this. Alternatively, we could make it false positives if you want and
>> that would also work for me considering this case is hit very rarely.
> 
> By now I'm confused by which are positives and which are negatives :)
> OK so the guarantees we want would be:
> 
> - empty can return false if ring is empty.
>   caller must re-check with consumer lock taken

Correct.

> - if multiple threads call it, only one thread

What is "it" here, __skb_array_empty() presumably.

>   is guaranteed that empty will not return true
>   if ring is non empty.
>   after detecting that ring is not empty,
>   this thread shall ....
> 
The guarantee is even weaker.

  - if __skb_array_empty() returns true, either the
    ring is empty OR a consumer operation is in
    progress.

For pfifo_fast this is OK because some thread is making
progress at this point. Note, even if multiple threads
are calling __skb_array_empty() there is no guarantee
any of them will return false even if the ring has
elements.

> can you help me fill in the blank please?
> 

Did that help?

> 
> 
> 
> 
>>>>> John, others, could you pls confirm it's not too bad performance-wise?
>>>>> I'll then split it up properly and re-post.
>>>>>
>>>>
>>>> I haven't benchmarked it but in the dequeue case taking a lock for
>>>> every priority even when its empty seems unneeded.
>>>
>>> Well it does seem to make the code more comprehensible and more robust.
>>>
>>
>> Its a trade-off between performance and robustness.
>>
>>> But OK -  I was looking at fixing the unlocked empty API to make sure it
>>> actually does what it's supposed to. I posted a draft earlier in this
>>> thread, it needs to be looked at in depth to figure out whether it can
>>> ever give false negatives or positives, and document the results.
>>>
>>>
>>
>> I'll look at it. But I still think keeping a lockless version makes sense
>> for many use cases.
> 
> Fine. Just let's try to document what are the guarantees.
> 

Yep. Guarantees should be (on ptr_ring implementation and similar
on skb_array implementation)

  - if __ptr_ring_empty returns true, the ring may be empty
    user must use operation with consumer lock to guarantee the
    ring is empty. Note, when run with concurrent consumer operations
    it is possible to return true when ring has elements. This
    occurs when ring consumer head rolls over ring size. Also,
    producer operations may put object in the ring concurrently
    making empty check incorrect. To avoid this use locked
    ptr_ring_empty() API.

  - if __ptr_ring_empty returns false, the ring may have elements.
    Possibly, scenario is consumer operation and __ptr_ring_empty
    operation run concurrently causing false to be returned when
    ring is empty. To avoid this use locked ptr_ring_empty() API.

>>>
>>>>> -->
>>>>>
>>>>> net: don't misuse ptr_ring_peek
>>>>>
>>>>> ptr_ring_peek only claims to be safe if the result is never
>>>>> dereferenced, which isn't the case for its use in sch_generic.
>>>>> Add locked API variants and use the bh one here.
>>>>>
>>>>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>>>>>
>>>>> ---
>>>>>
>>>>
>>>> [...]
>>>>
>>>>> --- a/net/sched/sch_generic.c
>>>>> +++ b/net/sched/sch_generic.c
>>>>> @@ -659,7 +659,7 @@ static struct sk_buff *pfifo_fast_peek(struct Qdisc *qdisc)
>>>>>  	for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
>>>>>  		struct skb_array *q = band2list(priv, band);
>>>>>  
>>>>> -		skb = __skb_array_peek(q);
>>>>> +		skb = skb_array_peek_bh(q);
>>>>
>>>> Ah I should have added a comment here. For now peek() is only used from
>>>> locking qdiscs. So peek and consume/produce operations will never happen
>>>> in parallel. In this case we should never hit the false negative case with
>>>> my patch or the out of bounds reference without my patch.
> 
> OK so this is the part I missed. Can you add a comment please?
> 
> 

Yes, working on a documentation patch set to address misleading and missing
comments/documentation in qdisc layer now.

>>>> Doing a peek() op without qdisc lock is a bit problematic anyways. With
>>>> current code another cpu could consume the skb and free it. Either we
>>>> can ensure a single consumer runs at a time on an array (not the same as
>>>> qdisc maybe) or just avoid peek operations in this case. My current plan
>>>> was to just avoid peek() ops altogether, they seem unnecessary with the
>>>> types of qdiscs I want to be build.
>>>>
>>>> Thanks,
>>>> John
>>>
>>> For the lockless qdisc, for net-next, do we need the patch above until you fix that though?
>>>
>>
>> No, I think after this patch (net: ptr_ring: otherwise safe empty checks...) is
>> applied we do not need any additional fixes in net-next. Future work will
>> require the above patch (the one you provided) though so its useful work.
> 
> The one that avoids allocating extra memory?
> 

Let me try summarize again.

We need the extra memory slot patch if we expose the
__ptr_ring_empty/__skb_array_empty API. Otherwise we hit the out
of bounds issues. And for qdisc layer the __skb_array_empty() API
call is useful so we should not remove it.

In addition to the qdisc layer using the __skb_array_empty() API if
we need a peek() API that works without the qdisc lock then your patch,
the one adding the locked peek API, will be needed. It is not needed
though until we have a qdisc that would use it without the lock.

In summary we keep the extra allocation to make the __skb_array_empty() API
usable. And hold off exposing the skb_array_peek()/ptr_ring_peek() until we
have a user for it. We can roll your proposed patch into any series we come
up with for a new/updated qdisc.

> 
>> I'll do another review of the false positive case though to be sure the
>> current code is OK wrt handling false positives and any potential stalls.
> 
> 
> Thanks!
> 
>>>>>  	}
>>>>>  
>>>>>  	return skb;
>>>>>