Re: [PATCH] page_pool: use irqsave/irqrestore to protect ring access.

From: "Jonathan Lemon" <jonathan.lemon@gmail.com>
To: "Jesper Dangaard Brouer" <brouer@redhat.com>
Cc: "Saeed Mahameed" <saeedm@mellanox.com>,
	davem@davemloft.net, kernel-team@fb.com, netdev@vger.kernel.org,
	ilias.apalodimas@linaro.org, "Li RongQing" <lirongqing@baidu.com>
Subject: Re: [PATCH] page_pool: use irqsave/irqrestore to protect ring access.
Date: Tue, 10 Mar 2020 10:52:37 -0700	[thread overview]
Message-ID: <67CA3793-636E-4F73-AC80-44D05A6BDB6F@gmail.com> (raw)
In-Reply-To: <20200310110412.66b60677@carbon>

On 10 Mar 2020, at 3:04, Jesper Dangaard Brouer wrote:

> On Tue, 10 Mar 2020 02:30:34 +0000
> Saeed Mahameed <saeedm@mellanox.com> wrote:
>
>> On Mon, 2020-03-09 at 17:55 -0700, David Miller wrote:
>>> From: Jonathan Lemon <jonathan.lemon@gmail.com>
>>> Date: Mon, 9 Mar 2020 12:49:29 -0700
>>>
>>>> netpoll may be called from IRQ context, which may access the
>>>> page pool ring.  The current _bh variants do not provide sufficient
>>>> protection, so use irqsave/restore instead.
>>>>
>>>> Error observed on a modified mlx4 driver, but the code path exists
>>>> for any driver which calls page_pool_recycle from napi poll.
>
> Netpoll calls into drivers are problematic, nasty and annoying. 
> Drivers
> usually catch netpoll calls via seeing NAPI budget is zero, and handle
> the situation inside the driver e.g.[1][2]. (even napi_consume_skb
> catch it this way).
>
> [1] 
> https://github.com/torvalds/linux/blob/master/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c#L3179
> [2] 
> https://github.com/torvalds/linux/blob/master/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c#L1110

This still doesn't seem safe.

netpoll calls napi poll in IRQ context.
Consider the following path:

  ixgbe_poll()
   ixgbe_for_each_ring()
    ixgbe_clean_tx_irq()    /* before the L3179 check above */
     xdp_return_frame()
      page_pool_put_page( .. napi=0 )

Which bypasses the cache and tries to return the page to the ring.
Since in_serving_softirq() is false, it uses the _bh variant, which
blows up.

>>>> WARNING: CPU: 34 PID: 550248 at /ro/source/kernel/softirq.c:161
>>> __local_bh_enable_ip+0x35/0x50
>>>  ...
>>>> Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
>>>
>>> The netpoll stuff always makes the locking more complicated than it
>>> needs to be.  I wonder if there is another way around this issue?
>>>
>>> Because IRQ save/restore is a high cost to pay in this critical 
>>> path.
>
> Yes, huge NACK from me, this would kill performance. We need to find
> another way.

Sure - checking in_irq() in the free path is another solution.

>> a printk inside irq context lead to this, so maybe it can be avoided 
>> ..
>>
>> or instead of checking in_serving_softirq()  change page_pool to
>> check in_interrupt() which is more powerful, to avoid ptr_ring 
>> locking
>> and the complication with netpoll altogether.
>>
>> I wonder why Jesper picked in_serving_softirq() in first place, was
>> there a specific reason ? or he just wanted it to be as less strict 
>> as
>> possible ?
>
> I wanted to make it specific that this is optimized for softirq.
>
> I actually think this is a regression we have introduced recently in
> combined patches:
>
>  304db6cb7610 ("page_pool: refill page when alloc.count of pool is 
> zero")
>  44768decb7c0 ("page_pool: handle page recycle for NUMA_NO_NODE 
> condition")
>
> I forgot that the in_serving_softirq() check was also protecting us
> when getting called from netpoll.  The general idea before was that if
> the in_serving_softirq() check failed, then we falled-through to
> slow-path calling normal alloc_pages_node (regular page allocator).

I don't think it ever did that - in all cases, the in_serving_softirq()
check guards access to the cache, otherwise it falls through to the 
ring.

Here, it seems we need  a tristate:
    in_irq:  fall through to system page allocator
    napi?    use cache
    else     ring

However, looking at those two patches again, while it doesn't address
this current issue, I do agree that they are regressions:

   1) Under stress, the system allocator may return pages with
       a different node id.

   2) Since the check is now removed from page_pool_reusable(), the
      foreign page is placed in the cache and persists there until
      it is moved to the ring.

   3) on refill from ring, if this page is found, processing stops,
      regardless of the other pages in the ring.

The above regression(s) are introduced by trying to avoid checking
the page node_id at free time.  Are there any metrics showing that
this is a performance issue?  Drivers which do page biasing still
have to perform this check in the driver (e.g.: mlx4, i40e) regardless.
-- 
Jonathan