Re: Re: [PATCH v2] mm: page_frag: Warn_on when frag_alloc size is bigger than PAGE_SIZE

From: Alexander Duyck <alexander.duyck@gmail.com>
To: 愚树 <chen45464546@163.com>
Cc: Jakub Kicinski <kuba@kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Netdev <netdev@vger.kernel.org>
Subject: Re: Re: [PATCH v2] mm: page_frag: Warn_on when frag_alloc size is bigger than PAGE_SIZE
Date: Wed, 1 Jun 2022 08:04:17 -0700	[thread overview]
Message-ID: <CAKgT0UdoGJ_dG9vZ3aqQhTagCGf_J3H9A8yJbO5mWCgrt6vd4Q@mail.gmail.com> (raw)
In-Reply-To: <3498989.c69f.1811f41186e.Coremail.chen45464546@163.com>

On Wed, Jun 1, 2022 at 5:33 AM 愚树 <chen45464546@163.com> wrote:
>
> At 2022-06-01 01:28:59, "Alexander Duyck" <alexander.duyck@gmail.com> wrote:
> >On Tue, May 31, 2022 at 8:47 AM Jakub Kicinski <kuba@kernel.org> wrote:
> >>
> >> On Tue, 31 May 2022 23:36:22 +0800 Chen Lin wrote:
> >> > At 2022-05-31 22:14:12, "Jakub Kicinski" <kuba@kernel.org> wrote:
> >> > >On Tue, 31 May 2022 22:41:12 +0800 Chen Lin wrote:
> >> > >> The sample code above cannot completely solve the current problem.
> >> > >> For example, when fragsz is greater than PAGE_FRAG_CACHE_MAX_SIZE(32768),
> >> > >> __page_frag_cache_refill will return a memory of only 32768 bytes, so
> >> > >> should we continue to expand the PAGE_FRAG_CACHE_MAX_SIZE? Maybe more
> >> > >> work needs to be done
> >> > >
> >> > >Right, but I can think of two drivers off the top of my head which will
> >> > >allocate <=32k frags but none which will allocate more.
> >> >
> >> > In fact, it is rare to apply for more than one page, so is it necessary to
> >> > change it to support?
> >>
> >> I don't really care if it's supported TBH, but I dislike adding
> >> a branch to the fast path just to catch one or two esoteric bad
> >> callers.
> >>
> >> Maybe you can wrap the check with some debug CONFIG_ so it won't
> >> run on production builds?
> >
> >Also the example used here to define what is triggering the behavior
> >is seriously flawed. The code itself is meant to allow for order0 page
> >reuse, and the 32K page was just an optimization. So the assumption
> >that you could request more than 4k is a bad assumption in the driver
> >that is making this call.
> >
> >So I am in agreement with Kuba. We shouldn't be needing to add code in
> >the fast path to tell users not to shoot themselves in the foot.
> >
> >We already have code in place in __netdev_alloc_skb that is calling
> >the slab allocator if "len > SKB_WITH_OVERHEAD(PAGE_SIZE)". We could
> >probably just add a DEBUG wrapped BUG_ON to capture those cases where
> >a driver is making that mistake with __netdev_alloc_frag_align.
>
> Thanks for the clear explanation.
> The reality is that it is not easy to capture the drivers that make such mistake.
> Because memory corruption usually leads to errors on other unrelated modules.
> Not long ago, we have spent a lot of time and effort to locate a issue that
> occasionally occurs in different kernel modules, and finally find the root cause is
> the improper use of this netdev_alloc_frag interface in DPAA net driver from NXP.
> It's a miserable process.
>
> I also found that some net drivers in the latest Linux version have this issue.
> Like:
> 1. netdev_alloc_frag "len" may larger than PAGE_SIZE
> #elif (PAGE_SIZE >= E1000_RXBUFFER_4096)
>                 adapter->rx_buffer_len = PAGE_SIZE;
> #endif
>
> static unsigned int e1000_frag_len(const struct e1000_adapter *a)
> {
>         return SKB_DATA_ALIGN(a->rx_buffer_len + E1000_HEADROOM) +
>                 SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> }
>
> static void *e1000_alloc_frag(const struct e1000_adapter *a)
> {
>         unsigned int len = e1000_frag_len(a);
>         u8 *data = netdev_alloc_frag(len);
> }
> "./drivers/net/ethernet/intel/e1000/e1000_main.c" 5316  --38%--

So there isn't actually a bug in this code. Specifically the code is
split up between two paths. The first code block comes from the jumbo
frames path which creates a fraglist skb and will memcpy the header
out if I recall correctly. The code from the other two functions is
from the non-jumbo frames path which has restricted the length to
MAXIMUM_ETHERNET_VLAN_SIZE.

> 2. netdev_alloc_frag "ring->frag_size" may larger than (4096 * 3)
>
> #define MTK_MAX_LRO_RX_LENGTH           (4096 * 3)
>         if (rx_flag == MTK_RX_FLAGS_HWLRO) {
>                 rx_data_len = MTK_MAX_LRO_RX_LENGTH;
>                 rx_dma_size = MTK_HW_LRO_DMA_SIZE;
>         } else {
>                 rx_data_len = ETH_DATA_LEN;
>                 rx_dma_size = MTK_DMA_SIZE;
>         }
>
>         ring->frag_size = mtk_max_frag_size(rx_data_len);
>
>         for (i = 0; i < rx_dma_size; i++) {
>                 ring->data[i] = netdev_alloc_frag(ring->frag_size);
>                 if (!ring->data[i])
>                         return -ENOMEM;
>         }
> "drivers/net/ethernet/mediatek/mtk_eth_soc.c" 3344  --50%--
>
> I will try to fix these drivers later.

This one I don't know as much about, and it does appear to contain a
bug. What it should be doing is a check before doing the
netdev_alloc_frag call to verify if it is less than 4K then it uses
netdev_alloc_frag, if it is greater then it needs to use alloc_pages.

> Even experienced driver engineers may use this netdev_alloc_frag
> interface incorrectly.
> So I thought it is best to provide some prompt information of usage
> error inside the netdev_alloc_frag, or it's OK to report such mistake
> during system running which may caused by fragsz varies(exceeded page size).
>
> Now, as you and Kuba mentioned earlier, "do not add code in fast path".
>
> Can we just add code to the relatively slow path to capture the mistake
> before it lead to memory corruption?
> Like:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e6f211d..ac60a97 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5580,6 +5580,7 @@ void *page_frag_alloc_align(struct page_frag_cache *nc,
>                 /* reset page count bias and offset to start of new frag */
>                 nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
>                 offset = size - fragsz;
> +               BUG_ON(offset < 0);
>         }
>
>         nc->pagecnt_bias--;
>

I think I could be onboard with a patch like this. The test shouldn't
add more than 1 instruction since it is essentially just a jump if
signed test which will be performed after the size - fragsz check.

> Additional, we may modify document to clearly indicate the limits of the
> input parameter fragsz.
> Like:
> diff --git a/Documentation/vm/page_frags.rst b/Documentation/vm/page_frags.rst
> index 7d6f938..61b2805 100644
> --- a/Documentation/vm/page_frags.rst
> +++ b/Documentation/vm/page_frags.rst
> @@ -4,7 +4,7 @@
>  Page fragments
>  ==============
>
> -A page fragment is an arbitrary-length arbitrary-offset area of memory
> +A page fragment is an arbitrary-length(must <= PAGE_SIZE) arbitrary-offset area of memory
>  which resides within a 0 or higher order compound page.

The main thing I would call out about the page fragment is that it
should be less than an order 0 page in size, ideally at least half a
page to allow for reuse even in the case of order 0 pages. Otherwise
it is really an abuse of the interface as it isn't really meant to be
allocating 1 fragment per page since the efficiency will drop pretty
significantly as memory becomes fragmented and it becomes harder to
allocate higher order pages. It would essentially just become
alloc_page with more overhead.