From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-db5eur01on0058.outbound.protection.outlook.com ([104.47.2.58]:51333 "EHLO EUR01-DB5-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1752055AbeCFQNd (ORCPT ); Tue, 6 Mar 2018 11:13:33 -0500 Subject: Re: [PATCH] net/mlx4_en: fix potential use-after-free with dma_unmap_page To: Sarah Newman , Tariq Toukan , Yishai Hadas Cc: netdev@vger.kernel.org References: <1520223607-17560-1-git-send-email-srn@prgmr.com> <1cee1792-52ae-f408-15f0-ee42c1fb2ea3@mellanox.com> <2a726e89-8411-380e-3809-2dd71bceb7a7@prgmr.com> From: Tariq Toukan Message-ID: <0b28d134-eec4-4e81-1b26-e05e38fe5605@mellanox.com> Date: Tue, 6 Mar 2018 18:13:24 +0200 MIME-Version: 1.0 In-Reply-To: <2a726e89-8411-380e-3809-2dd71bceb7a7@prgmr.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit Sender: netdev-owner@vger.kernel.org List-ID: On 05/03/2018 11:10 PM, Sarah Newman wrote: > On 03/05/2018 02:09 AM, Tariq Toukan wrote: >> >> >> On 05/03/2018 6:20 AM, Sarah Newman wrote: >>> Take an additional reference to a page whenever it is placed >>> into the rx ring and put the page again after running >>> dma_unmap_page. >>> >>> When swiotlb is in use, calling dma_unmap_page means that >>> the original page mapped with dma_map_page must still be valid, >>> as swiotlb will copy data from its internal cache back to the >>> originally requested DMA location. >>> >>> When GRO is enabled, before this patch all references to the >>> original frag may be put and the page freed before dma_unmap_page >>> in mlx4_en_free_frag is called. >>> >>> It is possible there is a path where the use-after-free occurs >>> even with GRO disabled, but this has not been observed so far. >>> >>> The bug can be trivially detected by doing the following: >>> >>> * Compile the kernel with DEBUG_PAGEALLOC >>> * Run the kernel as a Xen Dom0 >>> * Leave GRO enabled on the interface >>> * Run a 10 second or more test with iperf over the interface. >>> >> >> Hi Sarah, thanks for your patch! >> >>> This bug was likely introduced in >>> commit 4cce66cdd14a ("mlx4_en: map entire pages to increase throughput"), >>> first part of u3.6. >>> >>> It was incidentally fixed in >>> commit 34db548bfb95 ("mlx4: add page recycling in receive path"), >>> first part of v4.12. >>> >>> This version applies to the v4.9 series. >>> >>> Signed-off-by: Sarah Newman >>> Tested-by: Sarah Newman >>> --- >>>   drivers/net/ethernet/mellanox/mlx4/en_rx.c   | 39 +++++++++++++++++++++------- >>>   drivers/net/ethernet/mellanox/mlx4/en_tx.c   |  3 ++- >>>   drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |  1 + >>>   3 files changed, 32 insertions(+), 11 deletions(-) >>> >>> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c >>> index bcbb80f..d1fb087 100644 >>> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c >>> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c >>> @@ -80,10 +80,14 @@ static int mlx4_alloc_pages(struct mlx4_en_priv *priv, >>>       page_alloc->page = page; >>>       page_alloc->dma = dma; >>>       page_alloc->page_offset = 0; >>> +    page_alloc->page_owner = true; >> >> Do we really need this boolean? I believe the issue can be fixed without it. We need to make sure we hold the correct refcnt at every stage, and >> maintain symmetry between a flow and its inverse. > > The reason this was added is because the page address needs to stay around until after dma unmap_page is called, and right now setting page to NULL is > used to indicate that put_page should not be called when frags are freed in mlx4_en_free_frag. So either the code needs to be rearranged so that > dma_unmap_page while page is still set, or some variable needed to be used to indicate whether put_page should be called when the frags are freed. > rearranging sounds better. > If dma_unmap_page was called before page was set to NULL, then this variable doesn't need to be added, yes. Then the call to dma_unmap_page in > mlx4_en_free_frag would also be contingent on frags[i].page being set. > > There are two places where page is set to NULL without calling dma_unmap_page first, mlx4_en_complete_rx_desc and mlx4_en_xmit_frame. > In mlx4_en_xmit_frame, should not unmap, it should be done only upon completion, this is done in mlx4_en_recycle_tx_desc. In mlx4_en_complete_rx_desc I think it was just a bug. > > Is mlx4_en_complete_rx_desc the only place where a call to dma_unmap_page would need to be added? The other place page is set to NULL without a call > to dma_unmap_page first is in mlx4_en_xmit_frame, and I believe there is no call to mlx4_en_free_frag if mlx4_en_xmit_frame executes. > Yes, only in mlx4_en_complete_rx_desc, see above. >> >> Upon alloc, refcnt is 1. This alloc refcnt should be inverted by a call to put_page. We might want to introduce a page free API (symmetric to >> mlx4_alloc_pages), that does: dma unmap the page, call put_page, nullify pointer. > > That seems reasonable. > Yes, let's use it in mlx4_en_free_frag. >> Once alloced, page refcnt is bumped up by the amount of possible frags populating it, which is (page_size / frag_stride), as you do here. >> >>>       /* Not doing get_page() for each frag is a big win >>>        * on asymetric workloads. Note we can not use atomic_set(). >>>        */ >>> -    page_ref_add(page, page_alloc->page_size / frag_info->frag_stride - 1); >>> +    /* Since the page must be valid until after dma_unmap_page is called, >>> +     * take an additional reference we would not have otherwise. >>> +     */ >>> +    page_ref_add(page, page_alloc->page_size / frag_info->frag_stride); >>>       return 0; >>>   } >>>   @@ -105,9 +109,13 @@ static int mlx4_en_alloc_frags(struct mlx4_en_priv *priv, >>>           page_alloc[i].page_offset += frag_info->frag_stride; >>>             if (page_alloc[i].page_offset + frag_info->frag_stride <= >>> -            ring_alloc[i].page_size) >>> -            continue; >>> - >>> +            ring_alloc[i].page_size) { >>> +            WARN_ON(!page_alloc[i].page); >>> +            WARN_ON(!page_alloc[i].page_owner); >> >> Why WARN before the likely() check? >> Move after the check, for a better performance. > > No particular reason. > please move them. >> >>> +            if (likely(page_alloc[i].page && >>> +                   page_alloc[i].page_owner)) >>> +                continue; >>> +        } >>>           if (unlikely(mlx4_alloc_pages(priv, &page_alloc[i], >>>                             frag_info, gfp))) >>>               goto out; >>> @@ -131,7 +139,7 @@ static int mlx4_en_alloc_frags(struct mlx4_en_priv *priv, >>>               page = page_alloc[i].page; >>>               /* Revert changes done by mlx4_alloc_pages */ >>>               page_ref_sub(page, page_alloc[i].page_size / >>> -                       priv->frag_info[i].frag_stride - 1); >>> +                       priv->frag_info[i].frag_stride); >>>               put_page(page); >>>           } >>>       } >>> @@ -146,11 +154,13 @@ static void mlx4_en_free_frag(struct mlx4_en_priv *priv, >>>       u32 next_frag_end = frags[i].page_offset + 2 * frag_info->frag_stride; >>>     -    if (next_frag_end > frags[i].page_size) >>> +    if (next_frag_end > frags[i].page_size) { >>>           dma_unmap_page(priv->ddev, frags[i].dma, frags[i].page_size, >>>                      frag_info->dma_dir); >>> +        put_page(frags[i].page); >>> +    } >>>   -    if (frags[i].page) >>> +    if (frags[i].page_owner) >>>           put_page(frags[i].page); >>>   } >>>   @@ -184,9 +194,10 @@ static int mlx4_en_init_allocator(struct mlx4_en_priv *priv, >>>           page = page_alloc->page; >>>           /* Revert changes done by mlx4_alloc_pages */ >>>           page_ref_sub(page, page_alloc->page_size / >>> -                   priv->frag_info[i].frag_stride - 1); >>> +                   priv->frag_info[i].frag_stride); >>>           put_page(page); >>>           page_alloc->page = NULL; >>> +        page_alloc->page_owner = false; >>>       } >>>       return -ENOMEM; >>>   } >>> @@ -206,12 +217,14 @@ static void mlx4_en_destroy_allocator(struct mlx4_en_priv *priv, >>>             dma_unmap_page(priv->ddev, page_alloc->dma, >>>                   page_alloc->page_size, frag_info->dma_dir); >>> +        put_page(page_alloc->page); >> >> for symmetry, i'd move this after the while loop. > > Or use the wrapper function you suggested for dma_unmap_page? > yes. >> >>>           while (page_alloc->page_offset + frag_info->frag_stride < >>>                  page_alloc->page_size) { >>>               put_page(page_alloc->page); >>>               page_alloc->page_offset += frag_info->frag_stride; >>>           } >>>           page_alloc->page = NULL; >>> +        page_alloc->page_owner = false; >>>       } >>>   } >>>   @@ -251,6 +264,11 @@ static int mlx4_en_prepare_rx_desc(struct mlx4_en_priv *priv, >>>       if (ring->page_cache.index > 0) { >>>           frags[0] = ring->page_cache.buf[--ring->page_cache.index]; >>>           rx_desc->data[0].addr = cpu_to_be64(frags[0].dma); >>> +        WARN_ON(frags[0].page_owner); >>> +        if (likely(!frags[0].page_owner)) { >>> +            page_ref_inc(frags[0].page); >>> +            frags[0].page_owner = true; >>> +        } >> >> Why? If I'm not mistaken, the page is cached with refcnt == 2. No? > > In mlx4_en_deactivate_rx_ring, pages assigned to frames in the page_cache are only put once. If refcnt == 2 when it's inserted, isn't that a memory > leak? I can confirm one way or another if you haven't already. > I think you're right. But I didn't check this yet. > >> >>>           return 0; >>>       } >>>   @@ -569,6 +587,7 @@ void mlx4_en_deactivate_rx_ring(struct mlx4_en_priv *priv, >>>             dma_unmap_page(priv->ddev, frame->dma, frame->page_size, >>>                      priv->frag_info[0].dma_dir); >>> +        WARN_ON(frame->page_owner); >>>           put_page(frame->page); >>>       } >>>       ring->page_cache.index = 0; >>> @@ -595,7 +614,7 @@ static int mlx4_en_complete_rx_desc(struct mlx4_en_priv *priv, >>>           frag_info = &priv->frag_info[nr]; >>>           if (length <= frag_info->frag_prefix_size) >>>               break; >>> -        if (unlikely(!frags[nr].page)) >>> +        if (unlikely(!frags[nr].page_owner)) >>>               goto fail; >>>             dma = be64_to_cpu(rx_desc->data[nr].addr); >>> @@ -607,7 +626,7 @@ static int mlx4_en_complete_rx_desc(struct mlx4_en_priv *priv, >>>           skb_frag_size_set(&skb_frags_rx[nr], frag_info->frag_size); >>>           skb_frags_rx[nr].page_offset = frags[nr].page_offset; >>>           skb->truesize += frag_info->frag_stride; >>> -        frags[nr].page = NULL; >>> +        frags[nr].page_owner = false; >>>       } >>>       /* Adjust size of last fragment to match actual length */ >>>       if (nr > 0) >>> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c >>> index e2509bb..25f7f9e 100644 >>> --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c >>> +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c >>> @@ -356,6 +356,7 @@ u32 mlx4_en_recycle_tx_desc(struct mlx4_en_priv *priv, >>>           .dma = tx_info->map0_dma, >>>           .page_offset = 0, >>>           .page_size = PAGE_SIZE, >>> +        .page_owner = false, >> >> I don't understand why this is needed. > > Not strictly needed but there for clarity. > Let's obsolete it. >> >>>       }; >>>         if (!mlx4_en_rx_recycle(ring->recycle_ring, &frame)) { >>> @@ -1128,7 +1129,7 @@ netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_alloc *frame, >>>       dma = frame->dma; >>>         tx_info->page = frame->page; >>> -    frame->page = NULL; >>> +    frame->page_owner = false; >>>       tx_info->map0_dma = dma; >>>       tx_info->map0_byte_count = length; >>>       tx_info->nr_txbb = nr_txbb; >>> diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h >>> index df0f396..2c9d9a6 100644 >>> --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h >>> +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h >>> @@ -261,6 +261,7 @@ struct mlx4_en_rx_alloc { >>>       dma_addr_t    dma; >>>       u32        page_offset; >>>       u32        page_size; >>> +    bool        page_owner; >>>   }; >>>     #define MLX4_EN_CACHE_SIZE (2 * NAPI_POLL_WEIGHT) >>> >> >> Thanks, >> Tariq > > Thanks, Sarah > I have a general question about the process. I don't totally get what branch this patch is targeted to. It touches critical areas in datapath and should go through regression tests before it is accepted to any branch. Thanks, Tariq