From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <netdev-owner@vger.kernel.org>
Received: from mail-db5eur01on0058.outbound.protection.outlook.com ([104.47.2.58]:51333
        "EHLO EUR01-DB5-obe.outbound.protection.outlook.com"
        rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP
        id S1752055AbeCFQNd (ORCPT <rfc822;netdev@vger.kernel.org>);
        Tue, 6 Mar 2018 11:13:33 -0500
Subject: Re: [PATCH] net/mlx4_en: fix potential use-after-free with
 dma_unmap_page
To: Sarah Newman <srn@prgmr.com>, Tariq Toukan <tariqt@mellanox.com>,
        Yishai Hadas <yishaih@mellanox.com>
Cc: netdev@vger.kernel.org
References: <1520223607-17560-1-git-send-email-srn@prgmr.com>
 <1cee1792-52ae-f408-15f0-ee42c1fb2ea3@mellanox.com>
 <2a726e89-8411-380e-3809-2dd71bceb7a7@prgmr.com>
From: Tariq Toukan <tariqt@mellanox.com>
Message-ID: <0b28d134-eec4-4e81-1b26-e05e38fe5605@mellanox.com>
Date: Tue, 6 Mar 2018 18:13:24 +0200
MIME-Version: 1.0
In-Reply-To: <2a726e89-8411-380e-3809-2dd71bceb7a7@prgmr.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 8bit
Sender: netdev-owner@vger.kernel.org
List-ID: <netdev.vger.kernel.org>


On 05/03/2018 11:10 PM, Sarah Newman wrote:
> On 03/05/2018 02:09 AM, Tariq Toukan wrote:
>>
>>
>> On 05/03/2018 6:20 AM, Sarah Newman wrote:
>>> Take an additional reference to a page whenever it is placed
>>> into the rx ring and put the page again after running
>>> dma_unmap_page.
>>>
>>> When swiotlb is in use, calling dma_unmap_page means that
>>> the original page mapped with dma_map_page must still be valid,
>>> as swiotlb will copy data from its internal cache back to the
>>> originally requested DMA location.
>>>
>>> When GRO is enabled, before this patch all references to the
>>> original frag may be put and the page freed before dma_unmap_page
>>> in mlx4_en_free_frag is called.
>>>
>>> It is possible there is a path where the use-after-free occurs
>>> even with GRO disabled, but this has not been observed so far.
>>>
>>> The bug can be trivially detected by doing the following:
>>>
>>> * Compile the kernel with DEBUG_PAGEALLOC
>>> * Run the kernel as a Xen Dom0
>>> * Leave GRO enabled on the interface
>>> * Run a 10 second or more test with iperf over the interface.
>>>
>>
>> Hi Sarah, thanks for your patch!
>>
>>> This bug was likely introduced in
>>> commit 4cce66cdd14a ("mlx4_en: map entire pages to increase throughput"),
>>> first part of u3.6.
>>>
>>> It was incidentally fixed in
>>> commit 34db548bfb95 ("mlx4: add page recycling in receive path"),
>>> first part of v4.12.
>>>
>>> This version applies to the v4.9 series.
>>>
>>> Signed-off-by: Sarah Newman <srn@prgmr.com>
>>> Tested-by: Sarah Newman <srn@prgmr.com>
>>> ---
>>>    drivers/net/ethernet/mellanox/mlx4/en_rx.c   | 39 +++++++++++++++++++++-------
>>>    drivers/net/ethernet/mellanox/mlx4/en_tx.c   |  3 ++-
>>>    drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |  1 +
>>>    3 files changed, 32 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
>>> index bcbb80f..d1fb087 100644
>>> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
>>> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
>>> @@ -80,10 +80,14 @@ static int mlx4_alloc_pages(struct mlx4_en_priv *priv,
>>>        page_alloc->page = page;
>>>        page_alloc->dma = dma;
>>>        page_alloc->page_offset = 0;
>>> +    page_alloc->page_owner = true;
>>
>> Do we really need this boolean? I believe the issue can be fixed without it. We need to make sure we hold the correct refcnt at every stage, and
>> maintain symmetry between a flow and its inverse.
> 
> The reason this was added is because the page address needs to stay around until after dma unmap_page is called, and right now setting page to NULL is
> used to indicate that put_page should not be called when frags are freed in mlx4_en_free_frag. So either the code needs to be rearranged so that
> dma_unmap_page while page is still set, or some variable needed to be used to indicate whether put_page should be called when the frags are freed.
> 

rearranging sounds better.

> If dma_unmap_page was called before page was set to NULL, then this variable doesn't need to be added, yes. Then the call to dma_unmap_page in
> mlx4_en_free_frag would also be contingent on frags[i].page being set.
> 
> There are two places where page is set to NULL without calling dma_unmap_page first, mlx4_en_complete_rx_desc and mlx4_en_xmit_frame.
> 

In mlx4_en_xmit_frame, should not unmap, it should be done only upon 
completion, this is done in mlx4_en_recycle_tx_desc.
In mlx4_en_complete_rx_desc I think it was just a bug.

 >
> Is mlx4_en_complete_rx_desc the only place where a call to dma_unmap_page would need to be added? The other place page is set to NULL without a call
> to dma_unmap_page first is in mlx4_en_xmit_frame, and I believe there is no call to mlx4_en_free_frag if mlx4_en_xmit_frame executes.
> 

Yes, only in mlx4_en_complete_rx_desc, see above.

>>
>> Upon alloc, refcnt is 1. This alloc refcnt should be inverted by a call to put_page. We might want to introduce a page free API (symmetric to
>> mlx4_alloc_pages), that does: dma unmap the page, call put_page, nullify pointer.
> 
> That seems reasonable.
> 
Yes, let's use it in mlx4_en_free_frag.

>> Once alloced, page refcnt is bumped up by the amount of possible frags populating it, which is (page_size / frag_stride), as you do here.
>>
>>>        /* Not doing get_page() for each frag is a big win
>>>         * on asymetric workloads. Note we can not use atomic_set().
>>>         */
>>> -    page_ref_add(page, page_alloc->page_size / frag_info->frag_stride - 1);
>>> +    /* Since the page must be valid until after dma_unmap_page is called,
>>> +     * take an additional reference we would not have otherwise.
>>> +     */
>>> +    page_ref_add(page, page_alloc->page_size / frag_info->frag_stride);
>>>        return 0;
>>>    }
>>>    @@ -105,9 +109,13 @@ static int mlx4_en_alloc_frags(struct mlx4_en_priv *priv,
>>>            page_alloc[i].page_offset += frag_info->frag_stride;
>>>              if (page_alloc[i].page_offset + frag_info->frag_stride <=
>>> -            ring_alloc[i].page_size)
>>> -            continue;
>>> -
>>> +            ring_alloc[i].page_size) {
>>> +            WARN_ON(!page_alloc[i].page);
>>> +            WARN_ON(!page_alloc[i].page_owner);
>>
>> Why WARN before the likely() check?
>> Move after the check, for a better performance.
> 
> No particular reason.
> 

please move them.

>>
>>> +            if (likely(page_alloc[i].page &&
>>> +                   page_alloc[i].page_owner))
>>> +                continue;
>>> +        }
>>>            if (unlikely(mlx4_alloc_pages(priv, &page_alloc[i],
>>>                              frag_info, gfp)))
>>>                goto out;
>>> @@ -131,7 +139,7 @@ static int mlx4_en_alloc_frags(struct mlx4_en_priv *priv,
>>>                page = page_alloc[i].page;
>>>                /* Revert changes done by mlx4_alloc_pages */
>>>                page_ref_sub(page, page_alloc[i].page_size /
>>> -                       priv->frag_info[i].frag_stride - 1);
>>> +                       priv->frag_info[i].frag_stride);
>>>                put_page(page);
>>>            }
>>>        }
>>> @@ -146,11 +154,13 @@ static void mlx4_en_free_frag(struct mlx4_en_priv *priv,
>>>        u32 next_frag_end = frags[i].page_offset + 2 * frag_info->frag_stride;
>>>      -    if (next_frag_end > frags[i].page_size)
>>> +    if (next_frag_end > frags[i].page_size) {
>>>            dma_unmap_page(priv->ddev, frags[i].dma, frags[i].page_size,
>>>                       frag_info->dma_dir);
>>> +        put_page(frags[i].page);
>>> +    }
>>>    -    if (frags[i].page)
>>> +    if (frags[i].page_owner)
>>>            put_page(frags[i].page);
>>>    }
>>>    @@ -184,9 +194,10 @@ static int mlx4_en_init_allocator(struct mlx4_en_priv *priv,
>>>            page = page_alloc->page;
>>>            /* Revert changes done by mlx4_alloc_pages */
>>>            page_ref_sub(page, page_alloc->page_size /
>>> -                   priv->frag_info[i].frag_stride - 1);
>>> +                   priv->frag_info[i].frag_stride);
>>>            put_page(page);
>>>            page_alloc->page = NULL;
>>> +        page_alloc->page_owner = false;
>>>        }
>>>        return -ENOMEM;
>>>    }
>>> @@ -206,12 +217,14 @@ static void mlx4_en_destroy_allocator(struct mlx4_en_priv *priv,
>>>              dma_unmap_page(priv->ddev, page_alloc->dma,
>>>                    page_alloc->page_size, frag_info->dma_dir);
>>> +        put_page(page_alloc->page);
>>
>> for symmetry, i'd move this after the while loop.
> 
> Or use the wrapper function you suggested for dma_unmap_page?
> 

yes.

>>
>>>            while (page_alloc->page_offset + frag_info->frag_stride <
>>>                   page_alloc->page_size) {
>>>                put_page(page_alloc->page);
>>>                page_alloc->page_offset += frag_info->frag_stride;
>>>            }
>>>            page_alloc->page = NULL;
>>> +        page_alloc->page_owner = false;
>>>        }
>>>    }
>>>    @@ -251,6 +264,11 @@ static int mlx4_en_prepare_rx_desc(struct mlx4_en_priv *priv,
>>>        if (ring->page_cache.index > 0) {
>>>            frags[0] = ring->page_cache.buf[--ring->page_cache.index];
>>>            rx_desc->data[0].addr = cpu_to_be64(frags[0].dma);
>>> +        WARN_ON(frags[0].page_owner);
>>> +        if (likely(!frags[0].page_owner)) {
>>> +            page_ref_inc(frags[0].page);
>>> +            frags[0].page_owner = true;
>>> +        }
>>
>> Why? If I'm not mistaken, the page is cached with refcnt == 2. No?
> 
> In mlx4_en_deactivate_rx_ring, pages assigned to frames in the page_cache are only put once. If refcnt == 2 when it's inserted, isn't that a memory
> leak? I can confirm one way or another if you haven't already.
> 

I think you're right. But I didn't check this yet.

> 
>>
>>>            return 0;
>>>        }
>>>    @@ -569,6 +587,7 @@ void mlx4_en_deactivate_rx_ring(struct mlx4_en_priv *priv,
>>>              dma_unmap_page(priv->ddev, frame->dma, frame->page_size,
>>>                       priv->frag_info[0].dma_dir);
>>> +        WARN_ON(frame->page_owner);
>>>            put_page(frame->page);
>>>        }
>>>        ring->page_cache.index = 0;
>>> @@ -595,7 +614,7 @@ static int mlx4_en_complete_rx_desc(struct mlx4_en_priv *priv,
>>>            frag_info = &priv->frag_info[nr];
>>>            if (length <= frag_info->frag_prefix_size)
>>>                break;
>>> -        if (unlikely(!frags[nr].page))
>>> +        if (unlikely(!frags[nr].page_owner))
>>>                goto fail;
>>>              dma = be64_to_cpu(rx_desc->data[nr].addr);
>>> @@ -607,7 +626,7 @@ static int mlx4_en_complete_rx_desc(struct mlx4_en_priv *priv,
>>>            skb_frag_size_set(&skb_frags_rx[nr], frag_info->frag_size);
>>>            skb_frags_rx[nr].page_offset = frags[nr].page_offset;
>>>            skb->truesize += frag_info->frag_stride;
>>> -        frags[nr].page = NULL;
>>> +        frags[nr].page_owner = false;
>>>        }
>>>        /* Adjust size of last fragment to match actual length */
>>>        if (nr > 0)
>>> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
>>> index e2509bb..25f7f9e 100644
>>> --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
>>> +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
>>> @@ -356,6 +356,7 @@ u32 mlx4_en_recycle_tx_desc(struct mlx4_en_priv *priv,
>>>            .dma = tx_info->map0_dma,
>>>            .page_offset = 0,
>>>            .page_size = PAGE_SIZE,
>>> +        .page_owner = false,
>>
>> I don't understand why this is needed.
> 
> Not strictly needed but there for clarity.
> 

Let's obsolete it.

>>
>>>        };
>>>          if (!mlx4_en_rx_recycle(ring->recycle_ring, &frame)) {
>>> @@ -1128,7 +1129,7 @@ netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_alloc *frame,
>>>        dma = frame->dma;
>>>          tx_info->page = frame->page;
>>> -    frame->page = NULL;
>>> +    frame->page_owner = false;
>>>        tx_info->map0_dma = dma;
>>>        tx_info->map0_byte_count = length;
>>>        tx_info->nr_txbb = nr_txbb;
>>> diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
>>> index df0f396..2c9d9a6 100644
>>> --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
>>> +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
>>> @@ -261,6 +261,7 @@ struct mlx4_en_rx_alloc {
>>>        dma_addr_t    dma;
>>>        u32        page_offset;
>>>        u32        page_size;
>>> +    bool        page_owner;
>>>    };
>>>      #define MLX4_EN_CACHE_SIZE (2 * NAPI_POLL_WEIGHT)
>>>
>>
>> Thanks,
>> Tariq
> 
> Thanks, Sarah
> 

I have a general question about the process.
I don't totally get what branch this patch is targeted to.
It touches critical areas in datapath and should go through regression 
tests before it is accepted to any branch.

Thanks,
Tariq