RE: [PATCH rdma-rc v1] RDMA/core: fix sg_to_page mapping for boundary condition

From: Kashyap Desai <kashyap.desai@broadcom.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: linux-rdma@vger.kernel.org, leonro@nvidia.com,
	Selvin Xavier <selvin.xavier@broadcom.com>,
	Andrew Gospodarek <andrew.gospodarek@broadcom.com>
Subject: RE: [PATCH rdma-rc v1] RDMA/core: fix sg_to_page mapping for boundary condition
Date: Mon, 22 Aug 2022 19:51:22 +0530	[thread overview]
Message-ID: <2651261c642ca672864c2c6c8e7a9774@mail.gmail.com> (raw)
In-Reply-To: <Yv94fYp8869XZKFU@nvidia.com>

[-- Attachment #1: Type: text/plain, Size: 5685 bytes --]

>
> On Fri, Aug 19, 2022 at 03:13:47PM +0530, Kashyap Desai wrote:
> > > > All addresses other than 0xfffffffffff00000 are stale entries.
> > > > Once this kind of incorrect page entries are passed to the RDMA
> > > > h/w, AMD IOMMU module detects the page fault whenever h/w tries to
> > > > access addresses which are not actually populated by the ib stack
correctly.
> > > > Below prints are logged whenever this issue hits.
> > >
> > > I don't understand this. AFAIK on AMD platforms you can't create an
> > > IOVA mapping at -1 like you are saying above, so how is
> > > 0xfffffffffff00000 a valid DMA address?
> >
> > Hi Jason -
> >
> > Let me simplify - Consider a case if 1 SGE has 8K dma_len and starting
> > dma address is <0xffffffffffffe000>.
> > It is expected to have two page table entry - <0xffffffffffffe000 >
> > and
> > <0xfffffffffffff000 >.
> > Both the DMA address not mapped to -1.   Device expose dma_mask_bits =
64,
> > so above two addresses are valid mapping from IOMMU perspective.
>
> That is not my point.
>
> My point is that 0xFFFFFFFFFFFFFFF should never be used as a DMA address
> because it invites overflow on any maths, and we are not careful about
this in
> the kernel in general.

I am not seeing Address overflow case.  It is just that buffer is ending
at "0xffffffffffffffff" and it is a genuine dma buffer.
So, worst case scenario is DMA address = fffffffffffffffe and dma_len = 1
byte.  This must be handled as genuine dma request.

>
> > Since end_dma_addr will be zero (in current code) which is actually
> > not end_dma_addr but potential next_dma_addr, we will only endup
> > set_page() call one time.
>
> Which is the math overflow.
Let's take this case - ib_sg_to_pages(struct ib_mr *mr, struct scatterlist
*sgl, int sg_nents,
                unsigned int *sg_offset_p, int (*set_page)(struct ib_mr *,
u64))

sg_nents = 1
struct scatterlist {
    unsigned long page_link;
    unsigned int offset;		= 0
    unsigned int length;		= 8192
    dma_addr_t dma_address;	= 0xffffffffffffe000
    unsigned int dma_length;	= 8192

Below loop will run only one time.
        for_each_sg(sgl, sg, sg_nents, i) {

Now, we will enter into below loop with dma_addr = page_addr =
0xffffffffffffe000 and "end_dma_addr = dma_addr + dma_len" is ZERO.
eval 0xffffffffffffe000 + 8192
hexadecimal: 0

                do {
                        ret = set_page(mr, page_addr);
<<< - This callback will be called one time with page_addr =
0xffffffffffffe000
                        if (unlikely(ret < 0)) {
                                sg_offset = prev_addr -
sg_dma_address(sg);
                                mr->length += prev_addr - dma_addr;
                                if (sg_offset_p)
                                        *sg_offset_p = sg_offset;
                                return i || sg_offset ? i : ret;
                        }
                        prev_addr = page_addr;
next_page:
                        page_addr += mr->page_size;
<<< - After one iteration  page_addr = 0xfffffffffffff000
                } while (page_addr < end_dma_addr);
<<< - This loop will break because (0xfffffffffffff000 < 0) is not true.
>
> > I think this is a valid mapping request and don't see any issue with
> > IOMMU mapping is incorrect.
>
> It should not create mappings that are so dangerous. There is really no
reason to
> use the last page of IOVA space that includes -1.

That is correct, but if API which deals with mapping they handle this kind
of request gracefully is needed. Right ?

>
> > > And if we have to tolerate these addreses then the code should be
> > > designed to avoid the overflow in the first place ie 'end_dma_addr'
> > > should be changed to 'last_dma_addr = dma_addr + (dma_len - 1)'
> > > which does not overflow, and all the logics carefully organized so
> > > none of the math overflows.
> >
> > Making 'last_dma_addr = dma_addr + (dma_len - 1)' will have another
> > side effect.
>
> Yes, the patch would have to fix everything about the logic to work with
a last
> and avoid overflowing maths

Noted.

>
> > How about just doing below ? This was my initial thought of fixing but
> > I am not sure which approach Is best.
> >
> > diff --git a/drivers/infiniband/core/verbs.c
> > b/drivers/infiniband/core/verbs.c index e54b3f1b730e..56d1f3b20e98
> > 100644
> > --- a/drivers/infiniband/core/verbs.c
> > +++ b/drivers/infiniband/core/verbs.c
> > @@ -2709,7 +2709,7 @@ int ib_sg_to_pages(struct ib_mr *mr, struct
> > scatterlist *sgl, int sg_nents,
> >                         prev_addr = page_addr;
> >  next_page:
> >                         page_addr += mr->page_size;
> > -               } while (page_addr < end_dma_addr);
> > +               } while (page_addr < (end_dma_addr - 1));
>
> This is now overflowing twice :(

I thought about better approach without creating regression and I found
having loop using sg_dma_len can avoid such issues gracefully.
How about original patch. ?

>
> Does this bug even still exist? eg does this revert "fix" it?
>
>
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
?id=a
> f3e9579ecfb

Above revert is part of my test. In my setup "iommu_dma_forcedac = $2 =
false".
Without above revert, it may be possible that I can hit the issue
frequently. Currently I need heavy IO of 1MB to hit this issue. Almost
~8GB is attempted in my test.
Total 64 NVME target of 128 QD is sending 1MB IO. Looks like first DMA
mapping is attempted from <4GB and whenever it exhaust and start mapping >
4GB memory region, this kind of IOV mapping occurs.

Kashyap

>
> It makes me wonder if the use of -1 is why drivers started failing with
this mode.
>
> Jason

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4212 bytes --]