Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf: Require VM_PFNMAP vma for mmap

From: "Thomas Hellström (Intel)" <thomas_os@shipmail.org>
To: Daniel Vetter <daniel@ffwll.ch>
Cc: "Christian König" <christian.koenig@amd.com>,
	"Christian König" <ckoenig.leichtzumerken@gmail.com>,
	"Intel Graphics Development" <intel-gfx@lists.freedesktop.org>,
	"Matthew Wilcox" <willy@infradead.org>,
	"moderated list:DMA BUFFER SHARING FRAMEWORK"
	<linaro-mm-sig@lists.linaro.org>,
	"Jason Gunthorpe" <jgg@ziepe.ca>,
	"John Stultz" <john.stultz@linaro.org>,
	"DRI Development" <dri-devel@lists.freedesktop.org>,
	"Daniel Vetter" <daniel.vetter@intel.com>,
	"Suren Baghdasaryan" <surenb@google.com>,
	"open list:DMA BUFFER SHARING FRAMEWORK"
	<linux-media@vger.kernel.org>
Subject: Re: [Linaro-mm-sig] [PATCH 1/2] dma-buf: Require VM_PFNMAP vma for mmap
Date: Thu, 11 Mar 2021 16:37:45 +0100	[thread overview]
Message-ID: <b5385bb5-9d1b-3ce6-ee1f-a669d0780253@shipmail.org> (raw)
In-Reply-To: <CAKMK7uEzGKUc27xdWTv7KPESsyg1kCYCmVxP3b-HrzNCNO5x7g@mail.gmail.com>

On 3/11/21 2:17 PM, Daniel Vetter wrote:
> On Thu, Mar 11, 2021 at 2:12 PM Thomas Hellström (Intel)
> <thomas_os@shipmail.org> wrote:
>> Hi!
>>
>> On 3/11/21 2:00 PM, Daniel Vetter wrote:
>>> On Thu, Mar 11, 2021 at 11:22:06AM +0100, Thomas Hellström (Intel) wrote:
>>>> On 3/1/21 3:09 PM, Daniel Vetter wrote:
>>>>> On Mon, Mar 1, 2021 at 11:17 AM Christian König
>>>>> <christian.koenig@amd.com> wrote:
>>>>>> Am 01.03.21 um 10:21 schrieb Thomas Hellström (Intel):
>>>>>>> On 3/1/21 10:05 AM, Daniel Vetter wrote:
>>>>>>>> On Mon, Mar 01, 2021 at 09:39:53AM +0100, Thomas Hellström (Intel)
>>>>>>>> wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> On 3/1/21 9:28 AM, Daniel Vetter wrote:
>>>>>>>>>> On Sat, Feb 27, 2021 at 9:06 AM Thomas Hellström (Intel)
>>>>>>>>>> <thomas_os@shipmail.org> wrote:
>>>>>>>>>>> On 2/26/21 2:28 PM, Daniel Vetter wrote:
>>>>>>>>>>>> So I think it stops gup. But I haven't verified at all. Would be
>>>>>>>>>>>> good
>>>>>>>>>>>> if Christian can check this with some direct io to a buffer in
>>>>>>>>>>>> system
>>>>>>>>>>>> memory.
>>>>>>>>>>> Hmm,
>>>>>>>>>>>
>>>>>>>>>>> Docs (again vm_normal_page() say)
>>>>>>>>>>>
>>>>>>>>>>>        * VM_MIXEDMAP mappings can likewise contain memory with or
>>>>>>>>>>> without "struct
>>>>>>>>>>>        * page" backing, however the difference is that _all_ pages
>>>>>>>>>>> with a struct
>>>>>>>>>>>        * page (that is, those where pfn_valid is true) are refcounted
>>>>>>>>>>> and
>>>>>>>>>>> considered
>>>>>>>>>>>        * normal pages by the VM. The disadvantage is that pages are
>>>>>>>>>>> refcounted
>>>>>>>>>>>        * (which can be slower and simply not an option for some PFNMAP
>>>>>>>>>>> users). The
>>>>>>>>>>>        * advantage is that we don't have to follow the strict
>>>>>>>>>>> linearity rule of
>>>>>>>>>>>        * PFNMAP mappings in order to support COWable mappings.
>>>>>>>>>>>
>>>>>>>>>>> but it's true __vm_insert_mixed() ends up in the insert_pfn()
>>>>>>>>>>> path, so
>>>>>>>>>>> the above isn't really true, which makes me wonder if and in that
>>>>>>>>>>> case
>>>>>>>>>>> why there could any longer ever be a significant performance
>>>>>>>>>>> difference
>>>>>>>>>>> between MIXEDMAP and PFNMAP.
>>>>>>>>>> Yeah it's definitely confusing. I guess I'll hack up a patch and see
>>>>>>>>>> what sticks.
>>>>>>>>>>
>>>>>>>>>>> BTW regarding the TTM hugeptes, I don't think we ever landed that
>>>>>>>>>>> devmap
>>>>>>>>>>> hack, so they are (for the non-gup case) relying on
>>>>>>>>>>> vma_is_special_huge(). For the gup case, I think the bug is still
>>>>>>>>>>> there.
>>>>>>>>>> Maybe there's another devmap hack, but the ttm_vm_insert functions do
>>>>>>>>>> use PFN_DEV and all that. And I think that stops gup_fast from trying
>>>>>>>>>> to find the underlying page.
>>>>>>>>>> -Daniel
>>>>>>>>> Hmm perhaps it might, but I don't think so. The fix I tried out was
>>>>>>>>> to set
>>>>>>>>>
>>>>>>>>> PFN_DEV | PFN_MAP for huge PTEs which causes pfn_devmap() to be
>>>>>>>>> true, and
>>>>>>>>> then
>>>>>>>>>
>>>>>>>>> follow_devmap_pmd()->get_dev_pagemap() which returns NULL and
>>>>>>>>> gup_fast()
>>>>>>>>> backs off,
>>>>>>>>>
>>>>>>>>> in the end that would mean setting in stone that "if there is a huge
>>>>>>>>> devmap
>>>>>>>>> page table entry for which we haven't registered any devmap struct
>>>>>>>>> pages
>>>>>>>>> (get_dev_pagemap returns NULL), we should treat that as a "special"
>>>>>>>>> huge
>>>>>>>>> page table entry".
>>>>>>>>>
>>>>>>>>>     From what I can tell, all code calling get_dev_pagemap() already
>>>>>>>>> does that,
>>>>>>>>> it's just a question of getting it accepted and formalizing it.
>>>>>>>> Oh I thought that's already how it works, since I didn't spot anything
>>>>>>>> else that would block gup_fast from falling over. I guess really would
>>>>>>>> need some testcases to make sure direct i/o (that's the easiest to test)
>>>>>>>> fails like we expect.
>>>>>>> Yeah, IIRC the "| PFN_MAP" is the missing piece for TTM huge ptes.
>>>>>>> Otherwise pmd_devmap() will not return true and since there is no
>>>>>>> pmd_special() things break.
>>>>>> Is that maybe the issue we have seen with amdgpu and huge pages?
>>>>> Yeah, essentially when you have a hugepte inserted by ttm, and it
>>>>> happens to point at system memory, then gup will work on that. And
>>>>> create all kinds of havoc.
>>>>>
>>>>>> Apart from that I'm lost guys, that devmap and gup stuff is not
>>>>>> something I have a good knowledge of apart from a one mile high view.
>>>>> I'm not really better, hence would be good to do a testcase and see.
>>>>> This should provoke it:
>>>>> - allocate nicely aligned bo in system memory
>>>>> - mmap, again nicely aligned to 2M
>>>>> - do some direct io from a filesystem into that mmap, that should trigger gup
>>>>> - before the gup completes free the mmap and bo so that ttm recycles
>>>>> the pages, which should trip up on the elevated refcount. If you wait
>>>>> until the direct io is completely, then I think nothing bad can be
>>>>> observed.
>>>>>
>>>>> Ofc if your amdgpu+hugepte issue is something else, then maybe we have
>>>>> another issue.
>>>>>
>>>>> Also usual caveat: I'm not an mm hacker either, so might be completely wrong.
>>>>> -Daniel
>>>> So I did the following quick experiment on vmwgfx, and it turns out that
>>>> with it,
>>>> fast gup never succeeds. Without the "| PFN_MAP", it typically succeeds
>>>>
>>>> I should probably craft an RFC formalizing this.
>>> Yeah I think that would be good. Maybe even more formalized if we also
>>> switch over to VM_PFNMAP, since afaiui these pte flags here only stop the
>>> fast gup path. And slow gup can still peak through VM_MIXEDMAP. Or
>>> something like that.
>>>
>>> Otoh your description of when it only sometimes succeeds would indicate my
>>> understanding of VM_PFNMAP vs VM_MIXEDMAP is wrong here.
>> My understanding from reading the vmf_insert_mixed() code is that iff
>> the arch has pte_special(), VM_MIXEDMAP should be harmless. But that's
>> not consistent with the vm_normal_page() doc. For architectures without
>> pte_special, VM_PFNMAP must be used, and then we must also block COW
>> mappings.
>>
>> If we can get someone can commit to verify that the potential PAT WC
>> performance issue is gone with PFNMAP, I can put together a series with
>> that included.
> Iirc when I checked there's not much archs without pte_special, so I
> guess that's why we luck out. Hopefully.
>
>> As for existing userspace using COW TTM mappings, I once had a couple of
>> test cases to verify that it actually worked, in particular together
>> with huge PMDs and PUDs where breaking COW would imply splitting those,
>> but I can't think of anything else actually wanting to do that other
>> than by mistake.
> Yeah disallowing MAP_PRIVATE mappings would be another good thing to
> lock down. Really doesn't make much sense.
> -Daniel

Yes, we can't allow them with PFNMAP + a non-linear address space...

/Thomas

>> /Thomas
>>
>>
>>> Christian, what's your take?
>>> -Daniel
>>>
>>>> /Thomas
>>>>
>>>> diff --git a/drivers/gpu/drm/ttm/ttm_bo_vm.c
>>>> b/drivers/gpu/drm/ttm/ttm_bo_vm.c
>>>> index 6dc96cf66744..72b6fb17c984 100644
>>>> --- a/drivers/gpu/drm/ttm/ttm_bo_vm.c
>>>> +++ b/drivers/gpu/drm/ttm/ttm_bo_vm.c
>>>> @@ -195,6 +195,7 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault
>>>> *vmf,
>>>>           pfn_t pfnt;
>>>>           struct ttm_tt *ttm = bo->ttm;
>>>>           bool write = vmf->flags & FAULT_FLAG_WRITE;
>>>> +       struct dev_pagemap *pagemap;
>>>>
>>>>           /* Fault should not cross bo boundary. */
>>>>           page_offset &= ~(fault_page_size - 1);
>>>> @@ -210,6 +211,17 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault
>>>> *vmf,
>>>>           if ((pfn & (fault_page_size - 1)) != 0)
>>>>                   goto out_fallback;
>>>>
>>>> +       /*
>>>> +        * Huge entries must be special, that is marking them as devmap
>>>> +        * with no backing device map range. If there is a backing
>>>> +        * range, Don't insert a huge entry.
>>>> +        */
>>>> +       pagemap = get_dev_pagemap(pfn, NULL);
>>>> +       if (pagemap) {
>>>> +               put_dev_pagemap(pagemap);
>>>> +               goto out_fallback;
>>>> +       }
>>>> +
>>>>           /* Check that memory is contiguous. */
>>>>           if (!bo->mem.bus.is_iomem) {
>>>>                   for (i = 1; i < fault_page_size; ++i) {
>>>> @@ -223,7 +235,7 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault
>>>> *vmf,
>>>>                   }
>>>>           }
>>>>
>>>> -       pfnt = __pfn_to_pfn_t(pfn, PFN_DEV);
>>>> +       pfnt = __pfn_to_pfn_t(pfn, PFN_DEV | PFN_MAP);
>>>>           if (fault_page_size == (HPAGE_PMD_SIZE >> PAGE_SHIFT))
>>>>                   ret = vmf_insert_pfn_pmd_prot(vmf, pfnt, pgprot, write);
>>>>    #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>>>> @@ -236,6 +248,21 @@ static vm_fault_t ttm_bo_vm_insert_huge(struct vm_fault
>>>> *vmf,
>>>>           if (ret != VM_FAULT_NOPAGE)
>>>>                   goto out_fallback;
>>>>
>>>> +#if 1
>>>> +       {
>>>> +               int npages;
>>>> +               struct page *page;
>>>> +
>>>> +               npages = get_user_pages_fast_only(vmf->address, 1, 0,
>>>> &page);
>>>> +               if (npages == 1) {
>>>> +                       DRM_WARN("Fast gup succeeded. Bad.\n");
>>>> +                       put_page(page);
>>>> +               } else {
>>>> +                       DRM_INFO("Fast gup failed. Good.\n");
>>>> +               }
>>>> +       }
>>>> +#endif
>>>> +
>>>>           return VM_FAULT_NOPAGE;
>>>>    out_fallback:
>>>>           count_vm_event(THP_FAULT_FALLBACK);
>>>>
>>>>
>>>>
>>>>
>>>>
>
>