dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [RFC] drm/vgem: Don't use get_page in fault handler.
@ 2020-07-07  4:21 Lepton Wu
  2020-07-07  8:22 ` Chris Wilson
  2020-07-07  8:59 ` Daniel Vetter
  0 siblings, 2 replies; 8+ messages in thread
From: Lepton Wu @ 2020-07-07  4:21 UTC (permalink / raw)
  To: dri-devel; +Cc: christian.koenig, Lepton Wu

For pages which are allocated in ttm with transparent huge pages,
tail pages have zero as reference count. The current vgem code use
get_page on them and it will trigger BUG when release_pages get called
on those pages later.

Here I attach the minimal code to trigger this bug on a qemu VM which
enables virtio gpu (card1) and vgem (card0). BTW, since the upstream
virtio gpu has changed to use drm gem and moved away from ttm. So we
have to use an old kernel like 5.4 to reproduce it. But I guess
same thing could happen for a real GPU if the real GPU use similar code
path to allocate pages from ttm. I am not sure about two things: first, do we
need to fix this? will a real GPU hit this code path? Second, suppose this
need to be fixed, should this be fixed in ttm side or vgem side?  If we remove
"huge_flags &= ~__GFP_COMP" from ttm_get_pages, then get_page in vgem works
fine. But it's there for fix another bug:
https://bugs.freedesktop.org/show_bug.cgi?id=103138
It could also be "fixed" with this patch. But I am really not sure if this is
the way to go.

Here is the code to reproduce this bug:

unsigned int WIDTH = 1024;
unsigned int HEIGHT = 513;
unsigned int size = WIDTH * HEIGHT * 4;

int work(int vfd, int dfd, int handle) {
	int ret;
	struct drm_prime_handle hf = {.handle =  handle };
	ret = ioctl(dfd, DRM_IOCTL_PRIME_HANDLE_TO_FD, &hf);
	fprintf(stderr, "fd is %d\n", hf.fd);
	hf.flags = DRM_CLOEXEC | DRM_RDWR;
	ret = ioctl(vfd, DRM_IOCTL_PRIME_FD_TO_HANDLE, &hf);
	fprintf(stderr, "new handle is %d\n", hf.handle);
	struct drm_mode_map_dumb map = {.handle = hf.handle };
	ret = ioctl(vfd, DRM_IOCTL_MODE_MAP_DUMB, &map);
	fprintf(stderr, "need map at offset %lld\n", map.offset);
	unsigned char * ptr = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_SHARED, vfd,
			  map.offset);
	memset(ptr, 2, size);
	munmap(ptr, size);
}

int main()
{
	int vfd = open("/dev/dri/card0", O_RDWR); // vgem
	int dfd = open("/dev/dri/card1", O_RDWR); // virtio gpu

	int ret;
        struct drm_mode_create_dumb ct = {};

	ct.height = HEIGHT;
	ct.width = WIDTH;
	ct.bpp = 32;
	ret = ioctl(dfd, DRM_IOCTL_MODE_CREATE_DUMB, &ct);
	work(vfd, dfd, ct.handle);
	fprintf(stderr, "done\n");
}

Signed-off-by: Lepton Wu <ytht.net@gmail.com>
---
 drivers/gpu/drm/vgem/vgem_drv.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/vgem/vgem_drv.c b/drivers/gpu/drm/vgem/vgem_drv.c
index ec1a8ebb6f1b..be3d97e29804 100644
--- a/drivers/gpu/drm/vgem/vgem_drv.c
+++ b/drivers/gpu/drm/vgem/vgem_drv.c
@@ -87,9 +87,8 @@ static vm_fault_t vgem_gem_fault(struct vm_fault *vmf)
 
 	mutex_lock(&obj->pages_lock);
 	if (obj->pages) {
-		get_page(obj->pages[page_offset]);
-		vmf->page = obj->pages[page_offset];
-		ret = 0;
+		ret = vmf_insert_pfn(vmf->vma, vmf->address,
+				     page_to_pfn(obj->pages[page_offset]));
 	}
 	mutex_unlock(&obj->pages_lock);
 	if (ret) {
@@ -263,7 +262,6 @@ static struct drm_ioctl_desc vgem_ioctls[] = {
 
 static int vgem_mmap(struct file *filp, struct vm_area_struct *vma)
 {
-	unsigned long flags = vma->vm_flags;
 	int ret;
 
 	ret = drm_gem_mmap(filp, vma);
@@ -273,7 +271,6 @@ static int vgem_mmap(struct file *filp, struct vm_area_struct *vma)
 	/* Keep the WC mmaping set by drm_gem_mmap() but our pages
 	 * are ordinary and not special.
 	 */
-	vma->vm_flags = flags | VM_DONTEXPAND | VM_DONTDUMP;
 	return 0;
 }
 
-- 
2.27.0.212.ge8ba1cc988-goog

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC] drm/vgem: Don't use get_page in fault handler.
  2020-07-07  4:21 [RFC] drm/vgem: Don't use get_page in fault handler Lepton Wu
@ 2020-07-07  8:22 ` Chris Wilson
  2020-07-07  8:59 ` Daniel Vetter
  1 sibling, 0 replies; 8+ messages in thread
From: Chris Wilson @ 2020-07-07  8:22 UTC (permalink / raw)
  To: Lepton Wu, dri-devel; +Cc: christian.koenig, Lepton Wu

Quoting Lepton Wu (2020-07-07 05:21:00)
> For pages which are allocated in ttm with transparent huge pages,
> tail pages have zero as reference count. The current vgem code use
> get_page on them and it will trigger BUG when release_pages get called
> on those pages later.
> 
> Here I attach the minimal code to trigger this bug on a qemu VM which
> enables virtio gpu (card1) and vgem (card0). BTW, since the upstream
> virtio gpu has changed to use drm gem and moved away from ttm. So we
> have to use an old kernel like 5.4 to reproduce it. But I guess
> same thing could happen for a real GPU if the real GPU use similar code
> path to allocate pages from ttm. I am not sure about two things: first, do we
> need to fix this? will a real GPU hit this code path? Second, suppose this
> need to be fixed, should this be fixed in ttm side or vgem side?  If we remove
> "huge_flags &= ~__GFP_COMP" from ttm_get_pages, then get_page in vgem works
> fine. But it's there for fix another bug:
> https://bugs.freedesktop.org/show_bug.cgi?id=103138
> It could also be "fixed" with this patch. But I am really not sure if this is
> the way to go.
> 
> Here is the code to reproduce this bug:
> 
> unsigned int WIDTH = 1024;
> unsigned int HEIGHT = 513;
> unsigned int size = WIDTH * HEIGHT * 4;
> 
> int work(int vfd, int dfd, int handle) {
>         int ret;
>         struct drm_prime_handle hf = {.handle =  handle };
>         ret = ioctl(dfd, DRM_IOCTL_PRIME_HANDLE_TO_FD, &hf);
>         fprintf(stderr, "fd is %d\n", hf.fd);
>         hf.flags = DRM_CLOEXEC | DRM_RDWR;
>         ret = ioctl(vfd, DRM_IOCTL_PRIME_FD_TO_HANDLE, &hf);
>         fprintf(stderr, "new handle is %d\n", hf.handle);
>         struct drm_mode_map_dumb map = {.handle = hf.handle };
>         ret = ioctl(vfd, DRM_IOCTL_MODE_MAP_DUMB, &map);
>         fprintf(stderr, "need map at offset %lld\n", map.offset);
>         unsigned char * ptr = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_SHARED, vfd,
>                           map.offset);
>         memset(ptr, 2, size);
>         munmap(ptr, size);
> }
> 
> int main()
> {
>         int vfd = open("/dev/dri/card0", O_RDWR); // vgem
>         int dfd = open("/dev/dri/card1", O_RDWR); // virtio gpu
> 
>         int ret;
>         struct drm_mode_create_dumb ct = {};
> 
>         ct.height = HEIGHT;
>         ct.width = WIDTH;
>         ct.bpp = 32;
>         ret = ioctl(dfd, DRM_IOCTL_MODE_CREATE_DUMB, &ct);
>         work(vfd, dfd, ct.handle);
>         fprintf(stderr, "done\n");
> }
> 
> Signed-off-by: Lepton Wu <ytht.net@gmail.com>
> ---
>  drivers/gpu/drm/vgem/vgem_drv.c | 7 ++-----
>  1 file changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/vgem/vgem_drv.c b/drivers/gpu/drm/vgem/vgem_drv.c
> index ec1a8ebb6f1b..be3d97e29804 100644
> --- a/drivers/gpu/drm/vgem/vgem_drv.c
> +++ b/drivers/gpu/drm/vgem/vgem_drv.c
> @@ -87,9 +87,8 @@ static vm_fault_t vgem_gem_fault(struct vm_fault *vmf)
>  
>         mutex_lock(&obj->pages_lock);
>         if (obj->pages) {
> -               get_page(obj->pages[page_offset]);
> -               vmf->page = obj->pages[page_offset];
> -               ret = 0;
> +               ret = vmf_insert_pfn(vmf->vma, vmf->address,
> +                                    page_to_pfn(obj->pages[page_offset]));

The danger here is that you store a pointer to a physical page in the
CPU page tables and never remove it should the vgem bo be unpinned and
the obj->pages[] released [and the physical pages returned to the system
for reuse].

So judicial adding of 
drm_vma_node_unmap(&bo->base.vma_node, bo->base.anon_inode->i_mapping);
drm_gem_put_pages, and also the kvfree in vgem_gem_free_object is
suspect (for not being a drm_gem_put_pages).
-Chris
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] drm/vgem: Don't use get_page in fault handler.
  2020-07-07  4:21 [RFC] drm/vgem: Don't use get_page in fault handler Lepton Wu
  2020-07-07  8:22 ` Chris Wilson
@ 2020-07-07  8:59 ` Daniel Vetter
  2020-07-07 10:58   ` Christian König
  1 sibling, 1 reply; 8+ messages in thread
From: Daniel Vetter @ 2020-07-07  8:59 UTC (permalink / raw)
  To: Lepton Wu; +Cc: Christian König, dri-devel

On Tue, Jul 7, 2020 at 9:27 AM Lepton Wu <ytht.net@gmail.com> wrote:
>
> For pages which are allocated in ttm with transparent huge pages,
> tail pages have zero as reference count. The current vgem code use
> get_page on them and it will trigger BUG when release_pages get called
> on those pages later.
>
> Here I attach the minimal code to trigger this bug on a qemu VM which
> enables virtio gpu (card1) and vgem (card0). BTW, since the upstream
> virtio gpu has changed to use drm gem and moved away from ttm. So we
> have to use an old kernel like 5.4 to reproduce it. But I guess
> same thing could happen for a real GPU if the real GPU use similar code
> path to allocate pages from ttm. I am not sure about two things: first, do we
> need to fix this? will a real GPU hit this code path? Second, suppose this
> need to be fixed, should this be fixed in ttm side or vgem side?  If we remove
> "huge_flags &= ~__GFP_COMP" from ttm_get_pages, then get_page in vgem works
> fine. But it's there for fix another bug:
> https://bugs.freedesktop.org/show_bug.cgi?id=103138
> It could also be "fixed" with this patch. But I am really not sure if this is
> the way to go.

For imported dma-buf objects, vgem should not handle this itself, but
forward to the exporter through the dma_buf_mmap stuff. We have
helpers for this all now, probably just not wired up correctly. Trying
to ensure that all combinations of mmap code across all drivers work
the same doesn't work.

Caveat: I might not be understanding exactly what's going on here :-)
-Daniel

>
> Here is the code to reproduce this bug:
>
> unsigned int WIDTH = 1024;
> unsigned int HEIGHT = 513;
> unsigned int size = WIDTH * HEIGHT * 4;
>
> int work(int vfd, int dfd, int handle) {
>         int ret;
>         struct drm_prime_handle hf = {.handle =  handle };
>         ret = ioctl(dfd, DRM_IOCTL_PRIME_HANDLE_TO_FD, &hf);
>         fprintf(stderr, "fd is %d\n", hf.fd);
>         hf.flags = DRM_CLOEXEC | DRM_RDWR;
>         ret = ioctl(vfd, DRM_IOCTL_PRIME_FD_TO_HANDLE, &hf);
>         fprintf(stderr, "new handle is %d\n", hf.handle);
>         struct drm_mode_map_dumb map = {.handle = hf.handle };
>         ret = ioctl(vfd, DRM_IOCTL_MODE_MAP_DUMB, &map);
>         fprintf(stderr, "need map at offset %lld\n", map.offset);
>         unsigned char * ptr = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_SHARED, vfd,
>                           map.offset);
>         memset(ptr, 2, size);
>         munmap(ptr, size);
> }
>
> int main()
> {
>         int vfd = open("/dev/dri/card0", O_RDWR); // vgem
>         int dfd = open("/dev/dri/card1", O_RDWR); // virtio gpu
>
>         int ret;
>         struct drm_mode_create_dumb ct = {};
>
>         ct.height = HEIGHT;
>         ct.width = WIDTH;
>         ct.bpp = 32;
>         ret = ioctl(dfd, DRM_IOCTL_MODE_CREATE_DUMB, &ct);
>         work(vfd, dfd, ct.handle);
>         fprintf(stderr, "done\n");
> }
>
> Signed-off-by: Lepton Wu <ytht.net@gmail.com>
> ---
>  drivers/gpu/drm/vgem/vgem_drv.c | 7 ++-----
>  1 file changed, 2 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/vgem/vgem_drv.c b/drivers/gpu/drm/vgem/vgem_drv.c
> index ec1a8ebb6f1b..be3d97e29804 100644
> --- a/drivers/gpu/drm/vgem/vgem_drv.c
> +++ b/drivers/gpu/drm/vgem/vgem_drv.c
> @@ -87,9 +87,8 @@ static vm_fault_t vgem_gem_fault(struct vm_fault *vmf)
>
>         mutex_lock(&obj->pages_lock);
>         if (obj->pages) {
> -               get_page(obj->pages[page_offset]);
> -               vmf->page = obj->pages[page_offset];
> -               ret = 0;
> +               ret = vmf_insert_pfn(vmf->vma, vmf->address,
> +                                    page_to_pfn(obj->pages[page_offset]));
>         }
>         mutex_unlock(&obj->pages_lock);
>         if (ret) {
> @@ -263,7 +262,6 @@ static struct drm_ioctl_desc vgem_ioctls[] = {
>
>  static int vgem_mmap(struct file *filp, struct vm_area_struct *vma)
>  {
> -       unsigned long flags = vma->vm_flags;
>         int ret;
>
>         ret = drm_gem_mmap(filp, vma);
> @@ -273,7 +271,6 @@ static int vgem_mmap(struct file *filp, struct vm_area_struct *vma)
>         /* Keep the WC mmaping set by drm_gem_mmap() but our pages
>          * are ordinary and not special.
>          */
> -       vma->vm_flags = flags | VM_DONTEXPAND | VM_DONTDUMP;
>         return 0;
>  }
>
> --
> 2.27.0.212.ge8ba1cc988-goog
>
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel



-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] drm/vgem: Don't use get_page in fault handler.
  2020-07-07  8:59 ` Daniel Vetter
@ 2020-07-07 10:58   ` Christian König
  2020-07-07 11:07     ` Chris Wilson
  0 siblings, 1 reply; 8+ messages in thread
From: Christian König @ 2020-07-07 10:58 UTC (permalink / raw)
  To: Daniel Vetter, Lepton Wu; +Cc: dri-devel

Am 07.07.20 um 10:59 schrieb Daniel Vetter:
> On Tue, Jul 7, 2020 at 9:27 AM Lepton Wu <ytht.net@gmail.com> wrote:
>> For pages which are allocated in ttm with transparent huge pages,
>> tail pages have zero as reference count. The current vgem code use
>> get_page on them and it will trigger BUG when release_pages get called
>> on those pages later.
>>
>> Here I attach the minimal code to trigger this bug on a qemu VM which
>> enables virtio gpu (card1) and vgem (card0). BTW, since the upstream
>> virtio gpu has changed to use drm gem and moved away from ttm. So we
>> have to use an old kernel like 5.4 to reproduce it. But I guess
>> same thing could happen for a real GPU if the real GPU use similar code
>> path to allocate pages from ttm. I am not sure about two things: first, do we
>> need to fix this? will a real GPU hit this code path? Second, suppose this
>> need to be fixed, should this be fixed in ttm side or vgem side?  If we remove
>> "huge_flags &= ~__GFP_COMP" from ttm_get_pages, then get_page in vgem works
>> fine. But it's there for fix another bug:
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.freedesktop.org%2Fshow_bug.cgi%3Fid%3D103138&amp;data=02%7C01%7Cchristian.koenig%40amd.com%7Cfc0831be8fd848abfd8908d82254266d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637297092113216357&amp;sdata=S%2BtLJyB8mqyyE%2BSIjbfHM6FGFuFjrfI50tahpaoJ3wU%3D&amp;reserved=0
>> It could also be "fixed" with this patch. But I am really not sure if this is
>> the way to go.
> For imported dma-buf objects, vgem should not handle this itself, but
> forward to the exporter through the dma_buf_mmap stuff. We have
> helpers for this all now, probably just not wired up correctly. Trying
> to ensure that all combinations of mmap code across all drivers work
> the same doesn't work.

Yes, Daniel is right what vgem does here is fundamentally broken in many 
ways.

First of all nobody should ever call get_page() on the pages TTM has 
allocated. Those pages come out of a block box and should not be touched 
by anybody.

Second the DMA-buf doesn't even need to have pages backing them in the 
case of P2P.

And third you mess up the coherency with this massively. E.g. no correct 
cache flushes any more.

Regards,
Christian.

> Caveat: I might not be understanding exactly what's going on here :-)
> -Daniel
>
>> Here is the code to reproduce this bug:
>>
>> unsigned int WIDTH = 1024;
>> unsigned int HEIGHT = 513;
>> unsigned int size = WIDTH * HEIGHT * 4;
>>
>> int work(int vfd, int dfd, int handle) {
>>          int ret;
>>          struct drm_prime_handle hf = {.handle =  handle };
>>          ret = ioctl(dfd, DRM_IOCTL_PRIME_HANDLE_TO_FD, &hf);
>>          fprintf(stderr, "fd is %d\n", hf.fd);
>>          hf.flags = DRM_CLOEXEC | DRM_RDWR;
>>          ret = ioctl(vfd, DRM_IOCTL_PRIME_FD_TO_HANDLE, &hf);
>>          fprintf(stderr, "new handle is %d\n", hf.handle);
>>          struct drm_mode_map_dumb map = {.handle = hf.handle };
>>          ret = ioctl(vfd, DRM_IOCTL_MODE_MAP_DUMB, &map);
>>          fprintf(stderr, "need map at offset %lld\n", map.offset);
>>          unsigned char * ptr = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_SHARED, vfd,
>>                            map.offset);
>>          memset(ptr, 2, size);
>>          munmap(ptr, size);
>> }
>>
>> int main()
>> {
>>          int vfd = open("/dev/dri/card0", O_RDWR); // vgem
>>          int dfd = open("/dev/dri/card1", O_RDWR); // virtio gpu
>>
>>          int ret;
>>          struct drm_mode_create_dumb ct = {};
>>
>>          ct.height = HEIGHT;
>>          ct.width = WIDTH;
>>          ct.bpp = 32;
>>          ret = ioctl(dfd, DRM_IOCTL_MODE_CREATE_DUMB, &ct);
>>          work(vfd, dfd, ct.handle);
>>          fprintf(stderr, "done\n");
>> }
>>
>> Signed-off-by: Lepton Wu <ytht.net@gmail.com>
>> ---
>>   drivers/gpu/drm/vgem/vgem_drv.c | 7 ++-----
>>   1 file changed, 2 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/vgem/vgem_drv.c b/drivers/gpu/drm/vgem/vgem_drv.c
>> index ec1a8ebb6f1b..be3d97e29804 100644
>> --- a/drivers/gpu/drm/vgem/vgem_drv.c
>> +++ b/drivers/gpu/drm/vgem/vgem_drv.c
>> @@ -87,9 +87,8 @@ static vm_fault_t vgem_gem_fault(struct vm_fault *vmf)
>>
>>          mutex_lock(&obj->pages_lock);
>>          if (obj->pages) {
>> -               get_page(obj->pages[page_offset]);
>> -               vmf->page = obj->pages[page_offset];
>> -               ret = 0;
>> +               ret = vmf_insert_pfn(vmf->vma, vmf->address,
>> +                                    page_to_pfn(obj->pages[page_offset]));
>>          }
>>          mutex_unlock(&obj->pages_lock);
>>          if (ret) {
>> @@ -263,7 +262,6 @@ static struct drm_ioctl_desc vgem_ioctls[] = {
>>
>>   static int vgem_mmap(struct file *filp, struct vm_area_struct *vma)
>>   {
>> -       unsigned long flags = vma->vm_flags;
>>          int ret;
>>
>>          ret = drm_gem_mmap(filp, vma);
>> @@ -273,7 +271,6 @@ static int vgem_mmap(struct file *filp, struct vm_area_struct *vma)
>>          /* Keep the WC mmaping set by drm_gem_mmap() but our pages
>>           * are ordinary and not special.
>>           */
>> -       vma->vm_flags = flags | VM_DONTEXPAND | VM_DONTDUMP;
>>          return 0;
>>   }
>>
>> --
>> 2.27.0.212.ge8ba1cc988-goog
>>
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel@lists.freedesktop.org
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&amp;data=02%7C01%7Cchristian.koenig%40amd.com%7Cfc0831be8fd848abfd8908d82254266d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637297092113216357&amp;sdata=C1sO1pUj0wnga%2BodcDRl9gA%2FNpIGO368KeQUAFyWK2g%3D&amp;reserved=0
>
>

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] drm/vgem: Don't use get_page in fault handler.
  2020-07-07 10:58   ` Christian König
@ 2020-07-07 11:07     ` Chris Wilson
  2020-07-07 12:05       ` Thomas Hellström (Intel)
  0 siblings, 1 reply; 8+ messages in thread
From: Chris Wilson @ 2020-07-07 11:07 UTC (permalink / raw)
  To: Christian König, Daniel Vetter, Lepton Wu; +Cc: dri-devel

Quoting Christian König (2020-07-07 11:58:26)
> Am 07.07.20 um 10:59 schrieb Daniel Vetter:
> > On Tue, Jul 7, 2020 at 9:27 AM Lepton Wu <ytht.net@gmail.com> wrote:
> >> For pages which are allocated in ttm with transparent huge pages,
> >> tail pages have zero as reference count. The current vgem code use
> >> get_page on them and it will trigger BUG when release_pages get called
> >> on those pages later.
> >>
> >> Here I attach the minimal code to trigger this bug on a qemu VM which
> >> enables virtio gpu (card1) and vgem (card0). BTW, since the upstream
> >> virtio gpu has changed to use drm gem and moved away from ttm. So we
> >> have to use an old kernel like 5.4 to reproduce it. But I guess
> >> same thing could happen for a real GPU if the real GPU use similar code
> >> path to allocate pages from ttm. I am not sure about two things: first, do we
> >> need to fix this? will a real GPU hit this code path? Second, suppose this
> >> need to be fixed, should this be fixed in ttm side or vgem side?  If we remove
> >> "huge_flags &= ~__GFP_COMP" from ttm_get_pages, then get_page in vgem works
> >> fine. But it's there for fix another bug:
> >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.freedesktop.org%2Fshow_bug.cgi%3Fid%3D103138&amp;data=02%7C01%7Cchristian.koenig%40amd.com%7Cfc0831be8fd848abfd8908d82254266d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637297092113216357&amp;sdata=S%2BtLJyB8mqyyE%2BSIjbfHM6FGFuFjrfI50tahpaoJ3wU%3D&amp;reserved=0
> >> It could also be "fixed" with this patch. But I am really not sure if this is
> >> the way to go.
> > For imported dma-buf objects, vgem should not handle this itself, but
> > forward to the exporter through the dma_buf_mmap stuff. We have
> > helpers for this all now, probably just not wired up correctly. Trying
> > to ensure that all combinations of mmap code across all drivers work
> > the same doesn't work.
> 
> Yes, Daniel is right what vgem does here is fundamentally broken in many 
> ways.
> 
> First of all nobody should ever call get_page() on the pages TTM has 
> allocated. Those pages come out of a block box and should not be touched 
> by anybody.

It doesn't. It's only used on the pages vgem allocates (from shmemfs). So
I'm really confused as to how ttm gets involved here.
-Chris
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] drm/vgem: Don't use get_page in fault handler.
  2020-07-07 11:07     ` Chris Wilson
@ 2020-07-07 12:05       ` Thomas Hellström (Intel)
  2020-07-07 12:15         ` Chris Wilson
  2020-07-07 13:55         ` Christian König
  0 siblings, 2 replies; 8+ messages in thread
From: Thomas Hellström (Intel) @ 2020-07-07 12:05 UTC (permalink / raw)
  To: Chris Wilson, Christian König, Daniel Vetter, Lepton Wu; +Cc: dri-devel


On 7/7/20 1:07 PM, Chris Wilson wrote:
> Quoting Christian König (2020-07-07 11:58:26)
>> Am 07.07.20 um 10:59 schrieb Daniel Vetter:
>>> On Tue, Jul 7, 2020 at 9:27 AM Lepton Wu <ytht.net@gmail.com> wrote:
>>>> For pages which are allocated in ttm with transparent huge pages,
>>>> tail pages have zero as reference count. The current vgem code use
>>>> get_page on them and it will trigger BUG when release_pages get called
>>>> on those pages later.
>>>>
>>>> Here I attach the minimal code to trigger this bug on a qemu VM which
>>>> enables virtio gpu (card1) and vgem (card0). BTW, since the upstream
>>>> virtio gpu has changed to use drm gem and moved away from ttm. So we
>>>> have to use an old kernel like 5.4 to reproduce it. But I guess
>>>> same thing could happen for a real GPU if the real GPU use similar code
>>>> path to allocate pages from ttm. I am not sure about two things: first, do we
>>>> need to fix this? will a real GPU hit this code path? Second, suppose this
>>>> need to be fixed, should this be fixed in ttm side or vgem side?  If we remove
>>>> "huge_flags &= ~__GFP_COMP" from ttm_get_pages, then get_page in vgem works
>>>> fine. But it's there for fix another bug:
>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.freedesktop.org%2Fshow_bug.cgi%3Fid%3D103138&amp;data=02%7C01%7Cchristian.koenig%40amd.com%7Cfc0831be8fd848abfd8908d82254266d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637297092113216357&amp;sdata=S%2BtLJyB8mqyyE%2BSIjbfHM6FGFuFjrfI50tahpaoJ3wU%3D&amp;reserved=0
>>>> It could also be "fixed" with this patch. But I am really not sure if this is
>>>> the way to go.
>>> For imported dma-buf objects, vgem should not handle this itself, but
>>> forward to the exporter through the dma_buf_mmap stuff. We have
>>> helpers for this all now, probably just not wired up correctly. Trying
>>> to ensure that all combinations of mmap code across all drivers work
>>> the same doesn't work.
>> Yes, Daniel is right what vgem does here is fundamentally broken in many
>> ways.
>>
>> First of all nobody should ever call get_page() on the pages TTM has
>> allocated. Those pages come out of a block box and should not be touched
>> by anybody.
> It doesn't. It's only used on the pages vgem allocates (from shmemfs). So
> I'm really confused as to how ttm gets involved here.
> -Chris

Sounds like vgem is allowing mmap of imported dma-bufs?

/Thomas

> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] drm/vgem: Don't use get_page in fault handler.
  2020-07-07 12:05       ` Thomas Hellström (Intel)
@ 2020-07-07 12:15         ` Chris Wilson
  2020-07-07 13:55         ` Christian König
  1 sibling, 0 replies; 8+ messages in thread
From: Chris Wilson @ 2020-07-07 12:15 UTC (permalink / raw)
  To: Christian König, Daniel Vetter, Lepton Wu, Thomas Hellström
  Cc: dri-devel

Quoting Thomas Hellström (Intel) (2020-07-07 13:05:31)
> 
> On 7/7/20 1:07 PM, Chris Wilson wrote:
> > Quoting Christian König (2020-07-07 11:58:26)
> >> Am 07.07.20 um 10:59 schrieb Daniel Vetter:
> >>> On Tue, Jul 7, 2020 at 9:27 AM Lepton Wu <ytht.net@gmail.com> wrote:
> >>>> For pages which are allocated in ttm with transparent huge pages,
> >>>> tail pages have zero as reference count. The current vgem code use
> >>>> get_page on them and it will trigger BUG when release_pages get called
> >>>> on those pages later.
> >>>>
> >>>> Here I attach the minimal code to trigger this bug on a qemu VM which
> >>>> enables virtio gpu (card1) and vgem (card0). BTW, since the upstream
> >>>> virtio gpu has changed to use drm gem and moved away from ttm. So we
> >>>> have to use an old kernel like 5.4 to reproduce it. But I guess
> >>>> same thing could happen for a real GPU if the real GPU use similar code
> >>>> path to allocate pages from ttm. I am not sure about two things: first, do we
> >>>> need to fix this? will a real GPU hit this code path? Second, suppose this
> >>>> need to be fixed, should this be fixed in ttm side or vgem side?  If we remove
> >>>> "huge_flags &= ~__GFP_COMP" from ttm_get_pages, then get_page in vgem works
> >>>> fine. But it's there for fix another bug:
> >>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.freedesktop.org%2Fshow_bug.cgi%3Fid%3D103138&amp;data=02%7C01%7Cchristian.koenig%40amd.com%7Cfc0831be8fd848abfd8908d82254266d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637297092113216357&amp;sdata=S%2BtLJyB8mqyyE%2BSIjbfHM6FGFuFjrfI50tahpaoJ3wU%3D&amp;reserved=0
> >>>> It could also be "fixed" with this patch. But I am really not sure if this is
> >>>> the way to go.
> >>> For imported dma-buf objects, vgem should not handle this itself, but
> >>> forward to the exporter through the dma_buf_mmap stuff. We have
> >>> helpers for this all now, probably just not wired up correctly. Trying
> >>> to ensure that all combinations of mmap code across all drivers work
> >>> the same doesn't work.
> >> Yes, Daniel is right what vgem does here is fundamentally broken in many
> >> ways.
> >>
> >> First of all nobody should ever call get_page() on the pages TTM has
> >> allocated. Those pages come out of a block box and should not be touched
> >> by anybody.
> > It doesn't. It's only used on the pages vgem allocates (from shmemfs). So
> > I'm really confused as to how ttm gets involved here.
> > -Chris
> 
> Sounds like vgem is allowing mmap of imported dma-bufs?

It has vgem_prime_mmap(): if (!obj->filp) return -ENODEV;

Ah, vgem_prime_import_sg_table is where the wires get crossed.
Oh my, and that even calls __vgem_gem_create() so it ends up with a
dummy obj->filp from drm_gem_object_init.
-Chris
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] drm/vgem: Don't use get_page in fault handler.
  2020-07-07 12:05       ` Thomas Hellström (Intel)
  2020-07-07 12:15         ` Chris Wilson
@ 2020-07-07 13:55         ` Christian König
  1 sibling, 0 replies; 8+ messages in thread
From: Christian König @ 2020-07-07 13:55 UTC (permalink / raw)
  To: Thomas Hellström (Intel), Chris Wilson, Daniel Vetter, Lepton Wu
  Cc: dri-devel

Am 07.07.20 um 14:05 schrieb Thomas Hellström (Intel):
>
> On 7/7/20 1:07 PM, Chris Wilson wrote:
>> Quoting Christian König (2020-07-07 11:58:26)
>>> Am 07.07.20 um 10:59 schrieb Daniel Vetter:
>>>> On Tue, Jul 7, 2020 at 9:27 AM Lepton Wu <ytht.net@gmail.com> wrote:
>>>>> For pages which are allocated in ttm with transparent huge pages,
>>>>> tail pages have zero as reference count. The current vgem code use
>>>>> get_page on them and it will trigger BUG when release_pages get 
>>>>> called
>>>>> on those pages later.
>>>>>
>>>>> Here I attach the minimal code to trigger this bug on a qemu VM which
>>>>> enables virtio gpu (card1) and vgem (card0). BTW, since the upstream
>>>>> virtio gpu has changed to use drm gem and moved away from ttm. So we
>>>>> have to use an old kernel like 5.4 to reproduce it. But I guess
>>>>> same thing could happen for a real GPU if the real GPU use similar 
>>>>> code
>>>>> path to allocate pages from ttm. I am not sure about two things: 
>>>>> first, do we
>>>>> need to fix this? will a real GPU hit this code path? Second, 
>>>>> suppose this
>>>>> need to be fixed, should this be fixed in ttm side or vgem side?  
>>>>> If we remove
>>>>> "huge_flags &= ~__GFP_COMP" from ttm_get_pages, then get_page in 
>>>>> vgem works
>>>>> fine. But it's there for fix another bug:
>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.freedesktop.org%2Fshow_bug.cgi%3Fid%3D103138&amp;data=02%7C01%7Cchristian.koenig%40amd.com%7C6c4170559c0a4d0d5f5708d8226e0f97%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637297203404761846&amp;sdata=qXBQBKvBM7zvi6PIkcAPfBjmOC8UrKmjnO03hWKzNHU%3D&amp;reserved=0 
>>>>>
>>>>> It could also be "fixed" with this patch. But I am really not sure 
>>>>> if this is
>>>>> the way to go.
>>>> For imported dma-buf objects, vgem should not handle this itself, but
>>>> forward to the exporter through the dma_buf_mmap stuff. We have
>>>> helpers for this all now, probably just not wired up correctly. Trying
>>>> to ensure that all combinations of mmap code across all drivers work
>>>> the same doesn't work.
>>> Yes, Daniel is right what vgem does here is fundamentally broken in 
>>> many
>>> ways.
>>>
>>> First of all nobody should ever call get_page() on the pages TTM has
>>> allocated. Those pages come out of a block box and should not be 
>>> touched
>>> by anybody.
>> It doesn't. It's only used on the pages vgem allocates (from 
>> shmemfs). So
>> I'm really confused as to how ttm gets involved here.
>> -Chris
>
> Sounds like vgem is allowing mmap of imported dma-bufs?

Yeah agree, this is most likely the underlying problem and should be 
fixed instead.

Christian.

>
> /Thomas
>
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel@lists.freedesktop.org
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&amp;data=02%7C01%7Cchristian.koenig%40amd.com%7C6c4170559c0a4d0d5f5708d8226e0f97%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637297203404761846&amp;sdata=Ny4sdibo9o8%2F8UqxdP74HaOXQjoKkmd%2FrRAPHszr3rk%3D&amp;reserved=0 
>>

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-07-07 13:55 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-07  4:21 [RFC] drm/vgem: Don't use get_page in fault handler Lepton Wu
2020-07-07  8:22 ` Chris Wilson
2020-07-07  8:59 ` Daniel Vetter
2020-07-07 10:58   ` Christian König
2020-07-07 11:07     ` Chris Wilson
2020-07-07 12:05       ` Thomas Hellström (Intel)
2020-07-07 12:15         ` Chris Wilson
2020-07-07 13:55         ` Christian König

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).