[PATCH] drm/i915: Convert WARNs during userptr revoke to SIGBUS

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] drm/i915: Convert WARNs during userptr revoke to SIGBUS
@ 2015-09-23 20:07 Chris Wilson
  2015-09-24 10:23 ` Tvrtko Ursulin
  2015-09-28 13:42 ` Daniel Vetter
  0 siblings, 2 replies; 25+ messages in thread
From: Chris Wilson @ 2015-09-23 20:07 UTC (permalink / raw)
  To: intel-gfx

If the client revokes the virtual address it asked to be mapped into GPU
space via userptr (by using anything along the lines of mmap, mprotect,
madvise, munmap, ftruncate etc) the mmu notifier sends a range
invalidate command to userptr. Upon receiving the invalidation signal
for the revoked range, we try to release the struct pages we pinned into
the GTT. However, this can fail if any of the GPU's VMA are pinned for
use by the hardware (i.e. despite the user's intention we cannot
relinquish the client's address range and keep uptodate with whatever is
placed in there). Currently we emit a few WARN so that we would notice
if this every occurred in the wild; it has. Sadly this means we need to
replace those WARNs with the proper SIGBUS to the offending clients
instead.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Michał Winiarski <michal.winiarski@intel.com>
---
 drivers/gpu/drm/i915/i915_gem_userptr.c | 41 +++++++++++++++++++++++++++++----
 1 file changed, 37 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
index f75d90118888..efb404b9fe69 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -81,11 +81,44 @@ static void __cancel_userptr__worker(struct work_struct *work)
 		was_interruptible = dev_priv->mm.interruptible;
 		dev_priv->mm.interruptible = false;
 
-		list_for_each_entry_safe(vma, tmp, &obj->vma_list, obj_link) {
-			int ret = i915_vma_unbind(vma);
-			WARN_ON(ret && ret != -EIO);
+		list_for_each_entry_safe(vma, tmp, &obj->vma_list, obj_link)
+			i915_vma_unbind(vma);
+		if (i915_gem_object_put_pages(obj)) {
+			struct task_struct *p;
+
+			DRM_ERROR("Unable to revoke ownership by userptr of"
+				  " invalidated address range, sending SIGBUS"
+				  " to attached clients.\n");
+
+			rcu_read_lock();
+			for_each_process(p) {
+				siginfo_t info;
+
+				/* This doesn't capture everyone who has
+				 * the pages pinned behind a VMA as we
+				 * do not have that tracking information
+				 * available, it does however kill the
+				 * original process (and siblings) who
+				 * created the userptr and presumably tried
+				 * to reuse the address space despite having
+				 * pinned it (possibly indirectly) to the hw.
+				 * Arguably, we don't even want to kill the
+				 * other processes as they are not at fault,
+				 * likely to be a display server, and hopefully
+				 * will release the pages in due course once
+				 * the client is dead.
+				 */
+				if (p->mm != obj->userptr.mm->mm)
+					continue;
+
+				info.si_signo = SIGBUS;
+				info.si_errno = 0;
+				info.si_code = BUS_ADRERR;
+				info.si_addr = (void __user *)obj->userptr.ptr;
+				force_sig_info(SIGBUS, &info, p);
+			}
+			rcu_read_unlock();
 		}
-		WARN_ON(i915_gem_object_put_pages(obj));
 
 		dev_priv->mm.interruptible = was_interruptible;
 	}
-- 
2.5.3

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH] drm/i915: Convert WARNs during userptr revoke to SIGBUS
  2015-09-23 20:07 [PATCH] drm/i915: Convert WARNs during userptr revoke to SIGBUS Chris Wilson
@ 2015-09-24 10:23 ` Tvrtko Ursulin
  2015-09-24 10:31   ` Chris Wilson
  2015-09-28 13:42 ` Daniel Vetter
  1 sibling, 1 reply; 25+ messages in thread
From: Tvrtko Ursulin @ 2015-09-24 10:23 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 09/23/2015 09:07 PM, Chris Wilson wrote:
> If the client revokes the virtual address it asked to be mapped into GPU
> space via userptr (by using anything along the lines of mmap, mprotect,
> madvise, munmap, ftruncate etc) the mmu notifier sends a range
> invalidate command to userptr. Upon receiving the invalidation signal
> for the revoked range, we try to release the struct pages we pinned into
> the GTT. However, this can fail if any of the GPU's VMA are pinned for
> use by the hardware (i.e. despite the user's intention we cannot
> relinquish the client's address range and keep uptodate with whatever is
> placed in there). Currently we emit a few WARN so that we would notice
> if this every occurred in the wild; it has. Sadly this means we need to
> replace those WARNs with the proper SIGBUS to the offending clients
> instead.

How does it happen? Frame buffer?

> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Michał Winiarski <michal.winiarski@intel.com>
> ---
>   drivers/gpu/drm/i915/i915_gem_userptr.c | 41 +++++++++++++++++++++++++++++----
>   1 file changed, 37 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> index f75d90118888..efb404b9fe69 100644
> --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> @@ -81,11 +81,44 @@ static void __cancel_userptr__worker(struct work_struct *work)

This line is a reminder the previous series still hasn't landed. I think 
it was all r-b-ed, with only my request to not rely on release_pages (or 
something) handle negative and zero page count.

>   		was_interruptible = dev_priv->mm.interruptible;
>   		dev_priv->mm.interruptible = false;
>
> -		list_for_each_entry_safe(vma, tmp, &obj->vma_list, obj_link) {
> -			int ret = i915_vma_unbind(vma);
> -			WARN_ON(ret && ret != -EIO);
> +		list_for_each_entry_safe(vma, tmp, &obj->vma_list, obj_link)
> +			i915_vma_unbind(vma);
> +		if (i915_gem_object_put_pages(obj)) {
> +			struct task_struct *p;
> +
> +			DRM_ERROR("Unable to revoke ownership by userptr of"
> +				  " invalidated address range, sending SIGBUS"
> +				  " to attached clients.\n");
> +
> +			rcu_read_lock();
> +			for_each_process(p) {

I don't think this is safe this without holding the tasklist_lock.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] drm/i915: Convert WARNs during userptr revoke to SIGBUS
  2015-09-24 10:23 ` Tvrtko Ursulin
@ 2015-09-24 10:31   ` Chris Wilson
  2015-09-24 10:55     ` Tvrtko Ursulin
  0 siblings, 1 reply; 25+ messages in thread
From: Chris Wilson @ 2015-09-24 10:31 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx

On Thu, Sep 24, 2015 at 11:23:48AM +0100, Tvrtko Ursulin wrote:
> 
> On 09/23/2015 09:07 PM, Chris Wilson wrote:
> >If the client revokes the virtual address it asked to be mapped into GPU
> >space via userptr (by using anything along the lines of mmap, mprotect,
> >madvise, munmap, ftruncate etc) the mmu notifier sends a range
> >invalidate command to userptr. Upon receiving the invalidation signal
> >for the revoked range, we try to release the struct pages we pinned into
> >the GTT. However, this can fail if any of the GPU's VMA are pinned for
> >use by the hardware (i.e. despite the user's intention we cannot
> >relinquish the client's address range and keep uptodate with whatever is
> >placed in there). Currently we emit a few WARN so that we would notice
> >if this every occurred in the wild; it has. Sadly this means we need to
> >replace those WARNs with the proper SIGBUS to the offending clients
> >instead.
> 
> How does it happen? Frame buffer?

Ignoring the issue of -EIO since patches to fix that path also haven't
landed, the primary cause is through binding the userptr to a scanout
(framebuffer). This is not recommended usage for userptr since the CPU
view is then incoherent, but not impossible.
 
> >Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >Cc: Michał Winiarski <michal.winiarski@intel.com>
> >---
> >  drivers/gpu/drm/i915/i915_gem_userptr.c | 41 +++++++++++++++++++++++++++++----
> >  1 file changed, 37 insertions(+), 4 deletions(-)
> >
> >diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> >index f75d90118888..efb404b9fe69 100644
> >--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> >+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> >@@ -81,11 +81,44 @@ static void __cancel_userptr__worker(struct work_struct *work)
> 
> This line is a reminder the previous series still hasn't landed. I
> think it was all r-b-ed, with only my request to not rely on
> release_pages (or something) handle negative and zero page count.
> 
> >  		was_interruptible = dev_priv->mm.interruptible;
> >  		dev_priv->mm.interruptible = false;
> >
> >-		list_for_each_entry_safe(vma, tmp, &obj->vma_list, obj_link) {
> >-			int ret = i915_vma_unbind(vma);
> >-			WARN_ON(ret && ret != -EIO);
> >+		list_for_each_entry_safe(vma, tmp, &obj->vma_list, obj_link)
> >+			i915_vma_unbind(vma);
> >+		if (i915_gem_object_put_pages(obj)) {
> >+			struct task_struct *p;
> >+
> >+			DRM_ERROR("Unable to revoke ownership by userptr of"
> >+				  " invalidated address range, sending SIGBUS"
> >+				  " to attached clients.\n");
> >+
> >+			rcu_read_lock();
> >+			for_each_process(p) {
> 
> I don't think this is safe this without holding the tasklist_lock.

Hmm, it's the only lock taken in the oom-killer for sending the signal.
The list will not change nor will tasks disappear whilst we hold the
read-lock so it seems sane.
-Chirs

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] drm/i915: Convert WARNs during userptr revoke to SIGBUS
  2015-09-24 10:31   ` Chris Wilson
@ 2015-09-24 10:55     ` Tvrtko Ursulin
  0 siblings, 0 replies; 25+ messages in thread
From: Tvrtko Ursulin @ 2015-09-24 10:55 UTC (permalink / raw)
  To: Chris Wilson, intel-gfx


On 09/24/2015 11:31 AM, Chris Wilson wrote:
> On Thu, Sep 24, 2015 at 11:23:48AM +0100, Tvrtko Ursulin wrote:
>>
>> On 09/23/2015 09:07 PM, Chris Wilson wrote:
>>> If the client revokes the virtual address it asked to be mapped into GPU
>>> space via userptr (by using anything along the lines of mmap, mprotect,
>>> madvise, munmap, ftruncate etc) the mmu notifier sends a range
>>> invalidate command to userptr. Upon receiving the invalidation signal
>>> for the revoked range, we try to release the struct pages we pinned into
>>> the GTT. However, this can fail if any of the GPU's VMA are pinned for
>>> use by the hardware (i.e. despite the user's intention we cannot
>>> relinquish the client's address range and keep uptodate with whatever is
>>> placed in there). Currently we emit a few WARN so that we would notice
>>> if this every occurred in the wild; it has. Sadly this means we need to
>>> replace those WARNs with the proper SIGBUS to the offending clients
>>> instead.
>>
>> How does it happen? Frame buffer?
>
> Ignoring the issue of -EIO since patches to fix that path also haven't
> landed, the primary cause is through binding the userptr to a scanout
> (framebuffer). This is not recommended usage for userptr since the CPU
> view is then incoherent, but not impossible.
>
>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>> Cc: Michał Winiarski <michal.winiarski@intel.com>
>>> ---
>>>   drivers/gpu/drm/i915/i915_gem_userptr.c | 41 +++++++++++++++++++++++++++++----
>>>   1 file changed, 37 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
>>> index f75d90118888..efb404b9fe69 100644
>>> --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
>>> +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
>>> @@ -81,11 +81,44 @@ static void __cancel_userptr__worker(struct work_struct *work)
>>
>> This line is a reminder the previous series still hasn't landed. I
>> think it was all r-b-ed, with only my request to not rely on
>> release_pages (or something) handle negative and zero page count.
>>
>>>   		was_interruptible = dev_priv->mm.interruptible;
>>>   		dev_priv->mm.interruptible = false;
>>>
>>> -		list_for_each_entry_safe(vma, tmp, &obj->vma_list, obj_link) {
>>> -			int ret = i915_vma_unbind(vma);
>>> -			WARN_ON(ret && ret != -EIO);
>>> +		list_for_each_entry_safe(vma, tmp, &obj->vma_list, obj_link)
>>> +			i915_vma_unbind(vma);
>>> +		if (i915_gem_object_put_pages(obj)) {
>>> +			struct task_struct *p;
>>> +
>>> +			DRM_ERROR("Unable to revoke ownership by userptr of"
>>> +				  " invalidated address range, sending SIGBUS"
>>> +				  " to attached clients.\n");
>>> +
>>> +			rcu_read_lock();
>>> +			for_each_process(p) {
>>
>> I don't think this is safe this without holding the tasklist_lock.
>
> Hmm, it's the only lock taken in the oom-killer for sending the signal.
> The list will not change nor will tasks disappear whilst we hold the
> read-lock so it seems sane.

Then I'll say hmm as well. Since I've now seen there is both in use, 
with and without holding the tasklist_lock.

I thought that with just rcu_read_lock, nothing prevents another CPU 
from obtaining the write tasklist_lock and mess about with it. But maybe 
we are talking about some complex locking scheme here? I don't know. Did 
not find any documentation on the tasklist_lock..

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] drm/i915: Convert WARNs during userptr revoke to SIGBUS
  2015-09-23 20:07 [PATCH] drm/i915: Convert WARNs during userptr revoke to SIGBUS Chris Wilson
  2015-09-24 10:23 ` Tvrtko Ursulin
@ 2015-09-28 13:42 ` Daniel Vetter
  2015-09-28 13:52   ` Chris Wilson
  1 sibling, 1 reply; 25+ messages in thread
From: Daniel Vetter @ 2015-09-28 13:42 UTC (permalink / raw)
  To: Chris Wilson; +Cc: intel-gfx

On Wed, Sep 23, 2015 at 09:07:24PM +0100, Chris Wilson wrote:
> If the client revokes the virtual address it asked to be mapped into GPU
> space via userptr (by using anything along the lines of mmap, mprotect,
> madvise, munmap, ftruncate etc) the mmu notifier sends a range
> invalidate command to userptr. Upon receiving the invalidation signal
> for the revoked range, we try to release the struct pages we pinned into
> the GTT. However, this can fail if any of the GPU's VMA are pinned for
> use by the hardware (i.e. despite the user's intention we cannot
> relinquish the client's address range and keep uptodate with whatever is
> placed in there). Currently we emit a few WARN so that we would notice
> if this every occurred in the wild; it has. Sadly this means we need to
> replace those WARNs with the proper SIGBUS to the offending clients
> instead.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Michał Winiarski <michal.winiarski@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_gem_userptr.c | 41 +++++++++++++++++++++++++++++----
>  1 file changed, 37 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> index f75d90118888..efb404b9fe69 100644
> --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> @@ -81,11 +81,44 @@ static void __cancel_userptr__worker(struct work_struct *work)
>  		was_interruptible = dev_priv->mm.interruptible;
>  		dev_priv->mm.interruptible = false;
>  
> -		list_for_each_entry_safe(vma, tmp, &obj->vma_list, obj_link) {
> -			int ret = i915_vma_unbind(vma);
> -			WARN_ON(ret && ret != -EIO);
> +		list_for_each_entry_safe(vma, tmp, &obj->vma_list, obj_link)
> +			i915_vma_unbind(vma);
> +		if (i915_gem_object_put_pages(obj)) {
> +			struct task_struct *p;
> +
> +			DRM_ERROR("Unable to revoke ownership by userptr of"
> +				  " invalidated address range, sending SIGBUS"
> +				  " to attached clients.\n");
> +
> +			rcu_read_lock();
> +			for_each_process(p) {
> +				siginfo_t info;
> +
> +				/* This doesn't capture everyone who has
> +				 * the pages pinned behind a VMA as we
> +				 * do not have that tracking information
> +				 * available, it does however kill the
> +				 * original process (and siblings) who
> +				 * created the userptr and presumably tried
> +				 * to reuse the address space despite having
> +				 * pinned it (possibly indirectly) to the hw.
> +				 * Arguably, we don't even want to kill the
> +				 * other processes as they are not at fault,
> +				 * likely to be a display server, and hopefully
> +				 * will release the pages in due course once
> +				 * the client is dead.
> +				 */
> +				if (p->mm != obj->userptr.mm->mm)
> +					continue;
> +
> +				info.si_signo = SIGBUS;
> +				info.si_errno = 0;
> +				info.si_code = BUS_ADRERR;
> +				info.si_addr = (void __user *)obj->userptr.ptr;
> +				force_sig_info(SIGBUS, &info, p);
> +			}
> +			rcu_read_unlock();

Why do we need to send a SIGBUS? It won't tear down the offending gem bo,
any new users will hopefully get it, and abusing SIGBUS without the thread
actually doing a memory access is a bit surprising. DRM_DEBUG seems to be
the most we can do here I think - I think userspace being able to do this
is just a fundamental property of userptr.
-Daniel

>  		}
> -		WARN_ON(i915_gem_object_put_pages(obj));
>  
>  		dev_priv->mm.interruptible = was_interruptible;
>  	}
> -- 
> 2.5.3
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] drm/i915: Convert WARNs during userptr revoke to SIGBUS
  2015-09-28 13:42 ` Daniel Vetter
@ 2015-09-28 13:52   ` Chris Wilson
  2015-09-28 14:14     ` Daniel Vetter
  0 siblings, 1 reply; 25+ messages in thread
From: Chris Wilson @ 2015-09-28 13:52 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx

On Mon, Sep 28, 2015 at 03:42:22PM +0200, Daniel Vetter wrote:
> On Wed, Sep 23, 2015 at 09:07:24PM +0100, Chris Wilson wrote:
> > If the client revokes the virtual address it asked to be mapped into GPU
> > space via userptr (by using anything along the lines of mmap, mprotect,
> > madvise, munmap, ftruncate etc) the mmu notifier sends a range
> > invalidate command to userptr. Upon receiving the invalidation signal
> > for the revoked range, we try to release the struct pages we pinned into
> > the GTT. However, this can fail if any of the GPU's VMA are pinned for
> > use by the hardware (i.e. despite the user's intention we cannot
> > relinquish the client's address range and keep uptodate with whatever is
> > placed in there). Currently we emit a few WARN so that we would notice
> > if this every occurred in the wild; it has. Sadly this means we need to
> > replace those WARNs with the proper SIGBUS to the offending clients
> > instead.
> > 
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > Cc: Michał Winiarski <michal.winiarski@intel.com>
> > ---
> >  drivers/gpu/drm/i915/i915_gem_userptr.c | 41 +++++++++++++++++++++++++++++----
> >  1 file changed, 37 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > index f75d90118888..efb404b9fe69 100644
> > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > @@ -81,11 +81,44 @@ static void __cancel_userptr__worker(struct work_struct *work)
> >  		was_interruptible = dev_priv->mm.interruptible;
> >  		dev_priv->mm.interruptible = false;
> >  
> > -		list_for_each_entry_safe(vma, tmp, &obj->vma_list, obj_link) {
> > -			int ret = i915_vma_unbind(vma);
> > -			WARN_ON(ret && ret != -EIO);
> > +		list_for_each_entry_safe(vma, tmp, &obj->vma_list, obj_link)
> > +			i915_vma_unbind(vma);
> > +		if (i915_gem_object_put_pages(obj)) {
> > +			struct task_struct *p;
> > +
> > +			DRM_ERROR("Unable to revoke ownership by userptr of"
> > +				  " invalidated address range, sending SIGBUS"
> > +				  " to attached clients.\n");
> > +
> > +			rcu_read_lock();
> > +			for_each_process(p) {
> > +				siginfo_t info;
> > +
> > +				/* This doesn't capture everyone who has
> > +				 * the pages pinned behind a VMA as we
> > +				 * do not have that tracking information
> > +				 * available, it does however kill the
> > +				 * original process (and siblings) who
> > +				 * created the userptr and presumably tried
> > +				 * to reuse the address space despite having
> > +				 * pinned it (possibly indirectly) to the hw.
> > +				 * Arguably, we don't even want to kill the
> > +				 * other processes as they are not at fault,
> > +				 * likely to be a display server, and hopefully
> > +				 * will release the pages in due course once
> > +				 * the client is dead.
> > +				 */
> > +				if (p->mm != obj->userptr.mm->mm)
> > +					continue;
> > +
> > +				info.si_signo = SIGBUS;
> > +				info.si_errno = 0;
> > +				info.si_code = BUS_ADRERR;
> > +				info.si_addr = (void __user *)obj->userptr.ptr;
> > +				force_sig_info(SIGBUS, &info, p);
> > +			}
> > +			rcu_read_unlock();
> 
> Why do we need to send a SIGBUS? It won't tear down the offending gem bo,
> any new users will hopefully get it, and abusing SIGBUS without the thread
> actually doing a memory access is a bit surprising. DRM_DEBUG seems to be
> the most we can do here I think - I think userspace being able to do this
> is just a fundamental property of userptr.

It is not the bo that is at fault but the *client's* *address* *space*
that is being changed. It is equivalent to mmap on a truncated file i.e.
if the client tries to use its mmapping after it has truncated the file
it is scolded via SIGBUS.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] drm/i915: Convert WARNs during userptr revoke to SIGBUS
  2015-09-28 13:52   ` Chris Wilson
@ 2015-09-28 14:14     ` Daniel Vetter
  2015-10-08  9:45       ` Tvrtko Ursulin
  0 siblings, 1 reply; 25+ messages in thread
From: Daniel Vetter @ 2015-09-28 14:14 UTC (permalink / raw)
  To: Chris Wilson, Daniel Vetter, intel-gfx

On Mon, Sep 28, 2015 at 02:52:30PM +0100, Chris Wilson wrote:
> On Mon, Sep 28, 2015 at 03:42:22PM +0200, Daniel Vetter wrote:
> > On Wed, Sep 23, 2015 at 09:07:24PM +0100, Chris Wilson wrote:
> > > If the client revokes the virtual address it asked to be mapped into GPU
> > > space via userptr (by using anything along the lines of mmap, mprotect,
> > > madvise, munmap, ftruncate etc) the mmu notifier sends a range
> > > invalidate command to userptr. Upon receiving the invalidation signal
> > > for the revoked range, we try to release the struct pages we pinned into
> > > the GTT. However, this can fail if any of the GPU's VMA are pinned for
> > > use by the hardware (i.e. despite the user's intention we cannot
> > > relinquish the client's address range and keep uptodate with whatever is
> > > placed in there). Currently we emit a few WARN so that we would notice
> > > if this every occurred in the wild; it has. Sadly this means we need to
> > > replace those WARNs with the proper SIGBUS to the offending clients
> > > instead.
> > > 
> > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > > Cc: Michał Winiarski <michal.winiarski@intel.com>
> > > ---
> > >  drivers/gpu/drm/i915/i915_gem_userptr.c | 41 +++++++++++++++++++++++++++++----
> > >  1 file changed, 37 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > index f75d90118888..efb404b9fe69 100644
> > > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > @@ -81,11 +81,44 @@ static void __cancel_userptr__worker(struct work_struct *work)
> > >  		was_interruptible = dev_priv->mm.interruptible;
> > >  		dev_priv->mm.interruptible = false;
> > >  
> > > -		list_for_each_entry_safe(vma, tmp, &obj->vma_list, obj_link) {
> > > -			int ret = i915_vma_unbind(vma);
> > > -			WARN_ON(ret && ret != -EIO);
> > > +		list_for_each_entry_safe(vma, tmp, &obj->vma_list, obj_link)
> > > +			i915_vma_unbind(vma);
> > > +		if (i915_gem_object_put_pages(obj)) {
> > > +			struct task_struct *p;
> > > +
> > > +			DRM_ERROR("Unable to revoke ownership by userptr of"
> > > +				  " invalidated address range, sending SIGBUS"
> > > +				  " to attached clients.\n");
> > > +
> > > +			rcu_read_lock();
> > > +			for_each_process(p) {
> > > +				siginfo_t info;
> > > +
> > > +				/* This doesn't capture everyone who has
> > > +				 * the pages pinned behind a VMA as we
> > > +				 * do not have that tracking information
> > > +				 * available, it does however kill the
> > > +				 * original process (and siblings) who
> > > +				 * created the userptr and presumably tried
> > > +				 * to reuse the address space despite having
> > > +				 * pinned it (possibly indirectly) to the hw.
> > > +				 * Arguably, we don't even want to kill the
> > > +				 * other processes as they are not at fault,
> > > +				 * likely to be a display server, and hopefully
> > > +				 * will release the pages in due course once
> > > +				 * the client is dead.
> > > +				 */
> > > +				if (p->mm != obj->userptr.mm->mm)
> > > +					continue;
> > > +
> > > +				info.si_signo = SIGBUS;
> > > +				info.si_errno = 0;
> > > +				info.si_code = BUS_ADRERR;
> > > +				info.si_addr = (void __user *)obj->userptr.ptr;
> > > +				force_sig_info(SIGBUS, &info, p);
> > > +			}
> > > +			rcu_read_unlock();
> > 
> > Why do we need to send a SIGBUS? It won't tear down the offending gem bo,
> > any new users will hopefully get it, and abusing SIGBUS without the thread
> > actually doing a memory access is a bit surprising. DRM_DEBUG seems to be
> > the most we can do here I think - I think userspace being able to do this
> > is just a fundamental property of userptr.
> 
> It is not the bo that is at fault but the *client's* *address* *space*
> that is being changed. It is equivalent to mmap on a truncated file i.e.
> if the client tries to use its mmapping after it has truncated the file
> it is scolded via SIGBUS.

But existing SIGBUS is thread-bound and comes with the fault address
attached. This is just the gpu being a bit unhappy, so the SIGBUS comes
out of complete nowhere to smack the userspace thread. Any kind of SIGBUS
catcher userspace has for other reasons might be supremely surprised by
this and do stupid things. Hence I don't think throwing SIGBUS here is
correct behaviour. And there doesn't seem to be anything else suitable
really.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] drm/i915: Convert WARNs during userptr revoke to SIGBUS
  2015-09-28 14:14     ` Daniel Vetter
@ 2015-10-08  9:45       ` Tvrtko Ursulin
  2015-10-09  7:48         ` Daniel Vetter
  0 siblings, 1 reply; 25+ messages in thread
From: Tvrtko Ursulin @ 2015-10-08  9:45 UTC (permalink / raw)
  To: Daniel Vetter, Chris Wilson, intel-gfx


On 28/09/15 15:14, Daniel Vetter wrote:
> On Mon, Sep 28, 2015 at 02:52:30PM +0100, Chris Wilson wrote:
>> On Mon, Sep 28, 2015 at 03:42:22PM +0200, Daniel Vetter wrote:
>>> On Wed, Sep 23, 2015 at 09:07:24PM +0100, Chris Wilson wrote:
>>>> If the client revokes the virtual address it asked to be mapped into GPU
>>>> space via userptr (by using anything along the lines of mmap, mprotect,
>>>> madvise, munmap, ftruncate etc) the mmu notifier sends a range
>>>> invalidate command to userptr. Upon receiving the invalidation signal
>>>> for the revoked range, we try to release the struct pages we pinned into
>>>> the GTT. However, this can fail if any of the GPU's VMA are pinned for
>>>> use by the hardware (i.e. despite the user's intention we cannot
>>>> relinquish the client's address range and keep uptodate with whatever is
>>>> placed in there). Currently we emit a few WARN so that we would notice
>>>> if this every occurred in the wild; it has. Sadly this means we need to
>>>> replace those WARNs with the proper SIGBUS to the offending clients
>>>> instead.
>>>>
>>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>> Cc: Michał Winiarski <michal.winiarski@intel.com>
>>>> ---
>>>>   drivers/gpu/drm/i915/i915_gem_userptr.c | 41 +++++++++++++++++++++++++++++----
>>>>   1 file changed, 37 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
>>>> index f75d90118888..efb404b9fe69 100644
>>>> --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
>>>> +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
>>>> @@ -81,11 +81,44 @@ static void __cancel_userptr__worker(struct work_struct *work)
>>>>   		was_interruptible = dev_priv->mm.interruptible;
>>>>   		dev_priv->mm.interruptible = false;
>>>>
>>>> -		list_for_each_entry_safe(vma, tmp, &obj->vma_list, obj_link) {
>>>> -			int ret = i915_vma_unbind(vma);
>>>> -			WARN_ON(ret && ret != -EIO);
>>>> +		list_for_each_entry_safe(vma, tmp, &obj->vma_list, obj_link)
>>>> +			i915_vma_unbind(vma);
>>>> +		if (i915_gem_object_put_pages(obj)) {
>>>> +			struct task_struct *p;
>>>> +
>>>> +			DRM_ERROR("Unable to revoke ownership by userptr of"
>>>> +				  " invalidated address range, sending SIGBUS"
>>>> +				  " to attached clients.\n");
>>>> +
>>>> +			rcu_read_lock();
>>>> +			for_each_process(p) {
>>>> +				siginfo_t info;
>>>> +
>>>> +				/* This doesn't capture everyone who has
>>>> +				 * the pages pinned behind a VMA as we
>>>> +				 * do not have that tracking information
>>>> +				 * available, it does however kill the
>>>> +				 * original process (and siblings) who
>>>> +				 * created the userptr and presumably tried
>>>> +				 * to reuse the address space despite having
>>>> +				 * pinned it (possibly indirectly) to the hw.
>>>> +				 * Arguably, we don't even want to kill the
>>>> +				 * other processes as they are not at fault,
>>>> +				 * likely to be a display server, and hopefully
>>>> +				 * will release the pages in due course once
>>>> +				 * the client is dead.
>>>> +				 */
>>>> +				if (p->mm != obj->userptr.mm->mm)
>>>> +					continue;
>>>> +
>>>> +				info.si_signo = SIGBUS;
>>>> +				info.si_errno = 0;
>>>> +				info.si_code = BUS_ADRERR;
>>>> +				info.si_addr = (void __user *)obj->userptr.ptr;
>>>> +				force_sig_info(SIGBUS, &info, p);
>>>> +			}
>>>> +			rcu_read_unlock();
>>>
>>> Why do we need to send a SIGBUS? It won't tear down the offending gem bo,
>>> any new users will hopefully get it, and abusing SIGBUS without the thread
>>> actually doing a memory access is a bit surprising. DRM_DEBUG seems to be
>>> the most we can do here I think - I think userspace being able to do this
>>> is just a fundamental property of userptr.
>>
>> It is not the bo that is at fault but the *client's* *address* *space*
>> that is being changed. It is equivalent to mmap on a truncated file i.e.
>> if the client tries to use its mmapping after it has truncated the file
>> it is scolded via SIGBUS.
>
> But existing SIGBUS is thread-bound and comes with the fault address
> attached. This is just the gpu being a bit unhappy, so the SIGBUS comes
> out of complete nowhere to smack the userspace thread. Any kind of SIGBUS
> catcher userspace has for other reasons might be supremely surprised by
> this and do stupid things. Hence I don't think throwing SIGBUS here is
> correct behaviour. And there doesn't seem to be anything else suitable
> really.

Te offending address is provided with the signal as far as I can see.

I think it is fine to do this, even required since the alternative is 
for GPU to keep using random memory indefinitely and userspace never 
gets to know.

And I don't see any reason to keep the process running who did such an 
elementary and serious mistake.

Is the only concern that the process can catch it and not exit?

I am just not sure about the locking requirement for for_each_process 
since existing call sites give conflicting examples. I don't see how 
turning the preemption off can be safe without the tasklist lock but 
perhaps I am wrong, don't know.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] drm/i915: Convert WARNs during userptr revoke to SIGBUS
  2015-10-08  9:45       ` Tvrtko Ursulin
@ 2015-10-09  7:48         ` Daniel Vetter
  2015-10-09  8:40           ` Chris Wilson
  0 siblings, 1 reply; 25+ messages in thread
From: Daniel Vetter @ 2015-10-09  7:48 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx

On Thu, Oct 08, 2015 at 10:45:47AM +0100, Tvrtko Ursulin wrote:
> 
> On 28/09/15 15:14, Daniel Vetter wrote:
> >On Mon, Sep 28, 2015 at 02:52:30PM +0100, Chris Wilson wrote:
> >>On Mon, Sep 28, 2015 at 03:42:22PM +0200, Daniel Vetter wrote:
> >>>On Wed, Sep 23, 2015 at 09:07:24PM +0100, Chris Wilson wrote:
> >>>>If the client revokes the virtual address it asked to be mapped into GPU
> >>>>space via userptr (by using anything along the lines of mmap, mprotect,
> >>>>madvise, munmap, ftruncate etc) the mmu notifier sends a range
> >>>>invalidate command to userptr. Upon receiving the invalidation signal
> >>>>for the revoked range, we try to release the struct pages we pinned into
> >>>>the GTT. However, this can fail if any of the GPU's VMA are pinned for
> >>>>use by the hardware (i.e. despite the user's intention we cannot
> >>>>relinquish the client's address range and keep uptodate with whatever is
> >>>>placed in there). Currently we emit a few WARN so that we would notice
> >>>>if this every occurred in the wild; it has. Sadly this means we need to
> >>>>replace those WARNs with the proper SIGBUS to the offending clients
> >>>>instead.
> >>>>
> >>>>Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >>>>Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>>>Cc: Michał Winiarski <michal.winiarski@intel.com>
> >>>>---
> >>>>  drivers/gpu/drm/i915/i915_gem_userptr.c | 41 +++++++++++++++++++++++++++++----
> >>>>  1 file changed, 37 insertions(+), 4 deletions(-)
> >>>>
> >>>>diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> >>>>index f75d90118888..efb404b9fe69 100644
> >>>>--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> >>>>+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> >>>>@@ -81,11 +81,44 @@ static void __cancel_userptr__worker(struct work_struct *work)
> >>>>  		was_interruptible = dev_priv->mm.interruptible;
> >>>>  		dev_priv->mm.interruptible = false;
> >>>>
> >>>>-		list_for_each_entry_safe(vma, tmp, &obj->vma_list, obj_link) {
> >>>>-			int ret = i915_vma_unbind(vma);
> >>>>-			WARN_ON(ret && ret != -EIO);
> >>>>+		list_for_each_entry_safe(vma, tmp, &obj->vma_list, obj_link)
> >>>>+			i915_vma_unbind(vma);
> >>>>+		if (i915_gem_object_put_pages(obj)) {
> >>>>+			struct task_struct *p;
> >>>>+
> >>>>+			DRM_ERROR("Unable to revoke ownership by userptr of"
> >>>>+				  " invalidated address range, sending SIGBUS"
> >>>>+				  " to attached clients.\n");
> >>>>+
> >>>>+			rcu_read_lock();
> >>>>+			for_each_process(p) {
> >>>>+				siginfo_t info;
> >>>>+
> >>>>+				/* This doesn't capture everyone who has
> >>>>+				 * the pages pinned behind a VMA as we
> >>>>+				 * do not have that tracking information
> >>>>+				 * available, it does however kill the
> >>>>+				 * original process (and siblings) who
> >>>>+				 * created the userptr and presumably tried
> >>>>+				 * to reuse the address space despite having
> >>>>+				 * pinned it (possibly indirectly) to the hw.
> >>>>+				 * Arguably, we don't even want to kill the
> >>>>+				 * other processes as they are not at fault,
> >>>>+				 * likely to be a display server, and hopefully
> >>>>+				 * will release the pages in due course once
> >>>>+				 * the client is dead.
> >>>>+				 */
> >>>>+				if (p->mm != obj->userptr.mm->mm)
> >>>>+					continue;
> >>>>+
> >>>>+				info.si_signo = SIGBUS;
> >>>>+				info.si_errno = 0;
> >>>>+				info.si_code = BUS_ADRERR;
> >>>>+				info.si_addr = (void __user *)obj->userptr.ptr;
> >>>>+				force_sig_info(SIGBUS, &info, p);
> >>>>+			}
> >>>>+			rcu_read_unlock();
> >>>
> >>>Why do we need to send a SIGBUS? It won't tear down the offending gem bo,
> >>>any new users will hopefully get it, and abusing SIGBUS without the thread
> >>>actually doing a memory access is a bit surprising. DRM_DEBUG seems to be
> >>>the most we can do here I think - I think userspace being able to do this
> >>>is just a fundamental property of userptr.
> >>
> >>It is not the bo that is at fault but the *client's* *address* *space*
> >>that is being changed. It is equivalent to mmap on a truncated file i.e.
> >>if the client tries to use its mmapping after it has truncated the file
> >>it is scolded via SIGBUS.
> >
> >But existing SIGBUS is thread-bound and comes with the fault address
> >attached. This is just the gpu being a bit unhappy, so the SIGBUS comes
> >out of complete nowhere to smack the userspace thread. Any kind of SIGBUS
> >catcher userspace has for other reasons might be supremely surprised by
> >this and do stupid things. Hence I don't think throwing SIGBUS here is
> >correct behaviour. And there doesn't seem to be anything else suitable
> >really.
> 
> Te offending address is provided with the signal as far as I can see.
> 
> I think it is fine to do this, even required since the alternative is for
> GPU to keep using random memory indefinitely and userspace never gets to
> know.
> 
> And I don't see any reason to keep the process running who did such an
> elementary and serious mistake.
> 
> Is the only concern that the process can catch it and not exit?

The concern is that this isn't how SIG_SEGV works, it's a signal the
thread who made the invalid access gets directly. You never get a SIG_SEGV
for bad access someone else has made. So essentially it's new ABI.

And the other bit is that doing new ABI with signals is considered bad
taste - if you want to have something that gets to userspace directly then
create an fd which is polleable. And then, if userspace chooses so, it can
kill itself with a SIG_IO using fcntl.

But the goal here seems to just to be telling userspace that something bad
has happened with its gpu context, and we already have gem_reset_stats for
that. That would imo be the correct interface to provide this information
to userspace. After all after a gpu reset it's also all garbage and we
just let everything continue merilly without killing the process that
submitted the work. I don't see how a fault is any different.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] drm/i915: Convert WARNs during userptr revoke to SIGBUS
  2015-10-09  7:48         ` Daniel Vetter
@ 2015-10-09  8:40           ` Chris Wilson
  2015-10-09  8:55             ` Daniel Vetter
  0 siblings, 1 reply; 25+ messages in thread
From: Chris Wilson @ 2015-10-09  8:40 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx

On Fri, Oct 09, 2015 at 09:48:01AM +0200, Daniel Vetter wrote:
> On Thu, Oct 08, 2015 at 10:45:47AM +0100, Tvrtko Ursulin wrote:
> The concern is that this isn't how SIG_SEGV works, it's a signal the
> thread who made the invalid access gets directly. You never get a SIG_SEGV
> for bad access someone else has made. So essentially it's new ABI.

SIGBUS. For which the answer is yes, you can and do get SIGBUS for
actions taken by other processes.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] drm/i915: Convert WARNs during userptr revoke to SIGBUS
  2015-10-09  8:40           ` Chris Wilson
@ 2015-10-09  8:55             ` Daniel Vetter
  2015-10-09  8:59               ` Chris Wilson
  2015-10-09  9:03               ` Tvrtko Ursulin
  0 siblings, 2 replies; 25+ messages in thread
From: Daniel Vetter @ 2015-10-09  8:55 UTC (permalink / raw)
  To: Chris Wilson, Daniel Vetter, Tvrtko Ursulin, intel-gfx

On Fri, Oct 09, 2015 at 09:40:53AM +0100, Chris Wilson wrote:
> On Fri, Oct 09, 2015 at 09:48:01AM +0200, Daniel Vetter wrote:
> > On Thu, Oct 08, 2015 at 10:45:47AM +0100, Tvrtko Ursulin wrote:
> > The concern is that this isn't how SIG_SEGV works, it's a signal the
> > thread who made the invalid access gets directly. You never get a SIG_SEGV
> > for bad access someone else has made. So essentially it's new ABI.
> 
> SIGBUS. For which the answer is yes, you can and do get SIGBUS for
> actions taken by other processes.

Oh right I always forget that SIGBUS aliases with SIGIO. Anyway if
userspace wants SIGIO we just need to provide it with a pollable fd and
then it can use fcntl to make that happen. That's imo a much better api
than unconditionally throwing around signals. Also we already have the
reset stats ioctl to tell userspace that its gpu context is toats. If
anyone wants that to be pollable (or even send SIGIO) I think we should
extend that, with all the usual "needs userspace&igt" stuff on top.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] drm/i915: Convert WARNs during userptr revoke to SIGBUS
  2015-10-09  8:55             ` Daniel Vetter
@ 2015-10-09  8:59               ` Chris Wilson
  2015-10-09  9:03               ` Tvrtko Ursulin
  1 sibling, 0 replies; 25+ messages in thread
From: Chris Wilson @ 2015-10-09  8:59 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx

On Fri, Oct 09, 2015 at 10:55:26AM +0200, Daniel Vetter wrote:
> On Fri, Oct 09, 2015 at 09:40:53AM +0100, Chris Wilson wrote:
> > On Fri, Oct 09, 2015 at 09:48:01AM +0200, Daniel Vetter wrote:
> > > On Thu, Oct 08, 2015 at 10:45:47AM +0100, Tvrtko Ursulin wrote:
> > > The concern is that this isn't how SIG_SEGV works, it's a signal the
> > > thread who made the invalid access gets directly. You never get a SIG_SEGV
> > > for bad access someone else has made. So essentially it's new ABI.
> > 
> > SIGBUS. For which the answer is yes, you can and do get SIGBUS for
> > actions taken by other processes.
> 
> Oh right I always forget that SIGBUS aliases with SIGIO. Anyway if
> userspace wants SIGIO we just need to provide it with a pollable fd and
> then it can use fcntl to make that happen. That's imo a much better api
> than unconditionally throwing around signals. Also we already have the
> reset stats ioctl to tell userspace that its gpu context is toats. If
> anyone wants that to be pollable (or even send SIGIO) I think we should
> extend that, with all the usual "needs userspace&igt" stuff on top.

It's not EIO handling, it is a SIGBUS in the canonical sense that the
backing storage for a mmap just vanished whilst you were using it.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] drm/i915: Convert WARNs during userptr revoke to SIGBUS
  2015-10-09  8:55             ` Daniel Vetter
  2015-10-09  8:59               ` Chris Wilson
@ 2015-10-09  9:03               ` Tvrtko Ursulin
  2015-10-09 17:14                 ` Daniel Vetter
  1 sibling, 1 reply; 25+ messages in thread
From: Tvrtko Ursulin @ 2015-10-09  9:03 UTC (permalink / raw)
  To: Daniel Vetter, Chris Wilson, intel-gfx


On 09/10/15 09:55, Daniel Vetter wrote:
> On Fri, Oct 09, 2015 at 09:40:53AM +0100, Chris Wilson wrote:
>> On Fri, Oct 09, 2015 at 09:48:01AM +0200, Daniel Vetter wrote:
>>> On Thu, Oct 08, 2015 at 10:45:47AM +0100, Tvrtko Ursulin wrote:
>>> The concern is that this isn't how SIG_SEGV works, it's a signal the
>>> thread who made the invalid access gets directly. You never get a SIG_SEGV
>>> for bad access someone else has made. So essentially it's new ABI.
>>
>> SIGBUS. For which the answer is yes, you can and do get SIGBUS for
>> actions taken by other processes.
>
> Oh right I always forget that SIGBUS aliases with SIGIO. Anyway if
> userspace wants SIGIO we just need to provide it with a pollable fd and
> then it can use fcntl to make that happen. That's imo a much better api
> than unconditionally throwing around signals. Also we already have the
> reset stats ioctl to tell userspace that its gpu context is toats. If
> anyone wants that to be pollable (or even send SIGIO) I think we should
> extend that, with all the usual "needs userspace&igt" stuff on top.

I don't see that this notification can be optional. Process is confused 
about its memory map use so should die. :)

This is not a GPU error/hang - this is the process doing stupid things.

MMU notifiers do not support decision making otherwise we could say 
-ETXTBUSY or something on munmap, but we can't. Not even sure that it 
would help in all cases, would have to fail clone as well and who knows 
what.

Regards,

Tvrtko


_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] drm/i915: Convert WARNs during userptr revoke to SIGBUS
  2015-10-09  9:03               ` Tvrtko Ursulin
@ 2015-10-09 17:14                 ` Daniel Vetter
  2015-10-09 17:26                   ` Chris Wilson
  0 siblings, 1 reply; 25+ messages in thread
From: Daniel Vetter @ 2015-10-09 17:14 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx

On Fri, Oct 09, 2015 at 10:03:14AM +0100, Tvrtko Ursulin wrote:
> 
> On 09/10/15 09:55, Daniel Vetter wrote:
> >On Fri, Oct 09, 2015 at 09:40:53AM +0100, Chris Wilson wrote:
> >>On Fri, Oct 09, 2015 at 09:48:01AM +0200, Daniel Vetter wrote:
> >>>On Thu, Oct 08, 2015 at 10:45:47AM +0100, Tvrtko Ursulin wrote:
> >>>The concern is that this isn't how SIG_SEGV works, it's a signal the
> >>>thread who made the invalid access gets directly. You never get a SIG_SEGV
> >>>for bad access someone else has made. So essentially it's new ABI.
> >>
> >>SIGBUS. For which the answer is yes, you can and do get SIGBUS for
> >>actions taken by other processes.
> >
> >Oh right I always forget that SIGBUS aliases with SIGIO. Anyway if
> >userspace wants SIGIO we just need to provide it with a pollable fd and
> >then it can use fcntl to make that happen. That's imo a much better api
> >than unconditionally throwing around signals. Also we already have the
> >reset stats ioctl to tell userspace that its gpu context is toats. If
> >anyone wants that to be pollable (or even send SIGIO) I think we should
> >extend that, with all the usual "needs userspace&igt" stuff on top.
> 
> I don't see that this notification can be optional. Process is confused
> about its memory map use so should die. :)
> 
> This is not a GPU error/hang - this is the process doing stupid things.
> 
> MMU notifiers do not support decision making otherwise we could say
> -ETXTBUSY or something on munmap, but we can't. Not even sure that it would
> help in all cases, would have to fail clone as well and who knows what.

So what happens if the gpu just keeps using the memory? It'll all be
horribly undefined behaviour and eventually it'll die on an -EFAULT in
execbuf, but does anything else bad happen?
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] drm/i915: Convert WARNs during userptr revoke to SIGBUS
  2015-10-09 17:14                 ` Daniel Vetter
@ 2015-10-09 17:26                   ` Chris Wilson
  2015-10-09 18:33                     ` Dave Gordon
  2015-10-12  9:06                     ` Tvrtko Ursulin
  0 siblings, 2 replies; 25+ messages in thread
From: Chris Wilson @ 2015-10-09 17:26 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx

On Fri, Oct 09, 2015 at 07:14:02PM +0200, Daniel Vetter wrote:
> On Fri, Oct 09, 2015 at 10:03:14AM +0100, Tvrtko Ursulin wrote:
> > 
> > On 09/10/15 09:55, Daniel Vetter wrote:
> > >On Fri, Oct 09, 2015 at 09:40:53AM +0100, Chris Wilson wrote:
> > >>On Fri, Oct 09, 2015 at 09:48:01AM +0200, Daniel Vetter wrote:
> > >>>On Thu, Oct 08, 2015 at 10:45:47AM +0100, Tvrtko Ursulin wrote:
> > >>>The concern is that this isn't how SIG_SEGV works, it's a signal the
> > >>>thread who made the invalid access gets directly. You never get a SIG_SEGV
> > >>>for bad access someone else has made. So essentially it's new ABI.
> > >>
> > >>SIGBUS. For which the answer is yes, you can and do get SIGBUS for
> > >>actions taken by other processes.
> > >
> > >Oh right I always forget that SIGBUS aliases with SIGIO. Anyway if
> > >userspace wants SIGIO we just need to provide it with a pollable fd and
> > >then it can use fcntl to make that happen. That's imo a much better api
> > >than unconditionally throwing around signals. Also we already have the
> > >reset stats ioctl to tell userspace that its gpu context is toats. If
> > >anyone wants that to be pollable (or even send SIGIO) I think we should
> > >extend that, with all the usual "needs userspace&igt" stuff on top.
> > 
> > I don't see that this notification can be optional. Process is confused
> > about its memory map use so should die. :)
> > 
> > This is not a GPU error/hang - this is the process doing stupid things.
> > 
> > MMU notifiers do not support decision making otherwise we could say
> > -ETXTBUSY or something on munmap, but we can't. Not even sure that it would
> > help in all cases, would have to fail clone as well and who knows what.
> 
> So what happens if the gpu just keeps using the memory? It'll all be
> horribly undefined behaviour and eventually it'll die on an -EFAULT in
> execbuf, but does anything else bad happen?

We don't see an EFAULT unless a miracle occurs, and the stale pages
continue to be read/written by other processes (as well as the client).
Horribly undefined behaviour with a misinformation leak.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] drm/i915: Convert WARNs during userptr revoke to SIGBUS
  2015-10-09 17:26                   ` Chris Wilson
@ 2015-10-09 18:33                     ` Dave Gordon
  2015-10-12  9:06                     ` Tvrtko Ursulin
  1 sibling, 0 replies; 25+ messages in thread
From: Dave Gordon @ 2015-10-09 18:33 UTC (permalink / raw)
  To: intel-gfx

On 09/10/15 18:26, Chris Wilson wrote:
> On Fri, Oct 09, 2015 at 07:14:02PM +0200, Daniel Vetter wrote:
>> On Fri, Oct 09, 2015 at 10:03:14AM +0100, Tvrtko Ursulin wrote:
>>>
>>> On 09/10/15 09:55, Daniel Vetter wrote:
>>>> On Fri, Oct 09, 2015 at 09:40:53AM +0100, Chris Wilson wrote:
>>>>> On Fri, Oct 09, 2015 at 09:48:01AM +0200, Daniel Vetter wrote:
>>>>>> On Thu, Oct 08, 2015 at 10:45:47AM +0100, Tvrtko Ursulin wrote:
>>>>>> The concern is that this isn't how SIG_SEGV works, it's a signal the
>>>>>> thread who made the invalid access gets directly. You never get a SIG_SEGV
>>>>>> for bad access someone else has made. So essentially it's new ABI.
>>>>>
>>>>> SIGBUS. For which the answer is yes, you can and do get SIGBUS for
>>>>> actions taken by other processes.
>>>>
>>>> Oh right I always forget that SIGBUS aliases with SIGIO. Anyway if
>>>> userspace wants SIGIO we just need to provide it with a pollable fd and
>>>> then it can use fcntl to make that happen. That's imo a much better api
>>>> than unconditionally throwing around signals. Also we already have the
>>>> reset stats ioctl to tell userspace that its gpu context is toats. If
>>>> anyone wants that to be pollable (or even send SIGIO) I think we should
>>>> extend that, with all the usual "needs userspace&igt" stuff on top.
>>>
>>> I don't see that this notification can be optional. Process is confused
>>> about its memory map use so should die. :)
>>>
>>> This is not a GPU error/hang - this is the process doing stupid things.
>>>
>>> MMU notifiers do not support decision making otherwise we could say
>>> -ETXTBUSY or something on munmap, but we can't. Not even sure that it would
>>> help in all cases, would have to fail clone as well and who knows what.
>>
>> So what happens if the gpu just keeps using the memory? It'll all be
>> horribly undefined behaviour and eventually it'll die on an -EFAULT in
>> execbuf, but does anything else bad happen?
>
> We don't see an EFAULT unless a miracle occurs, and the stale pages
> continue to be read/written by other processes (as well as the client).
> Horribly undefined behaviour with a misinformation leak.
> -Chris

I think SIGBUS would be a good notification. It's the sort of outcome 
you expect when a privileged thread on the CPU or any sort of DMA-master 
device incurs an access fault on physical memory or I/O mapped register 
space. One explanation I found suggests:

	Another reason where SIGBUS can generate is explained below:

	You are currently using a external I/O device by mapping the
	device memory mapping into the system memory (Memory mapped
	I/O). You have used it. And now, you have disconnected it
	gracefully. But, somehow your code is trying to use an
	previously used address still in your code. The result in this
	case will be an SIGBUS, the reason is BUS_ADRERR, "non-existent
	physical address".

See http://cquestion.blogspot.com/2008/03/sigbus-vs-sigsegv.html

In this case, (we assume that) the GPU is going to continue to access 
the "physical" (PPGTT?) address of the (virtual) memory that the process 
is trying to revoke its access to. And while it might make sense to 
remove a buffer from the CPU's mapping while the GPU was still accessing 
it, it really makes no sense to delete a GTT mapping that the GPU may 
still (asynchronously) be accessing. So either we have to kill the 
process's outstanding tasks on the GPU (context-specific reset?) or fail 
the unmap (and shoot the process for trying to sabotage the GPU?).

Or ... could we decouple the pages? Duplicate them as for copy-on-write, 
and give one copy to the user process and the other to the GPU?
Of course the actual content of the page might be indeterminate if the 
GPU were writing it while the CPU was taking a copy ... does this make 
any sense?

.Dave.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] drm/i915: Convert WARNs during userptr revoke to SIGBUS
  2015-10-09 17:26                   ` Chris Wilson
  2015-10-09 18:33                     ` Dave Gordon
@ 2015-10-12  9:06                     ` Tvrtko Ursulin
  2015-10-12  9:31                       ` Chris Wilson
  1 sibling, 1 reply; 25+ messages in thread
From: Tvrtko Ursulin @ 2015-10-12  9:06 UTC (permalink / raw)
  To: Chris Wilson, Daniel Vetter, intel-gfx


On 09/10/15 18:26, Chris Wilson wrote:
> On Fri, Oct 09, 2015 at 07:14:02PM +0200, Daniel Vetter wrote:
>> On Fri, Oct 09, 2015 at 10:03:14AM +0100, Tvrtko Ursulin wrote:
>>>
>>> On 09/10/15 09:55, Daniel Vetter wrote:
>>>> On Fri, Oct 09, 2015 at 09:40:53AM +0100, Chris Wilson wrote:
>>>>> On Fri, Oct 09, 2015 at 09:48:01AM +0200, Daniel Vetter wrote:
>>>>>> On Thu, Oct 08, 2015 at 10:45:47AM +0100, Tvrtko Ursulin wrote:
>>>>>> The concern is that this isn't how SIG_SEGV works, it's a signal the
>>>>>> thread who made the invalid access gets directly. You never get a SIG_SEGV
>>>>>> for bad access someone else has made. So essentially it's new ABI.
>>>>>
>>>>> SIGBUS. For which the answer is yes, you can and do get SIGBUS for
>>>>> actions taken by other processes.
>>>>
>>>> Oh right I always forget that SIGBUS aliases with SIGIO. Anyway if
>>>> userspace wants SIGIO we just need to provide it with a pollable fd and
>>>> then it can use fcntl to make that happen. That's imo a much better api
>>>> than unconditionally throwing around signals. Also we already have the
>>>> reset stats ioctl to tell userspace that its gpu context is toats. If
>>>> anyone wants that to be pollable (or even send SIGIO) I think we should
>>>> extend that, with all the usual "needs userspace&igt" stuff on top.
>>>
>>> I don't see that this notification can be optional. Process is confused
>>> about its memory map use so should die. :)
>>>
>>> This is not a GPU error/hang - this is the process doing stupid things.
>>>
>>> MMU notifiers do not support decision making otherwise we could say
>>> -ETXTBUSY or something on munmap, but we can't. Not even sure that it would
>>> help in all cases, would have to fail clone as well and who knows what.
>>
>> So what happens if the gpu just keeps using the memory? It'll all be
>> horribly undefined behaviour and eventually it'll die on an -EFAULT in
>> execbuf, but does anything else bad happen?
>
> We don't see an EFAULT unless a miracle occurs, and the stale pages
> continue to be read/written by other processes (as well as the client).
> Horribly undefined behaviour with a misinformation leak.

What other processes? Pages will still be referenced so won't be reused 
so there is not information leak across unrelated processes. Unless you 
meant ones involved in object sharing?

But we could improve the revoke mechanism I suppose by marking the 
object and then revoking it at the next opportunity?

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] drm/i915: Convert WARNs during userptr revoke to SIGBUS
  2015-10-12  9:06                     ` Tvrtko Ursulin
@ 2015-10-12  9:31                       ` Chris Wilson
  2015-10-12 10:10                         ` Chris Wilson
  2015-10-13 11:26                         ` Daniel Vetter
  0 siblings, 2 replies; 25+ messages in thread
From: Chris Wilson @ 2015-10-12  9:31 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx

On Mon, Oct 12, 2015 at 10:06:23AM +0100, Tvrtko Ursulin wrote:
> 
> On 09/10/15 18:26, Chris Wilson wrote:
> >On Fri, Oct 09, 2015 at 07:14:02PM +0200, Daniel Vetter wrote:
> >>On Fri, Oct 09, 2015 at 10:03:14AM +0100, Tvrtko Ursulin wrote:
> >>>
> >>>On 09/10/15 09:55, Daniel Vetter wrote:
> >>>>On Fri, Oct 09, 2015 at 09:40:53AM +0100, Chris Wilson wrote:
> >>>>>On Fri, Oct 09, 2015 at 09:48:01AM +0200, Daniel Vetter wrote:
> >>>>>>On Thu, Oct 08, 2015 at 10:45:47AM +0100, Tvrtko Ursulin wrote:
> >>>>>>The concern is that this isn't how SIG_SEGV works, it's a signal the
> >>>>>>thread who made the invalid access gets directly. You never get a SIG_SEGV
> >>>>>>for bad access someone else has made. So essentially it's new ABI.
> >>>>>
> >>>>>SIGBUS. For which the answer is yes, you can and do get SIGBUS for
> >>>>>actions taken by other processes.
> >>>>
> >>>>Oh right I always forget that SIGBUS aliases with SIGIO. Anyway if
> >>>>userspace wants SIGIO we just need to provide it with a pollable fd and
> >>>>then it can use fcntl to make that happen. That's imo a much better api
> >>>>than unconditionally throwing around signals. Also we already have the
> >>>>reset stats ioctl to tell userspace that its gpu context is toats. If
> >>>>anyone wants that to be pollable (or even send SIGIO) I think we should
> >>>>extend that, with all the usual "needs userspace&igt" stuff on top.
> >>>
> >>>I don't see that this notification can be optional. Process is confused
> >>>about its memory map use so should die. :)
> >>>
> >>>This is not a GPU error/hang - this is the process doing stupid things.
> >>>
> >>>MMU notifiers do not support decision making otherwise we could say
> >>>-ETXTBUSY or something on munmap, but we can't. Not even sure that it would
> >>>help in all cases, would have to fail clone as well and who knows what.
> >>
> >>So what happens if the gpu just keeps using the memory? It'll all be
> >>horribly undefined behaviour and eventually it'll die on an -EFAULT in
> >>execbuf, but does anything else bad happen?
> >
> >We don't see an EFAULT unless a miracle occurs, and the stale pages
> >continue to be read/written by other processes (as well as the client).
> >Horribly undefined behaviour with a misinformation leak.
> 
> What other processes? Pages will still be referenced so won't be
> reused so there is not information leak across unrelated processes.
> Unless you meant ones involved in object sharing?

This client is trying to replace the userptr with a fresh set of pages.
The GPU and other processes continue to see the old pages i.e. old
information (misinformation :) leaks.
 
> But we could improve the revoke mechanism I suppose by marking the
> object and then revoking it at the next opportunity?

We basically need to clone the obj, move the pages and vma over and so
as the vma die the pages are unreferenced. All new use will be forced to
call gup and be fine.

Ok, that smells ok, I'll see how doable that is.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] drm/i915: Convert WARNs during userptr revoke to SIGBUS
  2015-10-12  9:31                       ` Chris Wilson
@ 2015-10-12 10:10                         ` Chris Wilson
  2015-10-12 12:59                           ` Tvrtko Ursulin
  2015-10-13 11:26                         ` Daniel Vetter
  1 sibling, 1 reply; 25+ messages in thread
From: Chris Wilson @ 2015-10-12 10:10 UTC (permalink / raw)
  To: Tvrtko Ursulin, Daniel Vetter, intel-gfx

On Mon, Oct 12, 2015 at 10:31:35AM +0100, Chris Wilson wrote:
> We basically need to clone the obj, move the pages and vma over and so
> as the vma die the pages are unreferenced. All new use will be forced to
> call gup and be fine.
> 
> Ok, that smells ok, I'll see how doable that is.

Hmm. If we take the vma-centric driver as granted (i.e. using the vma as the
token when pinning, the vma holds fences etc), the tricky part if that we
don't hold a reference from the pinned vma to the object. pin_display to
the rescue!

Initial sketch:

static struct drm_i915_gem_object *steal_pages(struct drm_i915_gem_object *obj)
{
        struct drm_i915_gem_object *clone;
        struct i915_vma *vma;
        int i;

        BUG_ON(obj->stolen);

        clone = i915_gem_object_alloc(obj->base.dev);
        if (clone == NULL)
                return clone;

        drm_gem_private_object_init(obj->base.dev,
                                    &clone->base,
                                    obj->base.size);
        i915_gem_object_init(clone, obj->ops);

        list_splice_init(&obj->vma_list, &clone->vma_list);
        list_for_each_entry(vma, &clone->vma_list, obj_link)
                vma->obj = clone;

        if (obj->pin_display) {
                clone->pin_display = obj->pin_display;
                while (obj->pin_display--) {
                        drm_gem_object_reference(&clone->base);
                        drm_gem_object_unreference(&obj->base);
                }
        }

        clone->bind_count = obj->bind_count;
        obj->bind_count = 0;
        /* vma_ht / vma_hashed */

        for (i = 0; i < I915_NUM_RINGS; i++) {
                if (obj->last_read[i].request == NULL)
                        continue;

                clone->last_read[i].request = obj->last_read[i].request;
                list_replace_init(&obj->last_read[i].link,
                                  &clone->last_read[i].link);
                clone->flags |= 1 << (i + I915_BO_ACTIVE_SHIFT);
        }
        if (obj->last_write.request) {
                clone->last_write.request = obj->last_write.request;
                list_replace_init(&obj->last_write.link,
                                  &clone->last_write.link);
        }

        clone->dirty = obj->dirty;
        obj->dirty = false;

        clone->tiling_mode = obj->tiling_mode;
        clone->stride = obj->stride;

        clone->pin_display = obj->pin_display;
        obj->pin_display = 0;

        clone->madv = I915_MADV_DONTNEED;
        clone->pages = obj->pages;
        clone->pages_pin_count = obj->pages_pin_count;
        clone->get_page = obj->get_page;
        clone->vmapping = obj->vmapping;
        obj->pages = NULL;
        obj->pages_pin_count = 0;
        obj->vmapping = NULL;

        clone->bit_17 = obj->bit_17;
        obj->bit_17 = NULL;

        i915_gem_release_mmap(obj);

        if (I915_BO_IS_ACTIVE(clone))
                clone->active_reference = true;
        else
                drm_gem_object_unreference(&clone->base);

        return clone;
}

It does have the presumption that we have either an active reference or a
pinned reference.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] drm/i915: Convert WARNs during userptr revoke to SIGBUS
  2015-10-12 10:10                         ` Chris Wilson
@ 2015-10-12 12:59                           ` Tvrtko Ursulin
  0 siblings, 0 replies; 25+ messages in thread
From: Tvrtko Ursulin @ 2015-10-12 12:59 UTC (permalink / raw)
  To: Chris Wilson, Daniel Vetter, intel-gfx


On 12/10/15 11:10, Chris Wilson wrote:
> On Mon, Oct 12, 2015 at 10:31:35AM +0100, Chris Wilson wrote:
>> We basically need to clone the obj, move the pages and vma over and so
>> as the vma die the pages are unreferenced. All new use will be forced to
>> call gup and be fine.
>>
>> Ok, that smells ok, I'll see how doable that is.
>
> Hmm. If we take the vma-centric driver as granted (i.e. using the vma as the
> token when pinning, the vma holds fences etc), the tricky part if that we
> don't hold a reference from the pinned vma to the object. pin_display to
> the rescue!
>
> Initial sketch:
>
> static struct drm_i915_gem_object *steal_pages(struct drm_i915_gem_object *obj)
> {
>          struct drm_i915_gem_object *clone;
>          struct i915_vma *vma;
>          int i;
>
>          BUG_ON(obj->stolen);
>
>          clone = i915_gem_object_alloc(obj->base.dev);
>          if (clone == NULL)
>                  return clone;
>
>          drm_gem_private_object_init(obj->base.dev,
>                                      &clone->base,
>                                      obj->base.size);
>          i915_gem_object_init(clone, obj->ops);
>
>          list_splice_init(&obj->vma_list, &clone->vma_list);
>          list_for_each_entry(vma, &clone->vma_list, obj_link)
>                  vma->obj = clone;
>
>          if (obj->pin_display) {
>                  clone->pin_display = obj->pin_display;
>                  while (obj->pin_display--) {
>                          drm_gem_object_reference(&clone->base);
>                          drm_gem_object_unreference(&obj->base);
>                  }
>          }
>
>          clone->bind_count = obj->bind_count;
>          obj->bind_count = 0;
>          /* vma_ht / vma_hashed */
>
>          for (i = 0; i < I915_NUM_RINGS; i++) {
>                  if (obj->last_read[i].request == NULL)
>                          continue;
>
>                  clone->last_read[i].request = obj->last_read[i].request;
>                  list_replace_init(&obj->last_read[i].link,
>                                    &clone->last_read[i].link);
>                  clone->flags |= 1 << (i + I915_BO_ACTIVE_SHIFT);
>          }
>          if (obj->last_write.request) {
>                  clone->last_write.request = obj->last_write.request;
>                  list_replace_init(&obj->last_write.link,
>                                    &clone->last_write.link);
>          }
>
>          clone->dirty = obj->dirty;
>          obj->dirty = false;
>
>          clone->tiling_mode = obj->tiling_mode;
>          clone->stride = obj->stride;
>
>          clone->pin_display = obj->pin_display;
>          obj->pin_display = 0;
>
>          clone->madv = I915_MADV_DONTNEED;
>          clone->pages = obj->pages;
>          clone->pages_pin_count = obj->pages_pin_count;
>          clone->get_page = obj->get_page;
>          clone->vmapping = obj->vmapping;
>          obj->pages = NULL;
>          obj->pages_pin_count = 0;
>          obj->vmapping = NULL;
>
>          clone->bit_17 = obj->bit_17;
>          obj->bit_17 = NULL;
>
>          i915_gem_release_mmap(obj);
>
>          if (I915_BO_IS_ACTIVE(clone))
>                  clone->active_reference = true;
>          else
>                  drm_gem_object_unreference(&clone->base);
>
>          return clone;
> }
>
> It does have the presumption that we have either an active reference or a
> pinned reference.

I get the general idea but the base is too far from the current code to 
properly evaluate. But in principle it sounds promising.

Regards,

Tvrtko





_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] drm/i915: Convert WARNs during userptr revoke to SIGBUS
  2015-10-12  9:31                       ` Chris Wilson
  2015-10-12 10:10                         ` Chris Wilson
@ 2015-10-13 11:26                         ` Daniel Vetter
  2015-10-13 11:44                           ` Chris Wilson
  1 sibling, 1 reply; 25+ messages in thread
From: Daniel Vetter @ 2015-10-13 11:26 UTC (permalink / raw)
  To: Chris Wilson, Tvrtko Ursulin, Daniel Vetter, intel-gfx

On Mon, Oct 12, 2015 at 10:31:35AM +0100, Chris Wilson wrote:
> On Mon, Oct 12, 2015 at 10:06:23AM +0100, Tvrtko Ursulin wrote:
> > 
> > On 09/10/15 18:26, Chris Wilson wrote:
> > >On Fri, Oct 09, 2015 at 07:14:02PM +0200, Daniel Vetter wrote:
> > >>On Fri, Oct 09, 2015 at 10:03:14AM +0100, Tvrtko Ursulin wrote:
> > >>>
> > >>>On 09/10/15 09:55, Daniel Vetter wrote:
> > >>>>On Fri, Oct 09, 2015 at 09:40:53AM +0100, Chris Wilson wrote:
> > >>>>>On Fri, Oct 09, 2015 at 09:48:01AM +0200, Daniel Vetter wrote:
> > >>>>>>On Thu, Oct 08, 2015 at 10:45:47AM +0100, Tvrtko Ursulin wrote:
> > >>>>>>The concern is that this isn't how SIG_SEGV works, it's a signal the
> > >>>>>>thread who made the invalid access gets directly. You never get a SIG_SEGV
> > >>>>>>for bad access someone else has made. So essentially it's new ABI.
> > >>>>>
> > >>>>>SIGBUS. For which the answer is yes, you can and do get SIGBUS for
> > >>>>>actions taken by other processes.
> > >>>>
> > >>>>Oh right I always forget that SIGBUS aliases with SIGIO. Anyway if
> > >>>>userspace wants SIGIO we just need to provide it with a pollable fd and
> > >>>>then it can use fcntl to make that happen. That's imo a much better api
> > >>>>than unconditionally throwing around signals. Also we already have the
> > >>>>reset stats ioctl to tell userspace that its gpu context is toats. If
> > >>>>anyone wants that to be pollable (or even send SIGIO) I think we should
> > >>>>extend that, with all the usual "needs userspace&igt" stuff on top.
> > >>>
> > >>>I don't see that this notification can be optional. Process is confused
> > >>>about its memory map use so should die. :)
> > >>>
> > >>>This is not a GPU error/hang - this is the process doing stupid things.
> > >>>
> > >>>MMU notifiers do not support decision making otherwise we could say
> > >>>-ETXTBUSY or something on munmap, but we can't. Not even sure that it would
> > >>>help in all cases, would have to fail clone as well and who knows what.
> > >>
> > >>So what happens if the gpu just keeps using the memory? It'll all be
> > >>horribly undefined behaviour and eventually it'll die on an -EFAULT in
> > >>execbuf, but does anything else bad happen?
> > >
> > >We don't see an EFAULT unless a miracle occurs, and the stale pages
> > >continue to be read/written by other processes (as well as the client).
> > >Horribly undefined behaviour with a misinformation leak.
> > 
> > What other processes? Pages will still be referenced so won't be
> > reused so there is not information leak across unrelated processes.
> > Unless you meant ones involved in object sharing?
> 
> This client is trying to replace the userptr with a fresh set of pages.
> The GPU and other processes continue to see the old pages i.e. old
> information (misinformation :) leaks.

userptr explicitly doesn't support this. You need to tear down the
existing userptr object and then create a new one if you change the
mmap'ing. So that's really just a bug in userspace, and the question is
how do we tell userspace best that it's done something stupid.

My stance that the best one is to either kill any ctx using that object or
at least indicate there's trouble using reset stats. But sending a
SIGBUS/SIG_SEGV (which can only happen to the thread that does a memory
access, not any other thread that's accidentally in the same process
group) is imo abuse. Or we just need to make sure we do get the EFAULT on
the next execbuffer.

Or maybe it just doesn't matter, i.e. where is the userspace which a) does
silly stuff like this b) wants proper notification? Adding ABI just
because we can't isn't going to get merged.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] drm/i915: Convert WARNs during userptr revoke to SIGBUS
  2015-10-13 11:26                         ` Daniel Vetter
@ 2015-10-13 11:44                           ` Chris Wilson
  2015-10-13 12:23                             ` Daniel Vetter
  0 siblings, 1 reply; 25+ messages in thread
From: Chris Wilson @ 2015-10-13 11:44 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx

On Tue, Oct 13, 2015 at 01:26:36PM +0200, Daniel Vetter wrote:
> On Mon, Oct 12, 2015 at 10:31:35AM +0100, Chris Wilson wrote:
> > On Mon, Oct 12, 2015 at 10:06:23AM +0100, Tvrtko Ursulin wrote:
> > > 
> > > On 09/10/15 18:26, Chris Wilson wrote:
> > > >On Fri, Oct 09, 2015 at 07:14:02PM +0200, Daniel Vetter wrote:
> > > >>On Fri, Oct 09, 2015 at 10:03:14AM +0100, Tvrtko Ursulin wrote:
> > > >>>
> > > >>>On 09/10/15 09:55, Daniel Vetter wrote:
> > > >>>>On Fri, Oct 09, 2015 at 09:40:53AM +0100, Chris Wilson wrote:
> > > >>>>>On Fri, Oct 09, 2015 at 09:48:01AM +0200, Daniel Vetter wrote:
> > > >>>>>>On Thu, Oct 08, 2015 at 10:45:47AM +0100, Tvrtko Ursulin wrote:
> > > >>>>>>The concern is that this isn't how SIG_SEGV works, it's a signal the
> > > >>>>>>thread who made the invalid access gets directly. You never get a SIG_SEGV
> > > >>>>>>for bad access someone else has made. So essentially it's new ABI.
> > > >>>>>
> > > >>>>>SIGBUS. For which the answer is yes, you can and do get SIGBUS for
> > > >>>>>actions taken by other processes.
> > > >>>>
> > > >>>>Oh right I always forget that SIGBUS aliases with SIGIO. Anyway if
> > > >>>>userspace wants SIGIO we just need to provide it with a pollable fd and
> > > >>>>then it can use fcntl to make that happen. That's imo a much better api
> > > >>>>than unconditionally throwing around signals. Also we already have the
> > > >>>>reset stats ioctl to tell userspace that its gpu context is toats. If
> > > >>>>anyone wants that to be pollable (or even send SIGIO) I think we should
> > > >>>>extend that, with all the usual "needs userspace&igt" stuff on top.
> > > >>>
> > > >>>I don't see that this notification can be optional. Process is confused
> > > >>>about its memory map use so should die. :)
> > > >>>
> > > >>>This is not a GPU error/hang - this is the process doing stupid things.
> > > >>>
> > > >>>MMU notifiers do not support decision making otherwise we could say
> > > >>>-ETXTBUSY or something on munmap, but we can't. Not even sure that it would
> > > >>>help in all cases, would have to fail clone as well and who knows what.
> > > >>
> > > >>So what happens if the gpu just keeps using the memory? It'll all be
> > > >>horribly undefined behaviour and eventually it'll die on an -EFAULT in
> > > >>execbuf, but does anything else bad happen?
> > > >
> > > >We don't see an EFAULT unless a miracle occurs, and the stale pages
> > > >continue to be read/written by other processes (as well as the client).
> > > >Horribly undefined behaviour with a misinformation leak.
> > > 
> > > What other processes? Pages will still be referenced so won't be
> > > reused so there is not information leak across unrelated processes.
> > > Unless you meant ones involved in object sharing?
> > 
> > This client is trying to replace the userptr with a fresh set of pages.
> > The GPU and other processes continue to see the old pages i.e. old
> > information (misinformation :) leaks.
> 
> userptr explicitly doesn't support this. You need to tear down the
> existing userptr object and then create a new one if you change the
> mmap'ing. So that's really just a bug in userspace, and the question is
> how do we tell userspace best that it's done something stupid.

Pardon? Note this also affects munmap if you don't accept mremap (which
is not explicitly unsupported as it fits quite nicely within the
existing rules).

> My stance that the best one is to either kill any ctx using that object or
> at least indicate there's trouble using reset stats. But sending a
> SIGBUS/SIG_SEGV (which can only happen to the thread that does a memory
> access, not any other thread that's accidentally in the same process
> group) is imo abuse.

The signal is sent to everything that inherited the mm, not bound to the
single thread.

> Or we just need to make sure we do get the EFAULT on
> the next execbuffer.
> 
> Or maybe it just doesn't matter, i.e. where is the userspace which a) does
> silly stuff like this b) wants proper notification? Adding ABI just
> because we can't isn't going to get merged.

No client wants to be killed just because it does something stupid, it
is killed to protect the integrity of the system.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] drm/i915: Convert WARNs during userptr revoke to SIGBUS
  2015-10-13 11:44                           ` Chris Wilson
@ 2015-10-13 12:23                             ` Daniel Vetter
  2015-10-13 13:09                               ` Chris Wilson
  0 siblings, 1 reply; 25+ messages in thread
From: Daniel Vetter @ 2015-10-13 12:23 UTC (permalink / raw)
  To: Chris Wilson, Daniel Vetter, Tvrtko Ursulin, intel-gfx

On Tue, Oct 13, 2015 at 12:44:05PM +0100, Chris Wilson wrote:
> On Tue, Oct 13, 2015 at 01:26:36PM +0200, Daniel Vetter wrote:
> > On Mon, Oct 12, 2015 at 10:31:35AM +0100, Chris Wilson wrote:
> > > On Mon, Oct 12, 2015 at 10:06:23AM +0100, Tvrtko Ursulin wrote:
> > > > 
> > > > On 09/10/15 18:26, Chris Wilson wrote:
> > > > >On Fri, Oct 09, 2015 at 07:14:02PM +0200, Daniel Vetter wrote:
> > > > >>On Fri, Oct 09, 2015 at 10:03:14AM +0100, Tvrtko Ursulin wrote:
> > > > >>>
> > > > >>>On 09/10/15 09:55, Daniel Vetter wrote:
> > > > >>>>On Fri, Oct 09, 2015 at 09:40:53AM +0100, Chris Wilson wrote:
> > > > >>>>>On Fri, Oct 09, 2015 at 09:48:01AM +0200, Daniel Vetter wrote:
> > > > >>>>>>On Thu, Oct 08, 2015 at 10:45:47AM +0100, Tvrtko Ursulin wrote:
> > > > >>>>>>The concern is that this isn't how SIG_SEGV works, it's a signal the
> > > > >>>>>>thread who made the invalid access gets directly. You never get a SIG_SEGV
> > > > >>>>>>for bad access someone else has made. So essentially it's new ABI.
> > > > >>>>>
> > > > >>>>>SIGBUS. For which the answer is yes, you can and do get SIGBUS for
> > > > >>>>>actions taken by other processes.
> > > > >>>>
> > > > >>>>Oh right I always forget that SIGBUS aliases with SIGIO. Anyway if
> > > > >>>>userspace wants SIGIO we just need to provide it with a pollable fd and
> > > > >>>>then it can use fcntl to make that happen. That's imo a much better api
> > > > >>>>than unconditionally throwing around signals. Also we already have the
> > > > >>>>reset stats ioctl to tell userspace that its gpu context is toats. If
> > > > >>>>anyone wants that to be pollable (or even send SIGIO) I think we should
> > > > >>>>extend that, with all the usual "needs userspace&igt" stuff on top.
> > > > >>>
> > > > >>>I don't see that this notification can be optional. Process is confused
> > > > >>>about its memory map use so should die. :)
> > > > >>>
> > > > >>>This is not a GPU error/hang - this is the process doing stupid things.
> > > > >>>
> > > > >>>MMU notifiers do not support decision making otherwise we could say
> > > > >>>-ETXTBUSY or something on munmap, but we can't. Not even sure that it would
> > > > >>>help in all cases, would have to fail clone as well and who knows what.
> > > > >>
> > > > >>So what happens if the gpu just keeps using the memory? It'll all be
> > > > >>horribly undefined behaviour and eventually it'll die on an -EFAULT in
> > > > >>execbuf, but does anything else bad happen?
> > > > >
> > > > >We don't see an EFAULT unless a miracle occurs, and the stale pages
> > > > >continue to be read/written by other processes (as well as the client).
> > > > >Horribly undefined behaviour with a misinformation leak.
> > > > 
> > > > What other processes? Pages will still be referenced so won't be
> > > > reused so there is not information leak across unrelated processes.
> > > > Unless you meant ones involved in object sharing?
> > > 
> > > This client is trying to replace the userptr with a fresh set of pages.
> > > The GPU and other processes continue to see the old pages i.e. old
> > > information (misinformation :) leaks.
> > 
> > userptr explicitly doesn't support this. You need to tear down the
> > existing userptr object and then create a new one if you change the
> > mmap'ing. So that's really just a bug in userspace, and the question is
> > how do we tell userspace best that it's done something stupid.
> 
> Pardon? Note this also affects munmap if you don't accept mremap (which
> is not explicitly unsupported as it fits quite nicely within the
> existing rules).
> 
> > My stance that the best one is to either kill any ctx using that object or
> > at least indicate there's trouble using reset stats. But sending a
> > SIGBUS/SIG_SEGV (which can only happen to the thread that does a memory
> > access, not any other thread that's accidentally in the same process
> > group) is imo abuse.
> 
> The signal is sent to everything that inherited the mm, not bound to the
> single thread.
> 
> > Or we just need to make sure we do get the EFAULT on
> > the next execbuffer.
> > 
> > Or maybe it just doesn't matter, i.e. where is the userspace which a) does
> > silly stuff like this b) wants proper notification? Adding ABI just
> > because we can't isn't going to get merged.
> 
> No client wants to be killed just because it does something stupid, it
> is killed to protect the integrity of the system.

But killing the client won't get rid of the ctx/objects, so won't really
solve all that much? Especially it won't get rid of the framebuffers,
which is the real trouble here it seems. That's because logind keeps a
duped copy of the fd for it's own purposes.

So the better fix would be to make sure we don't accidentally pin a
userptr object, i.e. adding a check in intel_pin_and_fence_fb for that
(plus igt testcase). I somehow missed in all this discussion that this is
about pinned objects ;-)
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] drm/i915: Convert WARNs during userptr revoke to SIGBUS
  2015-10-13 12:23                             ` Daniel Vetter
@ 2015-10-13 13:09                               ` Chris Wilson
  2015-10-13 13:39                                 ` Daniel Vetter
  0 siblings, 1 reply; 25+ messages in thread
From: Chris Wilson @ 2015-10-13 13:09 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx

On Tue, Oct 13, 2015 at 02:23:57PM +0200, Daniel Vetter wrote:
> On Tue, Oct 13, 2015 at 12:44:05PM +0100, Chris Wilson wrote:
> > On Tue, Oct 13, 2015 at 01:26:36PM +0200, Daniel Vetter wrote:
> > > On Mon, Oct 12, 2015 at 10:31:35AM +0100, Chris Wilson wrote:
> > > > On Mon, Oct 12, 2015 at 10:06:23AM +0100, Tvrtko Ursulin wrote:
> > > > > 
> > > > > On 09/10/15 18:26, Chris Wilson wrote:
> > > > > >On Fri, Oct 09, 2015 at 07:14:02PM +0200, Daniel Vetter wrote:
> > > > > >>On Fri, Oct 09, 2015 at 10:03:14AM +0100, Tvrtko Ursulin wrote:
> > > > > >>>
> > > > > >>>On 09/10/15 09:55, Daniel Vetter wrote:
> > > > > >>>>On Fri, Oct 09, 2015 at 09:40:53AM +0100, Chris Wilson wrote:
> > > > > >>>>>On Fri, Oct 09, 2015 at 09:48:01AM +0200, Daniel Vetter wrote:
> > > > > >>>>>>On Thu, Oct 08, 2015 at 10:45:47AM +0100, Tvrtko Ursulin wrote:
> > > > > >>>>>>The concern is that this isn't how SIG_SEGV works, it's a signal the
> > > > > >>>>>>thread who made the invalid access gets directly. You never get a SIG_SEGV
> > > > > >>>>>>for bad access someone else has made. So essentially it's new ABI.
> > > > > >>>>>
> > > > > >>>>>SIGBUS. For which the answer is yes, you can and do get SIGBUS for
> > > > > >>>>>actions taken by other processes.
> > > > > >>>>
> > > > > >>>>Oh right I always forget that SIGBUS aliases with SIGIO. Anyway if
> > > > > >>>>userspace wants SIGIO we just need to provide it with a pollable fd and
> > > > > >>>>then it can use fcntl to make that happen. That's imo a much better api
> > > > > >>>>than unconditionally throwing around signals. Also we already have the
> > > > > >>>>reset stats ioctl to tell userspace that its gpu context is toats. If
> > > > > >>>>anyone wants that to be pollable (or even send SIGIO) I think we should
> > > > > >>>>extend that, with all the usual "needs userspace&igt" stuff on top.
> > > > > >>>
> > > > > >>>I don't see that this notification can be optional. Process is confused
> > > > > >>>about its memory map use so should die. :)
> > > > > >>>
> > > > > >>>This is not a GPU error/hang - this is the process doing stupid things.
> > > > > >>>
> > > > > >>>MMU notifiers do not support decision making otherwise we could say
> > > > > >>>-ETXTBUSY or something on munmap, but we can't. Not even sure that it would
> > > > > >>>help in all cases, would have to fail clone as well and who knows what.
> > > > > >>
> > > > > >>So what happens if the gpu just keeps using the memory? It'll all be
> > > > > >>horribly undefined behaviour and eventually it'll die on an -EFAULT in
> > > > > >>execbuf, but does anything else bad happen?
> > > > > >
> > > > > >We don't see an EFAULT unless a miracle occurs, and the stale pages
> > > > > >continue to be read/written by other processes (as well as the client).
> > > > > >Horribly undefined behaviour with a misinformation leak.
> > > > > 
> > > > > What other processes? Pages will still be referenced so won't be
> > > > > reused so there is not information leak across unrelated processes.
> > > > > Unless you meant ones involved in object sharing?
> > > > 
> > > > This client is trying to replace the userptr with a fresh set of pages.
> > > > The GPU and other processes continue to see the old pages i.e. old
> > > > information (misinformation :) leaks.
> > > 
> > > userptr explicitly doesn't support this. You need to tear down the
> > > existing userptr object and then create a new one if you change the
> > > mmap'ing. So that's really just a bug in userspace, and the question is
> > > how do we tell userspace best that it's done something stupid.
> > 
> > Pardon? Note this also affects munmap if you don't accept mremap (which
> > is not explicitly unsupported as it fits quite nicely within the
> > existing rules).
> > 
> > > My stance that the best one is to either kill any ctx using that object or
> > > at least indicate there's trouble using reset stats. But sending a
> > > SIGBUS/SIG_SEGV (which can only happen to the thread that does a memory
> > > access, not any other thread that's accidentally in the same process
> > > group) is imo abuse.
> > 
> > The signal is sent to everything that inherited the mm, not bound to the
> > single thread.
> > 
> > > Or we just need to make sure we do get the EFAULT on
> > > the next execbuffer.
> > > 
> > > Or maybe it just doesn't matter, i.e. where is the userspace which a) does
> > > silly stuff like this b) wants proper notification? Adding ABI just
> > > because we can't isn't going to get merged.
> > 
> > No client wants to be killed just because it does something stupid, it
> > is killed to protect the integrity of the system.
> 
> But killing the client won't get rid of the ctx/objects, so won't really
> solve all that much? Especially it won't get rid of the framebuffers,
> which is the real trouble here it seems. That's because logind keeps a
> duped copy of the fd for it's own purposes.

I mentioned that in the commit message - but do note that what is pinned
is what the client has mmapped at that time. Before it is changed, the
client is killed and so we do not get into the situation where the
output is invalid (just old).

> So the better fix would be to make sure we don't accidentally pin a
> userptr object, i.e. adding a check in intel_pin_and_fence_fb for that
> (plus igt testcase). I somehow missed in all this discussion that this is
> about pinned objects ;-)

Preventing wrapping a userptr in a fb is one possibility.

diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
index b89131654a0e..0cc689d0ce0a 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -14116,6 +14116,9 @@ static int intel_user_framebuffer_create_handle(struct drm_framebuffer *fb,
        struct intel_framebuffer *intel_fb = to_intel_framebuffer(fb);
        struct drm_i915_gem_object *obj = intel_fb->obj;
 
+       if (obj->userptr.mm)
+               return -EINVAL;
+
        return drm_gem_handle_create(file, &obj->base, handle);
 }

And we can return to this discussion the next time someone mentions
hitting the WARN. Or they complain about not being able to create fb.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH] drm/i915: Convert WARNs during userptr revoke to SIGBUS
  2015-10-13 13:09                               ` Chris Wilson
@ 2015-10-13 13:39                                 ` Daniel Vetter
  0 siblings, 0 replies; 25+ messages in thread
From: Daniel Vetter @ 2015-10-13 13:39 UTC (permalink / raw)
  To: Chris Wilson, Daniel Vetter, Tvrtko Ursulin, intel-gfx

On Tue, Oct 13, 2015 at 02:09:48PM +0100, Chris Wilson wrote:
> On Tue, Oct 13, 2015 at 02:23:57PM +0200, Daniel Vetter wrote:
> > On Tue, Oct 13, 2015 at 12:44:05PM +0100, Chris Wilson wrote:
> > > On Tue, Oct 13, 2015 at 01:26:36PM +0200, Daniel Vetter wrote:
> > > > On Mon, Oct 12, 2015 at 10:31:35AM +0100, Chris Wilson wrote:
> > > > > On Mon, Oct 12, 2015 at 10:06:23AM +0100, Tvrtko Ursulin wrote:
> > > > > > 
> > > > > > On 09/10/15 18:26, Chris Wilson wrote:
> > > > > > >On Fri, Oct 09, 2015 at 07:14:02PM +0200, Daniel Vetter wrote:
> > > > > > >>On Fri, Oct 09, 2015 at 10:03:14AM +0100, Tvrtko Ursulin wrote:
> > > > > > >>>
> > > > > > >>>On 09/10/15 09:55, Daniel Vetter wrote:
> > > > > > >>>>On Fri, Oct 09, 2015 at 09:40:53AM +0100, Chris Wilson wrote:
> > > > > > >>>>>On Fri, Oct 09, 2015 at 09:48:01AM +0200, Daniel Vetter wrote:
> > > > > > >>>>>>On Thu, Oct 08, 2015 at 10:45:47AM +0100, Tvrtko Ursulin wrote:
> > > > > > >>>>>>The concern is that this isn't how SIG_SEGV works, it's a signal the
> > > > > > >>>>>>thread who made the invalid access gets directly. You never get a SIG_SEGV
> > > > > > >>>>>>for bad access someone else has made. So essentially it's new ABI.
> > > > > > >>>>>
> > > > > > >>>>>SIGBUS. For which the answer is yes, you can and do get SIGBUS for
> > > > > > >>>>>actions taken by other processes.
> > > > > > >>>>
> > > > > > >>>>Oh right I always forget that SIGBUS aliases with SIGIO. Anyway if
> > > > > > >>>>userspace wants SIGIO we just need to provide it with a pollable fd and
> > > > > > >>>>then it can use fcntl to make that happen. That's imo a much better api
> > > > > > >>>>than unconditionally throwing around signals. Also we already have the
> > > > > > >>>>reset stats ioctl to tell userspace that its gpu context is toats. If
> > > > > > >>>>anyone wants that to be pollable (or even send SIGIO) I think we should
> > > > > > >>>>extend that, with all the usual "needs userspace&igt" stuff on top.
> > > > > > >>>
> > > > > > >>>I don't see that this notification can be optional. Process is confused
> > > > > > >>>about its memory map use so should die. :)
> > > > > > >>>
> > > > > > >>>This is not a GPU error/hang - this is the process doing stupid things.
> > > > > > >>>
> > > > > > >>>MMU notifiers do not support decision making otherwise we could say
> > > > > > >>>-ETXTBUSY or something on munmap, but we can't. Not even sure that it would
> > > > > > >>>help in all cases, would have to fail clone as well and who knows what.
> > > > > > >>
> > > > > > >>So what happens if the gpu just keeps using the memory? It'll all be
> > > > > > >>horribly undefined behaviour and eventually it'll die on an -EFAULT in
> > > > > > >>execbuf, but does anything else bad happen?
> > > > > > >
> > > > > > >We don't see an EFAULT unless a miracle occurs, and the stale pages
> > > > > > >continue to be read/written by other processes (as well as the client).
> > > > > > >Horribly undefined behaviour with a misinformation leak.
> > > > > > 
> > > > > > What other processes? Pages will still be referenced so won't be
> > > > > > reused so there is not information leak across unrelated processes.
> > > > > > Unless you meant ones involved in object sharing?
> > > > > 
> > > > > This client is trying to replace the userptr with a fresh set of pages.
> > > > > The GPU and other processes continue to see the old pages i.e. old
> > > > > information (misinformation :) leaks.
> > > > 
> > > > userptr explicitly doesn't support this. You need to tear down the
> > > > existing userptr object and then create a new one if you change the
> > > > mmap'ing. So that's really just a bug in userspace, and the question is
> > > > how do we tell userspace best that it's done something stupid.
> > > 
> > > Pardon? Note this also affects munmap if you don't accept mremap (which
> > > is not explicitly unsupported as it fits quite nicely within the
> > > existing rules).
> > > 
> > > > My stance that the best one is to either kill any ctx using that object or
> > > > at least indicate there's trouble using reset stats. But sending a
> > > > SIGBUS/SIG_SEGV (which can only happen to the thread that does a memory
> > > > access, not any other thread that's accidentally in the same process
> > > > group) is imo abuse.
> > > 
> > > The signal is sent to everything that inherited the mm, not bound to the
> > > single thread.
> > > 
> > > > Or we just need to make sure we do get the EFAULT on
> > > > the next execbuffer.
> > > > 
> > > > Or maybe it just doesn't matter, i.e. where is the userspace which a) does
> > > > silly stuff like this b) wants proper notification? Adding ABI just
> > > > because we can't isn't going to get merged.
> > > 
> > > No client wants to be killed just because it does something stupid, it
> > > is killed to protect the integrity of the system.
> > 
> > But killing the client won't get rid of the ctx/objects, so won't really
> > solve all that much? Especially it won't get rid of the framebuffers,
> > which is the real trouble here it seems. That's because logind keeps a
> > duped copy of the fd for it's own purposes.
> 
> I mentioned that in the commit message - but do note that what is pinned
> is what the client has mmapped at that time. Before it is changed, the
> client is killed and so we do not get into the situation where the
> output is invalid (just old).
> 
> > So the better fix would be to make sure we don't accidentally pin a
> > userptr object, i.e. adding a check in intel_pin_and_fence_fb for that
> > (plus igt testcase). I somehow missed in all this discussion that this is
> > about pinned objects ;-)
> 
> Preventing wrapping a userptr in a fb is one possibility.
> 
> diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
> index b89131654a0e..0cc689d0ce0a 100644
> --- a/drivers/gpu/drm/i915/intel_display.c
> +++ b/drivers/gpu/drm/i915/intel_display.c
> @@ -14116,6 +14116,9 @@ static int intel_user_framebuffer_create_handle(struct drm_framebuffer *fb,
>         struct intel_framebuffer *intel_fb = to_intel_framebuffer(fb);
>         struct drm_i915_gem_object *obj = intel_fb->obj;
>  
> +       if (obj->userptr.mm)
> +               return -EINVAL;
> +
>         return drm_gem_handle_create(file, &obj->base, handle);
>  }
> 
> And we can return to this discussion the next time someone mentions
> hitting the WARN. Or they complain about not being able to create fb.

Yeah I think I much prefer this stop-gap measure until we have a real
need. sob + igt + discussion summary and I'll apply this.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2015-10-13 13:37 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-23 20:07 [PATCH] drm/i915: Convert WARNs during userptr revoke to SIGBUS Chris Wilson
2015-09-24 10:23 ` Tvrtko Ursulin
2015-09-24 10:31   ` Chris Wilson
2015-09-24 10:55     ` Tvrtko Ursulin
2015-09-28 13:42 ` Daniel Vetter
2015-09-28 13:52   ` Chris Wilson
2015-09-28 14:14     ` Daniel Vetter
2015-10-08  9:45       ` Tvrtko Ursulin
2015-10-09  7:48         ` Daniel Vetter
2015-10-09  8:40           ` Chris Wilson
2015-10-09  8:55             ` Daniel Vetter
2015-10-09  8:59               ` Chris Wilson
2015-10-09  9:03               ` Tvrtko Ursulin
2015-10-09 17:14                 ` Daniel Vetter
2015-10-09 17:26                   ` Chris Wilson
2015-10-09 18:33                     ` Dave Gordon
2015-10-12  9:06                     ` Tvrtko Ursulin
2015-10-12  9:31                       ` Chris Wilson
2015-10-12 10:10                         ` Chris Wilson
2015-10-12 12:59                           ` Tvrtko Ursulin
2015-10-13 11:26                         ` Daniel Vetter
2015-10-13 11:44                           ` Chris Wilson
2015-10-13 12:23                             ` Daniel Vetter
2015-10-13 13:09                               ` Chris Wilson
2015-10-13 13:39                                 ` Daniel Vetter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.