All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] GPU reset notification interface
@ 2012-07-17 22:16 Ian Romanick
  2012-07-18  1:57 ` Ian Romanick
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Ian Romanick @ 2012-07-17 22:16 UTC (permalink / raw)
  To: intel-gfx

I'm getting ready to implement the reset notification part of 
GL_ARB_robustness in the i965 driver.  There are a bunch of quirky bits 
of the extension that are causing some grief in designing the kernel / 
user interface.  I think I've settled on an interface that should meet 
all of the requirements, but I want to bounce it off people before I 
start writing code.

Here's the list of requirements.

- Applications poll for reset status.

- Contexts that did not lose data or rendering should not receive a 
reset notification.  This isn't strictly a requirement of the spec, but 
it seems like a good practice.  Once an application receives a reset 
notification for a context, it is supposed to destroy that context and 
start over.

- If one context in an OpenGL share group receives a reset notification, 
all contexts in that share group must receive a reset notification.

- All contexts in a single GL share group will have the same fd.  This 
isn't a requirement so much as a simplifying assumption.  All contexts 
in a share group have to be in the same address space, so I will assume 
that means they're all controlled by one DRI driver instance with a 
single fd.

- The reset notification given to the application should try to assign 
guilt.  There are three values possible: unknown guilt, you're not 
guilty, or you are guilty.

- If there are multiple resets between polls, the application should get 
the "most guilty" answer.  In other words, if there are two resets and 
the context was guilty for one and not the other, the application should 
get the guilty notification.

- After the application polls the status, the status should revert to 
"no reset occurred."

- If the application polls the status and the reset operation is still 
in progress, it should continue to get the reset value until it is 
"safe" to begin issuing GL commands again.

At some point I'd like to extend this to give slightly finer grained 
mechanism so that a context could be told that everything after a 
particular GL sync (fence) operation was lost.  This could prevent some 
applications from having to destroy and rebuild their context.  This 
isn't a requirement, but it's an idea that I've been mulling.

Each time a reset occurs, an internal count is incremented.  This 
associates a unique integer, starting with 1, with each reset event. 
Each context affected by the reset will have the reset event ID stored 
in one its three guilt levels.  An ioctl will be provided that returns 
the following data for all contexts associated with a particular fd.

In addition, it will return the index of any reset operation that is 
still in progress.

I think this should be sufficient information for user space to meet all 
of the requirements.  I had a conversation with Paul and Ken about this. 
  Paul was pretty happy with the idea.  Ken felt like too little policy 
was in the kernel, and the resulting interface was too heavy (I'm 
paraphrasing).

struct drm_context_reset_counts {
	__u32 ctx_id;

	/**
          * Index of the most recent reset where this context was
	 * guilty.  Zero if none.
          */
	__u32 guilty;

	/**
          * Index of the most recent reset where this context was
	 * not guilty.  Zero if none.
          */
	__u32 not_guilty;

	/**
          * Index of the most recent reset where guilt was unknown.
	 * Zero if none.
          */
	__u32 unknown_guilt;
};

struct drm_reset_counts {
	/** Index of the in-progress reset.  Zero if none. */
	unsigned reset_index_in_progress;

	/** Number of contexts. */
	__u32 num_contexts;

	struct drm_context_reset_counts contexts[0];
};

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] GPU reset notification interface
  2012-07-17 22:16 [RFC] GPU reset notification interface Ian Romanick
@ 2012-07-18  1:57 ` Ian Romanick
  2012-07-18  7:06 ` Chris Wilson
  2012-07-18  9:20 ` Daniel Vetter
  2 siblings, 0 replies; 10+ messages in thread
From: Ian Romanick @ 2012-07-18  1:57 UTC (permalink / raw)
  To: intel-gfx

On 07/17/2012 03:16 PM, Ian Romanick wrote:
> I'm getting ready to implement the reset notification part of
> GL_ARB_robustness in the i965 driver.  There are a bunch of quirky bits
> of the extension that are causing some grief in designing the kernel /
> user interface.  I think I've settled on an interface that should meet
> all of the requirements, but I want to bounce it off people before I
> start writing code.
>
> Here's the list of requirements.
>
> - Applications poll for reset status.
>
> - Contexts that did not lose data or rendering should not receive a
> reset notification.  This isn't strictly a requirement of the spec, but
> it seems like a good practice.  Once an application receives a reset
> notification for a context, it is supposed to destroy that context and
> start over.
>
> - If one context in an OpenGL share group receives a reset notification,
> all contexts in that share group must receive a reset notification.
>
> - All contexts in a single GL share group will have the same fd.  This
> isn't a requirement so much as a simplifying assumption.  All contexts
> in a share group have to be in the same address space, so I will assume
> that means they're all controlled by one DRI driver instance with a
> single fd.
>
> - The reset notification given to the application should try to assign
> guilt.  There are three values possible: unknown guilt, you're not
> guilty, or you are guilty.
>
> - If there are multiple resets between polls, the application should get
> the "most guilty" answer.  In other words, if there are two resets and
> the context was guilty for one and not the other, the application should
> get the guilty notification.
>
> - After the application polls the status, the status should revert to
> "no reset occurred."
>
> - If the application polls the status and the reset operation is still
> in progress, it should continue to get the reset value until it is
> "safe" to begin issuing GL commands again.
>
> At some point I'd like to extend this to give slightly finer grained
> mechanism so that a context could be told that everything after a
> particular GL sync (fence) operation was lost.  This could prevent some
> applications from having to destroy and rebuild their context.  This
> isn't a requirement, but it's an idea that I've been mulling.
>
> Each time a reset occurs, an internal count is incremented.  This
> associates a unique integer, starting with 1, with each reset event.
> Each context affected by the reset will have the reset event ID stored
> in one its three guilt levels.  An ioctl will be provided that returns
> the following data for all contexts associated with a particular fd.
>
> In addition, it will return the index of any reset operation that is
> still in progress.
>
> I think this should be sufficient information for user space to meet all
> of the requirements.  I had a conversation with Paul and Ken about this.
>   Paul was pretty happy with the idea.  Ken felt like too little policy
> was in the kernel, and the resulting interface was too heavy (I'm
> paraphrasing).
>
> struct drm_context_reset_counts {

Some of the Radeon guys on #dri-devel already told me these are the 
wrong prefixes for something that's not a shared DRM interface.  I guess 
drm_i915_gem is the correct prefix?  It's been a long time since my last 
kernel work. :)

>      __u32 ctx_id;
>
>      /**
>           * Index of the most recent reset where this context was
>       * guilty.  Zero if none.
>           */
>      __u32 guilty;
>
>      /**
>           * Index of the most recent reset where this context was
>       * not guilty.  Zero if none.
>           */
>      __u32 not_guilty;
>
>      /**
>           * Index of the most recent reset where guilt was unknown.
>       * Zero if none.
>           */
>      __u32 unknown_guilt;
> };
>
> struct drm_reset_counts {
>      /** Index of the in-progress reset.  Zero if none. */
>      unsigned reset_index_in_progress;
>
>      /** Number of contexts. */
>      __u32 num_contexts;
>
>      struct drm_context_reset_counts contexts[0];
> };
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] GPU reset notification interface
  2012-07-17 22:16 [RFC] GPU reset notification interface Ian Romanick
  2012-07-18  1:57 ` Ian Romanick
@ 2012-07-18  7:06 ` Chris Wilson
  2012-07-18  9:20 ` Daniel Vetter
  2 siblings, 0 replies; 10+ messages in thread
From: Chris Wilson @ 2012-07-18  7:06 UTC (permalink / raw)
  To: Ian Romanick, intel-gfx

On Tue, 17 Jul 2012 15:16:19 -0700, Ian Romanick <idr@freedesktop.org> wrote:
> I'm getting ready to implement the reset notification part of 
> GL_ARB_robustness in the i965 driver.  There are a bunch of quirky bits 
> of the extension that are causing some grief in designing the kernel / 
> user interface.  I think I've settled on an interface that should meet 
> all of the requirements, but I want to bounce it off people before I 
> start writing code.

What's the rationale for reset-in-progress? The actual delay between a
hang being noticed and the reset worker kicked off is going to be
very, very small -- in theory there is no window of opportunity for
another client to sneek and perform more work.

If we replace 'index' with breadcrumb, I'm much happier. Then you
have

struct drm_i915_reset_counts {
	u32 num_resets;
	u32 last_reset; /* using the seqno breadcrumb */
	u32 flags; /* 0x1 if in progress */

	u32 num_contexts;
	struct drm_i915_context_reset_counts contexts[0];
};

struct drm_i915_context_reset_counts {
	u32 ctx_id; /* per-fd namespace */

	u32 num_resets;
	u32 last_reset; // seqno

	u32 num_guilty;
	u32 last_guilty; // seqno
/*
	u32 num_innocent;
	u32 last_innocent;

	u32 num_unproven;
	u32 last_unproven;
*/
};;

So what do the terms mean as far as we are aware?

guilty - this context owned the batch we determine the GPU is executing

innocent - this context owns buffer that were on the active list at the
           time of execution and presumed trashed with the reset, i..e
           the context itself was on the active list.

unproven - a hang occurred in the driver without any indication of a
           guilty party (e.g. a semaphore hang due to bad olr), and
           this context was on the active list.

If we only increment the reset counter for a context if it is active at
the time of reset, then
  num_resets = num_guiltly + num_innocent + num_unproven;
and we can drop num_unproven. Since there is then no difference in
effect between unproven/innocence, we need only report the last time it
was found guilty and the total number of reset events involving this
context.

Similarly, the per-fd global counter is only incremented if one of its
contexts was active at the time of an event. The presumption being that
userspace does the broadcast across shared buffers/contexts to other
processes and updates the shared buffers as necessary.

We only keep 'monotonically' increasing state, so userspace is
responsible for computing deltas to catch events.

For the breadcrumb I'd just to use the seqno. It matters not to
userspace that the event breadcrumb jumps by more than one between
events, as it only sees a compressed view anyway (i.e. there may be
external resets that it knows nothing about).

The largest uncertainity for me is the guilty/innocence/unknown
counters and the value of providing both innocent and unproven. Then
there is the question whether we track/report resets that are nothing to
do with the fd. And whether contexts or buffers is more meaningful. From
a tracking perspective, contexts wins both for the kernel, and
presumably for userspace. And given a queriable last_write_seqno for
userspace, it can easily convert a context reset seqno into a trash
list.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] GPU reset notification interface
  2012-07-17 22:16 [RFC] GPU reset notification interface Ian Romanick
  2012-07-18  1:57 ` Ian Romanick
  2012-07-18  7:06 ` Chris Wilson
@ 2012-07-18  9:20 ` Daniel Vetter
  2012-07-18 16:23   ` Ian Romanick
  2 siblings, 1 reply; 10+ messages in thread
From: Daniel Vetter @ 2012-07-18  9:20 UTC (permalink / raw)
  To: Ian Romanick; +Cc: intel-gfx

On Tue, Jul 17, 2012 at 03:16:19PM -0700, Ian Romanick wrote:
> I'm getting ready to implement the reset notification part of
> GL_ARB_robustness in the i965 driver.  There are a bunch of quirky
> bits of the extension that are causing some grief in designing the
> kernel / user interface.  I think I've settled on an interface that
> should meet all of the requirements, but I want to bounce it off
> people before I start writing code.
> 
> Here's the list of requirements.
> 
> - Applications poll for reset status.
> 
> - Contexts that did not lose data or rendering should not receive a
> reset notification.  This isn't strictly a requirement of the spec,
> but it seems like a good practice.  Once an application receives a
> reset notification for a context, it is supposed to destroy that
> context and start over.
> 
> - If one context in an OpenGL share group receives a reset
> notification, all contexts in that share group must receive a reset
> notification.
> 
> - All contexts in a single GL share group will have the same fd.
> This isn't a requirement so much as a simplifying assumption.  All
> contexts in a share group have to be in the same address space, so I
> will assume that means they're all controlled by one DRI driver
> instance with a single fd.
> 
> - The reset notification given to the application should try to
> assign guilt.  There are three values possible: unknown guilt,
> you're not guilty, or you are guilty.
> 
> - If there are multiple resets between polls, the application should
> get the "most guilty" answer.  In other words, if there are two
> resets and the context was guilty for one and not the other, the
> application should get the guilty notification.
> 
> - After the application polls the status, the status should revert
> to "no reset occurred."
> 
> - If the application polls the status and the reset operation is
> still in progress, it should continue to get the reset value until
> it is "safe" to begin issuing GL commands again.
> 
> At some point I'd like to extend this to give slightly finer grained
> mechanism so that a context could be told that everything after a
> particular GL sync (fence) operation was lost.  This could prevent
> some applications from having to destroy and rebuild their context.
> This isn't a requirement, but it's an idea that I've been mulling.
> 
> Each time a reset occurs, an internal count is incremented.  This
> associates a unique integer, starting with 1, with each reset event.
> Each context affected by the reset will have the reset event ID
> stored in one its three guilt levels.  An ioctl will be provided
> that returns the following data for all contexts associated with a
> particular fd.
> 
> In addition, it will return the index of any reset operation that is
> still in progress.
> 
> I think this should be sufficient information for user space to meet
> all of the requirements.  I had a conversation with Paul and Ken
> about this.  Paul was pretty happy with the idea.  Ken felt like too
> little policy was in the kernel, and the resulting interface was too
> heavy (I'm paraphrasing).

A few things:
- I agree with Chris that reset_in_progress should go, if userspace can
  sneak in and witness a reset event, we have a bug in the kernel. Since
  very recently, we actually have a few bugs less in that area ;-)

- The "all contexts in a share group need to receive a reset notification"
  wording is irking me a bit because we currently only track all the
  actively used things. So if another context in that share group isn't
  affected (i.e. doesn't even use any of the potentially corrupted bos),
  is the idea that the kernel grows the required tracking, or should
  userspace always ask the reset state for all contexts and do the math
  itself?

- The above essentially links in with how we blame the guilt and figure
  out who's affected. Especially when there's no hw context around, e.g.
  on the blitter or bsd rings, or when an object is used by another
  process and gets corrupted there. Does the spec make any guarantees for
  such a case? Or should we just blame an unknown_reset for all the
  contexts that belong to the fd that submitted the bad batch.

As an idea for the above two issues, what about the kernel also keeps a
per-fd reset statistics, which will also be returned with this ioctl?
We'd then restrict the meaning of the ctx fields to only mean "and the
context was actually active". Userspace could then wrap all the per-fd
hang reports into reset_unknown for arb_robustness, but I think this would
be a bit more flexible for other userspace.

struct drm_context_reset_counts {
	__u32 ctx_id;

	/**
         * Index of the most recent reset where this context was
	 * guilty.  Zero if none.
         */
	__u32 ctx_guilty;

	/**
         * Index of the most recent reset where this context was active,
	 * not guilty.  Zero if none.
         */
	__u32 ctx_not_guilty;

	/**
         * Index of the most recent reset where this context was active,
	 * but guilt was unknown. Zero if none.
         */
	__u32 ctx_unknown_guilt;

	/**
         * Index of the most recent reset where any batchbuffer submitted
	 * through this fd was guilty.  Zero if none.
         */
	__u32 fd_guilty;

	/**
         * Index of the most recent reset where any batchbuffer submitted
	 * through this fd was not guilty, but affected.  Zero if none.
         */
	__u32 fd_not_guilty;

	/**
         * Index of the most recent reset where any batchbuffer submitted
	 * through this fd was affected, but no guilt for the hang could
	 * be assigned.  Zero if none.
         */
	__u32 fd_unknown_guilt;
};

The above could also be important to know when a hw context is toast and
should better not be used any more.

- I like Chris' suggestion of using our seqno breadcrumbs for this. If you
  also add a small extension to execbuf to return the seqno of the new
  batch, you should also have all the tools in place to implement your
  extension to notify userspace up to which glFence things have completed.
  Note though that returning the seqno in execbuffer is only correct once
  we've eliminated the flushing_list.

- Last but not least, how often is userspace expected to call this ioctl?
  Around once per patchbuffer submission or way more/less?

Cheers, Daniel
> 
> struct drm_context_reset_counts {
> 	__u32 ctx_id;
> 
> 	/**
>          * Index of the most recent reset where this context was
> 	 * guilty.  Zero if none.
>          */
> 	__u32 guilty;
> 
> 	/**
>          * Index of the most recent reset where this context was
> 	 * not guilty.  Zero if none.
>          */
> 	__u32 not_guilty;
> 
> 	/**
>          * Index of the most recent reset where guilt was unknown.
> 	 * Zero if none.
>          */
> 	__u32 unknown_guilt;
> };
> 
> struct drm_reset_counts {
> 	/** Index of the in-progress reset.  Zero if none. */
> 	unsigned reset_index_in_progress;
> 
> 	/** Number of contexts. */
> 	__u32 num_contexts;
> 
> 	struct drm_context_reset_counts contexts[0];
> };
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- 
Daniel Vetter
Mail: daniel@ffwll.ch
Mobile: +41 (0)79 365 57 48

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] GPU reset notification interface
  2012-07-18  9:20 ` Daniel Vetter
@ 2012-07-18 16:23   ` Ian Romanick
  2012-07-18 16:36     ` Adam Jackson
  2012-07-18 16:55     ` Daniel Vetter
  0 siblings, 2 replies; 10+ messages in thread
From: Ian Romanick @ 2012-07-18 16:23 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx

On 07/18/2012 02:20 AM, Daniel Vetter wrote:
> On Tue, Jul 17, 2012 at 03:16:19PM -0700, Ian Romanick wrote:
>> I'm getting ready to implement the reset notification part of
>> GL_ARB_robustness in the i965 driver.  There are a bunch of quirky
>> bits of the extension that are causing some grief in designing the
>> kernel / user interface.  I think I've settled on an interface that
>> should meet all of the requirements, but I want to bounce it off
>> people before I start writing code.
>>
>> Here's the list of requirements.
>>
>> - Applications poll for reset status.
>>
>> - Contexts that did not lose data or rendering should not receive a
>> reset notification.  This isn't strictly a requirement of the spec,
>> but it seems like a good practice.  Once an application receives a
>> reset notification for a context, it is supposed to destroy that
>> context and start over.
>>
>> - If one context in an OpenGL share group receives a reset
>> notification, all contexts in that share group must receive a reset
>> notification.
>>
>> - All contexts in a single GL share group will have the same fd.
>> This isn't a requirement so much as a simplifying assumption.  All
>> contexts in a share group have to be in the same address space, so I
>> will assume that means they're all controlled by one DRI driver
>> instance with a single fd.
>>
>> - The reset notification given to the application should try to
>> assign guilt.  There are three values possible: unknown guilt,
>> you're not guilty, or you are guilty.
>>
>> - If there are multiple resets between polls, the application should
>> get the "most guilty" answer.  In other words, if there are two
>> resets and the context was guilty for one and not the other, the
>> application should get the guilty notification.
>>
>> - After the application polls the status, the status should revert
>> to "no reset occurred."
>>
>> - If the application polls the status and the reset operation is
>> still in progress, it should continue to get the reset value until
>> it is "safe" to begin issuing GL commands again.
>>
>> At some point I'd like to extend this to give slightly finer grained
>> mechanism so that a context could be told that everything after a
>> particular GL sync (fence) operation was lost.  This could prevent
>> some applications from having to destroy and rebuild their context.
>> This isn't a requirement, but it's an idea that I've been mulling.
>>
>> Each time a reset occurs, an internal count is incremented.  This
>> associates a unique integer, starting with 1, with each reset event.
>> Each context affected by the reset will have the reset event ID
>> stored in one its three guilt levels.  An ioctl will be provided
>> that returns the following data for all contexts associated with a
>> particular fd.
>>
>> In addition, it will return the index of any reset operation that is
>> still in progress.
>>
>> I think this should be sufficient information for user space to meet
>> all of the requirements.  I had a conversation with Paul and Ken
>> about this.  Paul was pretty happy with the idea.  Ken felt like too
>> little policy was in the kernel, and the resulting interface was too
>> heavy (I'm paraphrasing).
>
> A few things:
> - I agree with Chris that reset_in_progress should go, if userspace can
>    sneak in and witness a reset event, we have a bug in the kernel. Since
>    very recently, we actually have a few bugs less in that area ;-)

I'm operating under the assumption that, from user space's perspective, 
resets are not instantaneous.  If resets are instantaneous, that may 
change things.

I had envisioned two potential uses for reset_in_progress, but I've 
managed to talk myself out of both.

> - The "all contexts in a share group need to receive a reset notification"
>    wording is irking me a bit because we currently only track all the
>    actively used things. So if another context in that share group isn't
>    affected (i.e. doesn't even use any of the potentially corrupted bos),
>    is the idea that the kernel grows the required tracking, or should
>    userspace always ask the reset state for all contexts and do the math
>    itself?

There are a couple reasons that all contexts in a share group need to 
get the reset notification.  Consider an application with two contexts, 
A and B.  Context A is a worker context that does a bunch of 
render-to-texture operations, and context B will eventually consume 
those textures.  If context A receives a reset, context B, even if it 
hasn't done any rendering in five minutes, has lost data.

The kernel should never have any knowledge about GL share groups.  This 
is where my simplifying assumption (that all contexts in a share group 
share an address space and an fd) comes in handy.  If context A queries 
the reset status from the kernel first, it can reach over and poke the 
reset status of context B (in the gl_context structure).  Likewise, if 
context B queries the kernel first, it can see that another kernel 
context in its GL context share group got reset.

> - The above essentially links in with how we blame the guilt and figure
>    out who's affected. Especially when there's no hw context around, e.g.
>    on the blitter or bsd rings, or when an object is used by another
>    process and gets corrupted there. Does the spec make any guarantees for
>    such a case? Or should we just blame an unknown_reset for all the
>    contexts that belong to the fd that submitted the bad batch.

That sounds right.  If we can't assess innocence or guilt, we should say 
guilt is unknown.  There are some GL commands that get generate blits, 
but I don't think there's anything that can get commands on the BSD 
ring.  That's just for media, right?

> As an idea for the above two issues, what about the kernel also keeps a
> per-fd reset statistics, which will also be returned with this ioctl?
> We'd then restrict the meaning of the ctx fields to only mean "and the
> context was actually active". Userspace could then wrap all the per-fd
> hang reports into reset_unknown for arb_robustness, but I think this would
> be a bit more flexible for other userspace.

Ah, right.  So the DDX or libva could detect resets that affect them. 
That's reasonable.

> struct drm_context_reset_counts {
> 	__u32 ctx_id;
>
> 	/**
>           * Index of the most recent reset where this context was
> 	 * guilty.  Zero if none.
>           */
> 	__u32 ctx_guilty;
>
> 	/**
>           * Index of the most recent reset where this context was active,
> 	 * not guilty.  Zero if none.
>           */
> 	__u32 ctx_not_guilty;
>
> 	/**
>           * Index of the most recent reset where this context was active,
> 	 * but guilt was unknown. Zero if none.
>           */
> 	__u32 ctx_unknown_guilt;
>
> 	/**
>           * Index of the most recent reset where any batchbuffer submitted
> 	 * through this fd was guilty.  Zero if none.
>           */
> 	__u32 fd_guilty;
>
> 	/**
>           * Index of the most recent reset where any batchbuffer submitted
> 	 * through this fd was not guilty, but affected.  Zero if none.
>           */
> 	__u32 fd_not_guilty;
>
> 	/**
>           * Index of the most recent reset where any batchbuffer submitted
> 	 * through this fd was affected, but no guilt for the hang could
> 	 * be assigned.  Zero if none.
>           */
> 	__u32 fd_unknown_guilt;

Since these three fields are per-fd, shouldn't they go in the proposed 
drm_reset_counts structure instead?  If we do that, it might be better 
to split into two queries.  One for the per-fd information, and one for 
the detailed per-context information.  If we expect the common case to 
be no-reset, user space could

> };
>
> The above could also be important to know when a hw context is toast and
> should better not be used any more.
>
> - I like Chris' suggestion of using our seqno breadcrumbs for this. If you
>    also add a small extension to execbuf to return the seqno of the new
>    batch, you should also have all the tools in place to implement your
>    extension to notify userspace up to which glFence things have completed.
>    Note though that returning the seqno in execbuffer is only correct once
>    we've eliminated the flushing_list.

It's possible, but I need to finish working out that idea (see below). 
I think the context only needs one seqno, not one per guiltiness level. 
  "This is the last seqno that was retired before this context lost some 
data."

That may still leave the context in a weird state.  Think about this 
timeline:

     Application      GPU
     draw A
                      submit A
     draw B
                      submit B
                      drawing A completes
                      reset (B is lost)
     draw C
	             submit C
                      drawing C completes
     query reset

If / when we implement this feature, the kernel may need to drop any 
work submitted between a reset and an ack of the reset.  Dunno.

> - Last but not least, how often is userspace expected to call this ioctl?
>    Around once per patchbuffer submission or way more/less?

I don't know.  This is a fairly new extension, and there are few users. 
  As far as I'm aware, only Chrome and Firefox use it.  I can find out 
some details from them.  My guess is somewhere between once per frame 
and once per draw call.

> Cheers, Daniel
>>
>> struct drm_context_reset_counts {
>> 	__u32 ctx_id;
>>
>> 	/**
>>           * Index of the most recent reset where this context was
>> 	 * guilty.  Zero if none.
>>           */
>> 	__u32 guilty;
>>
>> 	/**
>>           * Index of the most recent reset where this context was
>> 	 * not guilty.  Zero if none.
>>           */
>> 	__u32 not_guilty;
>>
>> 	/**
>>           * Index of the most recent reset where guilt was unknown.
>> 	 * Zero if none.
>>           */
>> 	__u32 unknown_guilt;
>> };
>>
>> struct drm_reset_counts {
>> 	/** Index of the in-progress reset.  Zero if none. */
>> 	unsigned reset_index_in_progress;
>>
>> 	/** Number of contexts. */
>> 	__u32 num_contexts;
>>
>> 	struct drm_context_reset_counts contexts[0];
>> };
>> _______________________________________________
>> Intel-gfx mailing list
>> Intel-gfx@lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] GPU reset notification interface
  2012-07-18 16:23   ` Ian Romanick
@ 2012-07-18 16:36     ` Adam Jackson
  2012-07-18 16:42       ` Daniel Vetter
  2012-07-18 16:55     ` Daniel Vetter
  1 sibling, 1 reply; 10+ messages in thread
From: Adam Jackson @ 2012-07-18 16:36 UTC (permalink / raw)
  To: Ian Romanick; +Cc: intel-gfx


[-- Attachment #1.1: Type: text/plain, Size: 696 bytes --]

On Wed, 2012-07-18 at 09:23 -0700, Ian Romanick wrote:
> On 07/18/2012 02:20 AM, Daniel Vetter wrote:
> > A few things:
> > - I agree with Chris that reset_in_progress should go, if userspace can
> >    sneak in and witness a reset event, we have a bug in the kernel. Since
> >    very recently, we actually have a few bugs less in that area ;-)
> 
> I'm operating under the assumption that, from user space's perspective, 
> resets are not instantaneous.  If resets are instantaneous, that may 
> change things.

Do they need to be instantaneous, or do they merely need to be
atomic-and-reasonably-quick?  Could just block new ioctl submission
until the reset completes.

- ajax

[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] GPU reset notification interface
  2012-07-18 16:36     ` Adam Jackson
@ 2012-07-18 16:42       ` Daniel Vetter
  0 siblings, 0 replies; 10+ messages in thread
From: Daniel Vetter @ 2012-07-18 16:42 UTC (permalink / raw)
  To: Adam Jackson; +Cc: intel-gfx

On Wed, Jul 18, 2012 at 6:36 PM, Adam Jackson <ajax@redhat.com> wrote:
> On Wed, 2012-07-18 at 09:23 -0700, Ian Romanick wrote:
>> On 07/18/2012 02:20 AM, Daniel Vetter wrote:
>> > A few things:
>> > - I agree with Chris that reset_in_progress should go, if userspace can
>> >    sneak in and witness a reset event, we have a bug in the kernel. Since
>> >    very recently, we actually have a few bugs less in that area ;-)
>>
>> I'm operating under the assumption that, from user space's perspective,
>> resets are not instantaneous.  If resets are instantaneous, that may
>> change things.
>
> Do they need to be instantaneous, or do they merely need to be
> atomic-and-reasonably-quick?  Could just block new ioctl submission
> until the reset completes.

That's actually what we already do, that "block ioctls until reset
completes thing" ;-)
-Daniel
-- 
Daniel Vetter
daniel.vetter@ffwll.ch - +41 (0) 79 364 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] GPU reset notification interface
  2012-07-18 16:23   ` Ian Romanick
  2012-07-18 16:36     ` Adam Jackson
@ 2012-07-18 16:55     ` Daniel Vetter
  2012-07-19  3:47       ` Ben Widawsky
  2012-07-19 15:58       ` Ian Romanick
  1 sibling, 2 replies; 10+ messages in thread
From: Daniel Vetter @ 2012-07-18 16:55 UTC (permalink / raw)
  To: Ian Romanick; +Cc: intel-gfx

On Wed, Jul 18, 2012 at 09:23:46AM -0700, Ian Romanick wrote:
> On 07/18/2012 02:20 AM, Daniel Vetter wrote:
> >- The "all contexts in a share group need to receive a reset notification"
> >   wording is irking me a bit because we currently only track all the
> >   actively used things. So if another context in that share group isn't
> >   affected (i.e. doesn't even use any of the potentially corrupted bos),
> >   is the idea that the kernel grows the required tracking, or should
> >   userspace always ask the reset state for all contexts and do the math
> >   itself?
> 
> There are a couple reasons that all contexts in a share group need
> to get the reset notification.  Consider an application with two
> contexts, A and B.  Context A is a worker context that does a bunch
> of render-to-texture operations, and context B will eventually
> consume those textures.  If context A receives a reset, context B,
> even if it hasn't done any rendering in five minutes, has lost data.
> 
> The kernel should never have any knowledge about GL share groups.
> This is where my simplifying assumption (that all contexts in a
> share group share an address space and an fd) comes in handy.  If
> context A queries the reset status from the kernel first, it can
> reach over and poke the reset status of context B (in the gl_context
> structure).  Likewise, if context B queries the kernel first, it can
> see that another kernel context in its GL context share group got
> reset.
> 
> >- The above essentially links in with how we blame the guilt and figure
> >   out who's affected. Especially when there's no hw context around, e.g.
> >   on the blitter or bsd rings, or when an object is used by another
> >   process and gets corrupted there. Does the spec make any guarantees for
> >   such a case? Or should we just blame an unknown_reset for all the
> >   contexts that belong to the fd that submitted the bad batch.
> 
> That sounds right.  If we can't assess innocence or guilt, we should
> say guilt is unknown.  There are some GL commands that get generate
> blits, but I don't think there's anything that can get commands on
> the BSD ring.  That's just for media, right?
> 
> >As an idea for the above two issues, what about the kernel also keeps a
> >per-fd reset statistics, which will also be returned with this ioctl?
> >We'd then restrict the meaning of the ctx fields to only mean "and the
> >context was actually active". Userspace could then wrap all the per-fd
> >hang reports into reset_unknown for arb_robustness, but I think this would
> >be a bit more flexible for other userspace.
> 
> Ah, right.  So the DDX or libva could detect resets that affect
> them. That's reasonable.
> 
> >struct drm_context_reset_counts {
> >	__u32 ctx_id;
> >
> >	/**
> >          * Index of the most recent reset where this context was
> >	 * guilty.  Zero if none.
> >          */
> >	__u32 ctx_guilty;
> >
> >	/**
> >          * Index of the most recent reset where this context was active,
> >	 * not guilty.  Zero if none.
> >          */
> >	__u32 ctx_not_guilty;
> >
> >	/**
> >          * Index of the most recent reset where this context was active,
> >	 * but guilt was unknown. Zero if none.
> >          */
> >	__u32 ctx_unknown_guilt;
> >
> >	/**
> >          * Index of the most recent reset where any batchbuffer submitted
> >	 * through this fd was guilty.  Zero if none.
> >          */
> >	__u32 fd_guilty;
> >
> >	/**
> >          * Index of the most recent reset where any batchbuffer submitted
> >	 * through this fd was not guilty, but affected.  Zero if none.
> >          */
> >	__u32 fd_not_guilty;
> >
> >	/**
> >          * Index of the most recent reset where any batchbuffer submitted
> >	 * through this fd was affected, but no guilt for the hang could
> >	 * be assigned.  Zero if none.
> >          */
> >	__u32 fd_unknown_guilt;
> 
> Since these three fields are per-fd, shouldn't they go in the
> proposed drm_reset_counts structure instead?  If we do that, it
> might be better to split into two queries.  One for the per-fd
> information, and one for the detailed per-context information.  If
> we expect the common case to be no-reset, user space could

Oh, I've missed a bit that your ioctl would read out the state for all
contexts on a given fd. My idea was that this is all the data that the
ioctl retuns (and it would only fill out the ctx fields if you spec a
nonzero context). Userspace would then only need to check whether the
context has a new reset number somewhere (= context suffered from a gpu
reset while it was actively used) and then also check whether the fd
suffered from a reset (= something in that gl share group is potentially
corrupted), mapping it to either victim_reset or unknown_reset (depending
upon what exactly the spec wants).

Less complexity in userspace, and imo not too much fuss for the kernel to
do the book-keeping: Instead of only blaming contexts, we also blame
file_privs. That would be a generally useful feature to e.g. cut off
certain fds selectively from the gpu because they hang it too fast - i.e.
to avoid the dreaded global "hanging to fast" when developing drivers ;-)

> >};
> >
> >The above could also be important to know when a hw context is toast and
> >should better not be used any more.
> >
> >- I like Chris' suggestion of using our seqno breadcrumbs for this. If you
> >   also add a small extension to execbuf to return the seqno of the new
> >   batch, you should also have all the tools in place to implement your
> >   extension to notify userspace up to which glFence things have completed.
> >   Note though that returning the seqno in execbuffer is only correct once
> >   we've eliminated the flushing_list.
> 
> It's possible, but I need to finish working out that idea (see
> below). I think the context only needs one seqno, not one per
> guiltiness level.  "This is the last seqno that was retired before
> this context lost some data."

The idea was to use the existing seqno as the reset counter in your ioctl.
It might wrap around, but you need to emit quite a few commands to get to
that point ;-) So we could use that right away.

> That may still leave the context in a weird state.  Think about this
> timeline:
> 
>     Application      GPU
>     draw A
>                      submit A
>     draw B
>                      submit B
>                      drawing A completes
>                      reset (B is lost)
>     draw C
> 	             submit C
>                      drawing C completes
>     query reset
> 
> If / when we implement this feature, the kernel may need to drop any
> work submitted between a reset and an ack of the reset.  Dunno.

Well, we could simply add a robust_context flag, set at create time. If
such a context suffers from a gpu reset (while it's active) the kernel
could refuse to queue up more batches with -EIO. Userspace could then
inquire the kernel with your ioctl about what's going on. Similarly we
could add such a flag to the fd, to catch everyone else not using hw
contexts. There we need a way to clear that flag though, so that you can
keep on using all the bo textures that survived (whereas hw contexts are
pretty much worthless on their own after reset, so we could demand that
they be destroyed).

> >- Last but not least, how often is userspace expected to call this ioctl?
> >   Around once per patchbuffer submission or way more/less?
> 
> I don't know.  This is a fairly new extension, and there are few
> users.  As far as I'm aware, only Chrome and Firefox use it.  I can
> find out some details from them.  My guess is somewhere between once
> per frame and once per draw call.

Cheers, Daniel
-- 
Daniel Vetter
Mail: daniel@ffwll.ch
Mobile: +41 (0)79 365 57 48

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] GPU reset notification interface
  2012-07-18 16:55     ` Daniel Vetter
@ 2012-07-19  3:47       ` Ben Widawsky
  2012-07-19 15:58       ` Ian Romanick
  1 sibling, 0 replies; 10+ messages in thread
From: Ben Widawsky @ 2012-07-19  3:47 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx

On Wed, 18 Jul 2012 18:55:08 +0200
Daniel Vetter <daniel@ffwll.ch> wrote:

> On Wed, Jul 18, 2012 at 09:23:46AM -0700, Ian Romanick wrote:
> > On 07/18/2012 02:20 AM, Daniel Vetter wrote:
> > >- The "all contexts in a share group need to receive a reset notification"
> > >   wording is irking me a bit because we currently only track all the
> > >   actively used things. So if another context in that share group isn't
> > >   affected (i.e. doesn't even use any of the potentially corrupted bos),
> > >   is the idea that the kernel grows the required tracking, or should
> > >   userspace always ask the reset state for all contexts and do the math
> > >   itself?
> > 
> > There are a couple reasons that all contexts in a share group need
> > to get the reset notification.  Consider an application with two
> > contexts, A and B.  Context A is a worker context that does a bunch
> > of render-to-texture operations, and context B will eventually
> > consume those textures.  If context A receives a reset, context B,
> > even if it hasn't done any rendering in five minutes, has lost data.
> > 
> > The kernel should never have any knowledge about GL share groups.
> > This is where my simplifying assumption (that all contexts in a
> > share group share an address space and an fd) comes in handy.  If
> > context A queries the reset status from the kernel first, it can
> > reach over and poke the reset status of context B (in the gl_context
> > structure).  Likewise, if context B queries the kernel first, it can
> > see that another kernel context in its GL context share group got
> > reset.
> > 
> > >- The above essentially links in with how we blame the guilt and figure
> > >   out who's affected. Especially when there's no hw context around, e.g.
> > >   on the blitter or bsd rings, or when an object is used by another
> > >   process and gets corrupted there. Does the spec make any guarantees for
> > >   such a case? Or should we just blame an unknown_reset for all the
> > >   contexts that belong to the fd that submitted the bad batch.
> > 
> > That sounds right.  If we can't assess innocence or guilt, we should
> > say guilt is unknown.  There are some GL commands that get generate
> > blits, but I don't think there's anything that can get commands on
> > the BSD ring.  That's just for media, right?
> > 
> > >As an idea for the above two issues, what about the kernel also keeps a
> > >per-fd reset statistics, which will also be returned with this ioctl?
> > >We'd then restrict the meaning of the ctx fields to only mean "and the
> > >context was actually active". Userspace could then wrap all the per-fd
> > >hang reports into reset_unknown for arb_robustness, but I think this would
> > >be a bit more flexible for other userspace.
> > 
> > Ah, right.  So the DDX or libva could detect resets that affect
> > them. That's reasonable.
> > 
> > >struct drm_context_reset_counts {
> > >	__u32 ctx_id;
> > >
> > >	/**
> > >          * Index of the most recent reset where this context was
> > >	 * guilty.  Zero if none.
> > >          */
> > >	__u32 ctx_guilty;
> > >
> > >	/**
> > >          * Index of the most recent reset where this context was active,
> > >	 * not guilty.  Zero if none.
> > >          */
> > >	__u32 ctx_not_guilty;
> > >
> > >	/**
> > >          * Index of the most recent reset where this context was active,
> > >	 * but guilt was unknown. Zero if none.
> > >          */
> > >	__u32 ctx_unknown_guilt;
> > >
> > >	/**
> > >          * Index of the most recent reset where any batchbuffer submitted
> > >	 * through this fd was guilty.  Zero if none.
> > >          */
> > >	__u32 fd_guilty;
> > >
> > >	/**
> > >          * Index of the most recent reset where any batchbuffer submitted
> > >	 * through this fd was not guilty, but affected.  Zero if none.
> > >          */
> > >	__u32 fd_not_guilty;
> > >
> > >	/**
> > >          * Index of the most recent reset where any batchbuffer submitted
> > >	 * through this fd was affected, but no guilt for the hang could
> > >	 * be assigned.  Zero if none.
> > >          */
> > >	__u32 fd_unknown_guilt;
> > 
> > Since these three fields are per-fd, shouldn't they go in the
> > proposed drm_reset_counts structure instead?  If we do that, it
> > might be better to split into two queries.  One for the per-fd
> > information, and one for the detailed per-context information.  If
> > we expect the common case to be no-reset, user space could
> 
> Oh, I've missed a bit that your ioctl would read out the state for all
> contexts on a given fd. My idea was that this is all the data that the
> ioctl retuns (and it would only fill out the ctx fields if you spec a
> nonzero context). Userspace would then only need to check whether the
> context has a new reset number somewhere (= context suffered from a gpu
> reset while it was actively used) and then also check whether the fd
> suffered from a reset (= something in that gl share group is potentially
> corrupted), mapping it to either victim_reset or unknown_reset (depending
> upon what exactly the spec wants).
> 
> Less complexity in userspace, and imo not too much fuss for the kernel to
> do the book-keeping: Instead of only blaming contexts, we also blame
> file_privs. That would be a generally useful feature to e.g. cut off
> certain fds selectively from the gpu because they hang it too fast - i.e.
> to avoid the dreaded global "hanging to fast" when developing drivers ;-)

I don't think it's bad to know the fd, but I think it will help us
overall if we switch to a more abstract context notion, and try really
hard to blame that. The best way to do it is to enforce it in the API.
If a gpu client isn't using context, then they get the shitty answer (I
don't know wtf happened) from this IOCATL. A bit more on it below/

> 
> > >};
> > >
> > >The above could also be important to know when a hw context is toast and
> > >should better not be used any more.
> > >
> > >- I like Chris' suggestion of using our seqno breadcrumbs for this. If you
> > >   also add a small extension to execbuf to return the seqno of the new
> > >   batch, you should also have all the tools in place to implement your
> > >   extension to notify userspace up to which glFence things have completed.
> > >   Note though that returning the seqno in execbuffer is only correct once
> > >   we've eliminated the flushing_list.
> > 
> > It's possible, but I need to finish working out that idea (see
> > below). I think the context only needs one seqno, not one per
> > guiltiness level.  "This is the last seqno that was retired before
> > this context lost some data."
> 
> The idea was to use the existing seqno as the reset counter in your ioctl.
> It might wrap around, but you need to emit quite a few commands to get to
> that point ;-) So we could use that right away.
> 
> > That may still leave the context in a weird state.  Think about this
> > timeline:
> > 
> >     Application      GPU
> >     draw A
> >                      submit A
> >     draw B
> >                      submit B
> >                      drawing A completes
> >                      reset (B is lost)
> >     draw C
> > 	             submit C
> >                      drawing C completes
> >     query reset
> > 
> > If / when we implement this feature, the kernel may need to drop any
> > work submitted between a reset and an ack of the reset.  Dunno.
> 
> Well, we could simply add a robust_context flag, set at create time. If
> such a context suffers from a gpu reset (while it's active) the kernel
> could refuse to queue up more batches with -EIO. Userspace could then
> inquire the kernel with your ioctl about what's going on. Similarly we
> could add such a flag to the fd, to catch everyone else not using hw
> contexts. There we need a way to clear that flag though, so that you can
> keep on using all the bo textures that survived (whereas hw contexts are
> pretty much worthless on their own after reset, so we could demand that
> they be destroyed).

First off, I don't think hw contexts are worthless after a reset. The
unaffected contexts would still be plenty valuable. Even the
context of the hung context is still valid, just the state of the
hung batch won't be present. I think also you need to stop considering
userpsapce not using HW contexts. Not because it's awesome, but
because there is no downside to using contexts. (I would have loved to
have an abstract context for all rings, but you shot that down pretty
quickly earlier even though I think it was originally your idea). I
also think there is never a case where we want to prevent submissions
for an fd when we can simply prevent submissions for a context. The
chrome/firefox examples you mention below are a perfect example.

In short, I don't understand why we wouldn't want a context over an fd,
it's really not a lot more work to track.

> 
> > >- Last but not least, how often is userspace expected to call this ioctl?
> > >   Around once per patchbuffer submission or way more/less?
> > 
> > I don't know.  This is a fairly new extension, and there are few
> > users.  As far as I'm aware, only Chrome and Firefox use it.  I can
> > find out some details from them.  My guess is somewhere between once
> > per frame and once per draw call.
> 
> Cheers, Daniel

-- 
Ben Widawsky, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC] GPU reset notification interface
  2012-07-18 16:55     ` Daniel Vetter
  2012-07-19  3:47       ` Ben Widawsky
@ 2012-07-19 15:58       ` Ian Romanick
  1 sibling, 0 replies; 10+ messages in thread
From: Ian Romanick @ 2012-07-19 15:58 UTC (permalink / raw)
  To: Daniel Vetter; +Cc: intel-gfx

On 07/18/2012 09:55 AM, Daniel Vetter wrote:
> On Wed, Jul 18, 2012 at 09:23:46AM -0700, Ian Romanick wrote:
>> On 07/18/2012 02:20 AM, Daniel Vetter wrote:
>>> - The "all contexts in a share group need to receive a reset notification"
>>>    wording is irking me a bit because we currently only track all the
>>>    actively used things. So if another context in that share group isn't
>>>    affected (i.e. doesn't even use any of the potentially corrupted bos),
>>>    is the idea that the kernel grows the required tracking, or should
>>>    userspace always ask the reset state for all contexts and do the math
>>>    itself?
>>
>> There are a couple reasons that all contexts in a share group need
>> to get the reset notification.  Consider an application with two
>> contexts, A and B.  Context A is a worker context that does a bunch
>> of render-to-texture operations, and context B will eventually
>> consume those textures.  If context A receives a reset, context B,
>> even if it hasn't done any rendering in five minutes, has lost data.
>>
>> The kernel should never have any knowledge about GL share groups.
>> This is where my simplifying assumption (that all contexts in a
>> share group share an address space and an fd) comes in handy.  If
>> context A queries the reset status from the kernel first, it can
>> reach over and poke the reset status of context B (in the gl_context
>> structure).  Likewise, if context B queries the kernel first, it can
>> see that another kernel context in its GL context share group got
>> reset.
>>
>>> - The above essentially links in with how we blame the guilt and figure
>>>    out who's affected. Especially when there's no hw context around, e.g.
>>>    on the blitter or bsd rings, or when an object is used by another
>>>    process and gets corrupted there. Does the spec make any guarantees for
>>>    such a case? Or should we just blame an unknown_reset for all the
>>>    contexts that belong to the fd that submitted the bad batch.
>>
>> That sounds right.  If we can't assess innocence or guilt, we should
>> say guilt is unknown.  There are some GL commands that get generate
>> blits, but I don't think there's anything that can get commands on
>> the BSD ring.  That's just for media, right?
>>
>>> As an idea for the above two issues, what about the kernel also keeps a
>>> per-fd reset statistics, which will also be returned with this ioctl?
>>> We'd then restrict the meaning of the ctx fields to only mean "and the
>>> context was actually active". Userspace could then wrap all the per-fd
>>> hang reports into reset_unknown for arb_robustness, but I think this would
>>> be a bit more flexible for other userspace.
>>
>> Ah, right.  So the DDX or libva could detect resets that affect
>> them. That's reasonable.
>>
>>> struct drm_context_reset_counts {
>>> 	__u32 ctx_id;
>>>
>>> 	/**
>>>           * Index of the most recent reset where this context was
>>> 	 * guilty.  Zero if none.
>>>           */
>>> 	__u32 ctx_guilty;
>>>
>>> 	/**
>>>           * Index of the most recent reset where this context was active,
>>> 	 * not guilty.  Zero if none.
>>>           */
>>> 	__u32 ctx_not_guilty;
>>>
>>> 	/**
>>>           * Index of the most recent reset where this context was active,
>>> 	 * but guilt was unknown. Zero if none.
>>>           */
>>> 	__u32 ctx_unknown_guilt;
>>>
>>> 	/**
>>>           * Index of the most recent reset where any batchbuffer submitted
>>> 	 * through this fd was guilty.  Zero if none.
>>>           */
>>> 	__u32 fd_guilty;
>>>
>>> 	/**
>>>           * Index of the most recent reset where any batchbuffer submitted
>>> 	 * through this fd was not guilty, but affected.  Zero if none.
>>>           */
>>> 	__u32 fd_not_guilty;
>>>
>>> 	/**
>>>           * Index of the most recent reset where any batchbuffer submitted
>>> 	 * through this fd was affected, but no guilt for the hang could
>>> 	 * be assigned.  Zero if none.
>>>           */
>>> 	__u32 fd_unknown_guilt;
>>
>> Since these three fields are per-fd, shouldn't they go in the
>> proposed drm_reset_counts structure instead?  If we do that, it
>> might be better to split into two queries.  One for the per-fd
>> information, and one for the detailed per-context information.  If
>> we expect the common case to be no-reset, user space could
>
> Oh, I've missed a bit that your ioctl would read out the state for all
> contexts on a given fd. My idea was that this is all the data that the
> ioctl retuns (and it would only fill out the ctx fields if you spec a
> nonzero context). Userspace would then only need to check whether the
> context has a new reset number somewhere (= context suffered from a gpu
> reset while it was actively used) and then also check whether the fd
> suffered from a reset (= something in that gl share group is potentially
> corrupted), mapping it to either victim_reset or unknown_reset (depending
> upon what exactly the spec wants).

Consider the case where a process has four contexts, A, B, X, and Y.  A 
and B are in a share group, and X and Y are in a share group not with A 
and B.  B sees a reset.  When the application queries the reset status 
on X or Y, it should not see a reset.  Knowing that there was a reset 
for some other context on the same fd isn't enough information.

> Less complexity in userspace, and imo not too much fuss for the kernel to
> do the book-keeping: Instead of only blaming contexts, we also blame
> file_privs. That would be a generally useful feature to e.g. cut off
> certain fds selectively from the gpu because they hang it too fast - i.e.
> to avoid the dreaded global "hanging to fast" when developing drivers ;-)
>
>>> };
>>>
>>> The above could also be important to know when a hw context is toast and
>>> should better not be used any more.
>>>
>>> - I like Chris' suggestion of using our seqno breadcrumbs for this. If you
>>>    also add a small extension to execbuf to return the seqno of the new
>>>    batch, you should also have all the tools in place to implement your
>>>    extension to notify userspace up to which glFence things have completed.
>>>    Note though that returning the seqno in execbuffer is only correct once
>>>    we've eliminated the flushing_list.
>>
>> It's possible, but I need to finish working out that idea (see
>> below). I think the context only needs one seqno, not one per
>> guiltiness level.  "This is the last seqno that was retired before
>> this context lost some data."
>
> The idea was to use the existing seqno as the reset counter in your ioctl.
> It might wrap around, but you need to emit quite a few commands to get to
> that point ;-) So we could use that right away.
>
>> That may still leave the context in a weird state.  Think about this
>> timeline:
>>
>>      Application      GPU
>>      draw A
>>                       submit A
>>      draw B
>>                       submit B
>>                       drawing A completes
>>                       reset (B is lost)
>>      draw C
>> 	             submit C
>>                       drawing C completes
>>      query reset
>>
>> If / when we implement this feature, the kernel may need to drop any
>> work submitted between a reset and an ack of the reset.  Dunno.
>
> Well, we could simply add a robust_context flag, set at create time. If
> such a context suffers from a gpu reset (while it's active) the kernel
> could refuse to queue up more batches with -EIO. Userspace could then

If we go that way, I think it would be better to just drop the requests. 
  OpenGL doesn't let me start generating errors after a reset, so I'd 
have to selectively ignore these errors from the kernel.  It sounds like 
it would add a bunch of annoying complexity to user space.

Of course, we're talking about step 38 here, and we're still on step 2.

> inquire the kernel with your ioctl about what's going on. Similarly we
> could add such a flag to the fd, to catch everyone else not using hw
> contexts. There we need a way to clear that flag though, so that you can
> keep on using all the bo textures that survived (whereas hw contexts are
> pretty much worthless on their own after reset, so we could demand that
> they be destroyed).
>
>>> - Last but not least, how often is userspace expected to call this ioctl?
>>>    Around once per patchbuffer submission or way more/less?
>>
>> I don't know.  This is a fairly new extension, and there are few
>> users.  As far as I'm aware, only Chrome and Firefox use it.  I can
>> find out some details from them.  My guess is somewhere between once
>> per frame and once per draw call.
>
> Cheers, Daniel

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2012-07-19 16:00 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-07-17 22:16 [RFC] GPU reset notification interface Ian Romanick
2012-07-18  1:57 ` Ian Romanick
2012-07-18  7:06 ` Chris Wilson
2012-07-18  9:20 ` Daniel Vetter
2012-07-18 16:23   ` Ian Romanick
2012-07-18 16:36     ` Adam Jackson
2012-07-18 16:42       ` Daniel Vetter
2012-07-18 16:55     ` Daniel Vetter
2012-07-19  3:47       ` Ben Widawsky
2012-07-19 15:58       ` Ian Romanick

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.