On Fri, Oct 25, 2019 at 4:29 PM Chris Wilson <chris@chris-wilson.co.uk> wrote:

Quoting Jason Ekstrand (2019-10-25 19:22:04)
> On Thu, Oct 24, 2019 at 6:40 AM Chris Wilson <chris@chris-wilson.co.uk> wrote:
>
> Our existing behaviour is to allow contexts and their GPU requests to
> persist past the point of closure until the requests are complete. This
> allows clients to operate in a 'fire-and-forget' manner where they can
> setup a rendering pipeline and hand it over to the display server and
> immediately exiting. As the rendering pipeline is kept alive until
> completion, the display server (or other consumer) can use the results
> in the future and present them to the user.
>
> However, not all clients want this persistent behaviour and would prefer
> that the contexts are cleaned up immediately upon closure. This ensures
> that when clients are run without hangchecking, any GPU hang is
> terminated with the process and does not continue to hog resources.
>
> By defining a context property to allow clients to control persistence
> explicitly, we can remove the blanket advice to disable hangchecking
> that seems to be far too prevalent.
>
>
> Just to be clear, when you say "disable hangchecking" do you mean disabling it
> for all processes via a kernel parameter at boot time or a sysfs entry or
> similar? Or is there some mechanism whereby a context can request no hang
> checking?

They are being told to use the module parameter i915.enable_hangcheck=0
to globally disable hangchecking. This is what we are trying to wean
them off, and yet still allow indefinitely long kernels. The softer
hangcheck is focused on if you block scheduling or preemption of higher
priority work, then you are forcibly removed from the GPU. However, even
that is too much for some workloads, where they really do expect to
permanently hog the GPU. (All I can say is that they better be dedicated
systems because if you demand interactivity on top of disabling
preemption...)

Ok, thinking out loud here (no need for you to respond): Why should we take this approach? It seems like there are several other ways we could solve this:

1. Have a per-context flag (that's what we did here)

2. Have a per-execbuf flag for "don't allow this execbuf to outlive the process".

3. Have a DRM_IOCTL_I915_KILL_CONTEXT which lets the client manually kill the context

Option 2 seems like a lot more work in i915 and it doesn't seem that advantageous. Most drivers are going to either want their batches to outlive them or not; they aren't going to be making that decision on a per-batch basis.

Option 3 would work for some cases but it doesn't let the kernel terminate work if the client is killed unexpectedly by, for instance a segfault. The client could try to insert a crash handler but calling a DRM ioctl from a crash handler sounds like a bad plan. On the other hand, the client can just as easily implement 3 by setting the new context flag and then calling GEM_CONTEXT_DESTROY.

With that, I think I'm convinced that a context param is the best way to do this. We may even consider using it in Vulkan when running headless to let us kill stuff quicker. We aren't seeing any long-running Vulkan compute workloads yet but they may be coming.

Acked-by: Jason Ekstrand <jason@jlekstrand.net>

One more question: Does this bit fully support being turned on and off or is it a set-once? I ask because how I'd likely go about doing this in Vulkan would be to set it on context create and then unset it the moment we see a buffer shared with the outside world.

--Jason