Re: [PATCH v4] drm/i915: Implement inter-engine read-read optimisations

From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: Chris Wilson <chris@chris-wilson.co.uk>,
	intel-gfx@lists.freedesktop.org,
	Lionel Landwerlin <lionel.g.landwerlin@linux.intel.com>
Subject: Re: [PATCH v4] drm/i915: Implement inter-engine read-read optimisations
Date: Fri, 27 Mar 2015 11:22:15 +0000	[thread overview]
Message-ID: <55153D67.6070501@linux.intel.com> (raw)
In-Reply-To: <20150326213140.GF18055@nuc-i3427.alporthouse.com>

On 03/26/2015 09:31 PM, Chris Wilson wrote:
> On Thu, Mar 26, 2015 at 05:43:33PM +0000, Tvrtko Ursulin wrote:
>>> -static int
>>> -i915_gem_object_wait_rendering__tail(struct drm_i915_gem_object *obj)
>>> -{
>>> -	if (!obj->active)
>>> -		return 0;
>>> -
>>> -	/* Manually manage the write flush as we may have not yet
>>> -	 * retired the buffer.
>>> -	 *
>>> -	 * Note that the last_write_req is always the earlier of
>>> -	 * the two (read/write) requests, so if we haved successfully waited,
>>> -	 * we know we have passed the last write.
>>> -	 */
>>> -	i915_gem_request_assign(&obj->last_write_req, NULL);
>>> -
>>> -	return 0;
>>> +	return __i915_wait_request(req, reset_counter,
>>> +				   interruptible, NULL, NULL);
>>>   }
>>
>> Net code comments -/+ for this patch needs improvement. :) Above you
>> deleted a chunk but below added nothing.
>
> I did. It's not added here as this code is buggy.

In a later version or you mean few lines at i915_gem_object_sync?

>> I think something high level is needed to explain the active lists
>> and tracking etc.
>
> GEM object domain management and eviction.
>
>>>   /**
>>> @@ -1405,18 +1381,39 @@ static __must_check int
>>>   i915_gem_object_wait_rendering(struct drm_i915_gem_object *obj,
>>>   			       bool readonly)
>>>   {
>>> -	struct drm_i915_gem_request *req;
>>> -	int ret;
>>> +	int ret, i;
>>>
>>> -	req = readonly ? obj->last_write_req : obj->last_read_req;
>>> -	if (!req)
>>> +	if (!obj->active)
>>>   		return 0;
>>>
>>> -	ret = i915_wait_request(req);
>>> -	if (ret)
>>> -		return ret;
>>> +	if (readonly) {
>>> +		if (obj->last_write_req != NULL) {
>>> +			ret = i915_wait_request(obj->last_write_req);
>>> +			if (ret)
>>> +				return ret;
>>> +
>>> +			i = obj->last_write_req->ring->id;
>>> +			if (obj->last_read_req[i] == obj->last_write_req)
>>> +				i915_gem_object_retire__read(obj, i);
>>> +			else
>>> +				i915_gem_object_retire__write(obj);
>>
>> Above mentioned comments would especially help understand the
>> ordering rules here.
>>
>>> +		}
>>> +	} else {
>>> +		for (i = 0; i < I915_NUM_RINGS; i++) {
>>> +			if (obj->last_read_req[i] == NULL)
>>> +				continue;
>>> +
>>> +			ret = i915_wait_request(obj->last_read_req[i]);
>>> +			if (ret)
>>> +				return ret;
>>> +
>>> +			i915_gem_object_retire__read(obj, i);
>>> +		}
>>
>> I think with optimistic spinning this could end up doing num_rings
>> spins etc in sequence. Would it be worth smarting it up somehow?
>
> Hmm. Good point. Obviously the goal of the optimisation is to have more
> opportunities where we never have to wait at at all. Writing a new
> i915_wait_requests() will add a lot of code, definitely something I want
> to postpone. But given the sequential wait, later ones are more likely
> to already be complete.

Just because of elapsed time I suppose, not because of any conceptual 
correlations between execution time and rings. If we have three batches 
on three rings with execution times like:

0: ==
1: ====
2: ========

It will end up "spin a bit wait a bit" three times.

But it is probably not likely so I defer to your opinion that it is OK 
to postpone smarting this up.

>>> +		BUG_ON(obj->active);
>>
>> Better WARN and halt the driver indirectly if unavoidable.
>
> It's a debug leftover, it's gone.
>
>>> +	}
>>> +
>>> +	return 0;
>>>
>>> -	return i915_gem_object_wait_rendering__tail(obj);
>>>   }
>>>
>>>   /* A nonblocking variant of the above wait. This is a highly dangerous routine
>>> @@ -1427,37 +1424,71 @@ i915_gem_object_wait_rendering__nonblocking(struct drm_i915_gem_object *obj,
>>>   					    struct drm_i915_file_private *file_priv,
>>>   					    bool readonly)
>>>   {
>>> -	struct drm_i915_gem_request *req;
>>>   	struct drm_device *dev = obj->base.dev;
>>>   	struct drm_i915_private *dev_priv = dev->dev_private;
>>> +	struct drm_i915_gem_request *requests[I915_NUM_RINGS];
>>>   	unsigned reset_counter;
>>> -	int ret;
>>> +	int ret, i, n = 0;
>>>
>>>   	BUG_ON(!mutex_is_locked(&dev->struct_mutex));
>>>   	BUG_ON(!dev_priv->mm.interruptible);
>>>
>>> -	req = readonly ? obj->last_write_req : obj->last_read_req;
>>> -	if (!req)
>>> +	if (!obj->active)
>>>   		return 0;
>>>
>>>   	ret = i915_gem_check_wedge(&dev_priv->gpu_error, true);
>>>   	if (ret)
>>>   		return ret;
>>>
>>> -	ret = i915_gem_check_olr(req);
>>> -	if (ret)
>>> -		return ret;
>>> -
>>>   	reset_counter = atomic_read(&dev_priv->gpu_error.reset_counter);
>>> -	i915_gem_request_reference(req);
>>
>> Good place for a comment: Build an array of requests to be waited upon etc..
>
> Code comments need to explain why.

I suppose that means it is totally obvious what all this code does since 
there are no comments? :D

Sometimes it can help to say what you are doing if it is a block of code 
in question. You can even add why to the what. I think it helps the 
reviewer or anyone trying to understand the code in the future.

>>> +
>>> +	if (readonly) {
>>> +		struct drm_i915_gem_request *rq;
>>> +
>>> +		rq = obj->last_write_req;
>>> +		if (rq == NULL)
>>> +			return 0;
>>> +
>>> +		ret = i915_gem_check_olr(rq);
>>> +		if (ret)
>>> +			goto err;
>>> +
>>> +		requests[n++] = i915_gem_request_reference(rq);
>>> +	} else {
>>> +		for (i = 0; i < I915_NUM_RINGS; i++) {
>>> +			struct drm_i915_gem_request *rq;
>>> +
>>> +			rq = obj->last_read_req[i];
>>> +			if (rq == NULL)
>>> +				continue;
>>> +
>>> +			ret = i915_gem_check_olr(rq);
>>> +			if (ret)
>>> +				goto err;
>>> +
>>> +			requests[n++] = i915_gem_request_reference(rq);
>>> +		}
>>> +	}
>>
>> I wonder if merging the tracked requests to a single array, roughly
>> something like:
>>
>> 	obj->last_req[1 + NUM_RINGS]
>>
>> and:
>>
>> 	LAST_WRITE_REQ = 0
>> 	LAST_READ_REQ(ring) = 1 + ring
>>
>> Could make the above logic more compact and how would it look
>> elsewhere in the code? Definitely looks like it would work for the
>> above loop:
>>
>> 	for (i = LAST_WRITE_REQ; i <= LAST_TRACKED_REQUEST; i++) {
>> 		rq = obj->last_req[i];
>> 		if (rq == NULL && readonly)
>> 			return 0;
>> 		else if (rq == NULL)
>> 			continue;
>> 		...
>> 		requests[n++] = ...
>> 		if (readonly)
>> 			break;
>> 	}
>>
>> Not sure, might not be that great idea.
>
> Nah. I think it's only a win here. Elsewhere there is greater
> diferrentiation between read/write. Conceptually it is even
>
> 	struct {
> 		struct drm_i915_gem_request *request;
> 		struct list_head active_link;
> 	} read[I915_NUM_RINGS];
> 	struct drm_i915_gem_request *write_request, *fence_request;
>
>>>   	mutex_unlock(&dev->struct_mutex);
>>> -	ret = __i915_wait_request(req, reset_counter, true, NULL, file_priv);
>>> +	for (i = 0; ret == 0 && i < n; i++)
>>> +		ret = __i915_wait_request(requests[i], reset_counter, true,
>>> +					  NULL, file_priv);
>>
>> Another chance for more optimal "group" waiting.
>>
>>>   	mutex_lock(&dev->struct_mutex);
>>> -	i915_gem_request_unreference(req);
>>> -	if (ret)
>>> -		return ret;
>>>
>>> -	return i915_gem_object_wait_rendering__tail(obj);
>>> +err:
>>> +	for (i = 0; i < n; i++) {
>>> +		if (ret == 0) {
>>> +			int ring = requests[i]->ring->id;
>>> +			if (obj->last_read_req[ring] == requests[i])
>>> +				i915_gem_object_retire__read(obj, ring);
>>> +			if (obj->last_write_req == requests[i])
>>> +				i915_gem_object_retire__write(obj);
>>> +		}
>>
>> What if one wait succeeded and one failed, no need to update anything?
>
> Too much complication for an error path. This just aims to optimise the
> bookkeeping for an object after a wait.
>
>>> -static void
>>> -i915_gem_object_move_to_active(struct drm_i915_gem_object *obj,
>>> -			       struct intel_engine_cs *ring)
>>> +void i915_vma_move_to_active(struct i915_vma *vma,
>>> +			     struct intel_engine_cs *ring)
>>>   {
>>> -	struct drm_i915_gem_request *req;
>>> -	struct intel_engine_cs *old_ring;
>>> -
>>> -	BUG_ON(ring == NULL);
>>> -
>>> -	req = intel_ring_get_request(ring);
>>> -	old_ring = i915_gem_request_get_ring(obj->last_read_req);
>>> -
>>> -	if (old_ring != ring && obj->last_write_req) {
>>> -		/* Keep the request relative to the current ring */
>>> -		i915_gem_request_assign(&obj->last_write_req, req);
>>> -	}
>>> +	struct drm_i915_gem_object *obj = vma->obj;
>>>
>>>   	/* Add a reference if we're newly entering the active list. */
>>> -	if (!obj->active) {
>>> +	if (obj->last_read_req[ring->id] == NULL && obj->active++ == 0)
>>>   		drm_gem_object_reference(&obj->base);
>>> -		obj->active = 1;
>>> -	}
>>> +	BUG_ON(obj->active == 0 || obj->active > I915_NUM_RINGS);
>>
>> Maybe we need I915_WARN_AND_HALT() which would be a WARN() plus put
>> the driver in halted state?
>
> This just magically evaporates by moving over to an obj->active
> bitfield. You'll love v5, but hate v6 (which abuses a lots more internal
> knowledge of request management).
>
>>> +	if (to == NULL) {
>>> +		ret = i915_gem_object_wait_rendering(obj, readonly);
>>> +	} else if (readonly) {
>>> +		ret = __i915_gem_object_sync(obj, to,
>>> +					     obj->last_write_req);
>>> +	} else {
>>> +		for (i = 0; i < I915_NUM_RINGS; i++) {
>>> +			ret = __i915_gem_object_sync(obj, to,
>>> +						     obj->last_read_req[i]);
>>
>> Here I think is another opportunity to wait for all of them at once.
>> Via a __i915_gem_object_sync helper which would take an array, or a
>> ring/request mask. Not sure if it would be worth it though.
>
> No. This is more complicated still since we have the semaphores to also
> deal will. Definitely worth waiting for a testcase.
>
>>> -	args->busy = obj->active;
>>> -	if (obj->last_read_req) {
>>> -		struct intel_engine_cs *ring;
>>>   		BUILD_BUG_ON(I915_NUM_RINGS > 16);
>>> -		ring = i915_gem_request_get_ring(obj->last_read_req);
>>> -		args->busy |= intel_ring_flag(ring) << 16;
>>> +
>>> +		for (i = 0; i < I915_NUM_RINGS; i++) {
>>> +			if (obj->last_read_req[i] == NULL)
>>> +				continue;
>>> +
>>> +			args->busy |= 1 << (16 + i) | 1;
>>
>> Doesn't look like equivalent bits will be set? What is the "| 1" at
>> the end for?
>
> No. It's designed for. This is what userspaces expects and is required
> to help workaround #70764.
>
> I want to replace the (| 1) with
> (| intel_ring_flag(obj->last_write_request->ring); but it exists because
> we couldn't be sure if any userspace depended on exactly (0, 1).

I don't get it, it is exported to userspace and now it is different, why 
is that OK?

>>> diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
>>> index 0efb19a9b9a5..1a3756e6bea4 100644
>>> --- a/drivers/gpu/drm/i915/intel_display.c
>>> +++ b/drivers/gpu/drm/i915/intel_display.c
>>> @@ -9674,7 +9674,7 @@ static bool use_mmio_flip(struct intel_engine_cs *ring,
>>>   	else if (i915.enable_execlists)
>>>   		return true;
>>>   	else
>>> -		return ring != i915_gem_request_get_ring(obj->last_read_req);
>>> +		return ring != i915_gem_request_get_ring(obj->last_write_req);
>>>   }
>>
>> Why this?
>
> Because that is correct.

OK I'll think about it.

Regards,

Tvrtko

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx