Re: [PATCH -fixes 5/5] drm/vmwgfx: Fix a buffer object eviction regression

From: Thomas Hellstrom <thellstrom@vmware.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: pv-drivers@vmware.com, linux-graphics-maintainer@vmware.com,
	dri-devel@lists.freedesktop.org
Subject: Re: [PATCH -fixes 5/5] drm/vmwgfx: Fix a buffer object eviction regression
Date: Thu, 13 Sep 2018 18:52:43 +0200	[thread overview]
Message-ID: <c575fa18-424c-bff7-48f2-9fea759fa627@vmware.com> (raw)
In-Reply-To: <20180913152811.GA11574@bombadil.infradead.org>

On 09/13/2018 05:28 PM, Matthew Wilcox wrote:
> On Thu, Sep 13, 2018 at 04:56:53PM +0200, Thomas Hellstrom wrote:
>> Hi,
>>
>> On 09/13/2018 04:10 PM, Matthew Wilcox wrote:
>>> On Thu, Sep 13, 2018 at 01:58:37PM +0200, Thomas Hellstrom wrote:
>>>> Commit 4eb085e42fde ("drm/vmwgfx: Convert to new IDA API") indroduced
>>>> an incorrect return value from the function vmw_gmrid_man_get_node(),
>>>> when we run out if integer ids. Instead of returning 0 (meaning
>>>> non-fatal error) we forward the ida_simple_get error code -ENOSPC.
>>>> This causes TTM not to retry allocation after buffer eviction and
>>>> instead return -ENOSPC to user-space.
>>>>
>>>> Fix this by returning 0 when ida_simple_get() returns -ENOSPC.
>>> Thanks.  I got confused by the convoluted code that was there before ;-(
>>>
>>> I think this could be better though ... if ida_alloc() ever starts
>>> returning a different errno in the future, you'll hit the same problem,
>>> right?  So how about this ...
>>>
>>>    	id = ida_alloc_max(&gman->gmr_ida, gman->max_gmr_ids - 1, GFP_KERNEL);
>>> +	if (id == -ENOMEM)
>>> +		return -ENOMEM;
>>> +	if (id < 0)
>>> +		return 0;
>>>    	spin_lock(&gman->lock);
>>>
>>> But I wonder ... why is -ENOMEM seen as a fatal error?  If you free up
>>> some memory, you'll free up an ID, so the next time around you should
>>> be able to allocate an ID.  So shouldn't this function just have
>>> been doing this all along?
>>>
>>>    	id = ida_alloc_max(&gman->gmr_ida, gman->max_gmr_ids - 1, GFP_KERNEL);
>>> +	if (id < 0)
>>> +		return 0;
>>>
>> Non-fatal errors are errors that can be remedied by GPU buffer eviction, and
>> buffer eviction will free up IDA space, so basically we need to target only
>> the error code that indicates we've run out of IDA space.
> Yes, but the following situation can happen:
>
>   - Allocate 1024 IDs
>   - Run very low on memory
>   - Allocating ID 1025 will fail (very very unlikely)
>   - ida_alloc_max() returns -ENOMEM
>
> In this situation, we want ttm_mem_evict_first() to be called which will
> free up one of the 1024 existing IDs and then we can allocate that ID for
> our new node.
>
> I'm assuming we're analysing the behaviour of ttm_bo_mem_force_space()
> here.

Well, that's true, but that situation depends I guess very much on the 
radix tree implementation of IDA? Also I would expect the eviction paths 
to try to allocate more memory here and there, so to me the preferred 
option when -ENOMEM happens, is really to back off as soon as possible 
to avoid interfering with shrinker work going on etc.

>> If we're worried that ida_alloc_max() will change return value, I guess we
>> will have to increase the IDA space and detect the error ourselves:  error
>> if (id >= gman->max_gmr_ids)
> My point was that your solution (detect the one error which should be
> deemed as non-fatal) was not as robust as its inverse (detect the one
> error which the previous code deemed as fatal).  But I now believe no
> error from the IDA should be seen as fatal.

If you insist, I can test on -ENOMEM instead of -ENOSPC to mimic the 
pre-change behaviour. We should really focus on the IDA api changes 
here, and defer changing -ENOMEM to non-fatal to a follow-up patch if 
needed.

Thanks,

Thomas

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel