On Thu, May 28, 2020 at 2:12 PM Christian König <christian.koenig@amd.com>
wrote:

> Am 28.05.20 um 18:06 schrieb Marek Olšák:
>
> On Thu, May 28, 2020 at 10:40 AM Christian König <christian.koenig@amd.com>
> wrote:
>
>> Am 28.05.20 um 12:06 schrieb Michel Dänzer:
>> > On 2020-05-28 11:11 a.m., Christian König wrote:
>> >> Well we still need implicit sync [...]
>> > Yeah, this isn't about "we don't want implicit sync", it's about "amdgpu
>> > doesn't ensure later jobs fully see the effects of previous implicitly
>> > synced jobs", requiring userspace to do pessimistic flushing.
>>
>> Yes, exactly that.
>>
>> For the background: We also do this flushing for explicit syncs. And
>> when this was implemented 2-3 years ago we first did the flushing for
>> implicit sync as well.
>>
>> That was immediately reverted and then implemented differently because
>> it caused severe performance problems in some use cases.
>>
>> I'm not sure of the root cause of this performance problems. My
>> assumption was always that we then insert to many pipeline syncs, but
>> Marek doesn't seem to think it could be that.
>>
>> On the one hand I'm rather keen to remove the extra handling and just
>> always use the explicit handling for everything because it simplifies
>> the kernel code quite a bit. On the other hand I don't want to run into
>> this performance problem again.
>>
>> Additional to that what the kernel does is a "full" pipeline sync, e.g.
>> we busy wait for the full hardware pipeline to drain. That might be
>> overkill if you just want to do some flushing so that the next shader
>> sees the stuff written, but I'm not an expert on that.
>>
>
> Do we busy-wait on the CPU or in WAIT_REG_MEM?
>
> WAIT_REG_MEM is what UMDs do and should be faster.
>
>
> We use WAIT_REG_MEM to wait for an EOP fence value to reach memory.
>
> We use this for a couple of things, especially to make sure that the
> hardware is idle before changing VMID to page table associations.
>
> What about your idea of having an extra dw in the shared BOs indicating
> that they are flushed?
>
> As far as I understand it an EOS or other event might be sufficient for
> the caches as well. And you could insert the WAIT_REG_MEM directly before
> the first draw using the texture and not before the whole IB.
>
> Could be that we can optimize this even more than what we do in the kernel.
>
> Christian.
>

Adding fences into BOs would be bad, because all UMDs would have to handle
them.

Is it possible to do this in the ring buffer:
if (fence_signalled) {
   indirect_buffer(dependent_IB);
   indirect_buffer(other_IB);
} else {
   indirect_buffer(other_IB);
   wait_reg_mem(fence);
   indirect_buffer(dependent_IB);
}

Or we might have to wait for a hw scheduler.

Does the kernel sync when the driver fd is different, or when the context
is different?

Marek