On Thu, May 28, 2020 at 2:12 PM Christian König wrote: > Am 28.05.20 um 18:06 schrieb Marek Olšák: > > On Thu, May 28, 2020 at 10:40 AM Christian König > wrote: > >> Am 28.05.20 um 12:06 schrieb Michel Dänzer: >> > On 2020-05-28 11:11 a.m., Christian König wrote: >> >> Well we still need implicit sync [...] >> > Yeah, this isn't about "we don't want implicit sync", it's about "amdgpu >> > doesn't ensure later jobs fully see the effects of previous implicitly >> > synced jobs", requiring userspace to do pessimistic flushing. >> >> Yes, exactly that. >> >> For the background: We also do this flushing for explicit syncs. And >> when this was implemented 2-3 years ago we first did the flushing for >> implicit sync as well. >> >> That was immediately reverted and then implemented differently because >> it caused severe performance problems in some use cases. >> >> I'm not sure of the root cause of this performance problems. My >> assumption was always that we then insert to many pipeline syncs, but >> Marek doesn't seem to think it could be that. >> >> On the one hand I'm rather keen to remove the extra handling and just >> always use the explicit handling for everything because it simplifies >> the kernel code quite a bit. On the other hand I don't want to run into >> this performance problem again. >> >> Additional to that what the kernel does is a "full" pipeline sync, e.g. >> we busy wait for the full hardware pipeline to drain. That might be >> overkill if you just want to do some flushing so that the next shader >> sees the stuff written, but I'm not an expert on that. >> > > Do we busy-wait on the CPU or in WAIT_REG_MEM? > > WAIT_REG_MEM is what UMDs do and should be faster. > > > We use WAIT_REG_MEM to wait for an EOP fence value to reach memory. > > We use this for a couple of things, especially to make sure that the > hardware is idle before changing VMID to page table associations. > > What about your idea of having an extra dw in the shared BOs indicating > that they are flushed? > > As far as I understand it an EOS or other event might be sufficient for > the caches as well. And you could insert the WAIT_REG_MEM directly before > the first draw using the texture and not before the whole IB. > > Could be that we can optimize this even more than what we do in the kernel. > > Christian. > Adding fences into BOs would be bad, because all UMDs would have to handle them. Is it possible to do this in the ring buffer: if (fence_signalled) { indirect_buffer(dependent_IB); indirect_buffer(other_IB); } else { indirect_buffer(other_IB); wait_reg_mem(fence); indirect_buffer(dependent_IB); } Or we might have to wait for a hw scheduler. Does the kernel sync when the driver fd is different, or when the context is different? Marek