As long as we can figure out who touched to a certain sync object last that would indeed work, yes. Christian. Am 14.06.21 um 19:10 schrieb Marek Olšák: > The call to the hw scheduler has a limitation on the size of all > parameters combined. I think we can only pass a 32-bit sequence number > and a ~16-bit global (per-GPU) syncobj handle in one call and not much > else. > > The syncobj handle can be an element index in a global (per-GPU) > syncobj table and it's read only for all processes with the exception > of the signal command. Syncobjs can either have per VMID write access > flags for the signal command (slow), or any process can write to any > syncobjs and only rely on the kernel checking the write log (fast). > > In any case, we can execute the memory write in the queue engine and > only use the hw scheduler for logging, which would be perfect. > > Marek > > On Thu, Jun 10, 2021 at 12:33 PM Christian König > > wrote: > > Hi guys, > > maybe soften that a bit. Reading from the shared memory of the > user fence is ok for everybody. What we need to take more care of > is the writing side. > > So my current thinking is that we allow read only access, but > writing a new sequence value needs to go through the scheduler/kernel. > > So when the CPU wants to signal a timeline fence it needs to call > an IOCTL. When the GPU wants to signal the timeline fence it needs > to hand that of to the hardware scheduler. > > If we lockup the kernel can check with the hardware who did the > last write and what value was written. > > That together with an IOCTL to give out sequence number for > implicit sync to applications should be sufficient for the kernel > to track who is responsible if something bad happens. > > In other words when the hardware says that the shader wrote stuff > like 0xdeadbeef 0x0 or 0xffffffff into memory we kill the process > who did that. > > If the hardware says that seq - 1 was written fine, but seq is > missing then the kernel blames whoever was supposed to write seq. > > Just pieping the write through a privileged instance should be > fine to make sure that we don't run into issues. > > Christian. > > Am 10.06.21 um 17:59 schrieb Marek Olšák: >> Hi Daniel, >> >> We just talked about this whole topic internally and we came up >> to the conclusion that the hardware needs to understand sync >> object handles and have high-level wait and signal operations in >> the command stream. Sync objects will be backed by memory, but >> they won't be readable or writable by processes directly. The >> hardware will log all accesses to sync objects and will send the >> log to the kernel periodically. The kernel will identify >> malicious behavior. >> >> Example of a hardware command stream: >> ... >> ImplicitSyncWait(syncObjHandle, sequenceNumber); // the sequence >> number is assigned by the kernel >> Draw(); >> ImplicitSyncSignalWhenDone(syncObjHandle); >> ... >> >> I'm afraid we have no other choice because of the TLB >> invalidation overhead. >> >> Marek >> >> >> On Wed, Jun 9, 2021 at 2:31 PM Daniel Vetter > > wrote: >> >> On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König wrote: >> > Am 09.06.21 um 15:19 schrieb Daniel Vetter: >> > > [SNIP] >> > > > Yeah, we call this the lightweight and the heavyweight >> tlb flush. >> > > > >> > > > The lighweight can be used when you are sure that you >> don't have any of the >> > > > PTEs currently in flight in the 3D/DMA engine and you >> just need to >> > > > invalidate the TLB. >> > > > >> > > > The heavyweight must be used when you need to >> invalidate the TLB *AND* make >> > > > sure that no concurrently operation moves new stuff >> into the TLB. >> > > > >> > > > The problem is for this use case we have to use the >> heavyweight one. >> > > Just for my own curiosity: So the lightweight flush is >> only for in-between >> > > CS when you know access is idle? Or does that also not >> work if userspace >> > > has a CS on a dma engine going at the same time because >> the tlb aren't >> > > isolated enough between engines? >> > >> > More or less correct, yes. >> > >> > The problem is a lightweight flush only invalidates the >> TLB, but doesn't >> > take care of entries which have been handed out to the >> different engines. >> > >> > In other words what can happen is the following: >> > >> > 1. Shader asks TLB to resolve address X. >> > 2. TLB looks into its cache and can't find address X so it >> asks the walker >> > to resolve. >> > 3. Walker comes back with result for address X and TLB puts >> that into its >> > cache and gives it to Shader. >> > 4. Shader starts doing some operation using result for >> address X. >> > 5. You send lightweight TLB invalidate and TLB throws away >> cached values for >> > address X. >> > 6. Shader happily still uses whatever the TLB gave to it in >> step 3 to >> > accesses address X >> > >> > See it like the shader has their own 1 entry L0 TLB cache >> which is not >> > affected by the lightweight flush. >> > >> > The heavyweight flush on the other hand sends out a >> broadcast signal to >> > everybody and only comes back when we are sure that an >> address is not in use >> > any more. >> >> Ah makes sense. On intel the shaders only operate in VA, >> everything goes >> around as explicit async messages to IO blocks. So we don't >> have this, the >> only difference in tlb flushes is between tlb flush in the IB >> and an mmio >> one which is independent for anything currently being >> executed on an >> egine. >> -Daniel >> -- >> Daniel Vetter >> Software Engineer, Intel Corporation >> http://blog.ffwll.ch >> >