As long as we can figure out who touched to a certain sync object last 
that would indeed work, yes.

Christian.

Am 14.06.21 um 19:10 schrieb Marek Olšák:
> The call to the hw scheduler has a limitation on the size of all 
> parameters combined. I think we can only pass a 32-bit sequence number 
> and a ~16-bit global (per-GPU) syncobj handle in one call and not much 
> else.
>
> The syncobj handle can be an element index in a global (per-GPU) 
> syncobj table and it's read only for all processes with the exception 
> of the signal command. Syncobjs can either have per VMID write access 
> flags for the signal command (slow), or any process can write to any 
> syncobjs and only rely on the kernel checking the write log (fast).
>
> In any case, we can execute the memory write in the queue engine and 
> only use the hw scheduler for logging, which would be perfect.
>
> Marek
>
> On Thu, Jun 10, 2021 at 12:33 PM Christian König 
> <ckoenig.leichtzumerken@gmail.com 
> <mailto:ckoenig.leichtzumerken@gmail.com>> wrote:
>
>     Hi guys,
>
>     maybe soften that a bit. Reading from the shared memory of the
>     user fence is ok for everybody. What we need to take more care of
>     is the writing side.
>
>     So my current thinking is that we allow read only access, but
>     writing a new sequence value needs to go through the scheduler/kernel.
>
>     So when the CPU wants to signal a timeline fence it needs to call
>     an IOCTL. When the GPU wants to signal the timeline fence it needs
>     to hand that of to the hardware scheduler.
>
>     If we lockup the kernel can check with the hardware who did the
>     last write and what value was written.
>
>     That together with an IOCTL to give out sequence number for
>     implicit sync to applications should be sufficient for the kernel
>     to track who is responsible if something bad happens.
>
>     In other words when the hardware says that the shader wrote stuff
>     like 0xdeadbeef 0x0 or 0xffffffff into memory we kill the process
>     who did that.
>
>     If the hardware says that seq - 1 was written fine, but seq is
>     missing then the kernel blames whoever was supposed to write seq.
>
>     Just pieping the write through a privileged instance should be
>     fine to make sure that we don't run into issues.
>
>     Christian.
>
>     Am 10.06.21 um 17:59 schrieb Marek Olšák:
>>     Hi Daniel,
>>
>>     We just talked about this whole topic internally and we came up
>>     to the conclusion that the hardware needs to understand sync
>>     object handles and have high-level wait and signal operations in
>>     the command stream. Sync objects will be backed by memory, but
>>     they won't be readable or writable by processes directly. The
>>     hardware will log all accesses to sync objects and will send the
>>     log to the kernel periodically. The kernel will identify
>>     malicious behavior.
>>
>>     Example of a hardware command stream:
>>     ...
>>     ImplicitSyncWait(syncObjHandle, sequenceNumber); // the sequence
>>     number is assigned by the kernel
>>     Draw();
>>     ImplicitSyncSignalWhenDone(syncObjHandle);
>>     ...
>>
>>     I'm afraid we have no other choice because of the TLB
>>     invalidation overhead.
>>
>>     Marek
>>
>>
>>     On Wed, Jun 9, 2021 at 2:31 PM Daniel Vetter <daniel@ffwll.ch
>>     <mailto:daniel@ffwll.ch>> wrote:
>>
>>         On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König wrote:
>>         > Am 09.06.21 um 15:19 schrieb Daniel Vetter:
>>         > > [SNIP]
>>         > > > Yeah, we call this the lightweight and the heavyweight
>>         tlb flush.
>>         > > >
>>         > > > The lighweight can be used when you are sure that you
>>         don't have any of the
>>         > > > PTEs currently in flight in the 3D/DMA engine and you
>>         just need to
>>         > > > invalidate the TLB.
>>         > > >
>>         > > > The heavyweight must be used when you need to
>>         invalidate the TLB *AND* make
>>         > > > sure that no concurrently operation moves new stuff
>>         into the TLB.
>>         > > >
>>         > > > The problem is for this use case we have to use the
>>         heavyweight one.
>>         > > Just for my own curiosity: So the lightweight flush is
>>         only for in-between
>>         > > CS when you know access is idle? Or does that also not
>>         work if userspace
>>         > > has a CS on a dma engine going at the same time because
>>         the tlb aren't
>>         > > isolated enough between engines?
>>         >
>>         > More or less correct, yes.
>>         >
>>         > The problem is a lightweight flush only invalidates the
>>         TLB, but doesn't
>>         > take care of entries which have been handed out to the
>>         different engines.
>>         >
>>         > In other words what can happen is the following:
>>         >
>>         > 1. Shader asks TLB to resolve address X.
>>         > 2. TLB looks into its cache and can't find address X so it
>>         asks the walker
>>         > to resolve.
>>         > 3. Walker comes back with result for address X and TLB puts
>>         that into its
>>         > cache and gives it to Shader.
>>         > 4. Shader starts doing some operation using result for
>>         address X.
>>         > 5. You send lightweight TLB invalidate and TLB throws away
>>         cached values for
>>         > address X.
>>         > 6. Shader happily still uses whatever the TLB gave to it in
>>         step 3 to
>>         > accesses address X
>>         >
>>         > See it like the shader has their own 1 entry L0 TLB cache
>>         which is not
>>         > affected by the lightweight flush.
>>         >
>>         > The heavyweight flush on the other hand sends out a
>>         broadcast signal to
>>         > everybody and only comes back when we are sure that an
>>         address is not in use
>>         > any more.
>>
>>         Ah makes sense. On intel the shaders only operate in VA,
>>         everything goes
>>         around as explicit async messages to IO blocks. So we don't
>>         have this, the
>>         only difference in tlb flushes is between tlb flush in the IB
>>         and an mmio
>>         one which is independent for anything currently being
>>         executed on an
>>         egine.
>>         -Daniel
>>         -- 
>>         Daniel Vetter
>>         Software Engineer, Intel Corporation
>>         http://blog.ffwll.ch <http://blog.ffwll.ch>
>>
>