dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [RFC] Host1x/TegraDRM UAPI
@ 2020-06-23 12:09 Mikko Perttunen
  2020-06-24 20:55 ` Dmitry Osipenko
                   ` (10 more replies)
  0 siblings, 11 replies; 50+ messages in thread
From: Mikko Perttunen @ 2020-06-23 12:09 UTC (permalink / raw)
  To: Thierry Reding, Jon Hunter, Dmitry Osipenko, David Airlie,
	Daniel Vetter, sumit.semwal, gustavo
  Cc: linux-tegra, talho, bhuntsman, dri-devel

# Host1x/TegraDRM UAPI proposal

This is a proposal for a stable UAPI for Host1x and TegraDRM, to replace 
the current TegraDRM UAPI that is behind `STAGING` and quite obsolete in 
many ways.

I haven't written any implementation yet -- I'll do that once there is 
some agreement on the high-level design.

Current open items:

* The syncpoint UAPI allows userspace to create sync_file FDs with 
arbitrary syncpoint fences. dma_fence code currently seems to assume all 
fences will be signaled, which would not necessarily be the case with 
this interface.
* Previously present GEM IOCTLs (GEM_CREATE, GEM_MMAP) are not present. 
Not sure if they are still needed.

## Introduction to the hardware

Tegra Host1x is a hardware block providing the following capabilities:

* Syncpoints, a unified whole-system synchronization primitive, allowing 
synchronization of work between graphics, compute and multimedia 
engines, CPUs including cross-VM synchronization, and devices on the 
PCIe bus, without incurring CPU overhead.
* Channels, a command DMA mechanism allowing asynchronous programming of 
various engines, integrating with syncpoints.
* Hardware virtualization support for syncpoints and channels. (On 
Tegra186 and newer)

This proposal defines APIs for userspace access to syncpoints and 
channels. Kernel drivers can additionally use syncpoints and channels 
internally, providing other userspace interfaces (e.g. V4L2).

Syncpoint and channel interfaces are split into separate parts, as 
syncpoints are useful as a system synchronization primitive even without 
using the engine drivers provided through TegraDRM. For example, a 
computer vision pipeline consisting of video capture, CPU processing and 
GPU processing would not necessarily use the engines provided by 
TegraDRM. See the Example workflows section for more details.

## Syncpoint interface

Syncpoints are a set of 32-bit values providing the following operations:

* Atomically increment value by one
* Read current value
* Wait until value reaches specified threshold. For waiting, the 32-bit 
value space is treated modulo 2^32; e.g. if the current value is 
0xffffffff, then value 0x0 is considered to be one increment in the future.

Each syncpoint is identified by a system-global ID ranging between [0, 
number of syncpoints supported by hardware). The entire system has 
read-only access to all syncpoints based on their ID.

Syncpoints are managed through the device node /dev/host1x provided by 
the gpu/host1x driver.

### IOCTL HOST1X_ALLOCATE_SYNCPOINT (on /dev/host1x)

Allocates a free syncpoint, returning a file descriptor representing it. 
Only the owner of the file descriptor is allowed to mutate the value of 
the syncpoint.

```
struct host1x_ctrl_allocate_syncpoint {
        /**
         * @fd:
         *
         * [out] New file descriptor representing the allocated syncpoint.
         */
        __s32 fd;

        __u32 reserved[3];
};
```

### IOCTL HOST1X_SYNCPOINT_INFO (on syncpoint file descriptor)

Allows retrieval of system-global syncpoint ID corresponding to syncpoint.

Use cases:

* Passing ID to other system components that identify syncpoints by ID
* Debugging and testing

```
struct host1x_syncpoint_info {
        /**
         * @id:
         *
         * [out] System-global ID of the syncpoint.
         */
        __u32 id;

        __u32 reserved[3];
};
```

### IOCTL HOST1X_SYNCPOINT_INCREMENT (on syncpoint file descriptor)

Allows incrementing of the syncpoint value.

Use cases:

* Signalling work completion when executing a pipeline step on the CPU
* Debugging and testing

```
struct host1x_syncpoint_increment {
        /**
         * @count:
         *
         * [in] Number of times to increment syncpoint. Value can be
         *   observed in in-between values, but increments are atomic.
         */
        __u32 count;
};
```

### IOCTL HOST1X_READ_SYNCPOINT (on /dev/host1x)

Read the value of a syncpoint based on its ID.

Use cases:

* Allows more fine-grained tracking of task progression for debugging 
purposes

```
struct host1x_ctrl_read_syncpoint {
        /**
         * @id:
         *
         * [in] ID of syncpoint to read.
         */
        __u32 id;

        /**
         * @value:
         *
         * [out] Value of the syncpoint.
         */
        __u32 value;
};
```

### IOCTL HOST1X_CREATE_FENCE (on /dev/host1x)

Creates a new SYNC_FILE fence file descriptor for the specified 
syncpoint ID and threshold.

Use cases:

* Creating a fence when receiving ID/threshold pair from another system 
component
* Creating a postfence when executing a pipeline step on the CPU
* Creating a postfence when executing a pipeline step controlled by 
userspace (e.g. GPU userspace submission)

```
struct host1x_ctrl_create_fence {
        /**
         * @id:
         *
         * [in] ID of the syncpoint for which to create a fence.
         */
        __u32 id;

        /**
         * @threshold:
         *
         * [in] Threshold value for fence.
         */
        __u32 threshold;

        /**
         * @fence_fd:
         *
         * [out] New sync_file FD corresponding to the ID and threshold.
         */
        __s32 fence_fd;

        __u32 reserved[1];
};
```

### IOCTL HOST1X_GET_FENCE_INFO (on /dev/host1x)

Allows retrieval of the ID/threshold pairs corresponding to a SYNC_FILE 
fence or fence array.

Use cases:

* Debugging and testing
* Transmitting a fence to another system component requiring ID/threshold
* Getting ID/threshold for prefence when programming a pipeline step 
controlled by userspace (e.g. GPU userspace submission)

```
/* If set, corresponding fence is backed by Host1x syncpoints. */
#define HOST1X_CTRL_FENCE_INFO_SYNCPOINT_FENCE      (1 << 0)

struct host1x_ctrl_fence_info {
        /**
         * @flags:
         *
         * [out] HOST1X_CTRL_FENCE_INFO flags.
         */
        __u32 flags;

        /**
         * @id:
         *
         * [out] ID of the syncpoint corresponding to this fence.
         * Only set if HOST1X_CTRL_FENCE_INFO_SYNCPOINT_FENCE is set.
         */
        __u32 id;

        /**
         * @threshold:
         *
         * [out] Signalling threshold of the fence.
         * Only set if HOST1X_CTRL_FENCE_INFO_SYNCPOINT_FENCE is set.
         */
        __u32 threshold;

        __u32 reserved[1];
};

struct host1x_ctrl_get_fence_info {
        /**
         * @fence_fd:
         *
         * [in] Syncpoint-backed sync_file FD for which to retrieve info.
         */
        __s32 fence_fd;

        /**
         * @num_fences:
         *
         * [in] Size of `fence_info` array in elements.
         * [out] Number of fences held by the FD.
         */
        __u32 num_fences;

        /**
         * @fence_info:
         *
         * [in] Pointer to array of 'struct host1x_ctrl_fence_info' 
where info will be stored.
         */
        __u64 fence_info;

        __u32 reserved[1];
};
```

## Channel interface

### DRM_TEGRA_OPEN_CHANNEL

```
struct drm_tegra_open_channel {
         /**
           * @class:
           *
           * [in] Host1x class (engine) the channel will target.
           */
         __u32 host1x_class;

         /**
           * @flags:
           *
           * [in] Flags. Currently none are specified.
           */
         __u32 flags;

         /**
           * @channel_id:
           *
           * [out] Process-specific identifier corresponding to opened
           *   channel. Not the hardware channel ID.
           */
         __u32 channel_id;

         /**
          * @hardware_version:
          *
          * [out] Hardware version of the engine targeted by the channel.
          *   Userspace can use this to select appropriate programming
          *   sequences.
          */
         __u32 hardware_version;

         /**
          * @mode:
          *
          * [out] Mode the hardware is executing in. Some engines can be
          *   configured with different firmware supporting different
          *   functionality depending on the system configuration. This
          *   value allows userspace to detect if the engine is configured
          *   for the intended use case.
          */
         __u32 mode;

         __u32 reserved[3];
};
```

### DRM_TEGRA_CLOSE_CHANNEL

```
struct drm_tegra_close_channel {
         /**
           * @channel_id:
           *
           * [in] ID of channel to close.
           */
         __u32 channel_id;

         __u32 reserved[3];
};
```

### DRM_TEGRA_CHANNEL_MAP

Make memory accessible by the engine while executing work on the channel.

```
#define DRM_TEGRA_CHANNEL_MAP_READWRITE        (1<<0)

struct drm_tegra_channel_map {
         /*
          * [in] ID of the channel for which to map memory to.
          */
         __u32 channel_id;

         /*
          * [in] GEM handle of the memory to map.
          */
         __u32 handle;

         /*
          * [in] Offset in GEM handle of the memory area to map.
          *
          * Must be aligned by 4K.
          */
         __u64 offset;

         /*
          * [in] Length of memory area to map in bytes.
          *
          * Must be aligned by 4K.
          */
         __u64 length;

         /*
          * [out] IOVA of mapped memory. Userspace can use this IOVA
          *   directly to refer to the memory to skip using relocations.
          *   Only available if hardware memory isolation is enabled.
          *
          *   Will be set to 0xffff_ffff_ffff_ffff if unavailable.
          */
         __u64 iova;

         /*
          * [out] ID corresponding to the mapped memory to be used for
          *   relocations or unmapping.
          */
         __u32 mapping_id;

         /*
          * [in] Flags.
          */
         __u32 flags;

         __u32 reserved[6];
};
```

### DRM_TEGRA_CHANNEL_UNMAP

Unmap previously mapped memory. Userspace shall do this only after it 
has determined the channel will no longer access the memory.

```
struct drm_tegra_channel_unmap {
         /*
          * [in] ID of the mapping to remove.
          */
         __u32 mapping_id;

         __u32 reserved[3];
};
```

### DRM_TEGRA_CHANNEL_SUBMIT

Submit a job to the engine/class targeted by the channel.

```
struct drm_tegra_submit_syncpt_incr {
         /*
          * [in] Syncpoint FD of the syncpoint that the job will
          *   increment.
          */
         __s32 syncpt_fd;

         /*
          * [in] Number of increments that the job will do.
          */
         __u32 num_incrs;

         /*
          * [out] Value the syncpoint will have once all increments have
          *   executed.
          */
         __u32 fence_value;

         __u32 reserved[1];
};

/* Sets paddr/IOVA bit 39 on T194 to enable MC swizzling */
#define DRM_TEGRA_SUBMIT_RELOCATION_BLOCKLINEAR   (1<<0)

struct drm_tegra_submit_relocation {
         /* [in] Index of GATHER or GATHER_UPTR command in commands. */
         __u32 gather_command_index;

         /*
          * [in] Mapping handle (obtained through CHANNEL_MAP) of the memory
          *   the address of which will be patched in.
          */
         __u32 mapping_id;

         /*
          * [in] Offset in the gather that will be patched.
          */
         __u64 gather_offset;

         /*
          * [in] Offset in target buffer whose paddr/IOVA will be written
          *   to the gather.
          */
         __u64 target_offset;

         /*
          * [in] Number of bits the resulting address will be logically
          *   shifted right before writing to gather.
          */
         __u32 shift;

         __u32 reserved[1];
};

/* Command is an opcode gather from a GEM handle */
#define DRM_TEGRA_SUBMIT_COMMAND_GATHER             0
/* Command is an opcode gather from a user pointer */
#define DRM_TEGRA_SUBMIT_COMMAND_GATHER_UPTR        1
/* Command is a wait for syncpt fence completion */
#define DRM_TEGRA_SUBMIT_COMMAND_WAIT_SYNCPT        2
/* Command is a wait for SYNC_FILE FD completion */
#define DRM_TEGRA_SUBMIT_COMMAND_WAIT_SYNC_FILE     3
/* Command is a wait for DRM syncobj completion */
#define DRM_TEGRA_SUBMIT_COMMAND_WAIT_SYNCOBJ       4

/*
  * Allow driver to skip command execution if engine
  * was not accessed by another channel between
  * submissions.
  */
#define DRM_TEGRA_SUBMIT_CONTEXT_SETUP                        (1<<0)

struct drm_tegra_submit_command {
         __u16 type;
         __u16 flags;

         union {
                 struct {
                     /* GEM handle */
                     __u32 handle;

                     /*
                      * Offset into GEM object in bytes.
                      * Must be aligned by 4.
                      */
                     __u64 offset;

                     /*
                      * Length of gather in bytes.
                      * Must be aligned by 4.
                      */
                     __u64 length;
                 } gather;

                 struct {
                         __u32 reserved[1];

                         /*
                          * Pointer to gather data.
                          * Must be aligned by 4 bytes.
                          */
                         __u64 base;

                         /*
                          * Length of gather in bytes.
                          * Must be aligned by 4.
                          */
                         __u64 length;
                 } gather_uptr;

                 struct {
                     __u32 syncpt_id;
                     __u32 threshold;

                     __u32 reserved[1];
                 } wait_syncpt;

                 struct {
                         __s32 fd;
                 } wait_sync_file;

                 struct {
                         __u32 handle;
                 } wait_syncobj;
         };
};


#define DRM_TEGRA_SUBMIT_CREATE_POST_SYNC_FILE      (1<<0)
#define DRM_TEGRA_SUBMIT_CREATE_POST_SYNCOBJ        (1<<1)

struct drm_tegra_channel_submit {
         __u32 channel_id;
         __u32 flags;

         /**
          * [in] Timeout in microseconds after which the kernel may
          *   consider the job to have hung and may reap it and
          *   fast-forward its syncpoint increments.
          *
          *   The value may be capped by the kernel.
          */
         __u32 timeout;

         __u32 num_syncpt_incrs;
         __u32 num_relocations;
         __u32 num_commands;

         __u64 syncpt_incrs;
         __u64 relocations;
         __u64 commands;

         /**
          * [out] Invalid, SYNC_FILE FD or syncobj handle, depending on
          *   if DRM_TEGRA_SUBMIT_CREATE_POST_SYNC_FILE,
          *   DRM_TEGRA_SUBMIT_CREATE_POST_SYNCOBJ, or neither are passed.
          *   Passing both is an error.
          *
          * The created fence object is signaled when all syncpoint
          * increments specified in `syncpt_incrs' have executed.
          */
         __u32 post_fence;

         __u32 reserved[3];
};
```

## Example workflows

### Image processing with TegraDRM/VIC

This example is a simple single-step operation using VIC through 
TegraDRM. For example, assume we have a dma-buf fd with an image we want 
to convert from YUV to RGB. This can occur for example as part of video 
decoding.

Syncpoint initialization

1. Allocate syncpoint (HOST1X_ALLOCATE_SYNCPOINT)
    1. This is used to track VIC submission completion.
2. Retrieve syncpoint ID (HOST1X_SYNCPOINT_INFO)
    1. The ID is required to program the increment as part of the 
submission.

Buffer allocation

3. Allocate memory for configuration buffers (DMA Heaps)
4. Import configuration buffer dma-buf as GEM object
5. Import input image dma-buf as GEM object

Channel initialization

6. Open VIC channel (DRM_TEGRA_OPEN_CHANNEL)
7. Map buffers for access by VIC (DRM_TEGRA_CHANNEL_MAP)
8. Create Host1x opcode buffer as userspace memory
    1. If buffer mapping returned an IOVA, that IOVA can be placed 
directly into the buffer. Otherwise, a relocation has to be passed as 
part of the submission
    2. The buffer should contain a syncpoint increment for the syncpoint 
allocated earlier.
9. Submit work, passing in the syncpoint file descriptor allocated in 
the beginning. The submit optionally returns a syncfd/syncobj that can 
be used to wait for submission completion.
    1. If more fine-grained syncpoint waiting is required, the 'fence' 
out-parameter of 'drm_tegra_submit_syncpt_incr' can be used in 
conjunction with HOST1X_CREATE_FENCE to create specific fences.

### Camera-GPU-CPU pipeline without TegraDRM

This example shows a pipeline with image input from a camera being 
processed using the GPU programmed from userspace, and then finally 
analyzed by CPU. This kind of usecase can occur e.g. as part of a 
computer vision usecase.

Syncpoint initialization

1. Camera V4L2 driver allocates a syncpoint internally within the kernel.
2. For CPU job tracking, allocate a syncpoint as in "Image processing 
with TegraDRM/VIC".
3. For GPU job tracking, the GPU kernel driver would allocate a 
syncpoint and assign it such that the GPU channel can access it.

Camera pipeline step

4. Allocate a dma-buf to store the captured image.
5. Trigger camera capture and store the resulting sync_file fd.

GPU pipeline step

6. Use HOST1X_GET_FENCE_INFO to extract syncpoint ID/threshold pair(s) 
from camera step's post-fence sync_file FD. If the sync_file FD is not 
backed by syncpoints, wait for the sync_file FD to signal otherwise 
(e.g. through polling it).
7. Use HOST1X_CREATE_FENCE to create a postfence that is signaled when 
the GPU step is complete.
8. Program the GPU to
    1. Wait for the syncpoint thresholds extracted from the camera 
postfence, if we were able to do so.
    2. Execute image processing on GPU.
    3. Increment GPU's job tracking syncpoint, causing the GPU 
post-fence FD to get signaled.

CPU pipeline step

9. Wait for GPU's post-fence sync_file FD
10. Map the dma-buf containing the image and retrieve results.

In place of the GPU pipeline step, a similar workflow would apply for a 
step executed on the CPU.

--

thanks,
Mikko

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] Host1x/TegraDRM UAPI
  2020-06-23 12:09 [RFC] Host1x/TegraDRM UAPI Mikko Perttunen
@ 2020-06-24 20:55 ` Dmitry Osipenko
  2020-06-25  9:17   ` Mikko Perttunen
  2020-06-24 22:33 ` Dmitry Osipenko
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 50+ messages in thread
From: Dmitry Osipenko @ 2020-06-24 20:55 UTC (permalink / raw)
  To: Mikko Perttunen, Thierry Reding, Jon Hunter, David Airlie,
	Daniel Vetter, sumit.semwal, gustavo
  Cc: linux-tegra, talho, bhuntsman, dri-devel

23.06.2020 15:09, Mikko Perttunen пишет:
...
> * Previously present GEM IOCTLs (GEM_CREATE, GEM_MMAP) are not present.
> Not sure if they are still needed.

Hello Mikko!

A quick comment for the starter. Switching away from the Tegra-specific
GEM IOCTLs to the dma-buf heaps is a such radical change! But I like it!

Previously I was curious about whether we could have multiple CMA
regions (one shared/reusable and other private, for example) for the
Tegra DRM driver and at a quick glance the dma-buf heaps could be a nice
solution that avoids re-inventing a driver-specific things for that.

I'm instantly foreseeing these types of heaps:

1. Large reusable CMA heap.
2. Smaller non-reusable common CMA that could be used when allocation
from a reusable CMA fails. Or vice versa, when we want to reduce the
chance to block for a long time on allocation, for example.
3. Sparse (system) memory heap.

It's the first time I'm looking at the dma-buf heaps and it sounds like
a very nice idea to utilize them!

The https://lkml.org/lkml/2019/11/18/787 says that the system-contiguous
and carveout heaps we not added because they were not needed, but they
will be needed for the Tegra20 drivers and for the case where IOMMU is
disabled. It shouldn't be difficult to add these types of heaps.

I'll continue to examine the dma-buf heaps in a more details.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] Host1x/TegraDRM UAPI
  2020-06-23 12:09 [RFC] Host1x/TegraDRM UAPI Mikko Perttunen
  2020-06-24 20:55 ` Dmitry Osipenko
@ 2020-06-24 22:33 ` Dmitry Osipenko
  2020-06-25  9:27   ` Mikko Perttunen
  2020-06-24 23:11 ` [RFC] Host1x/TegraDRM UAPI Dmitry Osipenko
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 50+ messages in thread
From: Dmitry Osipenko @ 2020-06-24 22:33 UTC (permalink / raw)
  To: Mikko Perttunen, Thierry Reding, Jon Hunter, David Airlie,
	Daniel Vetter, sumit.semwal, gustavo
  Cc: linux-tegra, talho, bhuntsman, dri-devel

23.06.2020 15:09, Mikko Perttunen пишет:
> struct drm_tegra_submit_relocation {
>         /* [in] Index of GATHER or GATHER_UPTR command in commands. */
>         __u32 gather_command_index;
> 
>         /*
>          * [in] Mapping handle (obtained through CHANNEL_MAP) of the memory
>          *   the address of which will be patched in.
>          */
>         __u32 mapping_id;
> 
>         /*
>          * [in] Offset in the gather that will be patched.
>          */
>         __u64 gather_offset;
> 
>         /*
>          * [in] Offset in target buffer whose paddr/IOVA will be written
>          *   to the gather.
>          */
>         __u64 target_offset;
> 
>         /*
>          * [in] Number of bits the resulting address will be logically
>          *   shifted right before writing to gather.
>          */
>         __u32 shift;
> 
>         __u32 reserved[1];
> };

We will also need read/write direction flag here for the
DMA-reservations set up, it will be used for the implicit BO fencing by
the job's scheduler.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] Host1x/TegraDRM UAPI
  2020-06-23 12:09 [RFC] Host1x/TegraDRM UAPI Mikko Perttunen
  2020-06-24 20:55 ` Dmitry Osipenko
  2020-06-24 22:33 ` Dmitry Osipenko
@ 2020-06-24 23:11 ` Dmitry Osipenko
  2020-06-25  9:16   ` Mikko Perttunen
  2020-06-24 23:18 ` Dmitry Osipenko
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 50+ messages in thread
From: Dmitry Osipenko @ 2020-06-24 23:11 UTC (permalink / raw)
  To: Mikko Perttunen, Thierry Reding, Jon Hunter, David Airlie,
	Daniel Vetter, sumit.semwal, gustavo
  Cc: linux-tegra, talho, bhuntsman, dri-devel

23.06.2020 15:09, Mikko Perttunen пишет:
> /* Command is an opcode gather from a GEM handle */
> #define DRM_TEGRA_SUBMIT_COMMAND_GATHER             0
> /* Command is an opcode gather from a user pointer */
> #define DRM_TEGRA_SUBMIT_COMMAND_GATHER_UPTR        1

I'm a bit dubious about whether we really need to retain the non-UPTR
variant. The memory-copying overhead is negligible because cmdstream's
data usually is hot in CPU's cache

IIRC, the most (if not all) of the modern DRM drivers drivers use the
usrptr-only for the cmdstream.

At least there is no any real-world userspace example today that could
benefit from a non-UPTR variant.

I'm suggesting to leave out the non-UPTR gather variant for now, keeping
it in mind as a potential future extension of the submission UAPI. Any
objections?
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] Host1x/TegraDRM UAPI
  2020-06-23 12:09 [RFC] Host1x/TegraDRM UAPI Mikko Perttunen
                   ` (2 preceding siblings ...)
  2020-06-24 23:11 ` [RFC] Host1x/TegraDRM UAPI Dmitry Osipenko
@ 2020-06-24 23:18 ` Dmitry Osipenko
  2020-06-25  0:59   ` Dmitry Osipenko
  2020-06-24 23:23 ` Dmitry Osipenko
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 50+ messages in thread
From: Dmitry Osipenko @ 2020-06-24 23:18 UTC (permalink / raw)
  To: Mikko Perttunen, Thierry Reding, Jon Hunter, David Airlie,
	Daniel Vetter, sumit.semwal, gustavo
  Cc: linux-tegra, talho, bhuntsman, dri-devel

23.06.2020 15:09, Mikko Perttunen пишет:
> struct drm_tegra_channel_submit {
>         __u32 channel_id;
>         __u32 flags;
> 
>         /**
>          * [in] Timeout in microseconds after which the kernel may
>          *   consider the job to have hung and may reap it and
>          *   fast-forward its syncpoint increments.
>          *
>          *   The value may be capped by the kernel.
>          */
>         __u32 timeout;
> 
>         __u32 num_syncpt_incrs;
>         __u32 num_relocations;
>         __u32 num_commands;
> 
>         __u64 syncpt_incrs;
>         __u64 relocations;
>         __u64 commands;

Do we really need to retain the multi-gather support? The code-bloat
(both userspace and kernel driver) is very significant just for
preparing and patching of the multi-buffer cmdstreams.

If userspace runs out of a free space within the pushbuffer, then it
should simply reallocate a larger pushbuffer.

I'm suggesting that we should have a single gather per-job, any objections?
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] Host1x/TegraDRM UAPI
  2020-06-23 12:09 [RFC] Host1x/TegraDRM UAPI Mikko Perttunen
                   ` (3 preceding siblings ...)
  2020-06-24 23:18 ` Dmitry Osipenko
@ 2020-06-24 23:23 ` Dmitry Osipenko
  2020-06-25  9:19   ` Mikko Perttunen
  2020-06-25  0:47 ` Dmitry Osipenko
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 50+ messages in thread
From: Dmitry Osipenko @ 2020-06-24 23:23 UTC (permalink / raw)
  To: Mikko Perttunen, Thierry Reding, Jon Hunter, David Airlie,
	Daniel Vetter, sumit.semwal, gustavo
  Cc: linux-tegra, talho, bhuntsman, dri-devel

23.06.2020 15:09, Mikko Perttunen пишет:
> 
> struct drm_tegra_channel_submit {
>         __u32 channel_id;
>         __u32 flags;
> 
>         /**
>          * [in] Timeout in microseconds after which the kernel may
>          *   consider the job to have hung and may reap it and
>          *   fast-forward its syncpoint increments.
>          *
>          *   The value may be capped by the kernel.
>          */
>         __u32 timeout;

What about to rename this to timeout_us? For clarity.

>         __u32 num_syncpt_incrs;
>         __u32 num_relocations;
>         __u32 num_commands;
> 
>         __u64 syncpt_incrs;
>         __u64 relocations;
>         __u64 commands;

Let's also add "_ptr" postfix to all usrptr names, again for clarity.

It's always nice to have meaningful names :)
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] Host1x/TegraDRM UAPI
  2020-06-23 12:09 [RFC] Host1x/TegraDRM UAPI Mikko Perttunen
                   ` (4 preceding siblings ...)
  2020-06-24 23:23 ` Dmitry Osipenko
@ 2020-06-25  0:47 ` Dmitry Osipenko
  2020-06-25  9:23   ` Mikko Perttunen
  2020-06-25 22:47 ` [RFC] Host1x/TegraDRM UAPI (drm_tegra_channel_map) Dmitry Osipenko
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 50+ messages in thread
From: Dmitry Osipenko @ 2020-06-25  0:47 UTC (permalink / raw)
  To: Mikko Perttunen, Thierry Reding, Jon Hunter, David Airlie,
	Daniel Vetter, sumit.semwal, gustavo
  Cc: linux-tegra, talho, bhuntsman, dri-devel

23.06.2020 15:09, Mikko Perttunen пишет:
> #define DRM_TEGRA_SUBMIT_CREATE_POST_SYNC_FILE      (1<<0)
> #define DRM_TEGRA_SUBMIT_CREATE_POST_SYNCOBJ        (1<<1)

The sync object shouldn't be created by the kernel driver and we
shouldn't need the sync-file FD.

Please consider this example of how syncobj may be used:

  1. Syncobj is created by userspace.

  2. Syncobj's handle (post_fence) is passed with the job to the kernel
driver.

  3. Userspace waits on syncobj for the job's submission.

  4. Kernel driver attaches job-completion dma-fence(s) to the syncobj
after starting to execute the job.

  5. Userspace waits on syncobj for the job's completion.

Syncobj is a superset of the sync-file fence:

  a) Syncobj doesn't have a backing file descriptor when it's created.
For example, why would you need an FD if you're not going to share fence
with other processes. It's possible to get FD out of syncobj later on,
please see [1][2].

  b) Syncobj is designed to be used with a GPU jobs. For example,
userspace passes job to the GPU driver's scheduler and then waits until
job has been started to execute on hardware, this is already supported
by syncobj.

  c) It is possible to attach sync-file's fence to a syncobj [1][2][3]
and now:

  - Userspace may attach sync-file's fence to a syncobj.

  - Userspace may use this syncobj for the job's pre-fence.

  - Kernel driver will take out the pre-fence from the syncobj during of
the job's submission and reset the syncobj's fence to NULL.

  - Job's scheduler will wait on this pre-fence before starting to
execute job.

  - Once the job is started to execute, the job's scheduler will attach
the job's completion-fence to the syncobj. Now syncobj has a post-fence!

  d) It is possible to get sync-file FD out of syncobj [1][2][4].

[1]
https://elixir.bootlin.com/linux/v5.7.2/source/include/uapi/drm/drm.h#L730
[2]
https://elixir.bootlin.com/linux/v5.7.2/source/include/uapi/drm/drm.h#L933
[3]
https://elixir.bootlin.com/linux/v5.7.2/source/drivers/gpu/drm/drm_syncobj.c#L653
[4]
https://elixir.bootlin.com/linux/v5.7.2/source/drivers/gpu/drm/drm_syncobj.c#L674

===

So, sync object can carry both pre-fence and post-fence, and they could
be sync-file FDs!

The corollary is that we can get away by using a single syncobj handle
for the job's submission IOCTL.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] Host1x/TegraDRM UAPI
  2020-06-24 23:18 ` Dmitry Osipenko
@ 2020-06-25  0:59   ` Dmitry Osipenko
  0 siblings, 0 replies; 50+ messages in thread
From: Dmitry Osipenko @ 2020-06-25  0:59 UTC (permalink / raw)
  To: Mikko Perttunen, Thierry Reding, Jon Hunter, David Airlie,
	Daniel Vetter, sumit.semwal, gustavo
  Cc: linux-tegra, talho, bhuntsman, dri-devel

25.06.2020 02:18, Dmitry Osipenko пишет:
> 23.06.2020 15:09, Mikko Perttunen пишет:
>> struct drm_tegra_channel_submit {
>>         __u32 channel_id;
>>         __u32 flags;
>>
>>         /**
>>          * [in] Timeout in microseconds after which the kernel may
>>          *   consider the job to have hung and may reap it and
>>          *   fast-forward its syncpoint increments.
>>          *
>>          *   The value may be capped by the kernel.
>>          */
>>         __u32 timeout;
>>
>>         __u32 num_syncpt_incrs;
>>         __u32 num_relocations;
>>         __u32 num_commands;
>>
>>         __u64 syncpt_incrs;
>>         __u64 relocations;
>>         __u64 commands;
> 
> Do we really need to retain the multi-gather support? The code-bloat
> (both userspace and kernel driver) is very significant just for
> preparing and patching of the multi-buffer cmdstreams.
> 
> If userspace runs out of a free space within the pushbuffer, then it
> should simply reallocate a larger pushbuffer.
> 
> I'm suggesting that we should have a single gather per-job, any objections?
> 

Oh, I just recalled that the later Host1x versions do not allow  to
switch class from gather! Let me think a bit more about it..
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] Host1x/TegraDRM UAPI
  2020-06-24 23:11 ` [RFC] Host1x/TegraDRM UAPI Dmitry Osipenko
@ 2020-06-25  9:16   ` Mikko Perttunen
  2020-06-25 23:24     ` Dmitry Osipenko
  0 siblings, 1 reply; 50+ messages in thread
From: Mikko Perttunen @ 2020-06-25  9:16 UTC (permalink / raw)
  To: Dmitry Osipenko, Thierry Reding, Jon Hunter, David Airlie,
	Daniel Vetter, sumit.semwal, gustavo
  Cc: linux-tegra, talho, bhuntsman, dri-devel

On 6/25/20 2:11 AM, Dmitry Osipenko wrote:
> 23.06.2020 15:09, Mikko Perttunen пишет:
>> /* Command is an opcode gather from a GEM handle */
>> #define DRM_TEGRA_SUBMIT_COMMAND_GATHER             0
>> /* Command is an opcode gather from a user pointer */
>> #define DRM_TEGRA_SUBMIT_COMMAND_GATHER_UPTR        1
> 
> I'm a bit dubious about whether we really need to retain the non-UPTR
> variant. The memory-copying overhead is negligible because cmdstream's
> data usually is hot in CPU's cache
> 
> IIRC, the most (if not all) of the modern DRM drivers drivers use the
> usrptr-only for the cmdstream.
> 
> At least there is no any real-world userspace example today that could
> benefit from a non-UPTR variant.
> 
> I'm suggesting to leave out the non-UPTR gather variant for now, keeping
> it in mind as a potential future extension of the submission UAPI. Any
> objections?
> 

Sure, we should be able to drop it. Downstream userspace is using it, 
but we should be able to fix that. I was thinking that we can directly 
map the user pages and point the gather to them without copying - that 
way we wouldn't need to make DMA allocations inside the driver for every 
submit. (On early Tegras we could just copy into the pushbuffer but that 
won't work for newer ones).

Mikko
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] Host1x/TegraDRM UAPI
  2020-06-24 20:55 ` Dmitry Osipenko
@ 2020-06-25  9:17   ` Mikko Perttunen
  0 siblings, 0 replies; 50+ messages in thread
From: Mikko Perttunen @ 2020-06-25  9:17 UTC (permalink / raw)
  To: Dmitry Osipenko, Thierry Reding, Jon Hunter, David Airlie,
	Daniel Vetter, sumit.semwal, gustavo
  Cc: linux-tegra, talho, bhuntsman, dri-devel

On 6/24/20 11:55 PM, Dmitry Osipenko wrote:
> 23.06.2020 15:09, Mikko Perttunen пишет:
> ...
>> * Previously present GEM IOCTLs (GEM_CREATE, GEM_MMAP) are not present.
>> Not sure if they are still needed.
> 
> Hello Mikko!
> 
> A quick comment for the starter. Switching away from the Tegra-specific
> GEM IOCTLs to the dma-buf heaps is a such radical change! But I like it!
> 
> Previously I was curious about whether we could have multiple CMA
> regions (one shared/reusable and other private, for example) for the
> Tegra DRM driver and at a quick glance the dma-buf heaps could be a nice
> solution that avoids re-inventing a driver-specific things for that.
> 
> I'm instantly foreseeing these types of heaps:
> 
> 1. Large reusable CMA heap.
> 2. Smaller non-reusable common CMA that could be used when allocation
> from a reusable CMA fails. Or vice versa, when we want to reduce the
> chance to block for a long time on allocation, for example.
> 3. Sparse (system) memory heap.
> 
> It's the first time I'm looking at the dma-buf heaps and it sounds like
> a very nice idea to utilize them!
> 
> The https://lkml.org/lkml/2019/11/18/787 says that the system-contiguous
> and carveout heaps we not added because they were not needed, but they
> will be needed for the Tegra20 drivers and for the case where IOMMU is
> disabled. It shouldn't be difficult to add these types of heaps.
> 
> I'll continue to examine the dma-buf heaps in a more details.
> 

Sure, let's go with this for now. We can anyway add GEM IOCTLs later if 
they are needed, without any compatibility issues.

Mikko
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] Host1x/TegraDRM UAPI
  2020-06-24 23:23 ` Dmitry Osipenko
@ 2020-06-25  9:19   ` Mikko Perttunen
  0 siblings, 0 replies; 50+ messages in thread
From: Mikko Perttunen @ 2020-06-25  9:19 UTC (permalink / raw)
  To: Dmitry Osipenko, Thierry Reding, Jon Hunter, David Airlie,
	Daniel Vetter, sumit.semwal, gustavo
  Cc: linux-tegra, talho, bhuntsman, dri-devel

On 6/25/20 2:23 AM, Dmitry Osipenko wrote:
> 23.06.2020 15:09, Mikko Perttunen пишет:
>>
>> struct drm_tegra_channel_submit {
>>          __u32 channel_id;
>>          __u32 flags;
>>
>>          /**
>>           * [in] Timeout in microseconds after which the kernel may
>>           *   consider the job to have hung and may reap it and
>>           *   fast-forward its syncpoint increments.
>>           *
>>           *   The value may be capped by the kernel.
>>           */
>>          __u32 timeout;
> 
> What about to rename this to timeout_us? For clarity.
> 
>>          __u32 num_syncpt_incrs;
>>          __u32 num_relocations;
>>          __u32 num_commands;
>>
>>          __u64 syncpt_incrs;
>>          __u64 relocations;
>>          __u64 commands;
> 
> Let's also add "_ptr" postfix to all usrptr names, again for clarity.
> 
> It's always nice to have meaningful names :)
> 

Yep, good point. I'll fix this for next revision :)

Mikko
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] Host1x/TegraDRM UAPI
  2020-06-25  0:47 ` Dmitry Osipenko
@ 2020-06-25  9:23   ` Mikko Perttunen
  0 siblings, 0 replies; 50+ messages in thread
From: Mikko Perttunen @ 2020-06-25  9:23 UTC (permalink / raw)
  To: Dmitry Osipenko, Thierry Reding, Jon Hunter, David Airlie,
	Daniel Vetter, sumit.semwal, gustavo
  Cc: linux-tegra, talho, bhuntsman, dri-devel

On 6/25/20 3:47 AM, Dmitry Osipenko wrote:
> 23.06.2020 15:09, Mikko Perttunen пишет:
>> #define DRM_TEGRA_SUBMIT_CREATE_POST_SYNC_FILE      (1<<0)
>> #define DRM_TEGRA_SUBMIT_CREATE_POST_SYNCOBJ        (1<<1)
> 
> The sync object shouldn't be created by the kernel driver and we
> shouldn't need the sync-file FD.
> 
> Please consider this example of how syncobj may be used:
> 
>    1. Syncobj is created by userspace.
> 
>    2. Syncobj's handle (post_fence) is passed with the job to the kernel
> driver.
> 
>    3. Userspace waits on syncobj for the job's submission.
> 
>    4. Kernel driver attaches job-completion dma-fence(s) to the syncobj
> after starting to execute the job.
> 
>    5. Userspace waits on syncobj for the job's completion.
> 
> Syncobj is a superset of the sync-file fence:
> 
>    a) Syncobj doesn't have a backing file descriptor when it's created.
> For example, why would you need an FD if you're not going to share fence
> with other processes. It's possible to get FD out of syncobj later on,
> please see [1][2].
> 
>    b) Syncobj is designed to be used with a GPU jobs. For example,
> userspace passes job to the GPU driver's scheduler and then waits until
> job has been started to execute on hardware, this is already supported
> by syncobj.
> 
>    c) It is possible to attach sync-file's fence to a syncobj [1][2][3]
> and now:
> 
>    - Userspace may attach sync-file's fence to a syncobj.
> 
>    - Userspace may use this syncobj for the job's pre-fence.
> 
>    - Kernel driver will take out the pre-fence from the syncobj during of
> the job's submission and reset the syncobj's fence to NULL.
> 
>    - Job's scheduler will wait on this pre-fence before starting to
> execute job.
> 
>    - Once the job is started to execute, the job's scheduler will attach
> the job's completion-fence to the syncobj. Now syncobj has a post-fence!
> 
>    d) It is possible to get sync-file FD out of syncobj [1][2][4].
> 
> [1]
> https://elixir.bootlin.com/linux/v5.7.2/source/include/uapi/drm/drm.h#L730
> [2]
> https://elixir.bootlin.com/linux/v5.7.2/source/include/uapi/drm/drm.h#L933
> [3]
> https://elixir.bootlin.com/linux/v5.7.2/source/drivers/gpu/drm/drm_syncobj.c#L653
> [4]
> https://elixir.bootlin.com/linux/v5.7.2/source/drivers/gpu/drm/drm_syncobj.c#L674
> 
> ===
> 
> So, sync object can carry both pre-fence and post-fence, and they could
> be sync-file FDs!
> 
> The corollary is that we can get away by using a single syncobj handle
> for the job's submission IOCTL.
> 

Ah, clearly I hadn't understood syncobjs properly :) Last time I spent 
significant time with this they didn't exist yet.. I'll check this out.

Mikko
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] Host1x/TegraDRM UAPI
  2020-06-24 22:33 ` Dmitry Osipenko
@ 2020-06-25  9:27   ` Mikko Perttunen
  2020-06-25 22:50     ` [RFC] Host1x/TegraDRM UAPI (drm_tegra_submit_relocation) Dmitry Osipenko
  0 siblings, 1 reply; 50+ messages in thread
From: Mikko Perttunen @ 2020-06-25  9:27 UTC (permalink / raw)
  To: Dmitry Osipenko, Thierry Reding, Jon Hunter, David Airlie,
	Daniel Vetter, sumit.semwal, gustavo
  Cc: linux-tegra, talho, bhuntsman, dri-devel

On 6/25/20 1:33 AM, Dmitry Osipenko wrote:
> 23.06.2020 15:09, Mikko Perttunen пишет:
>> struct drm_tegra_submit_relocation {
>>          /* [in] Index of GATHER or GATHER_UPTR command in commands. */
>>          __u32 gather_command_index;
>>
>>          /*
>>           * [in] Mapping handle (obtained through CHANNEL_MAP) of the memory
>>           *   the address of which will be patched in.
>>           */
>>          __u32 mapping_id;
>>
>>          /*
>>           * [in] Offset in the gather that will be patched.
>>           */
>>          __u64 gather_offset;
>>
>>          /*
>>           * [in] Offset in target buffer whose paddr/IOVA will be written
>>           *   to the gather.
>>           */
>>          __u64 target_offset;
>>
>>          /*
>>           * [in] Number of bits the resulting address will be logically
>>           *   shifted right before writing to gather.
>>           */
>>          __u32 shift;
>>
>>          __u32 reserved[1];
>> };
> 
> We will also need read/write direction flag here for the
> DMA-reservations set up, it will be used for the implicit BO fencing by
> the job's scheduler.
> 

Ideally on newer chips with context isolation, we no longer know what 
DMA-BUFs are being used by the job at submit time - they would just be 
pointers after being MAPped.

Do you know how other GPUs deal with DMA reservations - I expect 
separate MAP and SUBMIT steps would be pretty common? Do you have to 
pass the DMA-BUF to both steps (i.e. do IOMMU mapping during MAP, and 
manage reservations at SUBMIT)?

Mikko
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] Host1x/TegraDRM UAPI (drm_tegra_channel_map)
  2020-06-23 12:09 [RFC] Host1x/TegraDRM UAPI Mikko Perttunen
                   ` (5 preceding siblings ...)
  2020-06-25  0:47 ` Dmitry Osipenko
@ 2020-06-25 22:47 ` Dmitry Osipenko
  2020-06-26  7:34   ` Thierry Reding
  2020-06-26 11:06 ` [RFC] Host1x/TegraDRM UAPI Karol Herbst
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 50+ messages in thread
From: Dmitry Osipenko @ 2020-06-25 22:47 UTC (permalink / raw)
  To: Mikko Perttunen, Thierry Reding, Jon Hunter, David Airlie,
	Daniel Vetter, sumit.semwal, gustavo
  Cc: linux-tegra, talho, bhuntsman, dri-devel

23.06.2020 15:09, Mikko Perttunen пишет:
> ### DRM_TEGRA_CHANNEL_MAP
> 
> Make memory accessible by the engine while executing work on the channel.
> 
> ```
> #define DRM_TEGRA_CHANNEL_MAP_READWRITE        (1<<0)
> 
> struct drm_tegra_channel_map {
>         /*
>          * [in] ID of the channel for which to map memory to.
>          */
>         __u32 channel_id;
>         /*
>          * [in] GEM handle of the memory to map.
>          */
>         __u32 handle;
> 
>         /*
>          * [in] Offset in GEM handle of the memory area to map.
>          *
>          * Must be aligned by 4K.
>          */
>         __u64 offset;

Could you please give a use-case example for this partial mapping?

I vaguely recalling that maybe it was me who suggested this in the past..

I kinda see that this could be useful for a case where userspace
allocates a large chunk of memory and then performs sub-allocations in
the userspace driver. But do we have a real-world example for this right
now?

Please see more below.

>         /*
>          * [in] Length of memory area to map in bytes.
>          *
>          * Must be aligned by 4K.
>          */
>         __u64 length;
> 
>         /*
>          * [out] IOVA of mapped memory. Userspace can use this IOVA
>          *   directly to refer to the memory to skip using relocations.
>          *   Only available if hardware memory isolation is enabled.
>          *
>          *   Will be set to 0xffff_ffff_ffff_ffff if unavailable.
>          */
>         __u64 iova;
> 
>         /*
>          * [out] ID corresponding to the mapped memory to be used for
>          *   relocations or unmapping.
>          */
>         __u32 mapping_id;
>         /*
>          * [in] Flags.
>          */
>         __u32 flags;
> 
>         __u32 reserved[6];
> };

It looks to me that we actually need a bit different thing here.

This MAP IOCTL maps a portion of a GEM and then returns the mapping_id.
And I think we need something more flexible that will allow us to use
GEM handles for the relocation IDs, which should fit nicely with the
DMA-reservations.

What about an IOCTL that wraps GEM into another GEM? We could wrap a
portion of GEM_A into a GEM_B, and then map the GEM_B using the MAP IOCTL.

It could be something like this:

### DRM_TEGRA_BO_WRAP

struct drm_tegra_wrap_bo {
	__u32 bo_handle_wrapped; // out
	__u32 bo_handle;         // in
	__u64 offset;
	__u64 length;
};

### DRM_TEGRA_CHANNEL_MAP

struct drm_tegra_channel_map {
        __u32 channels_mask;
	__u32 mapping_id;
        __u32 bo_handle;
        __u32 flags;
        __u64 iova;
};

===

This allows multiple mapping_ids to have the same backing GEM, so the
mapping_id could be resolved into a BO during of job's submission for
the DMA-reservations handling.

Also:

  1. Maybe the WRAP IOCTL could be a generic GEM IOCTL?

  2. I guess we could start easy without the WRAP IOCTL and
     add it later on once there will be a real-world need.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] Host1x/TegraDRM UAPI (drm_tegra_submit_relocation)
  2020-06-25  9:27   ` Mikko Perttunen
@ 2020-06-25 22:50     ` Dmitry Osipenko
  2020-06-26  9:01       ` Mikko Perttunen
  0 siblings, 1 reply; 50+ messages in thread
From: Dmitry Osipenko @ 2020-06-25 22:50 UTC (permalink / raw)
  To: Mikko Perttunen, Thierry Reding, Jon Hunter, David Airlie,
	Daniel Vetter, sumit.semwal, gustavo
  Cc: linux-tegra, talho, bhuntsman, dri-devel

25.06.2020 12:27, Mikko Perttunen пишет:
> On 6/25/20 1:33 AM, Dmitry Osipenko wrote:
>> 23.06.2020 15:09, Mikko Perttunen пишет:
>>> struct drm_tegra_submit_relocation {
>>>          /* [in] Index of GATHER or GATHER_UPTR command in commands. */
>>>          __u32 gather_command_index;
>>>
>>>          /*
>>>           * [in] Mapping handle (obtained through CHANNEL_MAP) of the
>>> memory
>>>           *   the address of which will be patched in.
>>>           */
>>>          __u32 mapping_id;
>>>
>>>          /*
>>>           * [in] Offset in the gather that will be patched.
>>>           */
>>>          __u64 gather_offset;
>>>
>>>          /*
>>>           * [in] Offset in target buffer whose paddr/IOVA will be
>>> written
>>>           *   to the gather.
>>>           */
>>>          __u64 target_offset;
>>>
>>>          /*
>>>           * [in] Number of bits the resulting address will be logically
>>>           *   shifted right before writing to gather.
>>>           */
>>>          __u32 shift;
>>>
>>>          __u32 reserved[1];
>>> };
>>
>> We will also need read/write direction flag here for the
>> DMA-reservations set up, it will be used for the implicit BO fencing by
>> the job's scheduler.
>>
> 
> Ideally on newer chips with context isolation, we no longer know what
> DMA-BUFs are being used by the job at submit time - they would just be
> pointers after being MAPped.

The DMA-BUFs themselves shouldn't be needed, but GEMs should.

> Do you know how other GPUs deal with DMA reservations - I expect
> separate MAP and SUBMIT steps would be pretty common?

I can't instantly recall what other drivers do, could be worthwhile to
take a closer look.

> Do you have to
> pass the DMA-BUF to both steps (i.e. do IOMMU mapping during MAP, and
> manage reservations at SUBMIT)?

Yes, this sounds good to me if DMA-BUF part is replaced with a GEM.

Please see my other reply regarding the MAP IOCTL where I'm suggesting
to back mapping IDs with a GEM.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] Host1x/TegraDRM UAPI
  2020-06-25  9:16   ` Mikko Perttunen
@ 2020-06-25 23:24     ` Dmitry Osipenko
  2020-06-26  9:05       ` Mikko Perttunen
  0 siblings, 1 reply; 50+ messages in thread
From: Dmitry Osipenko @ 2020-06-25 23:24 UTC (permalink / raw)
  To: Mikko Perttunen, Thierry Reding, Jon Hunter, David Airlie,
	Daniel Vetter, sumit.semwal, gustavo
  Cc: linux-tegra, talho, bhuntsman, dri-devel

25.06.2020 12:16, Mikko Perttunen пишет:
> On 6/25/20 2:11 AM, Dmitry Osipenko wrote:
>> 23.06.2020 15:09, Mikko Perttunen пишет:
>>> /* Command is an opcode gather from a GEM handle */
>>> #define DRM_TEGRA_SUBMIT_COMMAND_GATHER             0
>>> /* Command is an opcode gather from a user pointer */
>>> #define DRM_TEGRA_SUBMIT_COMMAND_GATHER_UPTR        1
>>
>> I'm a bit dubious about whether we really need to retain the non-UPTR
>> variant. The memory-copying overhead is negligible because cmdstream's
>> data usually is hot in CPU's cache
>>
>> IIRC, the most (if not all) of the modern DRM drivers drivers use the
>> usrptr-only for the cmdstream.
>>
>> At least there is no any real-world userspace example today that could
>> benefit from a non-UPTR variant.
>>
>> I'm suggesting to leave out the non-UPTR gather variant for now, keeping
>> it in mind as a potential future extension of the submission UAPI. Any
>> objections?
>>
> 
> Sure, we should be able to drop it. Downstream userspace is using it,
> but we should be able to fix that. I was thinking that we can directly
> map the user pages and point the gather to them without copying - that
> way we wouldn't need to make DMA allocations inside the driver for every
> submit.

We will need to create a Host1x DMA pool and then the dynamic
allocations will be cheap. This is an implementation detail that we can
discuss separately.

We will need the UPTR anyways for the older Tergas because we need to
validate the cmdstreams and it's much more efficient to copy from UPTR
than from the uncacheable memory.

The non-UPTR variant will be fine to add if you'll have a realworld
example that demonstrates a noticeable performance difference.

Previously, I thought that there will be some perf difference if GR3D
shaders are moved into the "insert-opcode" gather, but it was negligible
once I implemented it and it should be even more negligible on a modern
hardware.

> (On early Tegras we could just copy into the pushbuffer but that
> won't work for newer ones).

Yes, we should copy data into a gather and then push it into channel's
pushbuffer. Just like it works with the current upstream driver.

Once all the UAPI will be settled, we'll also need to discuss the
pushbuffer's implementation because the current driver has some problems
with it.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] Host1x/TegraDRM UAPI (drm_tegra_channel_map)
  2020-06-25 22:47 ` [RFC] Host1x/TegraDRM UAPI (drm_tegra_channel_map) Dmitry Osipenko
@ 2020-06-26  7:34   ` Thierry Reding
  2020-06-26 16:35     ` Dmitry Osipenko
  0 siblings, 1 reply; 50+ messages in thread
From: Thierry Reding @ 2020-06-26  7:34 UTC (permalink / raw)
  To: Dmitry Osipenko
  Cc: Mikko Perttunen, David Airlie, gustavo, dri-devel, Jon Hunter,
	talho, bhuntsman, linux-tegra


[-- Attachment #1.1: Type: text/plain, Size: 4951 bytes --]

On Fri, Jun 26, 2020 at 01:47:46AM +0300, Dmitry Osipenko wrote:
> 23.06.2020 15:09, Mikko Perttunen пишет:
> > ### DRM_TEGRA_CHANNEL_MAP
> > 
> > Make memory accessible by the engine while executing work on the channel.
> > 
> > ```
> > #define DRM_TEGRA_CHANNEL_MAP_READWRITE        (1<<0)
> > 
> > struct drm_tegra_channel_map {
> >         /*
> >          * [in] ID of the channel for which to map memory to.
> >          */
> >         __u32 channel_id;
> >         /*
> >          * [in] GEM handle of the memory to map.
> >          */
> >         __u32 handle;
> > 
> >         /*
> >          * [in] Offset in GEM handle of the memory area to map.
> >          *
> >          * Must be aligned by 4K.
> >          */
> >         __u64 offset;
> 
> Could you please give a use-case example for this partial mapping?
> 
> I vaguely recalling that maybe it was me who suggested this in the past..
> 
> I kinda see that this could be useful for a case where userspace
> allocates a large chunk of memory and then performs sub-allocations in
> the userspace driver. But do we have a real-world example for this right
> now?

I think the main point about this IOCTL was to make mapping/unmapping
more efficient and avoid relocations for situations where we know it is
safe to do so.

The fact that this can be used to create partial mappings is mostly just
an added bonus, in my opinion. Doing this doesn't create much complexity
but in turn gives us a lot more flexibility.

A couple of places where I think this could be useful are OpenGL and
Vulkan, both of which support buffer suballocation. This has a couple of
advantages on modern GPUs where sometimes you want to use very large
allocation granularity, etc.

Now, I don't think that we'll see much of that in Tegra DRM directly,
although grate could certainly make use of this, I suspect. However, I
think for interoperation of dGPU and Tegra DRM (with VIC for post-
processing, or hopefully some of the other hardware acceleration
engines at some point), this might come in handy.

There are other possible use-cases within just Tegra DRM as well. We may
want to only partially map planar buffers for video post-processing, for
example. Or map each plane separately.

> Please see more below.
> 
> >         /*
> >          * [in] Length of memory area to map in bytes.
> >          *
> >          * Must be aligned by 4K.
> >          */
> >         __u64 length;
> > 
> >         /*
> >          * [out] IOVA of mapped memory. Userspace can use this IOVA
> >          *   directly to refer to the memory to skip using relocations.
> >          *   Only available if hardware memory isolation is enabled.
> >          *
> >          *   Will be set to 0xffff_ffff_ffff_ffff if unavailable.
> >          */
> >         __u64 iova;
> > 
> >         /*
> >          * [out] ID corresponding to the mapped memory to be used for
> >          *   relocations or unmapping.
> >          */
> >         __u32 mapping_id;
> >         /*
> >          * [in] Flags.
> >          */
> >         __u32 flags;
> > 
> >         __u32 reserved[6];
> > };
> 
> It looks to me that we actually need a bit different thing here.
> 
> This MAP IOCTL maps a portion of a GEM and then returns the mapping_id.
> And I think we need something more flexible that will allow us to use
> GEM handles for the relocation IDs, which should fit nicely with the
> DMA-reservations.
> 
> What about an IOCTL that wraps GEM into another GEM? We could wrap a
> portion of GEM_A into a GEM_B, and then map the GEM_B using the MAP IOCTL.
> 
> It could be something like this:
> 
> ### DRM_TEGRA_BO_WRAP
> 
> struct drm_tegra_wrap_bo {
> 	__u32 bo_handle_wrapped; // out
> 	__u32 bo_handle;         // in
> 	__u64 offset;
> 	__u64 length;
> };
> 
> ### DRM_TEGRA_CHANNEL_MAP
> 
> struct drm_tegra_channel_map {
>         __u32 channels_mask;
> 	__u32 mapping_id;
>         __u32 bo_handle;
>         __u32 flags;
>         __u64 iova;
> };
> 
> ===
> 
> This allows multiple mapping_ids to have the same backing GEM, so the
> mapping_id could be resolved into a BO during of job's submission for
> the DMA-reservations handling.

That's pretty much what we have already above, isn't it? Just because we
call the field "mapping_id" doesn't mean that in the background we can't
create a GEM object and return it's handle as "mapping_id".

One advantage of Mikko's proposal is that we have a single IOCTL rather
than two to create the mapping, making it a bit more lightweight.

Thierry

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] Host1x/TegraDRM UAPI (drm_tegra_submit_relocation)
  2020-06-25 22:50     ` [RFC] Host1x/TegraDRM UAPI (drm_tegra_submit_relocation) Dmitry Osipenko
@ 2020-06-26  9:01       ` Mikko Perttunen
  0 siblings, 0 replies; 50+ messages in thread
From: Mikko Perttunen @ 2020-06-26  9:01 UTC (permalink / raw)
  To: Dmitry Osipenko, Thierry Reding, Jon Hunter, David Airlie,
	Daniel Vetter, sumit.semwal, gustavo
  Cc: linux-tegra, talho, bhuntsman, dri-devel

On 6/26/20 1:50 AM, Dmitry Osipenko wrote:
> 25.06.2020 12:27, Mikko Perttunen пишет:
>> On 6/25/20 1:33 AM, Dmitry Osipenko wrote:
>>> 23.06.2020 15:09, Mikko Perttunen пишет:
>>>> struct drm_tegra_submit_relocation {
>>>>           /* [in] Index of GATHER or GATHER_UPTR command in commands. */
>>>>           __u32 gather_command_index;
>>>>
>>>>           /*
>>>>            * [in] Mapping handle (obtained through CHANNEL_MAP) of the
>>>> memory
>>>>            *   the address of which will be patched in.
>>>>            */
>>>>           __u32 mapping_id;
>>>>
>>>>           /*
>>>>            * [in] Offset in the gather that will be patched.
>>>>            */
>>>>           __u64 gather_offset;
>>>>
>>>>           /*
>>>>            * [in] Offset in target buffer whose paddr/IOVA will be
>>>> written
>>>>            *   to the gather.
>>>>            */
>>>>           __u64 target_offset;
>>>>
>>>>           /*
>>>>            * [in] Number of bits the resulting address will be logically
>>>>            *   shifted right before writing to gather.
>>>>            */
>>>>           __u32 shift;
>>>>
>>>>           __u32 reserved[1];
>>>> };
>>>
>>> We will also need read/write direction flag here for the
>>> DMA-reservations set up, it will be used for the implicit BO fencing by
>>> the job's scheduler.
>>>
>>
>> Ideally on newer chips with context isolation, we no longer know what
>> DMA-BUFs are being used by the job at submit time - they would just be
>> pointers after being MAPped.
> 
> The DMA-BUFs themselves shouldn't be needed, but GEMs should.

Yes, I meant to say GEM instead of DMA-BUF.

> 
>> Do you know how other GPUs deal with DMA reservations - I expect
>> separate MAP and SUBMIT steps would be pretty common?
> 
> I can't instantly recall what other drivers do, could be worthwhile to
> take a closer look.

I'll see if I can find some similar situations in other drivers.

Mikko

> 
>> Do you have to
>> pass the DMA-BUF to both steps (i.e. do IOMMU mapping during MAP, and
>> manage reservations at SUBMIT)?
> 
> Yes, this sounds good to me if DMA-BUF part is replaced with a GEM.
> 
> Please see my other reply regarding the MAP IOCTL where I'm suggesting
> to back mapping IDs with a GEM.
>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] Host1x/TegraDRM UAPI
  2020-06-25 23:24     ` Dmitry Osipenko
@ 2020-06-26  9:05       ` Mikko Perttunen
  0 siblings, 0 replies; 50+ messages in thread
From: Mikko Perttunen @ 2020-06-26  9:05 UTC (permalink / raw)
  To: Dmitry Osipenko, Thierry Reding, Jon Hunter, David Airlie,
	Daniel Vetter, sumit.semwal, gustavo
  Cc: linux-tegra, talho, bhuntsman, dri-devel

On 6/26/20 2:24 AM, Dmitry Osipenko wrote:
> 25.06.2020 12:16, Mikko Perttunen пишет:
>> On 6/25/20 2:11 AM, Dmitry Osipenko wrote:
>>> 23.06.2020 15:09, Mikko Perttunen пишет:
>>>> /* Command is an opcode gather from a GEM handle */
>>>> #define DRM_TEGRA_SUBMIT_COMMAND_GATHER             0
>>>> /* Command is an opcode gather from a user pointer */
>>>> #define DRM_TEGRA_SUBMIT_COMMAND_GATHER_UPTR        1
>>>
>>> I'm a bit dubious about whether we really need to retain the non-UPTR
>>> variant. The memory-copying overhead is negligible because cmdstream's
>>> data usually is hot in CPU's cache
>>>
>>> IIRC, the most (if not all) of the modern DRM drivers drivers use the
>>> usrptr-only for the cmdstream.
>>>
>>> At least there is no any real-world userspace example today that could
>>> benefit from a non-UPTR variant.
>>>
>>> I'm suggesting to leave out the non-UPTR gather variant for now, keeping
>>> it in mind as a potential future extension of the submission UAPI. Any
>>> objections?
>>>
>>
>> Sure, we should be able to drop it. Downstream userspace is using it,
>> but we should be able to fix that. I was thinking that we can directly
>> map the user pages and point the gather to them without copying - that
>> way we wouldn't need to make DMA allocations inside the driver for every
>> submit.
> 
> We will need to create a Host1x DMA pool and then the dynamic
> allocations will be cheap. This is an implementation detail that we can
> discuss separately.
> 
> We will need the UPTR anyways for the older Tergas because we need to
> validate the cmdstreams and it's much more efficient to copy from UPTR
> than from the uncacheable memory.
> 
> The non-UPTR variant will be fine to add if you'll have a realworld
> example that demonstrates a noticeable performance difference.
> 
> Previously, I thought that there will be some perf difference if GR3D
> shaders are moved into the "insert-opcode" gather, but it was negligible
> once I implemented it and it should be even more negligible on a modern
> hardware.
> 
>> (On early Tegras we could just copy into the pushbuffer but that
>> won't work for newer ones).
> 
> Yes, we should copy data into a gather and then push it into channel's
> pushbuffer. Just like it works with the current upstream driver.
> 
> Once all the UAPI will be settled, we'll also need to discuss the
> pushbuffer's implementation because the current driver has some problems
> with it.
> 

True, for earlier Tegras we'll need to copy anyway. So let's just 
implement copying for now, while making sure that extending to directly 
mapping pages will be possible later (don't know why it wouldn't be), 
and implement direct mapping or GEM gathers later if needed.

Mikko
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] Host1x/TegraDRM UAPI
  2020-06-23 12:09 [RFC] Host1x/TegraDRM UAPI Mikko Perttunen
                   ` (6 preceding siblings ...)
  2020-06-25 22:47 ` [RFC] Host1x/TegraDRM UAPI (drm_tegra_channel_map) Dmitry Osipenko
@ 2020-06-26 11:06 ` Karol Herbst
  2020-06-26 11:13   ` Mikko Perttunen
  2020-06-26 11:40   ` Thierry Reding
  2020-06-27 21:47 ` [RFC] Host1x/TegraDRM UAPI (drm_tegra_submit_syncpt_incr) Dmitry Osipenko
                   ` (2 subsequent siblings)
  10 siblings, 2 replies; 50+ messages in thread
From: Karol Herbst @ 2020-06-26 11:06 UTC (permalink / raw)
  To: Mikko Perttunen
  Cc: David Airlie, gustavo, dri-devel, Jon Hunter, talho, bhuntsman,
	Thierry Reding, linux-tegra, Dmitry Osipenko

On Tue, Jun 23, 2020 at 3:03 PM Mikko Perttunen <cyndis@kapsi.fi> wrote:
>
> # Host1x/TegraDRM UAPI proposal
>
> This is a proposal for a stable UAPI for Host1x and TegraDRM, to replace
> the current TegraDRM UAPI that is behind `STAGING` and quite obsolete in
> many ways.
>
> I haven't written any implementation yet -- I'll do that once there is
> some agreement on the high-level design.
>
> Current open items:
>
> * The syncpoint UAPI allows userspace to create sync_file FDs with
> arbitrary syncpoint fences. dma_fence code currently seems to assume all
> fences will be signaled, which would not necessarily be the case with
> this interface.
> * Previously present GEM IOCTLs (GEM_CREATE, GEM_MMAP) are not present.
> Not sure if they are still needed.
>

Hi, as this wasn't addressed here (and sorry if I missed it): is there
an open source userspace making use of this UAPI? Because this is
something which needs to be seen before it can be included at all.

> ## Introduction to the hardware
>
> Tegra Host1x is a hardware block providing the following capabilities:
>
> * Syncpoints, a unified whole-system synchronization primitive, allowing
> synchronization of work between graphics, compute and multimedia
> engines, CPUs including cross-VM synchronization, and devices on the
> PCIe bus, without incurring CPU overhead.
> * Channels, a command DMA mechanism allowing asynchronous programming of
> various engines, integrating with syncpoints.
> * Hardware virtualization support for syncpoints and channels. (On
> Tegra186 and newer)
>
> This proposal defines APIs for userspace access to syncpoints and
> channels. Kernel drivers can additionally use syncpoints and channels
> internally, providing other userspace interfaces (e.g. V4L2).
>
> Syncpoint and channel interfaces are split into separate parts, as
> syncpoints are useful as a system synchronization primitive even without
> using the engine drivers provided through TegraDRM. For example, a
> computer vision pipeline consisting of video capture, CPU processing and
> GPU processing would not necessarily use the engines provided by
> TegraDRM. See the Example workflows section for more details.
>
> ## Syncpoint interface
>
> Syncpoints are a set of 32-bit values providing the following operations:
>
> * Atomically increment value by one
> * Read current value
> * Wait until value reaches specified threshold. For waiting, the 32-bit
> value space is treated modulo 2^32; e.g. if the current value is
> 0xffffffff, then value 0x0 is considered to be one increment in the future.
>
> Each syncpoint is identified by a system-global ID ranging between [0,
> number of syncpoints supported by hardware). The entire system has
> read-only access to all syncpoints based on their ID.
>
> Syncpoints are managed through the device node /dev/host1x provided by
> the gpu/host1x driver.
>
> ### IOCTL HOST1X_ALLOCATE_SYNCPOINT (on /dev/host1x)
>
> Allocates a free syncpoint, returning a file descriptor representing it.
> Only the owner of the file descriptor is allowed to mutate the value of
> the syncpoint.
>
> ```
> struct host1x_ctrl_allocate_syncpoint {
>         /**
>          * @fd:
>          *
>          * [out] New file descriptor representing the allocated syncpoint.
>          */
>         __s32 fd;
>
>         __u32 reserved[3];
> };
> ```
>
> ### IOCTL HOST1X_SYNCPOINT_INFO (on syncpoint file descriptor)
>
> Allows retrieval of system-global syncpoint ID corresponding to syncpoint.
>
> Use cases:
>
> * Passing ID to other system components that identify syncpoints by ID
> * Debugging and testing
>
> ```
> struct host1x_syncpoint_info {
>         /**
>          * @id:
>          *
>          * [out] System-global ID of the syncpoint.
>          */
>         __u32 id;
>
>         __u32 reserved[3];
> };
> ```
>
> ### IOCTL HOST1X_SYNCPOINT_INCREMENT (on syncpoint file descriptor)
>
> Allows incrementing of the syncpoint value.
>
> Use cases:
>
> * Signalling work completion when executing a pipeline step on the CPU
> * Debugging and testing
>
> ```
> struct host1x_syncpoint_increment {
>         /**
>          * @count:
>          *
>          * [in] Number of times to increment syncpoint. Value can be
>          *   observed in in-between values, but increments are atomic.
>          */
>         __u32 count;
> };
> ```
>
> ### IOCTL HOST1X_READ_SYNCPOINT (on /dev/host1x)
>
> Read the value of a syncpoint based on its ID.
>
> Use cases:
>
> * Allows more fine-grained tracking of task progression for debugging
> purposes
>
> ```
> struct host1x_ctrl_read_syncpoint {
>         /**
>          * @id:
>          *
>          * [in] ID of syncpoint to read.
>          */
>         __u32 id;
>
>         /**
>          * @value:
>          *
>          * [out] Value of the syncpoint.
>          */
>         __u32 value;
> };
> ```
>
> ### IOCTL HOST1X_CREATE_FENCE (on /dev/host1x)
>
> Creates a new SYNC_FILE fence file descriptor for the specified
> syncpoint ID and threshold.
>
> Use cases:
>
> * Creating a fence when receiving ID/threshold pair from another system
> component
> * Creating a postfence when executing a pipeline step on the CPU
> * Creating a postfence when executing a pipeline step controlled by
> userspace (e.g. GPU userspace submission)
>
> ```
> struct host1x_ctrl_create_fence {
>         /**
>          * @id:
>          *
>          * [in] ID of the syncpoint for which to create a fence.
>          */
>         __u32 id;
>
>         /**
>          * @threshold:
>          *
>          * [in] Threshold value for fence.
>          */
>         __u32 threshold;
>
>         /**
>          * @fence_fd:
>          *
>          * [out] New sync_file FD corresponding to the ID and threshold.
>          */
>         __s32 fence_fd;
>
>         __u32 reserved[1];
> };
> ```
>
> ### IOCTL HOST1X_GET_FENCE_INFO (on /dev/host1x)
>
> Allows retrieval of the ID/threshold pairs corresponding to a SYNC_FILE
> fence or fence array.
>
> Use cases:
>
> * Debugging and testing
> * Transmitting a fence to another system component requiring ID/threshold
> * Getting ID/threshold for prefence when programming a pipeline step
> controlled by userspace (e.g. GPU userspace submission)
>
> ```
> /* If set, corresponding fence is backed by Host1x syncpoints. */
> #define HOST1X_CTRL_FENCE_INFO_SYNCPOINT_FENCE      (1 << 0)
>
> struct host1x_ctrl_fence_info {
>         /**
>          * @flags:
>          *
>          * [out] HOST1X_CTRL_FENCE_INFO flags.
>          */
>         __u32 flags;
>
>         /**
>          * @id:
>          *
>          * [out] ID of the syncpoint corresponding to this fence.
>          * Only set if HOST1X_CTRL_FENCE_INFO_SYNCPOINT_FENCE is set.
>          */
>         __u32 id;
>
>         /**
>          * @threshold:
>          *
>          * [out] Signalling threshold of the fence.
>          * Only set if HOST1X_CTRL_FENCE_INFO_SYNCPOINT_FENCE is set.
>          */
>         __u32 threshold;
>
>         __u32 reserved[1];
> };
>
> struct host1x_ctrl_get_fence_info {
>         /**
>          * @fence_fd:
>          *
>          * [in] Syncpoint-backed sync_file FD for which to retrieve info.
>          */
>         __s32 fence_fd;
>
>         /**
>          * @num_fences:
>          *
>          * [in] Size of `fence_info` array in elements.
>          * [out] Number of fences held by the FD.
>          */
>         __u32 num_fences;
>
>         /**
>          * @fence_info:
>          *
>          * [in] Pointer to array of 'struct host1x_ctrl_fence_info'
> where info will be stored.
>          */
>         __u64 fence_info;
>
>         __u32 reserved[1];
> };
> ```
>
> ## Channel interface
>
> ### DRM_TEGRA_OPEN_CHANNEL
>
> ```
> struct drm_tegra_open_channel {
>          /**
>            * @class:
>            *
>            * [in] Host1x class (engine) the channel will target.
>            */
>          __u32 host1x_class;
>
>          /**
>            * @flags:
>            *
>            * [in] Flags. Currently none are specified.
>            */
>          __u32 flags;
>
>          /**
>            * @channel_id:
>            *
>            * [out] Process-specific identifier corresponding to opened
>            *   channel. Not the hardware channel ID.
>            */
>          __u32 channel_id;
>
>          /**
>           * @hardware_version:
>           *
>           * [out] Hardware version of the engine targeted by the channel.
>           *   Userspace can use this to select appropriate programming
>           *   sequences.
>           */
>          __u32 hardware_version;
>
>          /**
>           * @mode:
>           *
>           * [out] Mode the hardware is executing in. Some engines can be
>           *   configured with different firmware supporting different
>           *   functionality depending on the system configuration. This
>           *   value allows userspace to detect if the engine is configured
>           *   for the intended use case.
>           */
>          __u32 mode;
>
>          __u32 reserved[3];
> };
> ```
>
> ### DRM_TEGRA_CLOSE_CHANNEL
>
> ```
> struct drm_tegra_close_channel {
>          /**
>            * @channel_id:
>            *
>            * [in] ID of channel to close.
>            */
>          __u32 channel_id;
>
>          __u32 reserved[3];
> };
> ```
>
> ### DRM_TEGRA_CHANNEL_MAP
>
> Make memory accessible by the engine while executing work on the channel.
>
> ```
> #define DRM_TEGRA_CHANNEL_MAP_READWRITE        (1<<0)
>
> struct drm_tegra_channel_map {
>          /*
>           * [in] ID of the channel for which to map memory to.
>           */
>          __u32 channel_id;
>
>          /*
>           * [in] GEM handle of the memory to map.
>           */
>          __u32 handle;
>
>          /*
>           * [in] Offset in GEM handle of the memory area to map.
>           *
>           * Must be aligned by 4K.
>           */
>          __u64 offset;
>
>          /*
>           * [in] Length of memory area to map in bytes.
>           *
>           * Must be aligned by 4K.
>           */
>          __u64 length;
>
>          /*
>           * [out] IOVA of mapped memory. Userspace can use this IOVA
>           *   directly to refer to the memory to skip using relocations.
>           *   Only available if hardware memory isolation is enabled.
>           *
>           *   Will be set to 0xffff_ffff_ffff_ffff if unavailable.
>           */
>          __u64 iova;
>
>          /*
>           * [out] ID corresponding to the mapped memory to be used for
>           *   relocations or unmapping.
>           */
>          __u32 mapping_id;
>
>          /*
>           * [in] Flags.
>           */
>          __u32 flags;
>
>          __u32 reserved[6];
> };
> ```
>
> ### DRM_TEGRA_CHANNEL_UNMAP
>
> Unmap previously mapped memory. Userspace shall do this only after it
> has determined the channel will no longer access the memory.
>
> ```
> struct drm_tegra_channel_unmap {
>          /*
>           * [in] ID of the mapping to remove.
>           */
>          __u32 mapping_id;
>
>          __u32 reserved[3];
> };
> ```
>
> ### DRM_TEGRA_CHANNEL_SUBMIT
>
> Submit a job to the engine/class targeted by the channel.
>
> ```
> struct drm_tegra_submit_syncpt_incr {
>          /*
>           * [in] Syncpoint FD of the syncpoint that the job will
>           *   increment.
>           */
>          __s32 syncpt_fd;
>
>          /*
>           * [in] Number of increments that the job will do.
>           */
>          __u32 num_incrs;
>
>          /*
>           * [out] Value the syncpoint will have once all increments have
>           *   executed.
>           */
>          __u32 fence_value;
>
>          __u32 reserved[1];
> };
>
> /* Sets paddr/IOVA bit 39 on T194 to enable MC swizzling */
> #define DRM_TEGRA_SUBMIT_RELOCATION_BLOCKLINEAR   (1<<0)
>
> struct drm_tegra_submit_relocation {
>          /* [in] Index of GATHER or GATHER_UPTR command in commands. */
>          __u32 gather_command_index;
>
>          /*
>           * [in] Mapping handle (obtained through CHANNEL_MAP) of the memory
>           *   the address of which will be patched in.
>           */
>          __u32 mapping_id;
>
>          /*
>           * [in] Offset in the gather that will be patched.
>           */
>          __u64 gather_offset;
>
>          /*
>           * [in] Offset in target buffer whose paddr/IOVA will be written
>           *   to the gather.
>           */
>          __u64 target_offset;
>
>          /*
>           * [in] Number of bits the resulting address will be logically
>           *   shifted right before writing to gather.
>           */
>          __u32 shift;
>
>          __u32 reserved[1];
> };
>
> /* Command is an opcode gather from a GEM handle */
> #define DRM_TEGRA_SUBMIT_COMMAND_GATHER             0
> /* Command is an opcode gather from a user pointer */
> #define DRM_TEGRA_SUBMIT_COMMAND_GATHER_UPTR        1
> /* Command is a wait for syncpt fence completion */
> #define DRM_TEGRA_SUBMIT_COMMAND_WAIT_SYNCPT        2
> /* Command is a wait for SYNC_FILE FD completion */
> #define DRM_TEGRA_SUBMIT_COMMAND_WAIT_SYNC_FILE     3
> /* Command is a wait for DRM syncobj completion */
> #define DRM_TEGRA_SUBMIT_COMMAND_WAIT_SYNCOBJ       4
>
> /*
>   * Allow driver to skip command execution if engine
>   * was not accessed by another channel between
>   * submissions.
>   */
> #define DRM_TEGRA_SUBMIT_CONTEXT_SETUP                        (1<<0)
>
> struct drm_tegra_submit_command {
>          __u16 type;
>          __u16 flags;
>
>          union {
>                  struct {
>                      /* GEM handle */
>                      __u32 handle;
>
>                      /*
>                       * Offset into GEM object in bytes.
>                       * Must be aligned by 4.
>                       */
>                      __u64 offset;
>
>                      /*
>                       * Length of gather in bytes.
>                       * Must be aligned by 4.
>                       */
>                      __u64 length;
>                  } gather;
>
>                  struct {
>                          __u32 reserved[1];
>
>                          /*
>                           * Pointer to gather data.
>                           * Must be aligned by 4 bytes.
>                           */
>                          __u64 base;
>
>                          /*
>                           * Length of gather in bytes.
>                           * Must be aligned by 4.
>                           */
>                          __u64 length;
>                  } gather_uptr;
>
>                  struct {
>                      __u32 syncpt_id;
>                      __u32 threshold;
>
>                      __u32 reserved[1];
>                  } wait_syncpt;
>
>                  struct {
>                          __s32 fd;
>                  } wait_sync_file;
>
>                  struct {
>                          __u32 handle;
>                  } wait_syncobj;
>          };
> };
>
>
> #define DRM_TEGRA_SUBMIT_CREATE_POST_SYNC_FILE      (1<<0)
> #define DRM_TEGRA_SUBMIT_CREATE_POST_SYNCOBJ        (1<<1)
>
> struct drm_tegra_channel_submit {
>          __u32 channel_id;
>          __u32 flags;
>
>          /**
>           * [in] Timeout in microseconds after which the kernel may
>           *   consider the job to have hung and may reap it and
>           *   fast-forward its syncpoint increments.
>           *
>           *   The value may be capped by the kernel.
>           */
>          __u32 timeout;
>
>          __u32 num_syncpt_incrs;
>          __u32 num_relocations;
>          __u32 num_commands;
>
>          __u64 syncpt_incrs;
>          __u64 relocations;
>          __u64 commands;
>
>          /**
>           * [out] Invalid, SYNC_FILE FD or syncobj handle, depending on
>           *   if DRM_TEGRA_SUBMIT_CREATE_POST_SYNC_FILE,
>           *   DRM_TEGRA_SUBMIT_CREATE_POST_SYNCOBJ, or neither are passed.
>           *   Passing both is an error.
>           *
>           * The created fence object is signaled when all syncpoint
>           * increments specified in `syncpt_incrs' have executed.
>           */
>          __u32 post_fence;
>
>          __u32 reserved[3];
> };
> ```
>
> ## Example workflows
>
> ### Image processing with TegraDRM/VIC
>
> This example is a simple single-step operation using VIC through
> TegraDRM. For example, assume we have a dma-buf fd with an image we want
> to convert from YUV to RGB. This can occur for example as part of video
> decoding.
>
> Syncpoint initialization
>
> 1. Allocate syncpoint (HOST1X_ALLOCATE_SYNCPOINT)
>     1. This is used to track VIC submission completion.
> 2. Retrieve syncpoint ID (HOST1X_SYNCPOINT_INFO)
>     1. The ID is required to program the increment as part of the
> submission.
>
> Buffer allocation
>
> 3. Allocate memory for configuration buffers (DMA Heaps)
> 4. Import configuration buffer dma-buf as GEM object
> 5. Import input image dma-buf as GEM object
>
> Channel initialization
>
> 6. Open VIC channel (DRM_TEGRA_OPEN_CHANNEL)
> 7. Map buffers for access by VIC (DRM_TEGRA_CHANNEL_MAP)
> 8. Create Host1x opcode buffer as userspace memory
>     1. If buffer mapping returned an IOVA, that IOVA can be placed
> directly into the buffer. Otherwise, a relocation has to be passed as
> part of the submission
>     2. The buffer should contain a syncpoint increment for the syncpoint
> allocated earlier.
> 9. Submit work, passing in the syncpoint file descriptor allocated in
> the beginning. The submit optionally returns a syncfd/syncobj that can
> be used to wait for submission completion.
>     1. If more fine-grained syncpoint waiting is required, the 'fence'
> out-parameter of 'drm_tegra_submit_syncpt_incr' can be used in
> conjunction with HOST1X_CREATE_FENCE to create specific fences.
>
> ### Camera-GPU-CPU pipeline without TegraDRM
>
> This example shows a pipeline with image input from a camera being
> processed using the GPU programmed from userspace, and then finally
> analyzed by CPU. This kind of usecase can occur e.g. as part of a
> computer vision usecase.
>
> Syncpoint initialization
>
> 1. Camera V4L2 driver allocates a syncpoint internally within the kernel.
> 2. For CPU job tracking, allocate a syncpoint as in "Image processing
> with TegraDRM/VIC".
> 3. For GPU job tracking, the GPU kernel driver would allocate a
> syncpoint and assign it such that the GPU channel can access it.
>
> Camera pipeline step
>
> 4. Allocate a dma-buf to store the captured image.
> 5. Trigger camera capture and store the resulting sync_file fd.
>
> GPU pipeline step
>
> 6. Use HOST1X_GET_FENCE_INFO to extract syncpoint ID/threshold pair(s)
> from camera step's post-fence sync_file FD. If the sync_file FD is not
> backed by syncpoints, wait for the sync_file FD to signal otherwise
> (e.g. through polling it).
> 7. Use HOST1X_CREATE_FENCE to create a postfence that is signaled when
> the GPU step is complete.
> 8. Program the GPU to
>     1. Wait for the syncpoint thresholds extracted from the camera
> postfence, if we were able to do so.
>     2. Execute image processing on GPU.
>     3. Increment GPU's job tracking syncpoint, causing the GPU
> post-fence FD to get signaled.
>
> CPU pipeline step
>
> 9. Wait for GPU's post-fence sync_file FD
> 10. Map the dma-buf containing the image and retrieve results.
>
> In place of the GPU pipeline step, a similar workflow would apply for a
> step executed on the CPU.
>
> --
>
> thanks,
> Mikko
>
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] Host1x/TegraDRM UAPI
  2020-06-26 11:06 ` [RFC] Host1x/TegraDRM UAPI Karol Herbst
@ 2020-06-26 11:13   ` Mikko Perttunen
  2020-06-26 11:16     ` Karol Herbst
  2020-06-26 11:40   ` Thierry Reding
  1 sibling, 1 reply; 50+ messages in thread
From: Mikko Perttunen @ 2020-06-26 11:13 UTC (permalink / raw)
  To: Karol Herbst
  Cc: David Airlie, gustavo, dri-devel, Jon Hunter, talho, bhuntsman,
	Thierry Reding, linux-tegra, Dmitry Osipenko

On 6/26/20 2:06 PM, Karol Herbst wrote:
> On Tue, Jun 23, 2020 at 3:03 PM Mikko Perttunen <cyndis@kapsi.fi> wrote:
>>
>> # Host1x/TegraDRM UAPI proposal
>>
>> This is a proposal for a stable UAPI for Host1x and TegraDRM, to replace
>> the current TegraDRM UAPI that is behind `STAGING` and quite obsolete in
>> many ways.
>>
>> I haven't written any implementation yet -- I'll do that once there is
>> some agreement on the high-level design.
>>
>> Current open items:
>>
>> * The syncpoint UAPI allows userspace to create sync_file FDs with
>> arbitrary syncpoint fences. dma_fence code currently seems to assume all
>> fences will be signaled, which would not necessarily be the case with
>> this interface.
>> * Previously present GEM IOCTLs (GEM_CREATE, GEM_MMAP) are not present.
>> Not sure if they are still needed.
>>
> 
> Hi, as this wasn't addressed here (and sorry if I missed it): is there
> an open source userspace making use of this UAPI? Because this is
> something which needs to be seen before it can be included at all.
> 

Hi Karol,

not currently, but once we have hashed out the design a little bit (and 
I'm back from vacation), I'll work on open source userspace and 
converting existing code using the staging UAPI to this one.

Mikko
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] Host1x/TegraDRM UAPI
  2020-06-26 11:13   ` Mikko Perttunen
@ 2020-06-26 11:16     ` Karol Herbst
  0 siblings, 0 replies; 50+ messages in thread
From: Karol Herbst @ 2020-06-26 11:16 UTC (permalink / raw)
  To: Mikko Perttunen
  Cc: David Airlie, gustavo, dri-devel, Jon Hunter, talho, bhuntsman,
	Thierry Reding, linux-tegra, Dmitry Osipenko

On Fri, Jun 26, 2020 at 1:13 PM Mikko Perttunen <cyndis@kapsi.fi> wrote:
>
> On 6/26/20 2:06 PM, Karol Herbst wrote:
> > On Tue, Jun 23, 2020 at 3:03 PM Mikko Perttunen <cyndis@kapsi.fi> wrote:
> >>
> >> # Host1x/TegraDRM UAPI proposal
> >>
> >> This is a proposal for a stable UAPI for Host1x and TegraDRM, to replace
> >> the current TegraDRM UAPI that is behind `STAGING` and quite obsolete in
> >> many ways.
> >>
> >> I haven't written any implementation yet -- I'll do that once there is
> >> some agreement on the high-level design.
> >>
> >> Current open items:
> >>
> >> * The syncpoint UAPI allows userspace to create sync_file FDs with
> >> arbitrary syncpoint fences. dma_fence code currently seems to assume all
> >> fences will be signaled, which would not necessarily be the case with
> >> this interface.
> >> * Previously present GEM IOCTLs (GEM_CREATE, GEM_MMAP) are not present.
> >> Not sure if they are still needed.
> >>
> >
> > Hi, as this wasn't addressed here (and sorry if I missed it): is there
> > an open source userspace making use of this UAPI? Because this is
> > something which needs to be seen before it can be included at all.
> >
>
> Hi Karol,
>
> not currently, but once we have hashed out the design a little bit (and
> I'm back from vacation), I'll work on open source userspace and
> converting existing code using the staging UAPI to this one.
>
> Mikko
>

okay, cool, sounds good!

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] Host1x/TegraDRM UAPI
  2020-06-26 11:06 ` [RFC] Host1x/TegraDRM UAPI Karol Herbst
  2020-06-26 11:13   ` Mikko Perttunen
@ 2020-06-26 11:40   ` Thierry Reding
  2020-06-26 13:38     ` Daniel Vetter
  1 sibling, 1 reply; 50+ messages in thread
From: Thierry Reding @ 2020-06-26 11:40 UTC (permalink / raw)
  To: Karol Herbst
  Cc: Mikko Perttunen, David Airlie, gustavo, dri-devel, Jon Hunter,
	talho, bhuntsman, linux-tegra, Dmitry Osipenko


[-- Attachment #1.1: Type: text/plain, Size: 1871 bytes --]

On Fri, Jun 26, 2020 at 01:06:58PM +0200, Karol Herbst wrote:
> On Tue, Jun 23, 2020 at 3:03 PM Mikko Perttunen <cyndis@kapsi.fi> wrote:
> >
> > # Host1x/TegraDRM UAPI proposal
> >
> > This is a proposal for a stable UAPI for Host1x and TegraDRM, to replace
> > the current TegraDRM UAPI that is behind `STAGING` and quite obsolete in
> > many ways.
> >
> > I haven't written any implementation yet -- I'll do that once there is
> > some agreement on the high-level design.
> >
> > Current open items:
> >
> > * The syncpoint UAPI allows userspace to create sync_file FDs with
> > arbitrary syncpoint fences. dma_fence code currently seems to assume all
> > fences will be signaled, which would not necessarily be the case with
> > this interface.
> > * Previously present GEM IOCTLs (GEM_CREATE, GEM_MMAP) are not present.
> > Not sure if they are still needed.
> >
> 
> Hi, as this wasn't addressed here (and sorry if I missed it): is there
> an open source userspace making use of this UAPI? Because this is
> something which needs to be seen before it can be included at all.

There's a set of commits that implement an earlier draft here:

    https://github.com/thierryreding/linux/tree/for-5.6/drm-tegra-destage-abi

and corresponding changes to libdrm to make use of that and implement
tests using the various generations of VIC here:

    https://cgit.freedesktop.org/~tagr/drm/log/

Before actually jumping to an implementation we wanted to go over the
design a bit more to avoid wasting a lot of work, like we've done in
the past. Turns out it isn't easy to get everyone to agree on a good
ABI. Who would've thought? =)

I expect that once the discussion around the ABI settles Mikko will
start on implementing this ABI in the kernel and we'll update the
libdrm patches to make use of the new ABI as well.

Thierry

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 160 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] Host1x/TegraDRM UAPI
  2020-06-26 11:40   ` Thierry Reding
@ 2020-06-26 13:38     ` Daniel Vetter
  2020-06-26 13:59       ` Dmitry Osipenko
  0 siblings, 1 reply; 50+ messages in thread
From: Daniel Vetter @ 2020-06-26 13:38 UTC (permalink / raw)
  To: Thierry Reding
  Cc: Mikko Perttunen, Karol Herbst, David Airlie, gustavo, dri-devel,
	Jon Hunter, talho, bhuntsman, linux-tegra, Dmitry Osipenko

On Fri, Jun 26, 2020 at 01:40:40PM +0200, Thierry Reding wrote:
> On Fri, Jun 26, 2020 at 01:06:58PM +0200, Karol Herbst wrote:
> > On Tue, Jun 23, 2020 at 3:03 PM Mikko Perttunen <cyndis@kapsi.fi> wrote:
> > >
> > > # Host1x/TegraDRM UAPI proposal
> > >
> > > This is a proposal for a stable UAPI for Host1x and TegraDRM, to replace
> > > the current TegraDRM UAPI that is behind `STAGING` and quite obsolete in
> > > many ways.
> > >
> > > I haven't written any implementation yet -- I'll do that once there is
> > > some agreement on the high-level design.
> > >
> > > Current open items:
> > >
> > > * The syncpoint UAPI allows userspace to create sync_file FDs with
> > > arbitrary syncpoint fences. dma_fence code currently seems to assume all
> > > fences will be signaled, which would not necessarily be the case with
> > > this interface.
> > > * Previously present GEM IOCTLs (GEM_CREATE, GEM_MMAP) are not present.
> > > Not sure if they are still needed.
> > >
> > 
> > Hi, as this wasn't addressed here (and sorry if I missed it): is there
> > an open source userspace making use of this UAPI? Because this is
> > something which needs to be seen before it can be included at all.
> 
> There's a set of commits that implement an earlier draft here:
> 
>     https://github.com/thierryreding/linux/tree/for-5.6/drm-tegra-destage-abi
> 
> and corresponding changes to libdrm to make use of that and implement
> tests using the various generations of VIC here:
> 
>     https://cgit.freedesktop.org/~tagr/drm/log/
> 
> Before actually jumping to an implementation we wanted to go over the
> design a bit more to avoid wasting a lot of work, like we've done in
> the past. Turns out it isn't easy to get everyone to agree on a good
> ABI. Who would've thought? =)
> 
> I expect that once the discussion around the ABI settles Mikko will
> start on implementing this ABI in the kernel and we'll update the
> libdrm patches to make use of the new ABI as well.

Do we have more than libdrm and tests for this, like something somewhat
functional? Since tegradrm landed years ago we've refined the criteria a
bit in this regard, latest version is explicit in that it needs to be
something that's functional (for end-users in some form), not just
infrastructure and tests.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] Host1x/TegraDRM UAPI
  2020-06-26 13:38     ` Daniel Vetter
@ 2020-06-26 13:59       ` Dmitry Osipenko
  2020-06-30  9:09         ` Daniel Vetter
  0 siblings, 1 reply; 50+ messages in thread
From: Dmitry Osipenko @ 2020-06-26 13:59 UTC (permalink / raw)
  To: Daniel Vetter, Thierry Reding
  Cc: Mikko Perttunen, Karol Herbst, David Airlie, gustavo, dri-devel,
	Jon Hunter, talho, bhuntsman, linux-tegra

26.06.2020 16:38, Daniel Vetter пишет:
> On Fri, Jun 26, 2020 at 01:40:40PM +0200, Thierry Reding wrote:
>> On Fri, Jun 26, 2020 at 01:06:58PM +0200, Karol Herbst wrote:
>>> On Tue, Jun 23, 2020 at 3:03 PM Mikko Perttunen <cyndis@kapsi.fi> wrote:
>>>>
>>>> # Host1x/TegraDRM UAPI proposal
>>>>
>>>> This is a proposal for a stable UAPI for Host1x and TegraDRM, to replace
>>>> the current TegraDRM UAPI that is behind `STAGING` and quite obsolete in
>>>> many ways.
>>>>
>>>> I haven't written any implementation yet -- I'll do that once there is
>>>> some agreement on the high-level design.
>>>>
>>>> Current open items:
>>>>
>>>> * The syncpoint UAPI allows userspace to create sync_file FDs with
>>>> arbitrary syncpoint fences. dma_fence code currently seems to assume all
>>>> fences will be signaled, which would not necessarily be the case with
>>>> this interface.
>>>> * Previously present GEM IOCTLs (GEM_CREATE, GEM_MMAP) are not present.
>>>> Not sure if they are still needed.
>>>>
>>>
>>> Hi, as this wasn't addressed here (and sorry if I missed it): is there
>>> an open source userspace making use of this UAPI? Because this is
>>> something which needs to be seen before it can be included at all.
>>
>> There's a set of commits that implement an earlier draft here:
>>
>>     https://github.com/thierryreding/linux/tree/for-5.6/drm-tegra-destage-abi
>>
>> and corresponding changes to libdrm to make use of that and implement
>> tests using the various generations of VIC here:
>>
>>     https://cgit.freedesktop.org/~tagr/drm/log/
>>
>> Before actually jumping to an implementation we wanted to go over the
>> design a bit more to avoid wasting a lot of work, like we've done in
>> the past. Turns out it isn't easy to get everyone to agree on a good
>> ABI. Who would've thought? =)
>>
>> I expect that once the discussion around the ABI settles Mikko will
>> start on implementing this ABI in the kernel and we'll update the
>> libdrm patches to make use of the new ABI as well.
> 
> Do we have more than libdrm and tests for this, like something somewhat
> functional? Since tegradrm landed years ago we've refined the criteria a
> bit in this regard, latest version is explicit in that it needs to be
> something that's functional (for end-users in some form), not just
> infrastructure and tests.

We have Opentegra [1] and VDPAU [2] userspace drivers for older Tegra
SoCs that have been using the staging UAPI for years now.

[1] https://github.com/grate-driver/xf86-video-opentegra
[2] https://github.com/grate-driver/libvdpau-tegra

The UAPI and the kernel driver updates are very needed for these drivers
because of the missing UAPI synchronization features and performance
problems of the kernel driver.

So we already have some real-world userspace for the testing!

It's not the first and not the second time we're discussing the Tegra
DRM UAPI changes, but this time it feels like there is a good chance
that it will finally materialize and I'm very happy about it :)
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] Host1x/TegraDRM UAPI (drm_tegra_channel_map)
  2020-06-26  7:34   ` Thierry Reding
@ 2020-06-26 16:35     ` Dmitry Osipenko
  2020-06-28 11:16       ` Mikko Perttunen
  0 siblings, 1 reply; 50+ messages in thread
From: Dmitry Osipenko @ 2020-06-26 16:35 UTC (permalink / raw)
  To: Thierry Reding
  Cc: Mikko Perttunen, David Airlie, gustavo, dri-devel, Jon Hunter,
	talho, bhuntsman, linux-tegra

26.06.2020 10:34, Thierry Reding пишет:
> On Fri, Jun 26, 2020 at 01:47:46AM +0300, Dmitry Osipenko wrote:
>> 23.06.2020 15:09, Mikko Perttunen пишет:
>>> ### DRM_TEGRA_CHANNEL_MAP
>>>
>>> Make memory accessible by the engine while executing work on the channel.
>>>
>>> ```
>>> #define DRM_TEGRA_CHANNEL_MAP_READWRITE        (1<<0)
>>>
>>> struct drm_tegra_channel_map {
>>>         /*
>>>          * [in] ID of the channel for which to map memory to.
>>>          */
>>>         __u32 channel_id;
>>>         /*
>>>          * [in] GEM handle of the memory to map.
>>>          */
>>>         __u32 handle;
>>>
>>>         /*
>>>          * [in] Offset in GEM handle of the memory area to map.
>>>          *
>>>          * Must be aligned by 4K.
>>>          */
>>>         __u64 offset;
>>
>> Could you please give a use-case example for this partial mapping?
>>
>> I vaguely recalling that maybe it was me who suggested this in the past..
>>
>> I kinda see that this could be useful for a case where userspace
>> allocates a large chunk of memory and then performs sub-allocations in
>> the userspace driver. But do we have a real-world example for this right
>> now?
> 
> I think the main point about this IOCTL was to make mapping/unmapping
> more efficient and avoid relocations for situations where we know it is
> safe to do so.
> 
> The fact that this can be used to create partial mappings is mostly just
> an added bonus, in my opinion. Doing this doesn't create much complexity
> but in turn gives us a lot more flexibility.
> 
> A couple of places where I think this could be useful are OpenGL and
> Vulkan, both of which support buffer suballocation. This has a couple of
> advantages on modern GPUs where sometimes you want to use very large
> allocation granularity, etc.
> 
> Now, I don't think that we'll see much of that in Tegra DRM directly,
> although grate could certainly make use of this, I suspect. However, I
> think for interoperation of dGPU and Tegra DRM (with VIC for post-
> processing, or hopefully some of the other hardware acceleration
> engines at some point), this might come in handy.
> 
> There are other possible use-cases within just Tegra DRM as well. We may
> want to only partially map planar buffers for video post-processing, for
> example. Or map each plane separately.
> 
>> Please see more below.
>>
>>>         /*
>>>          * [in] Length of memory area to map in bytes.
>>>          *
>>>          * Must be aligned by 4K.
>>>          */
>>>         __u64 length;
>>>
>>>         /*
>>>          * [out] IOVA of mapped memory. Userspace can use this IOVA
>>>          *   directly to refer to the memory to skip using relocations.
>>>          *   Only available if hardware memory isolation is enabled.
>>>          *
>>>          *   Will be set to 0xffff_ffff_ffff_ffff if unavailable.
>>>          */
>>>         __u64 iova;
>>>
>>>         /*
>>>          * [out] ID corresponding to the mapped memory to be used for
>>>          *   relocations or unmapping.
>>>          */
>>>         __u32 mapping_id;
>>>         /*
>>>          * [in] Flags.
>>>          */
>>>         __u32 flags;
>>>
>>>         __u32 reserved[6];
>>> };
>>
>> It looks to me that we actually need a bit different thing here.
>>
>> This MAP IOCTL maps a portion of a GEM and then returns the mapping_id.
>> And I think we need something more flexible that will allow us to use
>> GEM handles for the relocation IDs, which should fit nicely with the
>> DMA-reservations.
>>
>> What about an IOCTL that wraps GEM into another GEM? We could wrap a
>> portion of GEM_A into a GEM_B, and then map the GEM_B using the MAP IOCTL.
>>
>> It could be something like this:
>>
>> ### DRM_TEGRA_BO_WRAP
>>
>> struct drm_tegra_wrap_bo {
>> 	__u32 bo_handle_wrapped; // out
>> 	__u32 bo_handle;         // in
>> 	__u64 offset;
>> 	__u64 length;
>> };
>>
>> ### DRM_TEGRA_CHANNEL_MAP
>>
>> struct drm_tegra_channel_map {
>>         __u32 channels_mask;
>> 	__u32 mapping_id;
>>         __u32 bo_handle;
>>         __u32 flags;
>>         __u64 iova;
>> };
>>
>> ===
>>
>> This allows multiple mapping_ids to have the same backing GEM, so the
>> mapping_id could be resolved into a BO during of job's submission for
>> the DMA-reservations handling.
> 
> That's pretty much what we have already above, isn't it? Just because we
> call the field "mapping_id" doesn't mean that in the background we can't
> create a GEM object and return it's handle as "mapping_id".
> 
> One advantage of Mikko's proposal is that we have a single IOCTL rather
> than two to create the mapping, making it a bit more lightweight.

Thinking a bit more about it, I now changed my mind.

There is no need to perform implicit fencing on each suballocation,
instead explicit fencing should be used for the suballocations.

So, we will need to add the relocation flags for the direction and
explicit (or implicit) fencing per-relocation. The direction will tell
how fence should be attached to the BO's DMA-reservation, while the
fence-flag will tell whether job's scheduler should wait for the BO's
reservation before executing job on hardware. This all will be needed
for a proper DRI implementation on older Tegras.

Actually, during of my experiments with the UAPI, I added both these
flags for the relocations [1], but really used only the direction flag
so far, relying on the implicit fencing.

[1]
https://github.com/grate-driver/linux/blob/master/include/uapi/drm/tegra_drm.h#L894

So, let's keep the current variant of this MAP IOCTL as-is.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] Host1x/TegraDRM UAPI (drm_tegra_submit_syncpt_incr)
  2020-06-23 12:09 [RFC] Host1x/TegraDRM UAPI Mikko Perttunen
                   ` (7 preceding siblings ...)
  2020-06-26 11:06 ` [RFC] Host1x/TegraDRM UAPI Karol Herbst
@ 2020-06-27 21:47 ` Dmitry Osipenko
  2020-06-28 11:10   ` Mikko Perttunen
  2020-06-27 22:32 ` [RFC] Host1x/TegraDRM UAPI (drm_tegra_submit_command) Dmitry Osipenko
  2020-06-27 23:27 ` [RFC] Host1x/TegraDRM UAPI (sync points) Dmitry Osipenko
  10 siblings, 1 reply; 50+ messages in thread
From: Dmitry Osipenko @ 2020-06-27 21:47 UTC (permalink / raw)
  To: Mikko Perttunen, Thierry Reding, Jon Hunter, David Airlie,
	Daniel Vetter, sumit.semwal, gustavo
  Cc: linux-tegra, talho, bhuntsman, dri-devel

23.06.2020 15:09, Mikko Perttunen пишет:
> struct drm_tegra_submit_syncpt_incr {
>         /*
>          * [in] Syncpoint FD of the syncpoint that the job will
>          *   increment.
>          */
>         __s32 syncpt_fd;
> 
>         /*
>          * [in] Number of increments that the job will do.
>          */
>         __u32 num_incrs;
> 
>         /*
>          * [out] Value the syncpoint will have once all increments have
>          *   executed.
>          */
>         __u32 fence_value;
> 
>         __u32 reserved[1];
> };

The job should be considered executed once the final sync point is
incremented.

Hence, there should be only one sync point per-job for increment, why
would you ever need more than one?

Could you please explain what this submit_syncpt_incr is about?
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] Host1x/TegraDRM UAPI (drm_tegra_submit_command)
  2020-06-23 12:09 [RFC] Host1x/TegraDRM UAPI Mikko Perttunen
                   ` (8 preceding siblings ...)
  2020-06-27 21:47 ` [RFC] Host1x/TegraDRM UAPI (drm_tegra_submit_syncpt_incr) Dmitry Osipenko
@ 2020-06-27 22:32 ` Dmitry Osipenko
  2020-06-28 10:28   ` Mikko Perttunen
  2020-06-27 23:27 ` [RFC] Host1x/TegraDRM UAPI (sync points) Dmitry Osipenko
  10 siblings, 1 reply; 50+ messages in thread
From: Dmitry Osipenko @ 2020-06-27 22:32 UTC (permalink / raw)
  To: Mikko Perttunen, Thierry Reding, Jon Hunter, David Airlie,
	Daniel Vetter, sumit.semwal, gustavo
  Cc: linux-tegra, talho, bhuntsman, dri-devel

23.06.2020 15:09, Mikko Perttunen пишет:
> /* Command is an opcode gather from a GEM handle */
> #define DRM_TEGRA_SUBMIT_COMMAND_GATHER             0
> /* Command is an opcode gather from a user pointer */
> #define DRM_TEGRA_SUBMIT_COMMAND_GATHER_UPTR        1
> /* Command is a wait for syncpt fence completion */
> #define DRM_TEGRA_SUBMIT_COMMAND_WAIT_SYNCPT        2
> /* Command is a wait for SYNC_FILE FD completion */
> #define DRM_TEGRA_SUBMIT_COMMAND_WAIT_SYNC_FILE     3
> /* Command is a wait for DRM syncobj completion */
> #define DRM_TEGRA_SUBMIT_COMMAND_WAIT_SYNCOBJ       4
> 
> /*
>  * Allow driver to skip command execution if engine
>  * was not accessed by another channel between
>  * submissions.
>  */
> #define DRM_TEGRA_SUBMIT_CONTEXT_SETUP                        (1<<0)
> 
> struct drm_tegra_submit_command {
>         __u16 type;
>         __u16 flags;

Shouldn't the "packed" attribute be needed if a non-32bit aligned fields
are used?

>         union {
>                 struct {
>                     /* GEM handle */
>                     __u32 handle;
> 
>                     /*
>                      * Offset into GEM object in bytes.
>                      * Must be aligned by 4.
>                      */
>                     __u64 offset;

64bits for a gather offset is a bit too much, in most cases gathers are
under 4K.

u32 should be more than enough (maybe even u16 if offset is given in a
dword granularity).

>                     /*
>                      * Length of gather in bytes.
>                      * Must be aligned by 4.
>                      */
>                     __u64 length;

u32/16

>                 } gather;

>                 struct {
>                         __u32 reserved[1];
> 
>                         /*
>                          * Pointer to gather data.
>                          * Must be aligned by 4 bytes.
>                          */
>                         __u64 base;
>                         /*
>                          * Length of gather in bytes.
>                          * Must be aligned by 4.
>                          */
>                         __u64 length;
>                 } gather_uptr;

What about to inline the UPTR gather data and relocs into the
drm_tegra_submit_command[] buffer:

struct drm_tegra_submit_command {
	struct {
		u16 num_words;
		u16 num_relocs;

		gather_data[];
		drm_tegra_submit_relocation relocs[];
	} gather_uptr;
};

struct drm_tegra_channel_submit {
        __u32 num_syncpt_incrs;
        __u32 syncpt_idx;

        __u64 commands_ptr;
	__u32 commands_size;
...
};

struct drm_tegra_submit_command example[] = {
	cmd.gather_uptr{},
	...
	gather_data[],
	gather_relocs[],
	cmd.wait_syncpt{},
	...
};

This way we will have only a single copy_from_user() for the whole
cmdstream, which should be more efficient to do and nicer from both
userspace and kernel perspectives in regards to forming and parsing the
commands.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] Host1x/TegraDRM UAPI (sync points)
  2020-06-23 12:09 [RFC] Host1x/TegraDRM UAPI Mikko Perttunen
                   ` (9 preceding siblings ...)
  2020-06-27 22:32 ` [RFC] Host1x/TegraDRM UAPI (drm_tegra_submit_command) Dmitry Osipenko
@ 2020-06-27 23:27 ` Dmitry Osipenko
  2020-06-28  9:44   ` Mikko Perttunen
  10 siblings, 1 reply; 50+ messages in thread
From: Dmitry Osipenko @ 2020-06-27 23:27 UTC (permalink / raw)
  To: Mikko Perttunen, Thierry Reding, Jon Hunter, David Airlie,
	Daniel Vetter, sumit.semwal, gustavo
  Cc: linux-tegra, talho, bhuntsman, dri-devel

23.06.2020 15:09, Mikko Perttunen пишет:
> 
> ### IOCTL HOST1X_ALLOCATE_SYNCPOINT (on /dev/host1x)
> 
> Allocates a free syncpoint, returning a file descriptor representing it.
> Only the owner of the file descriptor is allowed to mutate the value of
> the syncpoint.
> 
> ```
> struct host1x_ctrl_allocate_syncpoint {
>        /**
>         * @fd:
>         *
>         * [out] New file descriptor representing the allocated syncpoint.
>         */
>        __s32 fd;
> 
>        __u32 reserved[3];
> };

We should need at least these basic things from the sync points API:

- Execution context shouldn't be able to tamper sync points of the other
contexts.

- Sync point could be shared with other contexts for explicit fencing.

- Sync points should work reliably.

Some problems of the current Host1x driver, like where it falls over if
sync point value is out-of-sync + all the hang-job recovery labor could
be easily reduced if sync point health is protected by extra UAPI
constraints.

So I think we may want the following:

1. We still should need to assign sync point ID to a DRM-channel's
context. This sync point ID will be used for a commands stream forming,
like it is done by the current staging UAPI.

So we should need to retain the DRM_TEGRA_GET_SYNCPT IOCTL, but improve it.

2. Allocated sync point must have a clean hardware state.

3. Sync points should be properly refcounted. Job's sync points
shouldn't be re-used while job is alive.

4. The job's sync point can't be re-used after job's submission (UAPI
constraint!). Userspace must free sync point and allocate a new one for
the next job submission. And now we:

  - Know that job's sync point is always in a healthy state!

  - We're not limited by a number of physically available hardware sync
points! Allocation should block until free sync point is available.

  - The logical number of job's sync point increments matches the SP
hardware state! Which is handy for a job's debugging.

Optionally, the job's sync point could be auto-removed from the DRM's
context after job's submission, avoiding a need for an extra SYNCPT_PUT
IOCTL invocation to be done by userspace after the job's submission.
Could be a job's flag.

We could avoid a need for a statically-allocated sync points at all for
a patched cmdstreams! The sync point could be dynamically allocated at a
job's submission time by the kernel driver and then cmdstream will be
patched with this sync point.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] Host1x/TegraDRM UAPI (sync points)
  2020-06-27 23:27 ` [RFC] Host1x/TegraDRM UAPI (sync points) Dmitry Osipenko
@ 2020-06-28  9:44   ` Mikko Perttunen
  2020-06-29  2:36     ` Dmitry Osipenko
  0 siblings, 1 reply; 50+ messages in thread
From: Mikko Perttunen @ 2020-06-28  9:44 UTC (permalink / raw)
  To: Dmitry Osipenko, Thierry Reding, Jon Hunter, David Airlie,
	Daniel Vetter, sumit.semwal, gustavo
  Cc: linux-tegra, talho, bhuntsman, dri-devel

On 6/28/20 2:27 AM, Dmitry Osipenko wrote:
> 23.06.2020 15:09, Mikko Perttunen пишет:
>>
>> ### IOCTL HOST1X_ALLOCATE_SYNCPOINT (on /dev/host1x)
>>
>> Allocates a free syncpoint, returning a file descriptor representing it.
>> Only the owner of the file descriptor is allowed to mutate the value of
>> the syncpoint.
>>
>> ```
>> struct host1x_ctrl_allocate_syncpoint {
>>         /**
>>          * @fd:
>>          *
>>          * [out] New file descriptor representing the allocated syncpoint.
>>          */
>>         __s32 fd;
>>
>>         __u32 reserved[3];
>> };
> 
> We should need at least these basic things from the sync points API >
> - Execution context shouldn't be able to tamper sync points of the other
> contexts.

This is covered by this UAPI - when submitting, as part of the 
syncpt_incr struct you pass the syncpoint FD. This way the driver can 
check the syncpoints used are correct, or program HW protection.

> 
> - Sync point could be shared with other contexts for explicit fencing.

Not sure what you specifically mean; you can get the ID out of the 
syncpoint fd and share the ID for read-only access. (Or the FD for 
read-write access)

> 
> - Sync points should work reliably.
> 
> Some problems of the current Host1x driver, like where it falls over if
> sync point value is out-of-sync + all the hang-job recovery labor could
> be easily reduced if sync point health is protected by extra UAPI
> constraints. >
> So I think we may want the following:
> 
> 1. We still should need to assign sync point ID to a DRM-channel's
> context. This sync point ID will be used for a commands stream forming,
> like it is done by the current staging UAPI.
> 
> So we should need to retain the DRM_TEGRA_GET_SYNCPT IOCTL, but improve it.
> 
> 2. Allocated sync point must have a clean hardware state.

What do you mean by clean hardware state?

> 
> 3. Sync points should be properly refcounted. Job's sync points
> shouldn't be re-used while job is alive.
> 
> 4. The job's sync point can't be re-used after job's submission (UAPI
> constraint!). Userspace must free sync point and allocate a new one for
> the next job submission. And now we:
> 
>    - Know that job's sync point is always in a healthy state!
> 
>    - We're not limited by a number of physically available hardware sync
> points! Allocation should block until free sync point is available.
> 
>    - The logical number of job's sync point increments matches the SP
> hardware state! Which is handy for a job's debugging.
> 
> Optionally, the job's sync point could be auto-removed from the DRM's
> context after job's submission, avoiding a need for an extra SYNCPT_PUT
> IOCTL invocation to be done by userspace after the job's submission.
> Could be a job's flag.

I think this would cause problems where after a job completes but before 
the fence has been waited, the syncpoint is already recycled (especially 
if the syncpoint is reset into some clean state).

I would prefer having a syncpoint for each userspace channel context 
(several of which could share a hardware channel if MLOCKing is not used).

In my experience it's then not difficult to pinpoint which job has 
failed, and if each userspace channel context uses a separate syncpoint, 
a hanging job wouldn't mess with other application's jobs, either.

Mikko

> 
> We could avoid a need for a statically-allocated sync points at all for
> a patched cmdstreams! The sync point could be dynamically allocated at a
> job's submission time by the kernel driver and then cmdstream will be
> patched with this sync point.
> 
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] Host1x/TegraDRM UAPI (drm_tegra_submit_command)
  2020-06-27 22:32 ` [RFC] Host1x/TegraDRM UAPI (drm_tegra_submit_command) Dmitry Osipenko
@ 2020-06-28 10:28   ` Mikko Perttunen
  2020-06-29  0:00     ` Dmitry Osipenko
  0 siblings, 1 reply; 50+ messages in thread
From: Mikko Perttunen @ 2020-06-28 10:28 UTC (permalink / raw)
  To: Dmitry Osipenko, Thierry Reding, Jon Hunter, David Airlie,
	Daniel Vetter, sumit.semwal, gustavo
  Cc: linux-tegra, talho, bhuntsman, dri-devel

On 6/28/20 1:32 AM, Dmitry Osipenko wrote:
> 23.06.2020 15:09, Mikko Perttunen пишет:
>> /* Command is an opcode gather from a GEM handle */
>> #define DRM_TEGRA_SUBMIT_COMMAND_GATHER             0
>> /* Command is an opcode gather from a user pointer */
>> #define DRM_TEGRA_SUBMIT_COMMAND_GATHER_UPTR        1
>> /* Command is a wait for syncpt fence completion */
>> #define DRM_TEGRA_SUBMIT_COMMAND_WAIT_SYNCPT        2
>> /* Command is a wait for SYNC_FILE FD completion */
>> #define DRM_TEGRA_SUBMIT_COMMAND_WAIT_SYNC_FILE     3
>> /* Command is a wait for DRM syncobj completion */
>> #define DRM_TEGRA_SUBMIT_COMMAND_WAIT_SYNCOBJ       4
>>
>> /*
>>   * Allow driver to skip command execution if engine
>>   * was not accessed by another channel between
>>   * submissions.
>>   */
>> #define DRM_TEGRA_SUBMIT_CONTEXT_SETUP                        (1<<0)
>>
>> struct drm_tegra_submit_command {
>>          __u16 type;
>>          __u16 flags;
> 
> Shouldn't the "packed" attribute be needed if a non-32bit aligned fields
> are used?

I guess we should change these to u32 for consistency.

> 
>>          union {
>>                  struct {
>>                      /* GEM handle */
>>                      __u32 handle;
>>
>>                      /*
>>                       * Offset into GEM object in bytes.
>>                       * Must be aligned by 4.
>>                       */
>>                      __u64 offset;
> 
> 64bits for a gather offset is a bit too much, in most cases gathers are
> under 4K.
> 
> u32 should be more than enough (maybe even u16 if offset is given in a
> dword granularity).

I guess this can be changed to u32, though I don't think there is any 
particularly pressing reason not to use u64.

In any case, I think we concluded that we'll drop the GEM gather command 
for now.

> 
>>                      /*
>>                       * Length of gather in bytes.
>>                       * Must be aligned by 4.
>>                       */
>>                      __u64 length;
> 
> u32/16
> 
>>                  } gather;
> 
>>                  struct {
>>                          __u32 reserved[1];
>>
>>                          /*
>>                           * Pointer to gather data.
>>                           * Must be aligned by 4 bytes.
>>                           */
>>                          __u64 base;
>>                          /*
>>                           * Length of gather in bytes.
>>                           * Must be aligned by 4.
>>                           */
>>                          __u64 length;
>>                  } gather_uptr;
> 
> What about to inline the UPTR gather data and relocs into the
> drm_tegra_submit_command[] buffer:
> 
> struct drm_tegra_submit_command {
> 	struct {
> 		u16 num_words;
> 		u16 num_relocs;
> 
> 		gather_data[];
> 		drm_tegra_submit_relocation relocs[];
> 	} gather_uptr;
> };
> 
> struct drm_tegra_channel_submit {
>          __u32 num_syncpt_incrs;
>          __u32 syncpt_idx;
> 
>          __u64 commands_ptr;
> 	__u32 commands_size;
> ...
> };
> 
> struct drm_tegra_submit_command example[] = {
> 	cmd.gather_uptr{},
> 	...
> 	gather_data[],
> 	gather_relocs[],
> 	cmd.wait_syncpt{},
> 	...
> };
> 
> This way we will have only a single copy_from_user() for the whole
> cmdstream, which should be more efficient to do and nicer from both
> userspace and kernel perspectives in regards to forming and parsing the
> commands.
> 

I'm not sure I agree it being nicer with regard to forming and parsing
- Can you actually place multiple unsized arrays in a struct without 
pointers?
- gather_data's length is a multiple of 4, and so since we have u64's in 
drm_tegra_submit_relocation, alignment needs to be taken care of 
manually, both when forming and kernel needs to validate it while 
parsing. Depending on number of words in the gather, padding would need 
to be inserted. We could swap the two around but it still feels more 
brittle.

Also, if we read the gather from a separate address, userspace doesn't 
necessarily need to know the length of the gather (and number of relocs) 
upfront, so it's easier to have a builder pattern without needing to 
copy things later.

If we allow direct page mappings for gathers later, a separate address 
would make things also a little bit easier. For direct page mappings 
userspace would need to keep the opcode data unchanged while the job is 
being executed, so userspace could keep an arena where the gathers are 
placed, and refer to segments of that, instead of having to keep the 
drm_tegra_submit_commands alive.

Mikko
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] Host1x/TegraDRM UAPI (drm_tegra_submit_syncpt_incr)
  2020-06-27 21:47 ` [RFC] Host1x/TegraDRM UAPI (drm_tegra_submit_syncpt_incr) Dmitry Osipenko
@ 2020-06-28 11:10   ` Mikko Perttunen
  0 siblings, 0 replies; 50+ messages in thread
From: Mikko Perttunen @ 2020-06-28 11:10 UTC (permalink / raw)
  To: Dmitry Osipenko, Thierry Reding, Jon Hunter, David Airlie,
	Daniel Vetter, sumit.semwal, gustavo
  Cc: linux-tegra, talho, bhuntsman, dri-devel

On 6/28/20 12:47 AM, Dmitry Osipenko wrote:
> 23.06.2020 15:09, Mikko Perttunen пишет:
>> struct drm_tegra_submit_syncpt_incr {
>>          /*
>>           * [in] Syncpoint FD of the syncpoint that the job will
>>           *   increment.
>>           */
>>          __s32 syncpt_fd;
>>
>>          /*
>>           * [in] Number of increments that the job will do.
>>           */
>>          __u32 num_incrs;
>>
>>          /*
>>           * [out] Value the syncpoint will have once all increments have
>>           *   executed.
>>           */
>>          __u32 fence_value;
>>
>>          __u32 reserved[1];
>> };
> 
> The job should be considered executed once the final sync point is
> incremented.
> 
> Hence, there should be only one sync point per-job for increment, why
> would you ever need more than one?
> 
> Could you please explain what this submit_syncpt_incr is about?
> 

This tells the kernel which syncpoint will be incremented and how many 
times for job tracking and verifying the user has access to that syncpoint.

A second syncpoint is used for NVENC in slice encoding mode, where the 
engine can be programmed to count encoded slices by incrementing a 
syncpoint. I'll ask around to see if I can find some more details on this.

Since the usecase is somewhat niche, we could see if we can have a 
design where it's only one syncpoint, but extensible later if needed.

Mikko
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] Host1x/TegraDRM UAPI (drm_tegra_channel_map)
  2020-06-26 16:35     ` Dmitry Osipenko
@ 2020-06-28 11:16       ` Mikko Perttunen
  2020-06-28 22:59         ` Dmitry Osipenko
  0 siblings, 1 reply; 50+ messages in thread
From: Mikko Perttunen @ 2020-06-28 11:16 UTC (permalink / raw)
  To: Dmitry Osipenko, Thierry Reding
  Cc: David Airlie, gustavo, dri-devel, Jon Hunter, talho, bhuntsman,
	linux-tegra

On 6/26/20 7:35 PM, Dmitry Osipenko wrote:
> 26.06.2020 10:34, Thierry Reding пишет:
>> On Fri, Jun 26, 2020 at 01:47:46AM +0300, Dmitry Osipenko wrote:
>>> 23.06.2020 15:09, Mikko Perttunen пишет:
>>>> ### DRM_TEGRA_CHANNEL_MAP
>>>>
>>>> Make memory accessible by the engine while executing work on the channel.
>>>>
>>>> ```
>>>> #define DRM_TEGRA_CHANNEL_MAP_READWRITE        (1<<0)
>>>>
>>>> struct drm_tegra_channel_map {
>>>>          /*
>>>>           * [in] ID of the channel for which to map memory to.
>>>>           */
>>>>          __u32 channel_id;
>>>>          /*
>>>>           * [in] GEM handle of the memory to map.
>>>>           */
>>>>          __u32 handle;
>>>>
>>>>          /*
>>>>           * [in] Offset in GEM handle of the memory area to map.
>>>>           *
>>>>           * Must be aligned by 4K.
>>>>           */
>>>>          __u64 offset;
>>>
>>> Could you please give a use-case example for this partial mapping?
>>>
>>> I vaguely recalling that maybe it was me who suggested this in the past..
>>>
>>> I kinda see that this could be useful for a case where userspace
>>> allocates a large chunk of memory and then performs sub-allocations in
>>> the userspace driver. But do we have a real-world example for this right
>>> now?
>>
>> I think the main point about this IOCTL was to make mapping/unmapping
>> more efficient and avoid relocations for situations where we know it is
>> safe to do so.
>>
>> The fact that this can be used to create partial mappings is mostly just
>> an added bonus, in my opinion. Doing this doesn't create much complexity
>> but in turn gives us a lot more flexibility.
>>
>> A couple of places where I think this could be useful are OpenGL and
>> Vulkan, both of which support buffer suballocation. This has a couple of
>> advantages on modern GPUs where sometimes you want to use very large
>> allocation granularity, etc.
>>
>> Now, I don't think that we'll see much of that in Tegra DRM directly,
>> although grate could certainly make use of this, I suspect. However, I
>> think for interoperation of dGPU and Tegra DRM (with VIC for post-
>> processing, or hopefully some of the other hardware acceleration
>> engines at some point), this might come in handy.
>>
>> There are other possible use-cases within just Tegra DRM as well. We may
>> want to only partially map planar buffers for video post-processing, for
>> example. Or map each plane separately.
>>
>>> Please see more below.
>>>
>>>>          /*
>>>>           * [in] Length of memory area to map in bytes.
>>>>           *
>>>>           * Must be aligned by 4K.
>>>>           */
>>>>          __u64 length;
>>>>
>>>>          /*
>>>>           * [out] IOVA of mapped memory. Userspace can use this IOVA
>>>>           *   directly to refer to the memory to skip using relocations.
>>>>           *   Only available if hardware memory isolation is enabled.
>>>>           *
>>>>           *   Will be set to 0xffff_ffff_ffff_ffff if unavailable.
>>>>           */
>>>>          __u64 iova;
>>>>
>>>>          /*
>>>>           * [out] ID corresponding to the mapped memory to be used for
>>>>           *   relocations or unmapping.
>>>>           */
>>>>          __u32 mapping_id;
>>>>          /*
>>>>           * [in] Flags.
>>>>           */
>>>>          __u32 flags;
>>>>
>>>>          __u32 reserved[6];
>>>> };
>>>
>>> It looks to me that we actually need a bit different thing here.
>>>
>>> This MAP IOCTL maps a portion of a GEM and then returns the mapping_id.
>>> And I think we need something more flexible that will allow us to use
>>> GEM handles for the relocation IDs, which should fit nicely with the
>>> DMA-reservations.
>>>
>>> What about an IOCTL that wraps GEM into another GEM? We could wrap a
>>> portion of GEM_A into a GEM_B, and then map the GEM_B using the MAP IOCTL.
>>>
>>> It could be something like this:
>>>
>>> ### DRM_TEGRA_BO_WRAP
>>>
>>> struct drm_tegra_wrap_bo {
>>> 	__u32 bo_handle_wrapped; // out
>>> 	__u32 bo_handle;         // in
>>> 	__u64 offset;
>>> 	__u64 length;
>>> };
>>>
>>> ### DRM_TEGRA_CHANNEL_MAP
>>>
>>> struct drm_tegra_channel_map {
>>>          __u32 channels_mask;
>>> 	__u32 mapping_id;
>>>          __u32 bo_handle;
>>>          __u32 flags;
>>>          __u64 iova;
>>> };
>>>
>>> ===
>>>
>>> This allows multiple mapping_ids to have the same backing GEM, so the
>>> mapping_id could be resolved into a BO during of job's submission for
>>> the DMA-reservations handling.
>>
>> That's pretty much what we have already above, isn't it? Just because we
>> call the field "mapping_id" doesn't mean that in the background we can't
>> create a GEM object and return it's handle as "mapping_id".
>>
>> One advantage of Mikko's proposal is that we have a single IOCTL rather
>> than two to create the mapping, making it a bit more lightweight.
> 
> Thinking a bit more about it, I now changed my mind.
> 
> There is no need to perform implicit fencing on each suballocation,
> instead explicit fencing should be used for the suballocations.
> 
> So, we will need to add the relocation flags for the direction and
> explicit (or implicit) fencing per-relocation. The direction will tell
> how fence should be attached to the BO's DMA-reservation, while the
> fence-flag will tell whether job's scheduler should wait for the BO's
> reservation before executing job on hardware. This all will be needed
> for a proper DRI implementation on older Tegras.
> 
> Actually, during of my experiments with the UAPI, I added both these
> flags for the relocations [1], but really used only the direction flag
> so far, relying on the implicit fencing.
> 
> [1]
> https://github.com/grate-driver/linux/blob/master/include/uapi/drm/tegra_drm.h#L894
> 
> So, let's keep the current variant of this MAP IOCTL as-is.
> 

Let me rephrase to make sure I understand:

For relocations, we should add flags for direction and fencing. This way 
at submit time we can do the proper fencing operations on the relocated 
BO's DMA reservation.

This sounds good to me, and I think it makes the "relocation" concept a 
bit more general than it is currently. I think we could rename it to 
something like "buffer_usage" (open to bikeshedding), and it can have 
fence-related flags and relocation-related flags. For newer Tegra chips 
we don't necessarily need to relocate, but we still may need to handle 
DMA reservations, so in these cases only the fencing flags would be set.

Mikko
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] Host1x/TegraDRM UAPI (drm_tegra_channel_map)
  2020-06-28 11:16       ` Mikko Perttunen
@ 2020-06-28 22:59         ` Dmitry Osipenko
  2020-06-30 10:55           ` Mikko Perttunen
  0 siblings, 1 reply; 50+ messages in thread
From: Dmitry Osipenko @ 2020-06-28 22:59 UTC (permalink / raw)
  To: Mikko Perttunen, Thierry Reding
  Cc: Erik Faye-Lund, David Airlie, gustavo, dri-devel, Jon Hunter,
	talho, bhuntsman, linux-tegra

28.06.2020 14:16, Mikko Perttunen пишет:
> On 6/26/20 7:35 PM, Dmitry Osipenko wrote:
>> 26.06.2020 10:34, Thierry Reding пишет:
>>> On Fri, Jun 26, 2020 at 01:47:46AM +0300, Dmitry Osipenko wrote:
>>>> 23.06.2020 15:09, Mikko Perttunen пишет:
>>>>> ### DRM_TEGRA_CHANNEL_MAP
>>>>>
>>>>> Make memory accessible by the engine while executing work on the
>>>>> channel.
>>>>>
>>>>> ```
>>>>> #define DRM_TEGRA_CHANNEL_MAP_READWRITE        (1<<0)
>>>>>
>>>>> struct drm_tegra_channel_map {
>>>>>          /*
>>>>>           * [in] ID of the channel for which to map memory to.
>>>>>           */
>>>>>          __u32 channel_id;
>>>>>          /*
>>>>>           * [in] GEM handle of the memory to map.
>>>>>           */
>>>>>          __u32 handle;
>>>>>
>>>>>          /*
>>>>>           * [in] Offset in GEM handle of the memory area to map.
>>>>>           *
>>>>>           * Must be aligned by 4K.
>>>>>           */
>>>>>          __u64 offset;
>>>>
>>>> Could you please give a use-case example for this partial mapping?
>>>>
>>>> I vaguely recalling that maybe it was me who suggested this in the
>>>> past..
>>>>
>>>> I kinda see that this could be useful for a case where userspace
>>>> allocates a large chunk of memory and then performs sub-allocations in
>>>> the userspace driver. But do we have a real-world example for this
>>>> right
>>>> now?
>>>
>>> I think the main point about this IOCTL was to make mapping/unmapping
>>> more efficient and avoid relocations for situations where we know it is
>>> safe to do so.
>>>
>>> The fact that this can be used to create partial mappings is mostly just
>>> an added bonus, in my opinion. Doing this doesn't create much complexity
>>> but in turn gives us a lot more flexibility.
>>>
>>> A couple of places where I think this could be useful are OpenGL and
>>> Vulkan, both of which support buffer suballocation. This has a couple of
>>> advantages on modern GPUs where sometimes you want to use very large
>>> allocation granularity, etc.
>>>
>>> Now, I don't think that we'll see much of that in Tegra DRM directly,
>>> although grate could certainly make use of this, I suspect. However, I
>>> think for interoperation of dGPU and Tegra DRM (with VIC for post-
>>> processing, or hopefully some of the other hardware acceleration
>>> engines at some point), this might come in handy.
>>>
>>> There are other possible use-cases within just Tegra DRM as well. We may
>>> want to only partially map planar buffers for video post-processing, for
>>> example. Or map each plane separately.
>>>
>>>> Please see more below.
>>>>
>>>>>          /*
>>>>>           * [in] Length of memory area to map in bytes.
>>>>>           *
>>>>>           * Must be aligned by 4K.
>>>>>           */
>>>>>          __u64 length;
>>>>>
>>>>>          /*
>>>>>           * [out] IOVA of mapped memory. Userspace can use this IOVA
>>>>>           *   directly to refer to the memory to skip using
>>>>> relocations.
>>>>>           *   Only available if hardware memory isolation is enabled.
>>>>>           *
>>>>>           *   Will be set to 0xffff_ffff_ffff_ffff if unavailable.
>>>>>           */
>>>>>          __u64 iova;
>>>>>
>>>>>          /*
>>>>>           * [out] ID corresponding to the mapped memory to be used for
>>>>>           *   relocations or unmapping.
>>>>>           */
>>>>>          __u32 mapping_id;
>>>>>          /*
>>>>>           * [in] Flags.
>>>>>           */
>>>>>          __u32 flags;
>>>>>
>>>>>          __u32 reserved[6];
>>>>> };
>>>>
>>>> It looks to me that we actually need a bit different thing here.
>>>>
>>>> This MAP IOCTL maps a portion of a GEM and then returns the mapping_id.
>>>> And I think we need something more flexible that will allow us to use
>>>> GEM handles for the relocation IDs, which should fit nicely with the
>>>> DMA-reservations.
>>>>
>>>> What about an IOCTL that wraps GEM into another GEM? We could wrap a
>>>> portion of GEM_A into a GEM_B, and then map the GEM_B using the MAP
>>>> IOCTL.
>>>>
>>>> It could be something like this:
>>>>
>>>> ### DRM_TEGRA_BO_WRAP
>>>>
>>>> struct drm_tegra_wrap_bo {
>>>>     __u32 bo_handle_wrapped; // out
>>>>     __u32 bo_handle;         // in
>>>>     __u64 offset;
>>>>     __u64 length;
>>>> };
>>>>
>>>> ### DRM_TEGRA_CHANNEL_MAP
>>>>
>>>> struct drm_tegra_channel_map {
>>>>          __u32 channels_mask;
>>>>     __u32 mapping_id;
>>>>          __u32 bo_handle;
>>>>          __u32 flags;
>>>>          __u64 iova;
>>>> };
>>>>
>>>> ===
>>>>
>>>> This allows multiple mapping_ids to have the same backing GEM, so the
>>>> mapping_id could be resolved into a BO during of job's submission for
>>>> the DMA-reservations handling.
>>>
>>> That's pretty much what we have already above, isn't it? Just because we
>>> call the field "mapping_id" doesn't mean that in the background we can't
>>> create a GEM object and return it's handle as "mapping_id".
>>>
>>> One advantage of Mikko's proposal is that we have a single IOCTL rather
>>> than two to create the mapping, making it a bit more lightweight.
>>
>> Thinking a bit more about it, I now changed my mind.
>>
>> There is no need to perform implicit fencing on each suballocation,
>> instead explicit fencing should be used for the suballocations.
>>
>> So, we will need to add the relocation flags for the direction and
>> explicit (or implicit) fencing per-relocation. The direction will tell
>> how fence should be attached to the BO's DMA-reservation, while the
>> fence-flag will tell whether job's scheduler should wait for the BO's
>> reservation before executing job on hardware. This all will be needed
>> for a proper DRI implementation on older Tegras.
>>
>> Actually, during of my experiments with the UAPI, I added both these
>> flags for the relocations [1], but really used only the direction flag
>> so far, relying on the implicit fencing.
>>
>> [1]
>> https://github.com/grate-driver/linux/blob/master/include/uapi/drm/tegra_drm.h#L894
>>
>>
>> So, let's keep the current variant of this MAP IOCTL as-is.
>>
> 
> Let me rephrase to make sure I understand:
> 
> For relocations, we should add flags for direction and fencing. This way
> at submit time we can do the proper fencing operations on the relocated
> BO's DMA reservation.

Correct

> This sounds good to me, and I think it makes the "relocation" concept a
> bit more general than it is currently. I think we could rename it to
> something like "buffer_usage" (open to bikeshedding), and it can have
> fence-related flags and relocation-related flags. For newer Tegra chips
> we don't necessarily need to relocate, but we still may need to handle
> DMA reservations, so in these cases only the fencing flags would be set.

Kernel driver already knows whether relocation needs to be patched since
it returns 0xffffffff by the MAP if patching is needed, and thus,
patching could be auto-skipped by the driver.

I don't think that a special flag is required for telling about whether
relocation needs to be done. The direction and fence flags should be
enough for now.

===

I'm curios what do you think about another variant:

In the grate-kernel I'm using a BO table which contains unique BO
entries for each job's relocation [1] and then IDs of this table and
target offsets are embedded into the cmdstream in-place of the memory
addresses [2]. This way, when cmdstream is fully firewalled, we're
reading the stream's data anyways, and thus, it's quite nice to embed
the BO table IDs and offsets into cmdstream itself [3].

In a result:

 - Each job's BO is resolved only once during submission.

 - BO table is kept small if cmdstream contains duplicated relocations.

[1]
https://github.com/grate-driver/linux/blob/master/include/uapi/drm/tegra_drm.h#L892

[2]
https://github.com/grate-driver/linux/blob/master/include/uapi/drm/tegra_drm.h#L783

[3]
https://github.com/grate-driver/linux/blob/master/drivers/gpu/drm/tegra/uapi/patching.c#L43

Of course this limits the number of BOs per-job. In a case of
grate-kernel it's max 64 BOs per-job + max 64MB for target offset. I
guess the BOs number could be lowered to 32 per-job, which should be a
bit more realistic, and then max target offset will be 128MB.

So, we could replace the BO table with a mapping table and have the
MAPPING_TABLE! :) And it doesn't matter whether cmdstream patched or
not, either way the MAPPING_TABLE will contain mapping ID + flags.

There are 3 possible variants of how job could be processed, depending
on hardware generation and driver security configuration:

  1. Job is fully-parsed and patched.
  2. Job is patched-only (with relocations).
  3. Job isn't parsed, nor patched.

My variant covers 1 and 3. I guess we could just ignore the case 2 for
now and fall back to 1, for simplicity. It shouldn't be a problem to add
support for the RELOCS_TABLE in addition to MAPPING_TABLE later on, if
will be really needed.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] Host1x/TegraDRM UAPI (drm_tegra_submit_command)
  2020-06-28 10:28   ` Mikko Perttunen
@ 2020-06-29  0:00     ` Dmitry Osipenko
  2020-06-30 10:40       ` Mikko Perttunen
  0 siblings, 1 reply; 50+ messages in thread
From: Dmitry Osipenko @ 2020-06-29  0:00 UTC (permalink / raw)
  To: Mikko Perttunen, Thierry Reding, Jon Hunter, David Airlie,
	Daniel Vetter, sumit.semwal, gustavo
  Cc: linux-tegra, talho, bhuntsman, dri-devel, Erik Faye-Lund

28.06.2020 13:28, Mikko Perttunen пишет:
> On 6/28/20 1:32 AM, Dmitry Osipenko wrote:
>> 23.06.2020 15:09, Mikko Perttunen пишет:
>>> /* Command is an opcode gather from a GEM handle */
>>> #define DRM_TEGRA_SUBMIT_COMMAND_GATHER             0
>>> /* Command is an opcode gather from a user pointer */
>>> #define DRM_TEGRA_SUBMIT_COMMAND_GATHER_UPTR        1
>>> /* Command is a wait for syncpt fence completion */
>>> #define DRM_TEGRA_SUBMIT_COMMAND_WAIT_SYNCPT        2
>>> /* Command is a wait for SYNC_FILE FD completion */
>>> #define DRM_TEGRA_SUBMIT_COMMAND_WAIT_SYNC_FILE     3
>>> /* Command is a wait for DRM syncobj completion */
>>> #define DRM_TEGRA_SUBMIT_COMMAND_WAIT_SYNCOBJ       4
>>>
>>> /*
>>>   * Allow driver to skip command execution if engine
>>>   * was not accessed by another channel between
>>>   * submissions.
>>>   */
>>> #define DRM_TEGRA_SUBMIT_CONTEXT_SETUP                        (1<<0)
>>>
>>> struct drm_tegra_submit_command {
>>>          __u16 type;
>>>          __u16 flags;
>>
>> Shouldn't the "packed" attribute be needed if a non-32bit aligned fields
>> are used?
> 
> I guess we should change these to u32 for consistency.
> 
>>
>>>          union {
>>>                  struct {
>>>                      /* GEM handle */
>>>                      __u32 handle;
>>>
>>>                      /*
>>>                       * Offset into GEM object in bytes.
>>>                       * Must be aligned by 4.
>>>                       */
>>>                      __u64 offset;
>>
>> 64bits for a gather offset is a bit too much, in most cases gathers are
>> under 4K.
>>
>> u32 should be more than enough (maybe even u16 if offset is given in a
>> dword granularity).
> 
> I guess this can be changed to u32, though I don't think there is any
> particularly pressing reason not to use u64.
> 
> In any case, I think we concluded that we'll drop the GEM gather command
> for now.

Sure, I commented this just in a case, for the future :)

>>
>>>                      /*
>>>                       * Length of gather in bytes.
>>>                       * Must be aligned by 4.
>>>                       */
>>>                      __u64 length;
>>
>> u32/16
>>
>>>                  } gather;
>>
>>>                  struct {
>>>                          __u32 reserved[1];
>>>
>>>                          /*
>>>                           * Pointer to gather data.
>>>                           * Must be aligned by 4 bytes.
>>>                           */
>>>                          __u64 base;
>>>                          /*
>>>                           * Length of gather in bytes.
>>>                           * Must be aligned by 4.
>>>                           */
>>>                          __u64 length;
>>>                  } gather_uptr;
>>
>> What about to inline the UPTR gather data and relocs into the
>> drm_tegra_submit_command[] buffer:
>>
>> struct drm_tegra_submit_command {
>>     struct {
>>         u16 num_words;
>>         u16 num_relocs;
>>
>>         gather_data[];
>>         drm_tegra_submit_relocation relocs[];
>>     } gather_uptr;
>> };
>>
>> struct drm_tegra_channel_submit {
>>          __u32 num_syncpt_incrs;
>>          __u32 syncpt_idx;
>>
>>          __u64 commands_ptr;
>>     __u32 commands_size;
>> ...
>> };
>>
>> struct drm_tegra_submit_command example[] = {
>>     cmd.gather_uptr{},
>>     ...
>>     gather_data[],
>>     gather_relocs[],
>>     cmd.wait_syncpt{},
>>     ...
>> };
>>
>> This way we will have only a single copy_from_user() for the whole
>> cmdstream, which should be more efficient to do and nicer from both
>> userspace and kernel perspectives in regards to forming and parsing the
>> commands.
>>
> 
> I'm not sure I agree it being nicer with regard to forming and parsing
> - Can you actually place multiple unsized arrays in a struct without
> pointers?

No :) But there are no unsized arrays here. The parser will read first
command and then move pointer to the next command based on size of the
parsed command.

> - gather_data's length is a multiple of 4, and so since we have u64's in
> drm_tegra_submit_relocation, alignment needs to be taken care of
> manually, both when forming and kernel needs to validate it while
> parsing. Depending on number of words in the gather, padding would need
> to be inserted. We could swap the two around but it still feels more
> brittle.

Yes, there will be unaligned reads on ARM64, but I don't think it's a
problem. Isn't it?

> Also, if we read the gather from a separate address, userspace doesn't
> necessarily need to know the length of the gather (and number of relocs)
> upfront, so it's easier to have a builder pattern without needing to
> copy things later.

Usually there are 2-3 relocs per gather, so userspace could simply
maintain a fixed-sized static buffer for the relocs (say 32 relocs).
Once gather is sliced by userspace, the stashed relocs will be appended
to the comands buffer and next gather will be formed.

The relocs copying doesn't really costs anything for userspace, the
price is incomparable with the cost of UPTR copying of each gather for
the kernel.

Well, the UPTR copying isn't expensive, but if there is a single copy
for the whole IOCTL, then it's even much better!

> If we allow direct page mappings for gathers later, a separate address
> would make things also a little bit easier. For direct page mappings
> userspace would need to keep the opcode data unchanged while the job is
> being executed, so userspace could keep an arena where the gathers are
> placed, and refer to segments of that, instead of having to keep the
> drm_tegra_submit_commands alive.

Okay, what about to have a single UPTR buffer for all gathers of a job?

struct drm_tegra_channel_submit {
	u64 commands_ptr;
	u64 gathers_ptr;

	u32 commands_size;
	u32 gathers_size;
	...
};

struct drm_tegra_submit_command {
	...
        union {
                struct {
                        /*
                         * Length of gather in bytes.
                         * Must be aligned by 4.
                         */
                        __u32 length;
                } gather_uptr;
	};
};

Now we have a single UPTR gather that could be sliced into sub-gathers
during of job's submission and also the whole data could be copied at
once by kernel driver for the parsing purposes.

The userspace will have to preallocate a large-enough buffer upfront for
the gathers. If it runs out of space in the buffer, then it should
reallocate a larger buffer. Nice and simple :)
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] Host1x/TegraDRM UAPI (sync points)
  2020-06-28  9:44   ` Mikko Perttunen
@ 2020-06-29  2:36     ` Dmitry Osipenko
  2020-06-29 10:27       ` Mikko Perttunen
  0 siblings, 1 reply; 50+ messages in thread
From: Dmitry Osipenko @ 2020-06-29  2:36 UTC (permalink / raw)
  To: Mikko Perttunen, Thierry Reding, Jon Hunter, David Airlie,
	Daniel Vetter, sumit.semwal, gustavo
  Cc: linux-tegra, talho, bhuntsman, dri-devel, Erik Faye-Lund

28.06.2020 12:44, Mikko Perttunen пишет:
> On 6/28/20 2:27 AM, Dmitry Osipenko wrote:
>> 23.06.2020 15:09, Mikko Perttunen пишет:
>>>
>>> ### IOCTL HOST1X_ALLOCATE_SYNCPOINT (on /dev/host1x)
>>>
>>> Allocates a free syncpoint, returning a file descriptor representing it.
>>> Only the owner of the file descriptor is allowed to mutate the value of
>>> the syncpoint.
>>>
>>> ```
>>> struct host1x_ctrl_allocate_syncpoint {
>>>         /**
>>>          * @fd:
>>>          *
>>>          * [out] New file descriptor representing the allocated
>>> syncpoint.
>>>          */
>>>         __s32 fd;
>>>
>>>         __u32 reserved[3];
>>> };
>>
>> We should need at least these basic things from the sync points API >
>> - Execution context shouldn't be able to tamper sync points of the other
>> contexts.
> 
> This is covered by this UAPI - when submitting, as part of the
> syncpt_incr struct you pass the syncpoint FD. This way the driver can
> check the syncpoints used are correct, or program HW protection.
> 
>>
>> - Sync point could be shared with other contexts for explicit fencing.
> 
> Not sure what you specifically mean; you can get the ID out of the
> syncpoint fd and share the ID for read-only access. (Or the FD for
> read-write access)

I enumerated the overall points that UAPI should provide to us, just for
clarity. Not like you haven't covered any of them, sorry for the
confusion! :)

Please see more comments below!

>>
>> - Sync points should work reliably.
>>
>> Some problems of the current Host1x driver, like where it falls over if
>> sync point value is out-of-sync + all the hang-job recovery labor could
>> be easily reduced if sync point health is protected by extra UAPI
>> constraints. >
>> So I think we may want the following:
>>
>> 1. We still should need to assign sync point ID to a DRM-channel's
>> context. This sync point ID will be used for a commands stream forming,
>> like it is done by the current staging UAPI.
>>
>> So we should need to retain the DRM_TEGRA_GET_SYNCPT IOCTL, but
>> improve it.

My point here is that the UAPI shouldn't be able to increment the job's
sync point using SYNCPOINT_INCREMENT IOCTL, which is another UAPI
constraint.

I'm suggesting that we should have two methods of sync point allocations:

1) Sync point that could be used only by a submitted job.

2) Sync point that could be incremented by CPU.

The first method will allocate a raw sync point ID that is assigned to
the channel's context. This ID will be used for the job's completion
tracking. Perhaps this method also could optionally return a sync point
FD if you'd want to wait on this sync point by another job.

We don't need a dedicated sync point FD for all kinds of jobs, don't we?
For example, I don't see why a sync point FD may be needed in a case of
Opentegra jobs.

>> 2. Allocated sync point must have a clean hardware state.
> 
> What do you mean by clean hardware state?

I mean that sync point should have a predictable state [1], it shouldn't
accidentally fire during of hardware programming for example.

[1]
https://github.com/grate-driver/linux/blob/master/drivers/gpu/host1x/soc/syncpoints.c#L132

For a submitted job, the job's sync point state could be reset at a
submission time, for example like I did it in the grate-kernel's
experimental driver [2].

[2]
https://github.com/grate-driver/linux/blob/master/drivers/gpu/host1x/soc/channel.c#L145

>>
>> 3. Sync points should be properly refcounted. Job's sync points
>> shouldn't be re-used while job is alive.
>>
>> 4. The job's sync point can't be re-used after job's submission (UAPI
>> constraint!). Userspace must free sync point and allocate a new one for
>> the next job submission. And now we:
>>
>>    - Know that job's sync point is always in a healthy state!
>>
>>    - We're not limited by a number of physically available hardware sync
>> points! Allocation should block until free sync point is available.
>>
>>    - The logical number of job's sync point increments matches the SP
>> hardware state! Which is handy for a job's debugging.
>>
>> Optionally, the job's sync point could be auto-removed from the DRM's
>> context after job's submission, avoiding a need for an extra SYNCPT_PUT
>> IOCTL invocation to be done by userspace after the job's submission.
>> Could be a job's flag.
> 
> I think this would cause problems where after a job completes but before
> the fence has been waited, the syncpoint is already recycled (especially
> if the syncpoint is reset into some clean state).

Exactly, good point! The dma-fence shouldn't be hardwired to the sync
point in order to avoid this situation :)

Please take a look at the fence implementation that I made for the
grate-driver [3]. The host1x-fence is a dma-fence [4] that is attached
to a sync point by host1x_fence_create(). Once job is completed, the
host1x-fence is detached from the sync point [5][6] and sync point could
be recycled safely!

[3]
https://github.com/grate-driver/linux/blob/master/drivers/gpu/host1x/fence.c

[4]
https://github.com/grate-driver/linux/blob/master/include/linux/host1x.h#L450

[5]
https://github.com/grate-driver/linux/blob/master/drivers/gpu/host1x/soc/syncpoints_hw.c#L50

[6]
https://github.com/grate-driver/linux/blob/master/drivers/gpu/drm/tegra/uapi/scheduler.c#L133

Please try to take a closer look at the grate-driver's implementation if
you haven't yet. I think we should be able to reuse or improve some of
the ideas. That implementation isn't 100% complete, it doesn't cover
things like CPU-incremented or exported sync points for example, but
basics are there.

> I would prefer having a syncpoint for each userspace channel context
> (several of which could share a hardware channel if MLOCKing is not used).
> 
> In my experience it's then not difficult to pinpoint which job has
> failed, and if each userspace channel context uses a separate syncpoint,
> a hanging job wouldn't mess with other application's jobs, either.

I agree that there shouldn't be any problems with finding what job is
hanged. The timed out job is always the hanged job, no? :)

Also, please take a look at the DRM scheduler. Once I started to wire up
the DRM scheduler support in the grate-driver, I realized that there is
no real need to try to recover sync point's counter and etc, like the
current upstream host1x driver does for a hanged job. When job is
hanged, the whole channel should be turned down and reset, the job's
sync point state reset, client's HW engine reset, all the pre-queued
jobs re-submitted. And the generic DRM job scheduler helps us with that!
It also has other neat features which I haven't tried yet, like job
priorities for example.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] Host1x/TegraDRM UAPI (sync points)
  2020-06-29  2:36     ` Dmitry Osipenko
@ 2020-06-29 10:27       ` Mikko Perttunen
  2020-06-29 19:28         ` Dmitry Osipenko
                           ` (2 more replies)
  0 siblings, 3 replies; 50+ messages in thread
From: Mikko Perttunen @ 2020-06-29 10:27 UTC (permalink / raw)
  To: Dmitry Osipenko, Thierry Reding, Jon Hunter, David Airlie,
	Daniel Vetter, sumit.semwal, gustavo
  Cc: linux-tegra, talho, bhuntsman, dri-devel, Erik Faye-Lund

On 6/29/20 5:36 AM, Dmitry Osipenko wrote:
> 28.06.2020 12:44, Mikko Perttunen пишет:
>> On 6/28/20 2:27 AM, Dmitry Osipenko wrote:
>>> 23.06.2020 15:09, Mikko Perttunen пишет:
>>>>
>>>> ### IOCTL HOST1X_ALLOCATE_SYNCPOINT (on /dev/host1x)
>>>>
>>>> Allocates a free syncpoint, returning a file descriptor representing it.
>>>> Only the owner of the file descriptor is allowed to mutate the value of
>>>> the syncpoint.
>>>>
>>>> ```
>>>> struct host1x_ctrl_allocate_syncpoint {
>>>>          /**
>>>>           * @fd:
>>>>           *
>>>>           * [out] New file descriptor representing the allocated
>>>> syncpoint.
>>>>           */
>>>>          __s32 fd;
>>>>
>>>>          __u32 reserved[3];
>>>> };
>>>
>>> We should need at least these basic things from the sync points API >
>>> - Execution context shouldn't be able to tamper sync points of the other
>>> contexts.
>>
>> This is covered by this UAPI - when submitting, as part of the
>> syncpt_incr struct you pass the syncpoint FD. This way the driver can
>> check the syncpoints used are correct, or program HW protection.
>>
>>>
>>> - Sync point could be shared with other contexts for explicit fencing.
>>
>> Not sure what you specifically mean; you can get the ID out of the
>> syncpoint fd and share the ID for read-only access. (Or the FD for
>> read-write access)
> 
> I enumerated the overall points that UAPI should provide to us, just for
> clarity. Not like you haven't covered any of them, sorry for the
> confusion! :)
> 
> Please see more comments below!

Ok, good :)

> 
>>>
>>> - Sync points should work reliably.
>>>
>>> Some problems of the current Host1x driver, like where it falls over if
>>> sync point value is out-of-sync + all the hang-job recovery labor could
>>> be easily reduced if sync point health is protected by extra UAPI
>>> constraints. >
>>> So I think we may want the following:
>>>
>>> 1. We still should need to assign sync point ID to a DRM-channel's
>>> context. This sync point ID will be used for a commands stream forming,
>>> like it is done by the current staging UAPI.
>>>
>>> So we should need to retain the DRM_TEGRA_GET_SYNCPT IOCTL, but
>>> improve it.
> 
> My point here is that the UAPI shouldn't be able to increment the job's
> sync point using SYNCPOINT_INCREMENT IOCTL, which is another UAPI
> constraint.
> 
> I'm suggesting that we should have two methods of sync point allocations:
> 
> 1) Sync point that could be used only by a submitted job.
> 
> 2) Sync point that could be incremented by CPU.
> 
> The first method will allocate a raw sync point ID that is assigned to
> the channel's context. This ID will be used for the job's completion
> tracking. Perhaps this method also could optionally return a sync point
> FD if you'd want to wait on this sync point by another job.
> 
> We don't need a dedicated sync point FD for all kinds of jobs, don't we?
> For example, I don't see why a sync point FD may be needed in a case of
> Opentegra jobs.

I think it's cleaner if we have just one way to allocate syncpoints, and 
then those syncpoints can be passed to different things depending on the 
situation.

If we want to protect direct incrementing while a job is submitted, we 
could have a locking API where an ongoing job can take a lock refcount 
in the syncpoint, and incrementing would return -EBUSY.

> 
>>> 2. Allocated sync point must have a clean hardware state.
>>
>> What do you mean by clean hardware state?
> 
> I mean that sync point should have a predictable state [1], it shouldn't
> accidentally fire during of hardware programming for example.
> 
> [1]
> https://github.com/grate-driver/linux/blob/master/drivers/gpu/host1x/soc/syncpoints.c#L132
> 
> For a submitted job, the job's sync point state could be reset at a
> submission time, for example like I did it in the grate-kernel's
> experimental driver [2].
> 
> [2]
> https://github.com/grate-driver/linux/blob/master/drivers/gpu/host1x/soc/channel.c#L145
> 
>>>
>>> 3. Sync points should be properly refcounted. Job's sync points
>>> shouldn't be re-used while job is alive.
>>>
>>> 4. The job's sync point can't be re-used after job's submission (UAPI
>>> constraint!). Userspace must free sync point and allocate a new one for
>>> the next job submission. And now we:
>>>
>>>     - Know that job's sync point is always in a healthy state!
>>>
>>>     - We're not limited by a number of physically available hardware sync
>>> points! Allocation should block until free sync point is available.
>>>
>>>     - The logical number of job's sync point increments matches the SP
>>> hardware state! Which is handy for a job's debugging.
>>>
>>> Optionally, the job's sync point could be auto-removed from the DRM's
>>> context after job's submission, avoiding a need for an extra SYNCPT_PUT
>>> IOCTL invocation to be done by userspace after the job's submission.
>>> Could be a job's flag.
>>
>> I think this would cause problems where after a job completes but before
>> the fence has been waited, the syncpoint is already recycled (especially
>> if the syncpoint is reset into some clean state).
> 
> Exactly, good point! The dma-fence shouldn't be hardwired to the sync
> point in order to avoid this situation :)
> 
> Please take a look at the fence implementation that I made for the
> grate-driver [3]. The host1x-fence is a dma-fence [4] that is attached
> to a sync point by host1x_fence_create(). Once job is completed, the
> host1x-fence is detached from the sync point [5][6] and sync point could
> be recycled safely!

What if the fence has been programmed as a prefence to another channel 
(that is getting delayed), or to the GPU, or some other accelerator like 
DLA, or maybe some other VM? Those don't know the dma_fence has been 
signaled, they can only rely on the syncpoint ID/threshold pair.

> 
> [3]
> https://github.com/grate-driver/linux/blob/master/drivers/gpu/host1x/fence.c
> 
> [4]
> https://github.com/grate-driver/linux/blob/master/include/linux/host1x.h#L450
> 
> [5]
> https://github.com/grate-driver/linux/blob/master/drivers/gpu/host1x/soc/syncpoints_hw.c#L50
> 
> [6]
> https://github.com/grate-driver/linux/blob/master/drivers/gpu/drm/tegra/uapi/scheduler.c#L133
> 
> Please try to take a closer look at the grate-driver's implementation if
> you haven't yet. I think we should be able to reuse or improve some of
> the ideas. That implementation isn't 100% complete, it doesn't cover
> things like CPU-incremented or exported sync points for example, but
> basics are there.

Sure, I'll take a look.

> 
>> I would prefer having a syncpoint for each userspace channel context
>> (several of which could share a hardware channel if MLOCKing is not used).
>>
>> In my experience it's then not difficult to pinpoint which job has
>> failed, and if each userspace channel context uses a separate syncpoint,
>> a hanging job wouldn't mess with other application's jobs, either.
> 
> I agree that there shouldn't be any problems with finding what job is
> hanged. The timed out job is always the hanged job, no? :)
> 
> Also, please take a look at the DRM scheduler. Once I started to wire up
> the DRM scheduler support in the grate-driver, I realized that there is
> no real need to try to recover sync point's counter and etc, like the
> current upstream host1x driver does for a hanged job. When job is
> hanged, the whole channel should be turned down and reset, the job's
> sync point state reset, client's HW engine reset, all the pre-queued
> jobs re-submitted. And the generic DRM job scheduler helps us with that!
> It also has other neat features which I haven't tried yet, like job
> priorities for example.
> 

Thanks, I'll take a look at this as well.

Mikko
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] Host1x/TegraDRM UAPI (sync points)
  2020-06-29 10:27       ` Mikko Perttunen
@ 2020-06-29 19:28         ` Dmitry Osipenko
  2020-06-29 19:33         ` Dmitry Osipenko
  2020-06-29 19:42         ` Dmitry Osipenko
  2 siblings, 0 replies; 50+ messages in thread
From: Dmitry Osipenko @ 2020-06-29 19:28 UTC (permalink / raw)
  To: Mikko Perttunen, Thierry Reding, Jon Hunter, David Airlie,
	Daniel Vetter, sumit.semwal, gustavo
  Cc: linux-tegra, talho, bhuntsman, dri-devel, Erik Faye-Lund

29.06.2020 13:27, Mikko Perttunen пишет:
...
>> We don't need a dedicated sync point FD for all kinds of jobs, don't we?
>> For example, I don't see why a sync point FD may be needed in a case of
>> Opentegra jobs.
> 
> I think it's cleaner if we have just one way to allocate syncpoints, and
> then those syncpoints can be passed to different things depending on the
> situation.
> 
> If we want to protect direct incrementing while a job is submitted, we
> could have a locking API where an ongoing job can take a lock refcount
> in the syncpoint, and incrementing would return -EBUSY.

Okay, let's go with this for now.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] Host1x/TegraDRM UAPI (sync points)
  2020-06-29 10:27       ` Mikko Perttunen
  2020-06-29 19:28         ` Dmitry Osipenko
@ 2020-06-29 19:33         ` Dmitry Osipenko
  2020-06-29 19:42         ` Dmitry Osipenko
  2 siblings, 0 replies; 50+ messages in thread
From: Dmitry Osipenko @ 2020-06-29 19:33 UTC (permalink / raw)
  To: Mikko Perttunen, Thierry Reding, Jon Hunter, David Airlie,
	Daniel Vetter, sumit.semwal, gustavo
  Cc: linux-tegra, talho, bhuntsman, dri-devel, Erik Faye-Lund

29.06.2020 13:27, Mikko Perttunen пишет:
...
>>>> 3. Sync points should be properly refcounted. Job's sync points
>>>> shouldn't be re-used while job is alive.
>>>>
>>>> 4. The job's sync point can't be re-used after job's submission (UAPI
>>>> constraint!). Userspace must free sync point and allocate a new one for
>>>> the next job submission. And now we:
>>>>
>>>>     - Know that job's sync point is always in a healthy state!
>>>>
>>>>     - We're not limited by a number of physically available hardware
>>>> sync
>>>> points! Allocation should block until free sync point is available.

It also occurred to me that if allocation is blocking and if there is a
need to allocate multiple sync points for a job, then the IOCTL should
be able to allocate multiple sync points at once. This will prevent
interlock situation where two context could block on each other.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] Host1x/TegraDRM UAPI (sync points)
  2020-06-29 10:27       ` Mikko Perttunen
  2020-06-29 19:28         ` Dmitry Osipenko
  2020-06-29 19:33         ` Dmitry Osipenko
@ 2020-06-29 19:42         ` Dmitry Osipenko
  2020-06-30 10:26           ` Mikko Perttunen
  2 siblings, 1 reply; 50+ messages in thread
From: Dmitry Osipenko @ 2020-06-29 19:42 UTC (permalink / raw)
  To: Mikko Perttunen, Thierry Reding, Jon Hunter, David Airlie,
	Daniel Vetter, sumit.semwal, gustavo
  Cc: linux-tegra, talho, bhuntsman, dri-devel, Erik Faye-Lund

29.06.2020 13:27, Mikko Perttunen пишет:
...
>>>> 4. The job's sync point can't be re-used after job's submission (UAPI
>>>> constraint!). Userspace must free sync point and allocate a new one for
>>>> the next job submission. And now we:
>>>>
>>>>     - Know that job's sync point is always in a healthy state!
>>>>
>>>>     - We're not limited by a number of physically available hardware
>>>> sync
>>>> points! Allocation should block until free sync point is available.
>>>>
>>>>     - The logical number of job's sync point increments matches the SP
>>>> hardware state! Which is handy for a job's debugging.
>>>>
>>>> Optionally, the job's sync point could be auto-removed from the DRM's
>>>> context after job's submission, avoiding a need for an extra SYNCPT_PUT
>>>> IOCTL invocation to be done by userspace after the job's submission.
>>>> Could be a job's flag.
>>>
>>> I think this would cause problems where after a job completes but before
>>> the fence has been waited, the syncpoint is already recycled (especially
>>> if the syncpoint is reset into some clean state).
>>
>> Exactly, good point! The dma-fence shouldn't be hardwired to the sync
>> point in order to avoid this situation :)
>>
>> Please take a look at the fence implementation that I made for the
>> grate-driver [3]. The host1x-fence is a dma-fence [4] that is attached
>> to a sync point by host1x_fence_create(). Once job is completed, the
>> host1x-fence is detached from the sync point [5][6] and sync point could
>> be recycled safely!
> 
> What if the fence has been programmed as a prefence to another channel
> (that is getting delayed), or to the GPU, or some other accelerator like
> DLA, or maybe some other VM? Those don't know the dma_fence has been
> signaled, they can only rely on the syncpoint ID/threshold pair.

The explicit job's fence is always just a dma-fence, it's not tied to a
host1x-fence and it should be waited for a signal on CPU.

If you want to make job to wait for a sync point on hardware, then you
should use the drm_tegra_submit_command wait-command.

Again, please notice that DRM scheduler supports the job-submitted
fence! This dma-fence will signal once job is pushed to hardware for
execution, so it shouldn't be a problem to maintain jobs order for a
complex jobs without much hassle. We'll need to write some userspace to
check how it works in practice :) For now I really had experience with a
simple jobs only.

Secondly, I suppose neither GPU, nor DLA could wait on a host1x sync
point, correct? Or are they integrated with Host1x HW?

Anyways, it shouldn't be difficult to resolve dma-fence into
host1x-fence, get SP ID and maintain the SP's liveness. Please see more
below.

In the grate-driver I made all sync points refcounted, so sync point
won't be recycled while it has active users [1][2][3] in kernel (or
userspace).

[1]
https://github.com/grate-driver/linux/blob/master/include/linux/host1x.h#L428
[2]
https://github.com/grate-driver/linux/blob/master/include/linux/host1x.h#L1206
[3]
https://github.com/grate-driver/linux/blob/master/drivers/gpu/host1x/soc/syncpoints.c#L163

Now, grate-kernel isn't a 100% complete implementation, as I already
mentioned before. The host1x-fence doesn't have a reference to a sync
point as you may see in the code, this is because the userspace sync
points are not implemented in the grate-driver.

But nothing stops us to add that SP reference and then we could simply
do the following in the code:

struct dma_fence *host1x_fence_create(syncpt,...)
{
	...
	fence->sp = syncpt;
	...
	return &fence->base;
}

void host1x_syncpt_signal_fence(struct host1x_fence *fence)
{
	...
	fence->sp = NULL;
}

irqreturn_t host1x_hw_syncpt_isr()
{
	spin_lock(&host1x_syncpts_lock);
	...
	host1x_syncpt_signal_fence(sp->fence);
	...
	spin_unlock(&host1x_syncpts_lock);
}

void host1x_submit_job(job)
{
	...
	spin_lock_irqsave(&host1x_syncpts_lock);
	sp = host1x_syncpt_get(host1x_fence->sp);
	spin_unlock_irqrestore(&host1x_syncpts_lock);
	...
	if (sp) {
		push(WAIT(sp->id, host1x_fence->threshold));
		job->sync_points = sp;
	}
}

void host1x_free_job(job)
{
	host1x_syncpt_put(job->sync_points);
	...
}

For example: if you'll share host1-fence (dma-fence) with a GPU context,
then the fence's SP won't be released until GPU's context will be done
with the SP. The GPU's job will be timed out if shared SP won't get
incremented, and this is totally okay situation.

Does this answer yours question?

===

I'm not familiar with the Host1x VMs, so please let me use my
imagination here:

In a case of VM we could have a special VM-shared sync point type. The
userspace will allocate this special VM SP using ALLOCATE_SYNCPOINT and
this SP won't be used for a job(!). This is the case where job will need
to increment multiple sync points, its own SP + VM's SP. If job hangs,
then there should be a way to tell VM to release the SP and try again
next time with a freshly-allocated SP. The shared SP should stay alive
as long as VM uses it, so there should be a way for VM to tell that it's
done with SP.

Alternatively, we could add a SP recovery (or whatever is needed) for
the VM, but this should be left specific to T194+. Older Tegras
shouldn't ever need this complexity if I'm not missing anything.

Please provide a detailed information about the VM's workflow if the
above doesn't sound good.

Perhaps we shouldn't focus on the VM support for now, but may left some
room for a potential future expansion if necessary.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] Host1x/TegraDRM UAPI
  2020-06-26 13:59       ` Dmitry Osipenko
@ 2020-06-30  9:09         ` Daniel Vetter
  0 siblings, 0 replies; 50+ messages in thread
From: Daniel Vetter @ 2020-06-30  9:09 UTC (permalink / raw)
  To: Dmitry Osipenko
  Cc: Mikko Perttunen, Karol Herbst, David Airlie, gustavo, dri-devel,
	Jon Hunter, talho, bhuntsman, Thierry Reding, linux-tegra

On Fri, Jun 26, 2020 at 04:59:45PM +0300, Dmitry Osipenko wrote:
> 26.06.2020 16:38, Daniel Vetter пишет:
> > On Fri, Jun 26, 2020 at 01:40:40PM +0200, Thierry Reding wrote:
> >> On Fri, Jun 26, 2020 at 01:06:58PM +0200, Karol Herbst wrote:
> >>> On Tue, Jun 23, 2020 at 3:03 PM Mikko Perttunen <cyndis@kapsi.fi> wrote:
> >>>>
> >>>> # Host1x/TegraDRM UAPI proposal
> >>>>
> >>>> This is a proposal for a stable UAPI for Host1x and TegraDRM, to replace
> >>>> the current TegraDRM UAPI that is behind `STAGING` and quite obsolete in
> >>>> many ways.
> >>>>
> >>>> I haven't written any implementation yet -- I'll do that once there is
> >>>> some agreement on the high-level design.
> >>>>
> >>>> Current open items:
> >>>>
> >>>> * The syncpoint UAPI allows userspace to create sync_file FDs with
> >>>> arbitrary syncpoint fences. dma_fence code currently seems to assume all
> >>>> fences will be signaled, which would not necessarily be the case with
> >>>> this interface.
> >>>> * Previously present GEM IOCTLs (GEM_CREATE, GEM_MMAP) are not present.
> >>>> Not sure if they are still needed.
> >>>>
> >>>
> >>> Hi, as this wasn't addressed here (and sorry if I missed it): is there
> >>> an open source userspace making use of this UAPI? Because this is
> >>> something which needs to be seen before it can be included at all.
> >>
> >> There's a set of commits that implement an earlier draft here:
> >>
> >>     https://github.com/thierryreding/linux/tree/for-5.6/drm-tegra-destage-abi
> >>
> >> and corresponding changes to libdrm to make use of that and implement
> >> tests using the various generations of VIC here:
> >>
> >>     https://cgit.freedesktop.org/~tagr/drm/log/
> >>
> >> Before actually jumping to an implementation we wanted to go over the
> >> design a bit more to avoid wasting a lot of work, like we've done in
> >> the past. Turns out it isn't easy to get everyone to agree on a good
> >> ABI. Who would've thought? =)
> >>
> >> I expect that once the discussion around the ABI settles Mikko will
> >> start on implementing this ABI in the kernel and we'll update the
> >> libdrm patches to make use of the new ABI as well.
> > 
> > Do we have more than libdrm and tests for this, like something somewhat
> > functional? Since tegradrm landed years ago we've refined the criteria a
> > bit in this regard, latest version is explicit in that it needs to be
> > something that's functional (for end-users in some form), not just
> > infrastructure and tests.
> 
> We have Opentegra [1] and VDPAU [2] userspace drivers for older Tegra
> SoCs that have been using the staging UAPI for years now.
> 
> [1] https://github.com/grate-driver/xf86-video-opentegra
> [2] https://github.com/grate-driver/libvdpau-tegra
> 
> The UAPI and the kernel driver updates are very needed for these drivers
> because of the missing UAPI synchronization features and performance
> problems of the kernel driver.
> 
> So we already have some real-world userspace for the testing!

Awesome! Maybe for future rounds include the links for the vdpau driver. I
didn't know about that one at all. -opentegra is probably not so relevant
anymore (and I flat out forgot it exists) ... Including the userspace side
links is good so that forgetful people like me don't nag :-)
-Daniel


> It's not the first and not the second time we're discussing the Tegra
> DRM UAPI changes, but this time it feels like there is a good chance
> that it will finally materialize and I'm very happy about it :)

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] Host1x/TegraDRM UAPI (sync points)
  2020-06-29 19:42         ` Dmitry Osipenko
@ 2020-06-30 10:26           ` Mikko Perttunen
  2020-07-01  0:22             ` Dmitry Osipenko
  0 siblings, 1 reply; 50+ messages in thread
From: Mikko Perttunen @ 2020-06-30 10:26 UTC (permalink / raw)
  To: Dmitry Osipenko, Thierry Reding, Jon Hunter, David Airlie,
	Daniel Vetter, sumit.semwal, gustavo
  Cc: linux-tegra, talho, bhuntsman, dri-devel, Erik Faye-Lund

On 6/29/20 10:42 PM, Dmitry Osipenko wrote:
> 
> Secondly, I suppose neither GPU, nor DLA could wait on a host1x sync
> point, correct? Or are they integrated with Host1x HW?
> 

They can access syncpoints directly. (That's what I alluded to in the 
"Introduction to the hardware" section :) all those things have hardware 
access to syncpoints)

 >
 > .. rest ..
 >

Let me try to summarize once more for my own understanding:

* When submitting a job, you would allocate new syncpoints for the job
* After submitting the job, those syncpoints are not usable anymore
* Postfences of that job would keep references to those syncpoints so 
they aren't freed and cleared before the fences have been released
* Once postfences have been released, syncpoints would be returned to 
the pool and reset to zero

The advantage of this would be that at any point in time, there would be 
a 1:1 correspondence between allocated syncpoints and jobs; so you could 
  shuffle the jobs around channels or reorder them.

Please correct if I got that wrong :)

---

I have two concerns:

* A lot of churn on syncpoints - any time you submit a job you might not 
get a syncpoint for an indefinite time. If we allocate syncpoints 
up-front at least you know beforehand, and then you have the syncpoint 
as long as you need it.
* Plumbing the dma-fence/sync_file everywhere, and keeping it alive 
until waits on it have completed, is more work than just having the 
ID/threshold. This is probably mainly a problem for downstream, where 
updating code for this would be difficult. I know that's not a proper 
argument but I hope we can reach something that works for both worlds.

Here's a proposal in between:

* Keep syncpoint allocation and submission in jobs as in my original 
proposal
* Don't attempt to recover user channel contexts. What this means:
   * If we have a hardware channel per context (MLOCKing), just tear 
down the channel
   * Otherwise, we can just remove (either by patching or by full 
teardown/resubmit of the channel) all jobs submitted by the user channel 
context that submitted the hanging job. Jobs of other contexts would be 
undisturbed (though potentially delayed, which could be taken into 
account and timeouts adjusted)
* If this happens, we can set removed jobs' post-fences to error status 
and user will have to resubmit them.
* We should be able to keep the syncpoint refcounting based on fences.

This can be made more fine-grained by not caring about the user channel 
context, but tearing down all jobs with the same syncpoint. I think the 
result would be that we can get either what you described (or how I 
understood it in the summary in the beginning of the message), or a more 
traditional syncpoint-per-userctx workflow, depending on how the 
userspace decides to allocate syncpoints.

If needed, the kernel can still do e.g. reordering (you mentioned job 
priorities) at syncpoint granularity, which, if the userspace followed 
the model you described, would be the same thing as job granularity.

(Maybe it would be more difficult with current drm_scheduler, sorry, 
haven't had the time yet to read up on that. Dealing with clearing work 
stuff up before summer vacation)

Mikko
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] Host1x/TegraDRM UAPI (drm_tegra_submit_command)
  2020-06-29  0:00     ` Dmitry Osipenko
@ 2020-06-30 10:40       ` Mikko Perttunen
  0 siblings, 0 replies; 50+ messages in thread
From: Mikko Perttunen @ 2020-06-30 10:40 UTC (permalink / raw)
  To: Dmitry Osipenko, Thierry Reding, Jon Hunter, David Airlie,
	Daniel Vetter, sumit.semwal, gustavo
  Cc: linux-tegra, talho, bhuntsman, dri-devel, Erik Faye-Lund

On 6/29/20 3:00 AM, Dmitry Osipenko wrote:
> 28.06.2020 13:28, Mikko Perttunen пишет:
>> On 6/28/20 1:32 AM, Dmitry Osipenko wrote:
>>> 23.06.2020 15:09, Mikko Perttunen пишет:
>>>> /* Command is an opcode gather from a GEM handle */
>>>> #define DRM_TEGRA_SUBMIT_COMMAND_GATHER             0
>>>> /* Command is an opcode gather from a user pointer */
>>>> #define DRM_TEGRA_SUBMIT_COMMAND_GATHER_UPTR        1
>>>> /* Command is a wait for syncpt fence completion */
>>>> #define DRM_TEGRA_SUBMIT_COMMAND_WAIT_SYNCPT        2
>>>> /* Command is a wait for SYNC_FILE FD completion */
>>>> #define DRM_TEGRA_SUBMIT_COMMAND_WAIT_SYNC_FILE     3
>>>> /* Command is a wait for DRM syncobj completion */
>>>> #define DRM_TEGRA_SUBMIT_COMMAND_WAIT_SYNCOBJ       4
>>>>
>>>> /*
>>>>    * Allow driver to skip command execution if engine
>>>>    * was not accessed by another channel between
>>>>    * submissions.
>>>>    */
>>>> #define DRM_TEGRA_SUBMIT_CONTEXT_SETUP                        (1<<0)
>>>>
>>>> struct drm_tegra_submit_command {
>>>>           __u16 type;
>>>>           __u16 flags;
>>>
>>> Shouldn't the "packed" attribute be needed if a non-32bit aligned fields
>>> are used?
>>
>> I guess we should change these to u32 for consistency.
>>
>>>
>>>>           union {
>>>>                   struct {
>>>>                       /* GEM handle */
>>>>                       __u32 handle;
>>>>
>>>>                       /*
>>>>                        * Offset into GEM object in bytes.
>>>>                        * Must be aligned by 4.
>>>>                        */
>>>>                       __u64 offset;
>>>
>>> 64bits for a gather offset is a bit too much, in most cases gathers are
>>> under 4K.
>>>
>>> u32 should be more than enough (maybe even u16 if offset is given in a
>>> dword granularity).
>>
>> I guess this can be changed to u32, though I don't think there is any
>> particularly pressing reason not to use u64.
>>
>> In any case, I think we concluded that we'll drop the GEM gather command
>> for now.
> 
> Sure, I commented this just in a case, for the future :)

Yep, OK :)

> 
>>>
>>>>                       /*
>>>>                        * Length of gather in bytes.
>>>>                        * Must be aligned by 4.
>>>>                        */
>>>>                       __u64 length;
>>>
>>> u32/16
>>>
>>>>                   } gather;
>>>
>>>>                   struct {
>>>>                           __u32 reserved[1];
>>>>
>>>>                           /*
>>>>                            * Pointer to gather data.
>>>>                            * Must be aligned by 4 bytes.
>>>>                            */
>>>>                           __u64 base;
>>>>                           /*
>>>>                            * Length of gather in bytes.
>>>>                            * Must be aligned by 4.
>>>>                            */
>>>>                           __u64 length;
>>>>                   } gather_uptr;
>>>
>>> What about to inline the UPTR gather data and relocs into the
>>> drm_tegra_submit_command[] buffer:
>>>
>>> struct drm_tegra_submit_command {
>>>      struct {
>>>          u16 num_words;
>>>          u16 num_relocs;
>>>
>>>          gather_data[];
>>>          drm_tegra_submit_relocation relocs[];
>>>      } gather_uptr;
>>> };
>>>
>>> struct drm_tegra_channel_submit {
>>>           __u32 num_syncpt_incrs;
>>>           __u32 syncpt_idx;
>>>
>>>           __u64 commands_ptr;
>>>      __u32 commands_size;
>>> ...
>>> };
>>>
>>> struct drm_tegra_submit_command example[] = {
>>>      cmd.gather_uptr{},
>>>      ...
>>>      gather_data[],
>>>      gather_relocs[],
>>>      cmd.wait_syncpt{},
>>>      ...
>>> };
>>>
>>> This way we will have only a single copy_from_user() for the whole
>>> cmdstream, which should be more efficient to do and nicer from both
>>> userspace and kernel perspectives in regards to forming and parsing the
>>> commands.
>>>
>>
>> I'm not sure I agree it being nicer with regard to forming and parsing
>> - Can you actually place multiple unsized arrays in a struct without
>> pointers?
> 
> No :) But there are no unsized arrays here. The parser will read first
> command and then move pointer to the next command based on size of the
> parsed command.
> 
>> - gather_data's length is a multiple of 4, and so since we have u64's in
>> drm_tegra_submit_relocation, alignment needs to be taken care of
>> manually, both when forming and kernel needs to validate it while
>> parsing. Depending on number of words in the gather, padding would need
>> to be inserted. We could swap the two around but it still feels more
>> brittle.
> 
> Yes, there will be unaligned reads on ARM64, but I don't think it's a
> problem. Isn't it?

It's not a big problem, but I think it would be less error-prone to have 
naturally aligned data.

> 
>> Also, if we read the gather from a separate address, userspace doesn't
>> necessarily need to know the length of the gather (and number of relocs)
>> upfront, so it's easier to have a builder pattern without needing to
>> copy things later.
> 
> Usually there are 2-3 relocs per gather, so userspace could simply
> maintain a fixed-sized static buffer for the relocs (say 32 relocs).
> Once gather is sliced by userspace, the stashed relocs will be appended
> to the comands buffer and next gather will be formed.
> 
> The relocs copying doesn't really costs anything for userspace, the
> price is incomparable with the cost of UPTR copying of each gather for
> the kernel.
> 
> Well, the UPTR copying isn't expensive, but if there is a single copy
> for the whole IOCTL, then it's even much better!
> 
>> If we allow direct page mappings for gathers later, a separate address
>> would make things also a little bit easier. For direct page mappings
>> userspace would need to keep the opcode data unchanged while the job is
>> being executed, so userspace could keep an arena where the gathers are
>> placed, and refer to segments of that, instead of having to keep the
>> drm_tegra_submit_commands alive.
> 
> Okay, what about to have a single UPTR buffer for all gathers of a job?
> 
> struct drm_tegra_channel_submit {
> 	u64 commands_ptr;
> 	u64 gathers_ptr;
> 
> 	u32 commands_size;
> 	u32 gathers_size;
> 	...
> };
> 
> struct drm_tegra_submit_command {
> 	...
>          union {
>                  struct {
>                          /*
>                           * Length of gather in bytes.
>                           * Must be aligned by 4.
>                           */
>                          __u32 length;
>                  } gather_uptr;
> 	};
> };
> 
> Now we have a single UPTR gather that could be sliced into sub-gathers
> during of job's submission and also the whole data could be copied at
> once by kernel driver for the parsing purposes.
> 
> The userspace will have to preallocate a large-enough buffer upfront for
> the gathers. If it runs out of space in the buffer, then it should
> reallocate a larger buffer. Nice and simple :)
> 

I think that would work fine for the usecases that I can think of.

Mikko
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] Host1x/TegraDRM UAPI (drm_tegra_channel_map)
  2020-06-28 22:59         ` Dmitry Osipenko
@ 2020-06-30 10:55           ` Mikko Perttunen
  2020-06-30 19:48             ` Dmitry Osipenko
  0 siblings, 1 reply; 50+ messages in thread
From: Mikko Perttunen @ 2020-06-30 10:55 UTC (permalink / raw)
  To: Dmitry Osipenko, Thierry Reding
  Cc: Erik Faye-Lund, David Airlie, gustavo, dri-devel, Jon Hunter,
	talho, bhuntsman, linux-tegra



On 6/29/20 1:59 AM, Dmitry Osipenko wrote:
> 28.06.2020 14:16, Mikko Perttunen пишет:
>> On 6/26/20 7:35 PM, Dmitry Osipenko wrote:
>>> 26.06.2020 10:34, Thierry Reding пишет:
>>>> On Fri, Jun 26, 2020 at 01:47:46AM +0300, Dmitry Osipenko wrote:
>>>>> 23.06.2020 15:09, Mikko Perttunen пишет:
>>>>>> ### DRM_TEGRA_CHANNEL_MAP
>>>>>>
>>>>>> Make memory accessible by the engine while executing work on the
>>>>>> channel.
>>>>>>
>>>>>> ```
>>>>>> #define DRM_TEGRA_CHANNEL_MAP_READWRITE        (1<<0)
>>>>>>
>>>>>> struct drm_tegra_channel_map {
>>>>>>           /*
>>>>>>            * [in] ID of the channel for which to map memory to.
>>>>>>            */
>>>>>>           __u32 channel_id;
>>>>>>           /*
>>>>>>            * [in] GEM handle of the memory to map.
>>>>>>            */
>>>>>>           __u32 handle;
>>>>>>
>>>>>>           /*
>>>>>>            * [in] Offset in GEM handle of the memory area to map.
>>>>>>            *
>>>>>>            * Must be aligned by 4K.
>>>>>>            */
>>>>>>           __u64 offset;
>>>>>
>>>>> Could you please give a use-case example for this partial mapping?
>>>>>
>>>>> I vaguely recalling that maybe it was me who suggested this in the
>>>>> past..
>>>>>
>>>>> I kinda see that this could be useful for a case where userspace
>>>>> allocates a large chunk of memory and then performs sub-allocations in
>>>>> the userspace driver. But do we have a real-world example for this
>>>>> right
>>>>> now?
>>>>
>>>> I think the main point about this IOCTL was to make mapping/unmapping
>>>> more efficient and avoid relocations for situations where we know it is
>>>> safe to do so.
>>>>
>>>> The fact that this can be used to create partial mappings is mostly just
>>>> an added bonus, in my opinion. Doing this doesn't create much complexity
>>>> but in turn gives us a lot more flexibility.
>>>>
>>>> A couple of places where I think this could be useful are OpenGL and
>>>> Vulkan, both of which support buffer suballocation. This has a couple of
>>>> advantages on modern GPUs where sometimes you want to use very large
>>>> allocation granularity, etc.
>>>>
>>>> Now, I don't think that we'll see much of that in Tegra DRM directly,
>>>> although grate could certainly make use of this, I suspect. However, I
>>>> think for interoperation of dGPU and Tegra DRM (with VIC for post-
>>>> processing, or hopefully some of the other hardware acceleration
>>>> engines at some point), this might come in handy.
>>>>
>>>> There are other possible use-cases within just Tegra DRM as well. We may
>>>> want to only partially map planar buffers for video post-processing, for
>>>> example. Or map each plane separately.
>>>>
>>>>> Please see more below.
>>>>>
>>>>>>           /*
>>>>>>            * [in] Length of memory area to map in bytes.
>>>>>>            *
>>>>>>            * Must be aligned by 4K.
>>>>>>            */
>>>>>>           __u64 length;
>>>>>>
>>>>>>           /*
>>>>>>            * [out] IOVA of mapped memory. Userspace can use this IOVA
>>>>>>            *   directly to refer to the memory to skip using
>>>>>> relocations.
>>>>>>            *   Only available if hardware memory isolation is enabled.
>>>>>>            *
>>>>>>            *   Will be set to 0xffff_ffff_ffff_ffff if unavailable.
>>>>>>            */
>>>>>>           __u64 iova;
>>>>>>
>>>>>>           /*
>>>>>>            * [out] ID corresponding to the mapped memory to be used for
>>>>>>            *   relocations or unmapping.
>>>>>>            */
>>>>>>           __u32 mapping_id;
>>>>>>           /*
>>>>>>            * [in] Flags.
>>>>>>            */
>>>>>>           __u32 flags;
>>>>>>
>>>>>>           __u32 reserved[6];
>>>>>> };
>>>>>
>>>>> It looks to me that we actually need a bit different thing here.
>>>>>
>>>>> This MAP IOCTL maps a portion of a GEM and then returns the mapping_id.
>>>>> And I think we need something more flexible that will allow us to use
>>>>> GEM handles for the relocation IDs, which should fit nicely with the
>>>>> DMA-reservations.
>>>>>
>>>>> What about an IOCTL that wraps GEM into another GEM? We could wrap a
>>>>> portion of GEM_A into a GEM_B, and then map the GEM_B using the MAP
>>>>> IOCTL.
>>>>>
>>>>> It could be something like this:
>>>>>
>>>>> ### DRM_TEGRA_BO_WRAP
>>>>>
>>>>> struct drm_tegra_wrap_bo {
>>>>>      __u32 bo_handle_wrapped; // out
>>>>>      __u32 bo_handle;         // in
>>>>>      __u64 offset;
>>>>>      __u64 length;
>>>>> };
>>>>>
>>>>> ### DRM_TEGRA_CHANNEL_MAP
>>>>>
>>>>> struct drm_tegra_channel_map {
>>>>>           __u32 channels_mask;
>>>>>      __u32 mapping_id;
>>>>>           __u32 bo_handle;
>>>>>           __u32 flags;
>>>>>           __u64 iova;
>>>>> };
>>>>>
>>>>> ===
>>>>>
>>>>> This allows multiple mapping_ids to have the same backing GEM, so the
>>>>> mapping_id could be resolved into a BO during of job's submission for
>>>>> the DMA-reservations handling.
>>>>
>>>> That's pretty much what we have already above, isn't it? Just because we
>>>> call the field "mapping_id" doesn't mean that in the background we can't
>>>> create a GEM object and return it's handle as "mapping_id".
>>>>
>>>> One advantage of Mikko's proposal is that we have a single IOCTL rather
>>>> than two to create the mapping, making it a bit more lightweight.
>>>
>>> Thinking a bit more about it, I now changed my mind.
>>>
>>> There is no need to perform implicit fencing on each suballocation,
>>> instead explicit fencing should be used for the suballocations.
>>>
>>> So, we will need to add the relocation flags for the direction and
>>> explicit (or implicit) fencing per-relocation. The direction will tell
>>> how fence should be attached to the BO's DMA-reservation, while the
>>> fence-flag will tell whether job's scheduler should wait for the BO's
>>> reservation before executing job on hardware. This all will be needed
>>> for a proper DRI implementation on older Tegras.
>>>
>>> Actually, during of my experiments with the UAPI, I added both these
>>> flags for the relocations [1], but really used only the direction flag
>>> so far, relying on the implicit fencing.
>>>
>>> [1]
>>> https://github.com/grate-driver/linux/blob/master/include/uapi/drm/tegra_drm.h#L894
>>>
>>>
>>> So, let's keep the current variant of this MAP IOCTL as-is.
>>>
>>
>> Let me rephrase to make sure I understand:
>>
>> For relocations, we should add flags for direction and fencing. This way
>> at submit time we can do the proper fencing operations on the relocated
>> BO's DMA reservation.
> 
> Correct
> 
>> This sounds good to me, and I think it makes the "relocation" concept a
>> bit more general than it is currently. I think we could rename it to
>> something like "buffer_usage" (open to bikeshedding), and it can have
>> fence-related flags and relocation-related flags. For newer Tegra chips
>> we don't necessarily need to relocate, but we still may need to handle
>> DMA reservations, so in these cases only the fencing flags would be set.
> 
> Kernel driver already knows whether relocation needs to be patched since
> it returns 0xffffffff by the MAP if patching is needed, and thus,
> patching could be auto-skipped by the driver.
> 
> I don't think that a special flag is required for telling about whether
> relocation needs to be done. The direction and fence flags should be
> enough for now.

Yeah, I guess it's not really required.

> 
> ===
> 
> I'm curios what do you think about another variant:
> 
> In the grate-kernel I'm using a BO table which contains unique BO
> entries for each job's relocation [1] and then IDs of this table and
> target offsets are embedded into the cmdstream in-place of the memory
> addresses [2]. This way, when cmdstream is fully firewalled, we're
> reading the stream's data anyways, and thus, it's quite nice to embed
> the BO table IDs and offsets into cmdstream itself [3].
> 
> In a result:
> 
>   - Each job's BO is resolved only once during submission.
> 
>   - BO table is kept small if cmdstream contains duplicated relocations.
> 
> [1]
> https://github.com/grate-driver/linux/blob/master/include/uapi/drm/tegra_drm.h#L892
> 
> [2]
> https://github.com/grate-driver/linux/blob/master/include/uapi/drm/tegra_drm.h#L783
> 
> [3]
> https://github.com/grate-driver/linux/blob/master/drivers/gpu/drm/tegra/uapi/patching.c#L43
> 
> Of course this limits the number of BOs per-job. In a case of
> grate-kernel it's max 64 BOs per-job + max 64MB for target offset. I
> guess the BOs number could be lowered to 32 per-job, which should be a
> bit more realistic, and then max target offset will be 128MB.
> 
> So, we could replace the BO table with a mapping table and have the
> MAPPING_TABLE! :) And it doesn't matter whether cmdstream patched or
> not, either way the MAPPING_TABLE will contain mapping ID + flags.
> 
> There are 3 possible variants of how job could be processed, depending
> on hardware generation and driver security configuration:
> 
>    1. Job is fully-parsed and patched.
>    2. Job is patched-only (with relocations).
>    3. Job isn't parsed, nor patched.
> 
> My variant covers 1 and 3. I guess we could just ignore the case 2 for
> now and fall back to 1, for simplicity. It shouldn't be a problem to add
> support for the RELOCS_TABLE in addition to MAPPING_TABLE later on, if
> will be really needed.
> 

I think the bitfield will get a bit tight once we add the 'shift' field 
and 'blocklinear' flag to it :) (which made me notice that apparently my 
original proposal has the define for 
DRM_TEGRA_SUBMIT_RELOCATION_BLOCKLINEAR, but doesn't have a flags field 
in the relocation structure, oops!)
Both of those affect the relocation value.

FWIW, we should design option 1 to be functional for new chips as well, 
as we need to use it e.g. if IOMMU is disabled.

Mikko
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] Host1x/TegraDRM UAPI (drm_tegra_channel_map)
  2020-06-30 10:55           ` Mikko Perttunen
@ 2020-06-30 19:48             ` Dmitry Osipenko
  0 siblings, 0 replies; 50+ messages in thread
From: Dmitry Osipenko @ 2020-06-30 19:48 UTC (permalink / raw)
  To: Mikko Perttunen, Thierry Reding
  Cc: Erik Faye-Lund, David Airlie, gustavo, dri-devel, Jon Hunter,
	talho, bhuntsman, linux-tegra

30.06.2020 13:55, Mikko Perttunen пишет:
> 
> 
> On 6/29/20 1:59 AM, Dmitry Osipenko wrote:
>> 28.06.2020 14:16, Mikko Perttunen пишет:
>>> On 6/26/20 7:35 PM, Dmitry Osipenko wrote:
>>>> 26.06.2020 10:34, Thierry Reding пишет:
>>>>> On Fri, Jun 26, 2020 at 01:47:46AM +0300, Dmitry Osipenko wrote:
>>>>>> 23.06.2020 15:09, Mikko Perttunen пишет:
>>>>>>> ### DRM_TEGRA_CHANNEL_MAP
>>>>>>>
>>>>>>> Make memory accessible by the engine while executing work on the
>>>>>>> channel.
>>>>>>>
>>>>>>> ```
>>>>>>> #define DRM_TEGRA_CHANNEL_MAP_READWRITE        (1<<0)
>>>>>>>
>>>>>>> struct drm_tegra_channel_map {
>>>>>>>           /*
>>>>>>>            * [in] ID of the channel for which to map memory to.
>>>>>>>            */
>>>>>>>           __u32 channel_id;
>>>>>>>           /*
>>>>>>>            * [in] GEM handle of the memory to map.
>>>>>>>            */
>>>>>>>           __u32 handle;
>>>>>>>
>>>>>>>           /*
>>>>>>>            * [in] Offset in GEM handle of the memory area to map.
>>>>>>>            *
>>>>>>>            * Must be aligned by 4K.
>>>>>>>            */
>>>>>>>           __u64 offset;
>>>>>>
>>>>>> Could you please give a use-case example for this partial mapping?
>>>>>>
>>>>>> I vaguely recalling that maybe it was me who suggested this in the
>>>>>> past..
>>>>>>
>>>>>> I kinda see that this could be useful for a case where userspace
>>>>>> allocates a large chunk of memory and then performs
>>>>>> sub-allocations in
>>>>>> the userspace driver. But do we have a real-world example for this
>>>>>> right
>>>>>> now?
>>>>>
>>>>> I think the main point about this IOCTL was to make mapping/unmapping
>>>>> more efficient and avoid relocations for situations where we know
>>>>> it is
>>>>> safe to do so.
>>>>>
>>>>> The fact that this can be used to create partial mappings is mostly
>>>>> just
>>>>> an added bonus, in my opinion. Doing this doesn't create much
>>>>> complexity
>>>>> but in turn gives us a lot more flexibility.
>>>>>
>>>>> A couple of places where I think this could be useful are OpenGL and
>>>>> Vulkan, both of which support buffer suballocation. This has a
>>>>> couple of
>>>>> advantages on modern GPUs where sometimes you want to use very large
>>>>> allocation granularity, etc.
>>>>>
>>>>> Now, I don't think that we'll see much of that in Tegra DRM directly,
>>>>> although grate could certainly make use of this, I suspect. However, I
>>>>> think for interoperation of dGPU and Tegra DRM (with VIC for post-
>>>>> processing, or hopefully some of the other hardware acceleration
>>>>> engines at some point), this might come in handy.
>>>>>
>>>>> There are other possible use-cases within just Tegra DRM as well.
>>>>> We may
>>>>> want to only partially map planar buffers for video
>>>>> post-processing, for
>>>>> example. Or map each plane separately.
>>>>>
>>>>>> Please see more below.
>>>>>>
>>>>>>>           /*
>>>>>>>            * [in] Length of memory area to map in bytes.
>>>>>>>            *
>>>>>>>            * Must be aligned by 4K.
>>>>>>>            */
>>>>>>>           __u64 length;
>>>>>>>
>>>>>>>           /*
>>>>>>>            * [out] IOVA of mapped memory. Userspace can use this
>>>>>>> IOVA
>>>>>>>            *   directly to refer to the memory to skip using
>>>>>>> relocations.
>>>>>>>            *   Only available if hardware memory isolation is
>>>>>>> enabled.
>>>>>>>            *
>>>>>>>            *   Will be set to 0xffff_ffff_ffff_ffff if unavailable.
>>>>>>>            */
>>>>>>>           __u64 iova;
>>>>>>>
>>>>>>>           /*
>>>>>>>            * [out] ID corresponding to the mapped memory to be
>>>>>>> used for
>>>>>>>            *   relocations or unmapping.
>>>>>>>            */
>>>>>>>           __u32 mapping_id;
>>>>>>>           /*
>>>>>>>            * [in] Flags.
>>>>>>>            */
>>>>>>>           __u32 flags;
>>>>>>>
>>>>>>>           __u32 reserved[6];
>>>>>>> };
>>>>>>
>>>>>> It looks to me that we actually need a bit different thing here.
>>>>>>
>>>>>> This MAP IOCTL maps a portion of a GEM and then returns the
>>>>>> mapping_id.
>>>>>> And I think we need something more flexible that will allow us to use
>>>>>> GEM handles for the relocation IDs, which should fit nicely with the
>>>>>> DMA-reservations.
>>>>>>
>>>>>> What about an IOCTL that wraps GEM into another GEM? We could wrap a
>>>>>> portion of GEM_A into a GEM_B, and then map the GEM_B using the MAP
>>>>>> IOCTL.
>>>>>>
>>>>>> It could be something like this:
>>>>>>
>>>>>> ### DRM_TEGRA_BO_WRAP
>>>>>>
>>>>>> struct drm_tegra_wrap_bo {
>>>>>>      __u32 bo_handle_wrapped; // out
>>>>>>      __u32 bo_handle;         // in
>>>>>>      __u64 offset;
>>>>>>      __u64 length;
>>>>>> };
>>>>>>
>>>>>> ### DRM_TEGRA_CHANNEL_MAP
>>>>>>
>>>>>> struct drm_tegra_channel_map {
>>>>>>           __u32 channels_mask;
>>>>>>      __u32 mapping_id;
>>>>>>           __u32 bo_handle;
>>>>>>           __u32 flags;
>>>>>>           __u64 iova;
>>>>>> };
>>>>>>
>>>>>> ===
>>>>>>
>>>>>> This allows multiple mapping_ids to have the same backing GEM, so the
>>>>>> mapping_id could be resolved into a BO during of job's submission for
>>>>>> the DMA-reservations handling.
>>>>>
>>>>> That's pretty much what we have already above, isn't it? Just
>>>>> because we
>>>>> call the field "mapping_id" doesn't mean that in the background we
>>>>> can't
>>>>> create a GEM object and return it's handle as "mapping_id".
>>>>>
>>>>> One advantage of Mikko's proposal is that we have a single IOCTL
>>>>> rather
>>>>> than two to create the mapping, making it a bit more lightweight.
>>>>
>>>> Thinking a bit more about it, I now changed my mind.
>>>>
>>>> There is no need to perform implicit fencing on each suballocation,
>>>> instead explicit fencing should be used for the suballocations.
>>>>
>>>> So, we will need to add the relocation flags for the direction and
>>>> explicit (or implicit) fencing per-relocation. The direction will tell
>>>> how fence should be attached to the BO's DMA-reservation, while the
>>>> fence-flag will tell whether job's scheduler should wait for the BO's
>>>> reservation before executing job on hardware. This all will be needed
>>>> for a proper DRI implementation on older Tegras.
>>>>
>>>> Actually, during of my experiments with the UAPI, I added both these
>>>> flags for the relocations [1], but really used only the direction flag
>>>> so far, relying on the implicit fencing.
>>>>
>>>> [1]
>>>> https://github.com/grate-driver/linux/blob/master/include/uapi/drm/tegra_drm.h#L894
>>>>
>>>>
>>>>
>>>> So, let's keep the current variant of this MAP IOCTL as-is.
>>>>
>>>
>>> Let me rephrase to make sure I understand:
>>>
>>> For relocations, we should add flags for direction and fencing. This way
>>> at submit time we can do the proper fencing operations on the relocated
>>> BO's DMA reservation.
>>
>> Correct
>>
>>> This sounds good to me, and I think it makes the "relocation" concept a
>>> bit more general than it is currently. I think we could rename it to
>>> something like "buffer_usage" (open to bikeshedding), and it can have
>>> fence-related flags and relocation-related flags. For newer Tegra chips
>>> we don't necessarily need to relocate, but we still may need to handle
>>> DMA reservations, so in these cases only the fencing flags would be set.
>>
>> Kernel driver already knows whether relocation needs to be patched since
>> it returns 0xffffffff by the MAP if patching is needed, and thus,
>> patching could be auto-skipped by the driver.
>>
>> I don't think that a special flag is required for telling about whether
>> relocation needs to be done. The direction and fence flags should be
>> enough for now.
> 
> Yeah, I guess it's not really required.
> 
>>
>> ===
>>
>> I'm curios what do you think about another variant:
>>
>> In the grate-kernel I'm using a BO table which contains unique BO
>> entries for each job's relocation [1] and then IDs of this table and
>> target offsets are embedded into the cmdstream in-place of the memory
>> addresses [2]. This way, when cmdstream is fully firewalled, we're
>> reading the stream's data anyways, and thus, it's quite nice to embed
>> the BO table IDs and offsets into cmdstream itself [3].
>>
>> In a result:
>>
>>   - Each job's BO is resolved only once during submission.
>>
>>   - BO table is kept small if cmdstream contains duplicated relocations.
>>
>> [1]
>> https://github.com/grate-driver/linux/blob/master/include/uapi/drm/tegra_drm.h#L892
>>
>>
>> [2]
>> https://github.com/grate-driver/linux/blob/master/include/uapi/drm/tegra_drm.h#L783
>>
>>
>> [3]
>> https://github.com/grate-driver/linux/blob/master/drivers/gpu/drm/tegra/uapi/patching.c#L43
>>
>>
>> Of course this limits the number of BOs per-job. In a case of
>> grate-kernel it's max 64 BOs per-job + max 64MB for target offset. I
>> guess the BOs number could be lowered to 32 per-job, which should be a
>> bit more realistic, and then max target offset will be 128MB.
>>
>> So, we could replace the BO table with a mapping table and have the
>> MAPPING_TABLE! :) And it doesn't matter whether cmdstream patched or
>> not, either way the MAPPING_TABLE will contain mapping ID + flags.
>>
>> There are 3 possible variants of how job could be processed, depending
>> on hardware generation and driver security configuration:
>>
>>    1. Job is fully-parsed and patched.
>>    2. Job is patched-only (with relocations).
>>    3. Job isn't parsed, nor patched.
>>
>> My variant covers 1 and 3. I guess we could just ignore the case 2 for
>> now and fall back to 1, for simplicity. It shouldn't be a problem to add
>> support for the RELOCS_TABLE in addition to MAPPING_TABLE later on, if
>> will be really needed.
>>
> 
> I think the bitfield will get a bit tight once we add the 'shift' field
> and 'blocklinear' flag to it :) (which made me notice that apparently my
> original proposal has the define for
> DRM_TEGRA_SUBMIT_RELOCATION_BLOCKLINEAR, but doesn't have a flags field
> in the relocation structure, oops!)
> Both of those affect the relocation value.

The flags will be in the MAPPING_TABLE. If you'll have two mappings that
use a different modifier flag, then it should be two different entries
in the table.

The MAPPING_TABLE is similar to a RELOCATION_TABLE, but it's a bit more
flexible since it allows to avoid the duplicated entries. For example,
if we have a job that copies one area of the mapping to another area of
the same mapping, then it will be a single entry in the table.

IIUC, the shift is always a constant value for each address register,
correct? Hence, the shift value could be looked up from the address
registers table of the kernel driver for each patched relocation during
of the patching process.

> FWIW, we should design option 1 to be functional for new chips as well,
> as we need to use it e.g. if IOMMU is disabled.

Firewall also could be useful for a debugging purposes. It should work
for all hardware versions universally.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] Host1x/TegraDRM UAPI (sync points)
  2020-06-30 10:26           ` Mikko Perttunen
@ 2020-07-01  0:22             ` Dmitry Osipenko
  2020-07-02 12:10               ` Mikko Perttunen
  0 siblings, 1 reply; 50+ messages in thread
From: Dmitry Osipenko @ 2020-07-01  0:22 UTC (permalink / raw)
  To: Mikko Perttunen, Thierry Reding, Jon Hunter, David Airlie,
	Daniel Vetter, sumit.semwal, gustavo
  Cc: linux-tegra, talho, bhuntsman, dri-devel, Erik Faye-Lund

30.06.2020 13:26, Mikko Perttunen пишет:
> On 6/29/20 10:42 PM, Dmitry Osipenko wrote:
>>
>> Secondly, I suppose neither GPU, nor DLA could wait on a host1x sync
>> point, correct? Or are they integrated with Host1x HW?
>>
> 
> They can access syncpoints directly. (That's what I alluded to in the
> "Introduction to the hardware" section :) all those things have hardware
> access to syncpoints)

Should we CC all the Nouveau developers then, or is it a bit too early? :)

>>
>> .. rest ..
>>
> 
> Let me try to summarize once more for my own understanding:
> 
> * When submitting a job, you would allocate new syncpoints for the job

- Yes

> * After submitting the job, those syncpoints are not usable anymore

- Yes

Although, thinking a bit more about it, this needs to be relaxed.

It should be a userspace agreement/policy how to utilize sync points.

For example, if we know that userspace will have multiple application
instances all using Tegra DRM UAPI, like a mesa or VDPAU drivers, then
this userspace should consider to return sync points into the pool for
sharing them with others. While something like an Opentegra Xorg driver,
which usually has a single instance, could keep sync points pre-allocated.

The job's sync point counter will be reset to 0 by the kernel driver
during of the submission process for each job, so we won't have the sync
point recovery problem.

> * Postfences of that job would keep references to those syncpoints so
> they aren't freed and cleared before the fences have been released

- No

I suggested that fence shouldn't refcount the sync point and *only* have
a reference to it, this reference will be invalidated once fence is
signaled by sync point reaching the threshold or once sync point is
released.

The sync point will have a reference to every active fence (waiting for
the signal) that is using this sync point until the threshold is reached.

So fence could detach itself from the sync point + sync point could
detach all the fences from itself.

There will be more about this below, please see example with a dead
process in the end of this email.

> * Once postfences have been released, syncpoints would be returned to
> the pool and reset to zero

- No

I'm suggesting that sync point should be returned to the pool once its
usage refcount reaches 0. This means that both userspace that created
this sync point + the executed job will both keep the sync point alive
until it is closed by userspace + job is completed.

> The advantage of this would be that at any point in time, there would be
> a 1:1 correspondence between allocated syncpoints and jobs; so you could
>  shuffle the jobs around channels or reorder them.

- Yes

> Please correct if I got that wrong :)
> 
> ---
> 
> I have two concerns:
> 
> * A lot of churn on syncpoints - any time you submit a job you might not
> get a syncpoint for an indefinite time. If we allocate syncpoints
> up-front at least you know beforehand, and then you have the syncpoint
> as long as you need it.

If you'll have a lot of active application instances all allocating sync
points, then inevitably the sync points pool will be exhausted.

But my proposal doesn't differ from yours in this regards, correct?

And maybe there is a nice solution, please see more below!

> * Plumbing the dma-fence/sync_file everywhere, and keeping it alive
> until waits on it have completed, is more work than just having the
> ID/threshold. This is probably mainly a problem for downstream, where
> updating code for this would be difficult. I know that's not a proper
> argument but I hope we can reach something that works for both worlds.

You could have ID/threshold! :)

But, you can't use the *job's* ID/threshold because you won't know them
until kernel driver scheduler will *complete(!)* the job's execution!
The job may be re-pushed multiple times by the scheduler to recovered
channel if a previous jobs hang!

Now, you could allocate *two* sync points:

  1. For the job itself (job's sync point).

  2. For the userspace to wait (user's sync point).

The job will have to increment both these sync points (example of
multiple sync points usage) and you know the user's sync point ID/threshold!

If job times out, you *could* increment the user's sync point on CPU
from userspace!

The counter of the user's sync point won't be touched by the kernel
driver if job hangs!

> Here's a proposal in between:
> 
> * Keep syncpoint allocation and submission in jobs as in my original
> proposal

Yes, we could keep it.

But, as I suggested in my other email, we may want to extend the
allocation IOCTL for the multi-syncpoint allocation support.

Secondly, if we'll want to have the multi-syncpoint support for the job,
then we may want improve the SUBMIT IOCTL like this:

struct drm_tegra_channel_submit {
        __u32 num_usr_syncpt_incrs;
        __u64 usr_sync_points_ptr;

        __u32 num_job_syncpt_incrs;
        __u32 job_syncpt_handle;
};

If job doesn't need to increment user's sync points, then there is no
need to copy them from userspace, hence num_usr_syncpt_incrs should be
0. I.e. one less copy_from_user() operation.

> * Don't attempt to recover user channel contexts. What this means:
>   * If we have a hardware channel per context (MLOCKing), just tear down
> the channel

!!!

Hmm, we actually should be able to have a one sync point per-channel for
the job submission, similarly to what the current driver does!

I'm keep forgetting about the waitbase existence!

Please read more below.

>   * Otherwise, we can just remove (either by patching or by full
> teardown/resubmit of the channel) all jobs submitted by the user channel
> context that submitted the hanging job. Jobs of other contexts would be
> undisturbed (though potentially delayed, which could be taken into
> account and timeouts adjusted)

The DRM scheduler itself has an assumption/requirement that when channel
hangs, it must be fully reset. The hanged job will be killed by the
scheduler (maybe dependent jobs will be killed too, but I don't remember
details right now) and then scheduler will re-submit jobs to the
recovered channel [1].

[1]
https://github.com/grate-driver/linux/blob/master/drivers/gpu/drm/tegra/uapi/scheduler.c#L206

Hence, if we could assign a sync point per-channel, then during of the
channel's recovery, the channel's sync point will be reset as well! Only
the waitbases of the re-submitted jobs will differ!

It also means that userspace won't need to allocate sync point for each job!

So far it sounds great! I'll try to think more thoroughly about this.

> * If this happens, we can set removed jobs' post-fences to error status
> and user will have to resubmit them.
> * We should be able to keep the syncpoint refcounting based on fences.

The fence doesn't need the sync point itself, it only needs to get a
signal when the threshold is reached or when sync point is ceased.

Imagine:

  - Process A creates sync point
  - Process A creates dma-fence from this sync point
  - Process A exports dma-fence to process B
  - Process A dies

What should happen to process B?

  - Should dma-fence of the process B get a error signal when process A
dies?
  - Should process B get stuck waiting endlessly for the dma-fence?

This is one example of why I'm proposing that fence shouldn't be coupled
tightly to a sync point.

> This can be made more fine-grained by not caring about the user channel
> context, but tearing down all jobs with the same syncpoint. I think the
> result would be that we can get either what you described (or how I
> understood it in the summary in the beginning of the message), or a more
> traditional syncpoint-per-userctx workflow, depending on how the
> userspace decides to allocate syncpoints.
> 
> If needed, the kernel can still do e.g. reordering (you mentioned job
> priorities) at syncpoint granularity, which, if the userspace followed
> the model you described, would be the same thing as job granularity.
> 
> (Maybe it would be more difficult with current drm_scheduler, sorry,
> haven't had the time yet to read up on that. Dealing with clearing work
> stuff up before summer vacation)

Please take yours time! You definitely will need take a closer look at
the DRM scheduler.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] Host1x/TegraDRM UAPI (sync points)
  2020-07-01  0:22             ` Dmitry Osipenko
@ 2020-07-02 12:10               ` Mikko Perttunen
  2020-07-07 11:06                 ` Dmitry Osipenko
  0 siblings, 1 reply; 50+ messages in thread
From: Mikko Perttunen @ 2020-07-02 12:10 UTC (permalink / raw)
  To: Dmitry Osipenko, Thierry Reding, Jon Hunter, David Airlie,
	Daniel Vetter, sumit.semwal, gustavo
  Cc: linux-tegra, talho, bhuntsman, dri-devel, Erik Faye-Lund

On 7/1/20 3:22 AM, Dmitry Osipenko wrote:
> 30.06.2020 13:26, Mikko Perttunen пишет:
>> On 6/29/20 10:42 PM, Dmitry Osipenko wrote:
>>>
>>> Secondly, I suppose neither GPU, nor DLA could wait on a host1x sync
>>> point, correct? Or are they integrated with Host1x HW?
>>>
>>
>> They can access syncpoints directly. (That's what I alluded to in the
>> "Introduction to the hardware" section :) all those things have hardware
>> access to syncpoints)
> 
> Should we CC all the Nouveau developers then, or is it a bit too early? :)

I think we have a few other issues still to resolve before that :)

> 
>>>
>>> .. rest ..
>>>
>>
>> Let me try to summarize once more for my own understanding:
>>
>> * When submitting a job, you would allocate new syncpoints for the job
> 
> - Yes
> 
>> * After submitting the job, those syncpoints are not usable anymore
> 
> - Yes
> 
> Although, thinking a bit more about it, this needs to be relaxed.
> 
> It should be a userspace agreement/policy how to utilize sync points.
> 
> For example, if we know that userspace will have multiple application
> instances all using Tegra DRM UAPI, like a mesa or VDPAU drivers, then
> this userspace should consider to return sync points into the pool for
> sharing them with others. While something like an Opentegra Xorg driver,
> which usually has a single instance, could keep sync points pre-allocated.
> 
> The job's sync point counter will be reset to 0 by the kernel driver
> during of the submission process for each job, so we won't have the sync
> point recovery problem.
> 
>> * Postfences of that job would keep references to those syncpoints so
>> they aren't freed and cleared before the fences have been released
> 
> - No
> 
> I suggested that fence shouldn't refcount the sync point and *only* have
> a reference to it, this reference will be invalidated once fence is
> signaled by sync point reaching the threshold or once sync point is
> released.
> 
> The sync point will have a reference to every active fence (waiting for
> the signal) that is using this sync point until the threshold is reached.
> 
> So fence could detach itself from the sync point + sync point could
> detach all the fences from itself.
> 
> There will be more about this below, please see example with a dead
> process in the end of this email.
> 
>> * Once postfences have been released, syncpoints would be returned to
>> the pool and reset to zero
> 
> - No
> 
> I'm suggesting that sync point should be returned to the pool once its
> usage refcount reaches 0. This means that both userspace that created
> this sync point + the executed job will both keep the sync point alive
> until it is closed by userspace + job is completed.
> 
>> The advantage of this would be that at any point in time, there would be
>> a 1:1 correspondence between allocated syncpoints and jobs; so you could
>>   shuffle the jobs around channels or reorder them.
> 
> - Yes
> 
>> Please correct if I got that wrong :)
>>
>> ---
>>
>> I have two concerns:
>>
>> * A lot of churn on syncpoints - any time you submit a job you might not
>> get a syncpoint for an indefinite time. If we allocate syncpoints
>> up-front at least you know beforehand, and then you have the syncpoint
>> as long as you need it.
> 
> If you'll have a lot of active application instances all allocating sync
> points, then inevitably the sync points pool will be exhausted.
> 
> But my proposal doesn't differ from yours in this regards, correct?
> 
> And maybe there is a nice solution, please see more below!
> 
>> * Plumbing the dma-fence/sync_file everywhere, and keeping it alive
>> until waits on it have completed, is more work than just having the
>> ID/threshold. This is probably mainly a problem for downstream, where
>> updating code for this would be difficult. I know that's not a proper
>> argument but I hope we can reach something that works for both worlds.
> 
> You could have ID/threshold! :)
> 
> But, you can't use the *job's* ID/threshold because you won't know them
> until kernel driver scheduler will *complete(!)* the job's execution!
> The job may be re-pushed multiple times by the scheduler to recovered
> channel if a previous jobs hang!
> 
> Now, you could allocate *two* sync points:
> 
>    1. For the job itself (job's sync point).
> 
>    2. For the userspace to wait (user's sync point).
> 
> The job will have to increment both these sync points (example of
> multiple sync points usage) and you know the user's sync point ID/threshold!
> 
> If job times out, you *could* increment the user's sync point on CPU
> from userspace!
> 
> The counter of the user's sync point won't be touched by the kernel
> driver if job hangs!

Ok, so we would have two kinds of syncpoints for the job; one for kernel 
job tracking; and one that userspace can manipulate as it wants to.

Could we handle the job tracking syncpoint completely inside the kernel, 
i.e. allocate it in kernel during job submission, and add an increment 
for it at the end of the job (with condition OP_DONE)? For MLOCKing, the 
kernel already needs to insert a SYNCPT_INCR(OP_DONE) + WAIT + 
MLOCK_RELEASE sequence at the end of each job.

> 
>> Here's a proposal in between:
>>
>> * Keep syncpoint allocation and submission in jobs as in my original
>> proposal
> 
> Yes, we could keep it.
> 
> But, as I suggested in my other email, we may want to extend the
> allocation IOCTL for the multi-syncpoint allocation support.
> 
> Secondly, if we'll want to have the multi-syncpoint support for the job,
> then we may want improve the SUBMIT IOCTL like this:
> 
> struct drm_tegra_channel_submit {
>          __u32 num_usr_syncpt_incrs;
>          __u64 usr_sync_points_ptr;
> 
>          __u32 num_job_syncpt_incrs;
>          __u32 job_syncpt_handle;
> };
> 
> If job doesn't need to increment user's sync points, then there is no
> need to copy them from userspace, hence num_usr_syncpt_incrs should be
> 0. I.e. one less copy_from_user() operation.
> 
>> * Don't attempt to recover user channel contexts. What this means:
>>    * If we have a hardware channel per context (MLOCKing), just tear down
>> the channel
> 
> !!!
> 
> Hmm, we actually should be able to have a one sync point per-channel for
> the job submission, similarly to what the current driver does!
> 
> I'm keep forgetting about the waitbase existence!

Tegra194 doesn't have waitbases, but if we are resubmitting all the jobs 
anyway, can't we just recalculate wait thresholds at that time?

Maybe a more detailed sequence list or diagram of what happens during 
submission and recovery would be useful.

> 
> Please read more below.
> 
>>    * Otherwise, we can just remove (either by patching or by full
>> teardown/resubmit of the channel) all jobs submitted by the user channel
>> context that submitted the hanging job. Jobs of other contexts would be
>> undisturbed (though potentially delayed, which could be taken into
>> account and timeouts adjusted)
> 
> The DRM scheduler itself has an assumption/requirement that when channel
> hangs, it must be fully reset. The hanged job will be killed by the
> scheduler (maybe dependent jobs will be killed too, but I don't remember
> details right now) and then scheduler will re-submit jobs to the
> recovered channel [1].
> 
> [1]
> https://github.com/grate-driver/linux/blob/master/drivers/gpu/drm/tegra/uapi/scheduler.c#L206
> 
> Hence, if we could assign a sync point per-channel, then during of the
> channel's recovery, the channel's sync point will be reset as well! Only
> the waitbases of the re-submitted jobs will differ!
> 
> It also means that userspace won't need to allocate sync point for each job!
> 
> So far it sounds great! I'll try to think more thoroughly about this.
> 
>> * If this happens, we can set removed jobs' post-fences to error status
>> and user will have to resubmit them.
>> * We should be able to keep the syncpoint refcounting based on fences.
> 
> The fence doesn't need the sync point itself, it only needs to get a
> signal when the threshold is reached or when sync point is ceased.
> 
> Imagine:
> 
>    - Process A creates sync point
>    - Process A creates dma-fence from this sync point
>    - Process A exports dma-fence to process B
>    - Process A dies
> 
> What should happen to process B?
> 
>    - Should dma-fence of the process B get a error signal when process A
> dies?
>    - Should process B get stuck waiting endlessly for the dma-fence?
> 
> This is one example of why I'm proposing that fence shouldn't be coupled
> tightly to a sync point.

As a baseline, we should consider process B to get stuck endlessly 
(until a timeout of its choosing) for the fence. In this case it is 
avoidable, but if the ID/threshold pair is exported out of the fence and 
is waited for otherwise, it is unavoidable. I.e. once the ID/threshold 
are exported out of a fence, the waiter can only see the fence being 
signaled by the threshold being reached, not by the syncpoint getting freed.

> 
>> This can be made more fine-grained by not caring about the user channel
>> context, but tearing down all jobs with the same syncpoint. I think the
>> result would be that we can get either what you described (or how I
>> understood it in the summary in the beginning of the message), or a more
>> traditional syncpoint-per-userctx workflow, depending on how the
>> userspace decides to allocate syncpoints.
>>
>> If needed, the kernel can still do e.g. reordering (you mentioned job
>> priorities) at syncpoint granularity, which, if the userspace followed
>> the model you described, would be the same thing as job granularity.
>>
>> (Maybe it would be more difficult with current drm_scheduler, sorry,
>> haven't had the time yet to read up on that. Dealing with clearing work
>> stuff up before summer vacation)
> 
> Please take yours time! You definitely will need take a closer look at
> the DRM scheduler.
> 
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] Host1x/TegraDRM UAPI (sync points)
  2020-07-02 12:10               ` Mikko Perttunen
@ 2020-07-07 11:06                 ` Dmitry Osipenko
  2020-07-08 10:06                   ` Mikko Perttunen
  0 siblings, 1 reply; 50+ messages in thread
From: Dmitry Osipenko @ 2020-07-07 11:06 UTC (permalink / raw)
  To: Mikko Perttunen, Thierry Reding, Jon Hunter, David Airlie,
	Daniel Vetter, sumit.semwal, gustavo
  Cc: linux-tegra, talho, bhuntsman, dri-devel, Erik Faye-Lund

02.07.2020 15:10, Mikko Perttunen пишет:
> Ok, so we would have two kinds of syncpoints for the job; one
> for kernel job tracking; and one that userspace can
> manipulate as it wants to.
> 
> Could we handle the job tracking syncpoint completely inside the kernel,
> i.e. allocate it in kernel during job submission, and add an increment
> for it at the end of the job (with condition OP_DONE)? For MLOCKing, the
> kernel already needs to insert a SYNCPT_INCR(OP_DONE) + WAIT +
> MLOCK_RELEASE sequence at the end of each job.

If sync point is allocated within kernel, then we'll need to always
patch all job's sync point increments with the ID of the allocated sync
point, regardless of whether firewall enabled or not.

Secondly, I'm now recalling that only one sync point could be assigned
to a channel at a time on newer Tegras which support sync point
protection. So it sounds like we don't really have variants other than
to allocate one sync point per channel for the jobs usage if we want to
be able to push multiple jobs into channel's pushbuffer, correct?

...
>> Hmm, we actually should be able to have a one sync point per-channel for
>> the job submission, similarly to what the current driver does!
>>
>> I'm keep forgetting about the waitbase existence!
> 
> Tegra194 doesn't have waitbases, but if we are resubmitting all the jobs
> anyway, can't we just recalculate wait thresholds at that time?

Yes, thresholds could be recalculated + job should be re-formed at the
push-time then.

It also means that jobs always should be formed only at the push-time if
wait-command is utilized by cmdstream since the waits always need to be
patched because we won't know the thresholds until scheduler actually
runs the job.

> Maybe a more detailed sequence list or diagram of what happens during
> submission and recovery would be useful.

The textual form + code is already good enough to me. A diagram could be
nice to have, although it may take a bit too much effort to create +
maintain it. But I don't mind at all if you'd want to make one :)

...
>>> * We should be able to keep the syncpoint refcounting based on fences.
>>
>> The fence doesn't need the sync point itself, it only needs to get a
>> signal when the threshold is reached or when sync point is ceased.
>>
>> Imagine:
>>
>>    - Process A creates sync point
>>    - Process A creates dma-fence from this sync point
>>    - Process A exports dma-fence to process B
>>    - Process A dies
>>
>> What should happen to process B?
>>
>>    - Should dma-fence of the process B get a error signal when process A
>> dies?
>>    - Should process B get stuck waiting endlessly for the dma-fence?
>>
>> This is one example of why I'm proposing that fence shouldn't be coupled
>> tightly to a sync point.
> 
> As a baseline, we should consider process B to get stuck endlessly
> (until a timeout of its choosing) for the fence. In this case it is
> avoidable, but if the ID/threshold pair is exported out of the fence and
> is waited for otherwise, it is unavoidable. I.e. once the ID/threshold
> are exported out of a fence, the waiter can only see the fence being
> signaled by the threshold being reached, not by the syncpoint getting
> freed.

This is correct. If sync point's FD is exported or once sync point is
resolved from a dma-fence, then sync point will stay alive until the
last reference to the sync point is dropped. I.e. if Process A dies
*after* process B started to wait on the sync point, then it will get stuck.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] Host1x/TegraDRM UAPI (sync points)
  2020-07-07 11:06                 ` Dmitry Osipenko
@ 2020-07-08 10:06                   ` Mikko Perttunen
  2020-07-09  9:28                     ` Dmitry Osipenko
  0 siblings, 1 reply; 50+ messages in thread
From: Mikko Perttunen @ 2020-07-08 10:06 UTC (permalink / raw)
  To: Dmitry Osipenko, Thierry Reding, Jon Hunter, David Airlie,
	Daniel Vetter, sumit.semwal, gustavo
  Cc: linux-tegra, talho, bhuntsman, dri-devel, Erik Faye-Lund

On 7/7/20 2:06 PM, Dmitry Osipenko wrote:
> 02.07.2020 15:10, Mikko Perttunen пишет:
>> Ok, so we would have two kinds of syncpoints for the job; one
>> for kernel job tracking; and one that userspace can
>> manipulate as it wants to.
>>
>> Could we handle the job tracking syncpoint completely inside the kernel,
>> i.e. allocate it in kernel during job submission, and add an increment
>> for it at the end of the job (with condition OP_DONE)? For MLOCKing, the
>> kernel already needs to insert a SYNCPT_INCR(OP_DONE) + WAIT +
>> MLOCK_RELEASE sequence at the end of each job.
> 
> If sync point is allocated within kernel, then we'll need to always
> patch all job's sync point increments with the ID of the allocated sync
> point, regardless of whether firewall enabled or not.

The idea was that the job tracking increment would also be added to the 
pushbuffer in the kernel, so gathers would only have increments for the 
"user syncpoints", if any. I think this should work for THI-based 
engines (e.g. VIC), you probably have better information about 
GR2D/GR3D. On newer Tegras we could use CHANNEL/APPID protection to 
prevent the gathers from incrementing these job tracking syncpoints.

> 
> Secondly, I'm now recalling that only one sync point could be assigned
> to a channel at a time on newer Tegras which support sync point
> protection. So it sounds like we don't really have variants other than
> to allocate one sync point per channel for the jobs usage if we want to
> be able to push multiple jobs into channel's pushbuffer, correct?
> 

The other way around; each syncpoint can be assigned to one channel. One 
channel may have multiple syncpoints.

> ...
>>> Hmm, we actually should be able to have a one sync point per-channel for
>>> the job submission, similarly to what the current driver does!
>>>
>>> I'm keep forgetting about the waitbase existence!
>>
>> Tegra194 doesn't have waitbases, but if we are resubmitting all the jobs
>> anyway, can't we just recalculate wait thresholds at that time?
> 
> Yes, thresholds could be recalculated + job should be re-formed at the
> push-time then.
> 
> It also means that jobs always should be formed only at the push-time if
> wait-command is utilized by cmdstream since the waits always need to be
> patched because we won't know the thresholds until scheduler actually
> runs the job.

Could we keep the job tracking syncpoints entirely within the kernel, 
and have all wait commands and other stuff that userspace does use the 
user syncpoints? Then kernel job tracking and userspace activity would 
be separate from each other.

Alternatively, if we know that jobs can only be removed from the middle 
of pushbuffers, and not added, we could replace any removed jobs with 
syncpoint increments in the pushbuffer and any thresholds would stay intact.

> 
>> Maybe a more detailed sequence list or diagram of what happens during
>> submission and recovery would be useful.
> 
> The textual form + code is already good enough to me. A diagram could be
> nice to have, although it may take a bit too much effort to create +
> maintain it. But I don't mind at all if you'd want to make one :)
> 
> ...
>>>> * We should be able to keep the syncpoint refcounting based on fences.
>>>
>>> The fence doesn't need the sync point itself, it only needs to get a
>>> signal when the threshold is reached or when sync point is ceased.
>>>
>>> Imagine:
>>>
>>>     - Process A creates sync point
>>>     - Process A creates dma-fence from this sync point
>>>     - Process A exports dma-fence to process B
>>>     - Process A dies
>>>
>>> What should happen to process B?
>>>
>>>     - Should dma-fence of the process B get a error signal when process A
>>> dies?
>>>     - Should process B get stuck waiting endlessly for the dma-fence?
>>>
>>> This is one example of why I'm proposing that fence shouldn't be coupled
>>> tightly to a sync point.
>>
>> As a baseline, we should consider process B to get stuck endlessly
>> (until a timeout of its choosing) for the fence. In this case it is
>> avoidable, but if the ID/threshold pair is exported out of the fence and
>> is waited for otherwise, it is unavoidable. I.e. once the ID/threshold
>> are exported out of a fence, the waiter can only see the fence being
>> signaled by the threshold being reached, not by the syncpoint getting
>> freed.
> 
> This is correct. If sync point's FD is exported or once sync point is
> resolved from a dma-fence, then sync point will stay alive until the
> last reference to the sync point is dropped. I.e. if Process A dies
> *after* process B started to wait on the sync point, then it will get stuck.
> 
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [RFC] Host1x/TegraDRM UAPI (sync points)
  2020-07-08 10:06                   ` Mikko Perttunen
@ 2020-07-09  9:28                     ` Dmitry Osipenko
  0 siblings, 0 replies; 50+ messages in thread
From: Dmitry Osipenko @ 2020-07-09  9:28 UTC (permalink / raw)
  To: Mikko Perttunen, Thierry Reding, Jon Hunter, David Airlie,
	Daniel Vetter, sumit.semwal, gustavo
  Cc: linux-tegra, talho, bhuntsman, dri-devel, Erik Faye-Lund

08.07.2020 13:06, Mikko Perttunen пишет:
> On 7/7/20 2:06 PM, Dmitry Osipenko wrote:
>> 02.07.2020 15:10, Mikko Perttunen пишет:
>>> Ok, so we would have two kinds of syncpoints for the job; one
>>> for kernel job tracking; and one that userspace can
>>> manipulate as it wants to.
>>>
>>> Could we handle the job tracking syncpoint completely inside the kernel,
>>> i.e. allocate it in kernel during job submission, and add an increment
>>> for it at the end of the job (with condition OP_DONE)? For MLOCKing, the
>>> kernel already needs to insert a SYNCPT_INCR(OP_DONE) + WAIT +
>>> MLOCK_RELEASE sequence at the end of each job.
>>
>> If sync point is allocated within kernel, then we'll need to always
>> patch all job's sync point increments with the ID of the allocated sync
>> point, regardless of whether firewall enabled or not.
> 
> The idea was that the job tracking increment would also be added to the
> pushbuffer in the kernel, so gathers would only have increments for the
> "user syncpoints", if any. I think this should work for THI-based
> engines (e.g. VIC), you probably have better information about
> GR2D/GR3D. On newer Tegras we could use CHANNEL/APPID protection to
> prevent the gathers from incrementing these job tracking syncpoints.

Could you please clarify what is THI?

>> Secondly, I'm now recalling that only one sync point could be assigned
>> to a channel at a time on newer Tegras which support sync point
>> protection. So it sounds like we don't really have variants other than
>> to allocate one sync point per channel for the jobs usage if we want to
>> be able to push multiple jobs into channel's pushbuffer, correct?
>>
> 
> The other way around; each syncpoint can be assigned to one channel. One
> channel may have multiple syncpoints.

Okay! So we actually could use a single sync point per-job for user
increments + job tracking, correct?

>> ...
>>>> Hmm, we actually should be able to have a one sync point per-channel
>>>> for
>>>> the job submission, similarly to what the current driver does!
>>>>
>>>> I'm keep forgetting about the waitbase existence!
>>>
>>> Tegra194 doesn't have waitbases, but if we are resubmitting all the jobs
>>> anyway, can't we just recalculate wait thresholds at that time?
>>
>> Yes, thresholds could be recalculated + job should be re-formed at the
>> push-time then.
>>
>> It also means that jobs always should be formed only at the push-time if
>> wait-command is utilized by cmdstream since the waits always need to be
>> patched because we won't know the thresholds until scheduler actually
>> runs the job.
> 
> Could we keep the job tracking syncpoints entirely within the kernel,
> and have all wait commands and other stuff that userspace does use the
> user syncpoints? Then kernel job tracking and userspace activity would
> be separate from each other.

I think this should work, but it's not clear to me what benefits this
brings to us if we could re-use the same user-allocated sync point for
both user increments + kernel job tracking.

Is the idea to protect from userspace incrementing sync point too much?
I.e. job could be treated as completed before CDMA is actually done with
it.

> Alternatively, if we know that jobs can only be removed from the middle
> of pushbuffers, and not added, we could replace any removed jobs with
> syncpoint increments in the pushbuffer and any thresholds would stay
> intact.

A bit unlikely that jobs could ever be removed from the middle of
hardware queue by the DRM scheduler.

Anyways, it should be nicer to shoot down everything and restart with a
clean slate when necessary, like it's already supposed to happen by the
scheduler.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2020-07-10  7:54 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-23 12:09 [RFC] Host1x/TegraDRM UAPI Mikko Perttunen
2020-06-24 20:55 ` Dmitry Osipenko
2020-06-25  9:17   ` Mikko Perttunen
2020-06-24 22:33 ` Dmitry Osipenko
2020-06-25  9:27   ` Mikko Perttunen
2020-06-25 22:50     ` [RFC] Host1x/TegraDRM UAPI (drm_tegra_submit_relocation) Dmitry Osipenko
2020-06-26  9:01       ` Mikko Perttunen
2020-06-24 23:11 ` [RFC] Host1x/TegraDRM UAPI Dmitry Osipenko
2020-06-25  9:16   ` Mikko Perttunen
2020-06-25 23:24     ` Dmitry Osipenko
2020-06-26  9:05       ` Mikko Perttunen
2020-06-24 23:18 ` Dmitry Osipenko
2020-06-25  0:59   ` Dmitry Osipenko
2020-06-24 23:23 ` Dmitry Osipenko
2020-06-25  9:19   ` Mikko Perttunen
2020-06-25  0:47 ` Dmitry Osipenko
2020-06-25  9:23   ` Mikko Perttunen
2020-06-25 22:47 ` [RFC] Host1x/TegraDRM UAPI (drm_tegra_channel_map) Dmitry Osipenko
2020-06-26  7:34   ` Thierry Reding
2020-06-26 16:35     ` Dmitry Osipenko
2020-06-28 11:16       ` Mikko Perttunen
2020-06-28 22:59         ` Dmitry Osipenko
2020-06-30 10:55           ` Mikko Perttunen
2020-06-30 19:48             ` Dmitry Osipenko
2020-06-26 11:06 ` [RFC] Host1x/TegraDRM UAPI Karol Herbst
2020-06-26 11:13   ` Mikko Perttunen
2020-06-26 11:16     ` Karol Herbst
2020-06-26 11:40   ` Thierry Reding
2020-06-26 13:38     ` Daniel Vetter
2020-06-26 13:59       ` Dmitry Osipenko
2020-06-30  9:09         ` Daniel Vetter
2020-06-27 21:47 ` [RFC] Host1x/TegraDRM UAPI (drm_tegra_submit_syncpt_incr) Dmitry Osipenko
2020-06-28 11:10   ` Mikko Perttunen
2020-06-27 22:32 ` [RFC] Host1x/TegraDRM UAPI (drm_tegra_submit_command) Dmitry Osipenko
2020-06-28 10:28   ` Mikko Perttunen
2020-06-29  0:00     ` Dmitry Osipenko
2020-06-30 10:40       ` Mikko Perttunen
2020-06-27 23:27 ` [RFC] Host1x/TegraDRM UAPI (sync points) Dmitry Osipenko
2020-06-28  9:44   ` Mikko Perttunen
2020-06-29  2:36     ` Dmitry Osipenko
2020-06-29 10:27       ` Mikko Perttunen
2020-06-29 19:28         ` Dmitry Osipenko
2020-06-29 19:33         ` Dmitry Osipenko
2020-06-29 19:42         ` Dmitry Osipenko
2020-06-30 10:26           ` Mikko Perttunen
2020-07-01  0:22             ` Dmitry Osipenko
2020-07-02 12:10               ` Mikko Perttunen
2020-07-07 11:06                 ` Dmitry Osipenko
2020-07-08 10:06                   ` Mikko Perttunen
2020-07-09  9:28                     ` Dmitry Osipenko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).